Utilization of Association Rule Thresholds Considering Frequency and Occurrence/Nonoccurrence Rates

데이터마이닝 기법은 빅 데이터에 함축적으로 들어 있는 지식이나 패턴을 찾아내는 기술이다. 데이터마이닝 기법들 중에서 가장 활발하게 연구되고 있는 연관성 규칙은 항목들 간의 지지도, 신뢰도, 향상도 등의 연관성 규칙 평가 기준을 근거로 하여 항목들 간의 관련성을 탐색하는 데 활용되고 있다. 본 논문에서는 기존의 연관성 규칙의 단점을 해결하기 위해 항목의 발생빈도, 상대적 발생비율, 그리고 상대적 비발생 비율을 동시에 고려한 연관성 규칙 평가 모형을 제안하였다. 이와 더불어 예제 데이터를 이용하여 네 가지 종류의 지지도, 신뢰도, 그리고 향상도를 비교하였다. 본 논문에서 제안한 모형을 이용하게 되면 발생 빈도의 크기를 무시함으로써 발생하는 정보 손실에 의한 오류를 미연에 방지할 수 있을 뿐만 아니라 발생 비율이 다른 항목들 간의 연관성 규칙들을 합리적이고도 공정하게 비교할 수 있을 것이다. 따라서 이 모형은 대형 쇼핑몰에서 제품의 가격 또는 판매량 등의 영향을 조정하거나, 프로야구에서 각 선수의 타율과 한 경기에서의 안타수를 동시에 고려하기 위해서 필요한 모형으로 사료된다.

Data mining is a powerful technology with great potential to help companies focus on the most important information in a big database. One of the well-studied techniques in data mining is association rule. An association rule technique finds the relation among items in a huge database and has been applied in various fields. It is intended to identify strong rules discovered in large databases using different measures of interestingness. There are three primary quality measures for meaningful association rules; support, confidence, and lift. In this paper, we propose some association thresholds considering frequency and relative occurrence and nonoccurrence rates for association rule exploration. The comparative studies with four kinds of supports and confidences are shown by numerical examples. As a result, we have confirmed the fact that the proposed model is able to prevent the errors due to loss of information caused by ignoring the size of incidence frequency and is reasonably and fairly comparable to association rules between itemsets with different occurrence and nonoccurrence rates.