earticle

논문검색

The Similarity for Nominal Variables Based on F-Divergence

원문정보

초록

영어

Measuring the similarity between nominal variables is an important problem in data mining. It's the base to measure the similarity of data objects which contain nominal variables. There are two kinds of traditional methods for this task, the first one simply distinguish variables by same or not same while the second one measures the similarity based on co-occurrence with variables of other attributes. Though they perform well in some conditions, but are still not enough in accuracy. This paper proposes an algorithm to measure the similarity between nominal variables of the same attribute based on the fact that the similarity between nominal variables depends on the relationship between subsets which hold them in the same dataset. This algorithm use the difference of the distribution which is quantified by f-divergence to form feature vector of nominal variables. The theoretical analysis helps to choose the best metric from four most common used forms of f-divergence. Time complexity of the method is linear with the size of dataset and it makes this method suitable for processing the large-scale data. The experiments which use the derived similarity metrics with K-modes on extensive UCI datasets demonstrate the effectiveness of our proposed method.

목차

Abstract
 1. Introduction
 2. Proposed Algorithm
  2.1 Definition of Similarity
  2.2 Hellinger Distance
  2.3 Distance in Unsupervised Learning
 3. Theoretical Analysis
  3.1 Why Hellinger Distance
  3.2 Complexity of the Algorithm
 4. Experiments
  4.1 Intrinsic Method
  4.2. The Extrinsic Method
 5. Conclusion
 References

저자정보

  • Zhao Liang Institute of Graduate, Liaoning Technical University, Fuxin, Liaoning, 123000, P.R. China
  • Liu Jianhui School of Electronic and Information Engineering, Liaoning Technical University, Huludao, Liaoning, 125000, P.R. China

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.