Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Yong-Il Kim; Yoo-Kang Ji; Sun Park

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

원문정보

Yong-Il Kim, Yoo-Kang Ji, Sun Park

보안공학연구지원센터(IJSEIA) International Journal of Software Engineering and Its Applications Vol.8 No.4 2014.04 pp.1-10 SCOPUS

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrectly classified documents into different clusters when documents of same cluster lack the shared terms. To overcome this problem are applied by the knowledge based approaches. However, these approaches have an influence of inherent structure of documents on clustering and a cost problem of constructing ontology. In addition, the methods are limited to cluster suitable text document clustering from in exploding big text data on Cloud environment. In this paper, we propose a big text document clustering method using terms of class label and semantic feature based Hadoop. Class label term can well represent the inherent structure of document clusters by non-negative matrix factorization (NMF) based Hadoop. The proposed method can improve the quality of document clustering which uses the class label terms and the term weights based on term mutual information (TMI) with WordNet at a little cost. It also can cluster the big data size of document using the distributed parallel processing based on Hadoop. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

Abstract
1. Introduction
2. Hadoop Framework
3. Non-negative Matrix Factorization
4. Proposed Document Clustering Method
  4.1. Preprocessing
  4.2. Extracting Class Label Terms
  4.3. Computing Term Weights by TMI based on WordNet
  4.4. Clustering Document using Similarity
5. Experiments and Evaluation
6. Conclusion
References

키워드

저자정보

Yong-Il Kim Honam University, South Korea
Yoo-Kang Ji DongShin University, South Korea
Sun Park GIST, South Korea

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

0개의 논문이 장바구니에 담겼습니다.

earticle