earticle

논문검색

Efficient Document Similarity Detection Using Weighted Phrase Indexing

초록

영어

Document similarity techniques mostly rely on single term analysis of the document in the data set. To improve the efficiency and effectiveness of the process of document similarity detection, more informative feature terms have been developed and presented by many researchers. In this paper, we present phrase weight index, which indexes documents in the data set based on important phrases. Phrasal indexing aims to reduce the ambiguity inherent to the words considered in isolation, and then improve the effectiveness in document similarity computation. The method we are presenting here in this paper inherit the term tf-idf weighting scheme in computing important phrases in the collection. It computes the weight of phrases in the document collection and according to a given threshold; the important phrases are identified and are indexed. The data dimensionality which hinders the performance of document similarity for different methods is solved by an offline index creation of important phrases for every document. The evaluation experiments indicate that the presented method is very effective on document similarity detection and its quality surpasses the traditional phrase-based approach in which the reduction of dimensionality is ignored and other methods which use single-word tf-idf.

목차

Abstract
 1. Introduction
 2. Related Works
 3. Phrase-based Searching for Document Similarity
  3.1. Efficiently Detection of Good Phrases
  3.2. Duplicate or Near-Duplicate Document Detection
  3.3. Pairwise similarity
  3.4. Similarity of Document Compared to the Document Collection
  3.5. Comparing a Document to the Corpus
 4. Experimental Results
  4.1. Document Collection
  4.2. Efficiency
 5. Conclusion
 References

저자정보

  • Papias Niyigena School of Information Science and Engineering, Central South University, Changsha, 410083, PR China
  • Zhang Zuping School of Information Science and Engineering, Central South University, Changsha, 410083, PR China
  • Mansoor Ahmed Khuhro School of Information Science and Engineering, Central South University, Changsha, 410083, PR China
  • Damien Hanyurwimfura College of Science and Technology, University of Rwanda, Kigali, Rwanda

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.