earticle

논문검색

Focused Crawler Research for Business Intelligence Acquisition

초록

영어

The internet has become indispensable part of people's life. For enterprises, there are mass of valuable information in the internet. It not only includes competitor information, but also includes customer’s evaluation of products. These information is an important source of business intelligence. This paper aims to build a focused crawler to filter business intelligence from vast amounts of information in the internet. The crawler takes a certain number of web pages as seed. Then extract URLs in these pages, and parse main text of every URL. After that, the crawler calculates relevancy between every main text and the crawler’s topic based on VSM (vector space model) and TF-IDF (Term Frequency-Inverse Document Frequency). If a web page is relevant, it will be saved; otherwise, it will be discarded. At last, an experiment is done to test the performance of crawler. It can be seen that the recall rate and accuracy of the crawler is very high though the result of this experiment.

목차

Abstract
 1. Introduction
 2. The Process of Focused Crawler
 3. Key Technologies
  3.1 Web page pretreatment
  3.2 Web page analysis
  3.3 The algorithm of calculate website’s weight
  3.4 The TF - IDF text relevancy analysis based on the VSM
 4. Experiment
  4.1 Evaluation index
  4.2 Results analysis
 5. Conclusion
 References

저자정보

  • Peng Xin School of Economics and Management, Beijing Jiaotong University No.3 Shang Yuan Cun, Hai Dian District, Beijing 100044, Peoples R China
  • Qin Qiuli School of Economics and Management, Beijing Jiaotong University No.3 Shang Yuan Cun, Hai Dian District, Beijing 100044, Peoples R China

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.