원문정보
초록
영어
The internet has become indispensable part of people's life. For enterprises, there are mass of valuable information in the internet. It not only includes competitor information, but also includes customer’s evaluation of products. These information is an important source of business intelligence. This paper aims to build a focused crawler to filter business intelligence from vast amounts of information in the internet. The crawler takes a certain number of web pages as seed. Then extract URLs in these pages, and parse main text of every URL. After that, the crawler calculates relevancy between every main text and the crawler’s topic based on VSM (vector space model) and TF-IDF (Term Frequency-Inverse Document Frequency). If a web page is relevant, it will be saved; otherwise, it will be discarded. At last, an experiment is done to test the performance of crawler. It can be seen that the recall rate and accuracy of the crawler is very high though the result of this experiment.
목차
1. Introduction
2. The Process of Focused Crawler
3. Key Technologies
3.1 Web page pretreatment
3.2 Web page analysis
3.3 The algorithm of calculate website’s weight
3.4 The TF - IDF text relevancy analysis based on the VSM
4. Experiment
4.1 Evaluation index
4.2 Results analysis
5. Conclusion
References