원문정보
초록
영어
In the research of Web crawler, the most important things are structure design and solution of the key technologies. Based on the work of other people, we described the structure design of a distribute Web crawler, which including the organization of hardware and module partition of software. In this paper, one PC is utilized as the main node, and other PCs as the common nodes which are connected in LAN. The software architecture included main node design and common node design. Then, we analyzed solutions of the major techniques of the distributed Web crawler, such as how the nodes of the crawler cooperate with each other, how the task is distributed, how to keep the important Web fresh. We have proposed some practicable arithmetic to solve the problems mentioned above. Besides, we implemented a robust, distensible, customized, distributed Web crawler, and anatomized it. At last, we gave the results of two experiments, including common test and a site download test.
목차
1. Introduction
2. Structural Design of Distributed Web Crawler
2.1 URL Distribution Module
2.2Node Communication Module
2.3URL Analysis Module
2.4 Download Module
2.5 Web Analysis Module
3. Key Technology of Web Crawler
3.1 Selection of Seed Set
3.2 Distributed Strategy
4. Experiment Implementation and Evaluation
4.1 Implementation of the System
4.2 Realization of Distributed Task Allocation
4.3 Realization of Single-node Downloading Tasks
4.4 System Evaluation
5. Conclusion
References