원문정보
초록
영어
Since the encoding of Mongolian web pages is not unified and the amount of web pages are is fewer, a method to unify linguistic model and hyperlink analysis is designed to solve the problem. Firstly the web page language identification is carried on by the N-Gram language model, as well as the average distance of language identification is a part of the hyperlink correlation degree. Secondly the hyperlink correlation degree is calculated based on the anchor text, hyperlink increasing and hyperlink depth. Finally the hyperlinks which are sorted by the hyperlink correlation degree become the collecting seeds of the next web page. The experimental results show that the method of collecting Mongolian web page based on hyperlink correlation degree can effectively enhance the information sum, collection speed and the accuracy rate.
목차
1. Introduction
2. N-Gram-Based Mongolian Topic Recognition
3. Hyperlink Correlation Degree of Node
4. The Mongolian Web Page Collection Model
5. Experiment Design and Result Analysis
6. Conclusion
References