The Spam Detection Model for Web Forums using Text Mining Techniques

웹포럼상의 불량글은 사용자의 불편함을 증가시키고 사용자 의견 수렴의 원천으로서의 웹포럼의 가치를 하락시킨다. 웹포럼상의 글의 중요도는 일반적으로 해당글에 참여한 사용자 수나 댓글수로 측정하므로, 이러한 불량글은 웹포럼상의 의견수집 시 많은 불필요한 데이터를 포함하게 되고, 이렇게 분석된 결과는 왜곡된 결과를 내기도 한다. 본 연구에서는 웹포럼상의 불량글을 자동 분류하는 탐지 모델을 제안한다. 웹포럼상의 글에 대해 텍스트마이닝 기술을 적용하여 글의 구조적, 의미적, 형식적 텍스트 특성을 추출하고, 정상과 불량글을 두 개의 클래스로 한 분류 모델을 학습시킨다. 불량글을 자동으로 분류하는 모델을 학습시키기 위해서는 불량글과 정상글로 사전에 분류된 학습데이타가 필요하기 때문에, 평가자가 글의 불량성 여부를 평가하도록 한다. 자동화된 학습 모델을 구축하기 위해 데이터마이닝 및 텍스트마이닝에서 그 성능이 입증된, Naive Bayesian, Support Vector Machine (SVM), Decision Tree를 이용하였다. 이를 통해 불량글과 정상글을 구분하는 특성을 파악할 수 있으며, 학습된 분류 모델을 이용하여 새로운 글을 자동으로 분류할 수 있다. 본 연구에서는 제시하는 모델을 월마트(Walmart) 관련 최대 포럼인 YahooFinace내에 존재하는 Walmart-forum에 적용하였다.

The spam in the discussion web forum causes user inconvenience and lowers the value of the web forum as the open source of user opinion. The importance of postings is evaluated in terms of the number of involved authors, so the spam distorts the analysis result by adding the unnecessary data in the opinion analysis. We propose the automatic detection model of spam postings in the web forum. We extract text features of posting contents using text mining techniques from the perspective of linguistics and then perform supervised learning to recognize spam from normal postings. Significant features are derived through the learning process and the automatic detection model is built based on those features. To build the automatic detection model of normal postings and spam, four evaluators are asked to recognize the spam posting in prior. We adopted the Naive Bayesian, Support Vector Machine (SVM), decision tree, which are known to perform well in data and text mining tasks. We can extract the text features to recognize the spam and detect automatically the newly posted spam. We apply the proposed model to the YahooFinace-Walmart forum, which is the world largest Walmart-related web forum.

키워드

인용현황

텍스트 마이닝을 이용한 웹 포럼 불량글 탐지 모델 The Spam Detection Model for Web Forums using Text Mining Techniques

초록 열기/닫기 버튼

키워드열기/닫기 버튼

피인용 횟수

인용현황

KCI에서 이 논문을 인용한 논문의 수는 3건입니다. 열기/닫기 버튼

참고문헌(24) 열기/닫기 버튼 * 2023년 이후 발행 논문의 참고문헌은 현재 구축 중입니다.