사전학습 언어모델을 활용한 범죄수사 도메인 개체명 인식

김희두; 임희석

사전학습 언어모델을 활용한 범죄수사 도메인 개체명 인식

원문정보

A Named Entity Recognition Model in Criminal Investigation Domain using Pretrained Language Model

김희두, 임희석

한국융합학회 한국융합학회논문지 제13권 제2호 2022.02 pp.13-20 KCI 등재

초록

영어

This study is to develop a named entity recognition model specialized in criminal investigation domains using deep learning techniques. Through this study, we propose a system that can contribute to analysis of crime for prevention and investigation using data analysis techniques in the future by automatically extracting and categorizing crime-related information from text-based data such as criminal judgments and investigation documents. For this study, the criminal investigation domain text was collected and the required entity name was newly defined from the perspective of criminal analysis. In addition, the proposed model applying KoELECTRA, a pre-trained language model that has recently shown high performance in natural language processing, shows performance of micro average(referred to as micro avg) F1-score 98% and macro average(referred to as macro avg) F1-score 95% in 9 main categories of crime domain NER experiment data, and micro avg F1-score 98% and macro avg F1-score 62% in 56 sub categories. The proposed model is analyzed from the perspective of future improvement and utilization.

한국어

본 연구는 딥러닝 기법을 활용하여 범죄 수사 도메인에 특화된 개체명 인식 모델을 개발하는 연구이다. 본 연구를 통해 비정형의 형사 판결문·수사 문서와 같은 텍스트 기반의 데이터에서 자동으로 범죄 수법과 범죄 관련 정보를 추출하고 유형화하여, 향후 데이터 분석기법을 활용한 범죄 예방 분석과 수사에 기여할 수 있는 시스템을 제안한다. 본 연구에서는 범죄 수사 도메인 텍스트를 수집하고 범죄 분석의 관점에서 필요한 개체명 분류를 새로 정의하였다. 또한 최근 자연어 처리에서 높은 성능을 보이고 있는 사전학습 언어모델인 KoELECTRA를 적용한 제안 모델은 본 연구에서 정의한 범죄 도메인 개체명 실험 데이터의 9종의 메인 카테고리 분류에서 micro average(이하 micro avg) F1-score 99%, macro average(이하 macro avg) F1-score 96%의 성능을 보이고, 56종의 서브 카테고리 분류에서 micro avg F1-score 98%, macro avg F1-score 62%의 성능을 보인다. 제안한 모델을 통해 향후 개선 가능성과 활용 가능성의 관점에서 분석한다.

요약
Abstract
1. 서론
1.1 서론
2. 관련 연구
2.1 범죄 정보 추출(Crime Information Extraction)
3. 범죄수사 도메인 개체명 인식 모델
3.1 범죄 도메인 개체명 정의
3.2 Bi-LSTM
3.3 사전학습 언어모델
4. 실험 및 분석
4.1 데이터
4.2 개체명 태깅
4.3 평가지표
4.4 실험
4.5 실험결과
4.6 토의
5. 결론 및 시사점
REFERENCES

키워드

저자정보

김희두 Hee-Dou Kim. 고려대학교 빅데이터융합학과 석사과정
임희석 Heuiseok Lim. 고려대학교 컴퓨터학과 교수

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

4,000원

0개의 논문이 장바구니에 담겼습니다.

earticle