기계학습 기반 기업신용정보 분석을 통한 채무불이행 예측

송민찬; 류두진

기계학습 기반 기업신용정보 분석을 통한 채무불이행 예측

원문정보

Predicting Loan Delinquency by Analyzing Sample DB with Machine Learning

송민찬, 류두진

한국재무학회 재무연구 제34권 제4호 2021.11 pp.199-234 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

This paper investigates the ability to predict corporate default rates using loan-sample data from the Korea Credit Information Service's financial big data open system (CreDB). The corporate loan from financial institution increases financial institution's credit exposure. Because measurement of the impact on the credit risk in the financial institution is used in determining the pricing model and structure of loan products, it is an essential factor for the financial institution that affects its profit structure. In terms of risk management, predicting delinquency using loan data is necessary for 5,000 Korean financial institutions. In several studies, bankruptcy forecasting was conducted on listed companies that disclosed financial and stock price information. However, this study increases the practical utility by extending the analysis target to individual entrepreneurs and small and medium-sized enterprises(SMEs). In addition, this study presents representative big data analysis results by utilizing loan, delinquency, and technology credit information of approximately 1.1 million corporations, which is 20% of almost 5.6 million domestic sole proprietors and non-listed corporations. For loan data, it includes ten monthly loan type codes and eleven overdue reason codes. Prediction targets are separated by individual and corporate entrepreneurs. Also, analyses are divided by use of the processed dataset. For efficient analysis, the data dimension was reduced by changing the table structure through nested iterative operations while expanding the variable composition from a table consisting of N rows to one column. To reflect the characteristics of the data as much as possible, exploratory data analysis and feature-engineering were performed to process the data. Also, classification models are classified by four groups using a parametric method that nine models train for classification. Group 1 consists of Logistic Regression and Linear discriminant analysis based on the parametric method, group 2 consists of several algorithms that calculate the distance for model learning. In addition, group 3 consists of tree-based algorithms, which are also non-parametric methods. Group 4 consists of the semi-parametric method, which is deep neural network. However, out of the total 438,697 corporations, 810 defaulted, accounting for only 0.2% of the forecast, so the target distribution is severely imbalanced. For this reason, before model fitting, under sampling of imbalanced data was performed. The bias of the sampled training and validation data is minimized by performing. K-fold cross validation as much as the level of K=5. Finally, the analysis result suggests a significant effect on classification performance when the processed data is used. However, this study suggests no significant effect on performance when loan owner's characteristics are included. Moreover, tech-credit rating (TCB) information gives any meaningful effect regarding the type of corporation. Also, classification with Deep Neural Network (DNN), which is based on the Semi-parametric method, makes the best performance of binary classification. Non-parametric and Non-tree based models are not appropriate methods for analyzing loan data. In the case of the DNN based on the semi-parametric methodology, the highest classification performance was confirmed for all analyses and entrepreneurs' classifications performed in this study. The neural network used in this study consists of 14 hidden layers. According to the neural network baseline design, the sigmoid function was applied to the activation function's initial value, the relu function was applied to the hidden layer, and optimization was performed through the Adam optimizer. In particular, the analysis of credit transaction information based on credit information of all financial institutions in Korea was conducted, and there is a possibility for alleviating information asymmetry of individual credit institutions regarding risk management targets. In addition, in the case of parametric methodologies used in classical studies and most used in practice, the average classification performance for major segments was inferior to that of semi-parametric methodologies. Furthermore, the difference between these performances is up to 16 percent. This paper suggests the direction of using loan-sample data. It is foundational research for financial institutions that are using loan data for credit risk management. It is necessary to expand research focusing on semi-parametric methodologies about corporate credit information analysis.

한국어

본 연구는 신용정보 표본DB 원격분석시스템에서 제공하는 기업신용정보를 분석하여 기업의 채무불이행을 예측한다. 사업자 구분에 따라 분석대상을 나누고 표본DB에서 제공하는 기업신용정보의 활용에 따른 실험을 구성한다. 또한, 다양한 기계학습 기법을 모수 추정 방식에 따라 모수적 방법론, 비모수적 방법론, 준모수적 방법론으로 구분하여 예측성과를 비교한다. 표본DB 원시데이터를 활용한 분석보다 대출 및 연체 종류에 따라 가공한 자료를 활용하는 경우 각 기계학습 모형별 성능개선이 관측되었으나, 기업 차주의 특성정보와 기술신용평가 정보의 활용은 모형별 성능개선에 기여하지 못하였다. 모든 세그먼트에서 준모수적 방법론에 해당하는 심층신경망 모형에 대해 성능이 가장 우수한 것으로 확인되었으며, 트리계열이 아닌 비모수적 방법론의 경우 재현율이 낮게 관측되어 채무불이행 예측 문제에 적합하지 않았다. 기존 실무에서 사용되는 모수적 방법론을 활용한 경우보다 준모수적 방법론을 활용할 경우 분류 성능이 향상됨을 확인하였다. 본 연구는 실제 기업신용정보에 대해 구성된 표본DB를 활용하여 기업부실 예측을 시도한 최초의 연구이며, 기업신용정보를 활용하는 여신 금융기관과 신용정보사의 자료 활용 및 모형 구축에 대한 방향성을 제시한다.

키워드

저자정보

송민찬 Minchan Song. NICE디앤비 과장
류두진 Doojin Ryu. 성균관대학교 경제학과 교수

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

7,900원

0개의 논문이 장바구니에 담겼습니다.

earticle