무작위 표본에 대한 코퍼스 언어학적 연구

홍정하

무작위 표본에 대한 코퍼스 언어학적 연구

원문정보

A Corpus-linguistic Approach to Random Samples

홍정하

고려대학교 언어정보연구소 언어정보 제18호 2014.03 pp.137-162 KCI 등재

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

In quantitative studies, a randomsample is supposed to be randomly selected by probability sampling in sucha way that it represents a population. The statistical analysis of corpus frequencydata is based on a random sample model, which assumes that the corpus wasrandomly selected from the language. However, Kilgarriff (2005), Evert (2006),Goh (2011) show that typical corpus data severely violate the randomnessassumption. This paper aims to evaluate random sampling methods for corpuslinguistics and to explore their characteristics and applicability. They are evaluatedon the relative frequencies of 30 morphemes and the frequencies of all morphemetypes which occur in each sample observed from 1,000 resampling trials basedon how close each random sample is to the normal distribution and theZipf-Mandelbrot (Mandelbrot 1977) law. The present study creates three findings. First, systematic sampling at the unit of measurement, i.e. individual words froman entire corpus is a best way to construct random samples for corpus linguistics. Second, the closer the relative frequencies of 30 morphemes in a sample lieto the normal distribution, the closer the frequency distribution of all morphemetypes to the Zipf-Mandelbrot distribution. Third, It is an effective way to utilizerandom samples for solving problems that stem from different sample size anddata sparseness. Moreover, using them facilitates detecting rather big differencein word frequencies obtained from different corpora.

키워드

저자정보

홍정하 Hong, Jungha.. 고려대학교 언어학과

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 기관로그인 시 무료 이용이 가능합니다.

6,400원

0개의 논문이 장바구니에 담겼습니다.

earticle