원문정보
A Corpus-linguistic Approach to Random Samples
초록
영어
In quantitative studies, a randomsample is supposed to be randomly selected by probability sampling in sucha way that it represents a population. The statistical analysis of corpus frequencydata is based on a random sample model, which assumes that the corpus wasrandomly selected from the language. However, Kilgarriff (2005), Evert (2006),Goh (2011) show that typical corpus data severely violate the randomnessassumption. This paper aims to evaluate random sampling methods for corpuslinguistics and to explore their characteristics and applicability. They are evaluatedon the relative frequencies of 30 morphemes and the frequencies of all morphemetypes which occur in each sample observed from 1,000 resampling trials basedon how close each random sample is to the normal distribution and theZipf-Mandelbrot (Mandelbrot 1977) law. The present study creates three findings. First, systematic sampling at the unit of measurement, i.e. individual words froman entire corpus is a best way to construct random samples for corpus linguistics. Second, the closer the relative frequencies of 30 morphemes in a sample lieto the normal distribution, the closer the frequency distribution of all morphemetypes to the Zipf-Mandelbrot distribution. Third, It is an effective way to utilizerandom samples for solving problems that stem from different sample size anddata sparseness. Moreover, using them facilitates detecting rather big differencein word frequencies obtained from different corpora.
목차
1. 서론
2. 문제 제기
3. 연구 방법
4. 이론적 분포 평가
5. 무작위 표본의 활용
6. 결론
참고문헌
