원문정보
초록
영어
Analyzing the unstructured information in the source code (that is, the comments and identifiers) is based on the idea that the unstructured information reveals, to some extent, the concepts of the problem domain of the software. This information adds a new layer of source code semantic information and captures the domain semantics of the software. Developers use identifiers, method names, and comments to incorporate components of the solution domain of the software. Topic models reveal topics from the corpus, which embody real world concepts by analyzing words that frequently co-occur. These topics have been found to be effective mechanisms for describing the major themes spanning a corpus. Recently, software engineering researchers established that topic models can be effective in structuring various software artifacts, such as bug reports and requirements documents. In this paper, we extract topic models from the textual content of source code by conducting a case study on the source code of Java-based open-source systems, ArgoUML, Checkstyle, JHotDraw and jEdit. The paper investigates the effectiveness of LDA in comprehending large open-source software systems.
목차
1. Introduction
2. Latent Dirichlet Allocation (LDA)
3. Study Methodology
3.1. Unstructured Source Code
3.2. Systems Under Study
3.3. Study Setup
3.4. Choosing K
4. The Discovered Topics
5. Threats to Validity
6. Related Work
7. Conclusion
References
