Characterization of Source Code Defects by Data Mining Conducted on GitHub

Abstract

In software systems the coding errors are unavoidable due to the frequent source changes, the tight deadlines and the inaccurate specifications. Therefore, it is important to have tools that help us in finding these errors. One way of supporting bug prediction is to analyze the characteristics of the previous errors and identify the unknown ones based on these characteristics. This paper aims to characterize the known coding errors.

Nowadays, the popularity of the source code hosting services like GitHub are increasing rapidly. They provide a variety of services, among which the most important ones are the version and bug tracking systems. Version control systems store all versions of the source code, and bug tracking systems provide a unified interface for reporting errors. Bug reports can be used to identify the wrong and the previously fixed source code parts, thus the bugs can be characterized by static source code metrics or by other quantitatively measured properties using the gathered data.

We chose GitHub for the base of data collection and we selected 13 Java projects for analysis. As a result, a database was constructed, which characterizes the bugs of the examined projects, thus can be used, inter alia, to improve the automatic detection of software defects.

Publication
Proceedings of the 15th International Conference on Computational Science and Its Applications (ICCSA 2015), Banff, Alberta, Canada, Pages 47–62

BibTeX:

@InProceedings{GGT15,
    author    = {Gyimesi, P\'eter and Gyimesi, G\'abor and T\'oth, Zolt\'an and Ferenc, Rudolf},
    booktitle = {Proceedings of the 15th International Conference on Computational Science and Its Applications (ICCSA 2015)},
    title     = {Characterization of Source Code Defects by Data Mining Conducted on {GitHub}},
    year      = {2015},
    address   = {Banff, Alberta, Canada},
    month     = jun,
    pages     = {47--62},
    publisher = {Springer-Verlag},
    series    = {Lecture Notes in Computer Science (LNCS)},
    volume    = {9159},
    doi       = {10.1007/978-3-319-21413-9_4},
    keywords  = {Bug database, GitHub, Data mining},
    url       = {https://link.springer.com/chapter/10.1007%2F978-3-319-21413-9_4},
}