Bug Database of GitHub Projects

A Public Bug Database of GitHub Projects and its Application in Bug Prediction.

Online appendix for the ICCSA 2016 paper
(7th International Symposium on Software Quality).


Zoltán Tóth, Péter Gyimesi, and Rudolf Ferenc.


Detecting defects in software systems is an evergreen topic, since there is no real world software without bugs. Many different bug locating algorithms have been presented recently that can help to detect hidden and newly occurred bugs in software. Papers trying to predict the faulty source code elements or code segments in the system always use experience from the past. In most of the cases these studies construct a database for their own purposes and do not make the gathered data publicly available. Public datasets are rare; however, a well constructed dataset could serve as a benchmark test input. Furthermore, open-source software development is rapidly increasing that also gives an opportunity to work with public data.
In this study we selected 15 Java projects from GitHub to construct a public bug database from. We matched the already known and fixed bugs with the corresponding source code elements (classes and files) and calculated a wide set of product metrics on these elements. After creating the desired bug database, we investigated whether the built database is usable for bug prediction. We used 13 machine learning algorithms to address this research question and finally we achieved F-measure values between 0.7 and 0.8. Beside the F-measure values we calculated the bug coverage ratio on every project for every machine learning algorithm. We obtained very high and promising bug coverage values (up to 100%).


bug prediction, bug database

Online appendix:

Download link for the GitHub Bug DataSet 1.0.

Download link for the GitHub Bug DataSet 1.1.