Class Balancing Approaches to Improve for Software Defect Prediction Estimations
https://doi.org/10.15514/ISPRAS-2024-36(6)-2
Abstract
Addressing software defects is an ongoing challenge in software development, and effectively managing and resolving defects is vital for ensuring software reliability, which is in turn a crucial quality attribute of any software system. Software defect prediction supported by Machine Learning (ML) methods offers a promising approach to address the problem of software defects. However, one common challenge in ML-based software defect prediction is the issue of data imbalance. In this paper, we present an empirical study aimed at assessing the impact of various class balancing methods on the issue of class imbalance in software defect prediction. We conducted a set of experiments that involved nine distinct class balancing methods across seven different classifiers. We used datasets from the PROMISE repository, provided by the NASA software project. We also employed various metrics including AUC, Accuracy, Precision, Recall, and the F1 measure to gauge the effectiveness of the different class balancing methods. Furthermore, we applied hypothesis testing to determine any significant differences in metric results between datasets with balanced and unbalanced classes. Based on our findings, we conclude that balancing the classes in software defect prediction yields significant improvements in overall performance. Therefore, we strongly advocate for the inclusion of class balancing as a pre-processing step in this domain.
About the Authors
Ángel Juan SÁNCHEZ-GARCÍAMexico
Doctor in artificial intelligence, associate professor at Facultad de Estadística e Informática, Universidad Veracruzana, Mexico (School of Statistics and Informatics, University of Veracruz). Research interests: software measurement, machine learning, effort prediction, evolutionary computation.
Riaño Hector Xavier LIMÓN
Mexico
Doctor in artificial intelligence, associate professor at Facultad de Estadística e Informática, Universidad Veracruzana, Mexico (School of Statistics and Informatics, University of Veracruz). Research interests: data mining, multi-agent systems, distributed systems, and software architecture.
Saúl DOMÍNGUEZ-ISIDRO
Mexico
PhD in artificial intelligence, associate professor at Facultad de Estadística e Informática, Universidad Veracruzana, Mexico (School of Statistics and Informatics, University of Veracruz). Research interests: distributed systems, software development, computational intelligence, and machine learning.
Dan Javier OLVERA-VILLEDA
Mexico
Bachelor in software engineering, Facultad de Estadística e Informática, Universidad Veracruzana, Mexico (School of Statistics and Informatics, University of Veracruz). Research interests: software development, machine learning, and computational intelligence.
Juan Carlos PÉREZ-ARRIAGA
Mexico
Has a master’s degree in computer science, an associate professor at Facultad de Estadística e Informática, Universidad Veracruzana, Mexico (School of Statistics and Informatics, University of Veracruz). Research interests: software accessibility, software security, and software construction.
References
1. D. J. Olvera-Villeda, A. J. Sánchez-García, X. Limón, and S. Domínguez Isidro, “Class balancing approaches in dataset for software defect prediction: A systematic literature review” in 2023 11th International Conference in Software Engineering Research and Innovation (CONISOFT). IEEE, 2023, pp. 1–6.
2. M. Glinz, “A glossary of requirements engineering terminology”, Standard Glossary of the Certified Professionalfor Requirements Engineering (CPRE) Studies and Exam, Version, vol. 1, p. 56, 2011.
3. J. D. Musa, “Software reliability measurement”, Journal of Systems and Software, vol. 1, pp. 223–241, 1979.
4. I. Iso and N. IEC, “Iso/iec”, IEEE International Standard-Systems and software engineering- Vocabulary, pp. 1–541,2017.
5. P. D. Singh and A. Chug, “Software defect prediction analysis using machine learning algorithms” in 2017 7th international conference on cloud computing, data science & engineering confluence. IEEE, 2017, pp. 775–781.
6. J. Sayyad Shirabad and T. Menzies, “The PROMISE Repository of Software Engineering Databases”, School of Information Technology and Engineering, University of Ottawa, Canada, 2005. [Online]. Available: http://PROMISEsite.uottawa.ca/SERepository
7. T. McCabe, “A complexity measure”, IEEE Transactions on Software Engineering, vol. 2, no. 4, pp. 308–320, December 1976.
8. M. Halstead, Elements of Software Science. Elsevier, 1977.
9. D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization”, IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997.
10. Y. Zhang, X. Yan, and A. A. Khan, “A kernel density estimation-based variation sampling for class imbalance in defect prediction” in 2020 IEEE Intl Conf. on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2020, pp. 1058–1065.
11. E. Elahi, S. Kanwal, and A. N. Asif, “A new ensemble approach for software fault prediction” in 2020 17th international Bhurban conference on applied sciences and technology (IBCAST). IEEE, 2020, pp. 407–412.
12. J. Zheng, X. Wang, D. Wei, B. Chen, and Y. Shao, “A novel imbalanced ensemble learning in software defect predication”, IEEE Access, vol. 9, pp. 86 855–86 868, 2021.
13. Q. Zha, X. Yan, and Y. Zhou, “Adaptive centre-weighted oversampling for class imbalance in software defect prediction”, in 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, 2018, pp. 223–230.
14. S. Huda, K. Liu, M. Abdelrazek, A. Ibrahim, S. Alyahya, H. Al-Dossari, and S. Ahmad, “An ensemble oversampling model for class imbalance problem in software defect prediction”, IEEE access, vol. 6, pp. 24 184–24 195, 2018.
15. R. Malhotra, N. Nishant, S. Gurha, and V. Rathi, “Application of particle swarm optimization for software defect prediction using object oriented metrics” in 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2021, pp. 88–93.
16. Z. Li, X. Zhang, J. Guo, and Y. Shang, “Class imbalance data generation for software defect prediction”, in 2019 26th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2019, pp. 276–283.
17. S. Ghosh, A. Rana, and V. Kansal, “Combining integrated sampling with nonlinear manifold detection techniques for software defect prediction” in 2018 3rd International Conference on Contemporary Computing and Informatics (IC3I). IEEE, 2018, pp. 147–154.
18. S. A. Putri et al., “Combining integreted sampling technique with feature selection for software defect prediction” in 2017 5th International Conference on Cyber and IT Service Management (CITSM). IEEE, 2017, pp. 1–6.
19. T. Thaher and N. Arman, “Efficient Multi-Swarm Binary Harris Hawks Optimization as a Feature Selection Approach for Software Fault Prediction” in 2020 11th International conference on information and communication systems (ICICS). IEEE, 2020, pp. 249–254.
20. K. Bashir, T. Li, C. W. Yohannese, and Y. Mahama, “Enhancing software defect prediction using supervised-learning based framework” in 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE). IEEE, 2017, pp. 1–6.
21. S. S. Rathore, S. S. Chouhan, D. K. Jain, and A. G. Vachhani, “Generative oversampling methods for handling imbalanced data in software fault prediction”, IEEE Transactions on Reliability, vol. 71, no. 2, pp. 747–762, 2022.
22. Z. Eivazpour and M. R. Keyvanpour, “Improving performance in software defect prediction using variational autoencoder” in 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI). IEEE, 2019, pp. 644–649.
23. A. Bispo, R. Prudˆencio, and D. VÅLeras, “Instance selection and class balancing techniques for cross project defect prediction” in 2018 7th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2018, pp. 552–557.
24. K. E. Bennin, J. Keung, P. Phannachitta, A. Monden, and S. Mensah, “Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction” IEEE Transactions on Software Engineering, vol. 44, no. 6, pp. 534–550, 2017.
25. R. Malhotra, R. Kapoor, P. Saxena, and P. Sharma, “Saga: A hybrid technique to handle imbalance data in software defect prediction” in 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE, 2021, pp. 331–336.
26. D. Wang and X. Xiong, “Software defect prediction based on combined sampling and feature selection” in ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application. VDE, 2021, pp. 1–5.
27. Y. Liu, F. Sun, J. Yang, and D. Zhou, “Software defect prediction model based on improved bp neural network” in 2019 6th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 2020, pp. 521–522.
28. R. B. Bahaweres, F. Agustian, I. Hermadi, A. I. Suroso, and Y. Arkeman, “Software defect prediction using neural network based smote” in 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI). IEEE, 2020, pp. 71–76.
29. S. Choirunnisa, B. Meidyani, and S. Rochimah, “Software defect prediction using oversampling algorithm: A-suwo” in 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS). IEEE, 2018, pp. 337– 341.
30. W. A. Dipa and W. D. Sunindyo, “Software defect prediction using smote and artificial neural network” in 2021 International Conference on Data and Software Engineering (ICoDSE). IEEE, 2021, pp. 1– 4.
31. R. Malhotra, V. Agrawal, V. Pal, and T. Agarwal, “Support vector based oversampling technique for handling class imbalance in software defect prediction” in 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2021, pp. 1078–1083.
32. L. Gong, S. Jiang, and L. Jiang, “Tackling class imbalance problem in software defect prediction through clusterbased over-sampling with filtering” IEEE Access, vol. 7, pp. 145 725–145 737, 2019.
33. R. Malhotra and S. Kamal, “Tool to handle imbalancing problem in software defect prediction using oversampling methods” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2017, pp. 906–912.
34. S. K. Pandey and A. K. Tripathi, “Class imbalance issue in software defect prediction models by various machine learning techniques: an empirical study” in 2021 8th International Conference on Smart Computing and Communications (ICSCC). IEEE, 2021, pp. 58–63.
35. W. Zhang, Y. Li, M. Wen, and R. He, “Comparative study of ensemble learning methods in just-in-time software defect prediction” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRSC), 2023, pp. 83–92.
36. X. Yang, S. Wang, Y. Li, and S. Wang, “Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 2287–2298.
37. R. Kumar and A. Chaturvedi, “Software bug prediction using reward-based weighted majority voting ensemble technique”, IEEE Transactions on Reliability, vol. 73, no. 1, pp. 726–740, 2024.
38. M. Devi, T. Rajkumar, and D. Balakrishnan, “Prediction of software defects by employing optimized deep learning and oversampling approaches” in 2024 2nd International Conference on Computer, Communication and Control (IC4), 2024, pp. 1–5.
39. W. Wei, F. Jiang, X. Yu, and J. Du, “An under- sampling algorithm based on weighted complexity and its application in software defect prediction” in Proceedings of the 2022 5th International Conference on Software Engineering and Information Management, 2022, pp. 38–44.
40. G. Abaei, W. Z. Tah, J. Z. W. Toh, and E. S. J. Hor, “Improving software fault prediction in imbalanced datasets using the under-sampling approach” in Proceedings of the 2022 11th International Conference on Software and Computer Applications, 2022, pp. 41–47.
41. Z.-W. Zhang, X.-Y. Jing, and T.-J. Wang, “Label propagation based semi-supervised learning for software defect prediction”, Automated Software Engineering, vol. 24, pp. 47–69, 2017.
42. X. Du, H. Yue, and H. Dong, “Software defect prediction method based on hybrid sampling” in International Conference on Frontiers of Electronics, Information and Computation Technologies, ser. ICFEICT 2021. New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3474198.3478215.
43. D. Ryu, J.-I. Jang, and J. Baik, “A transfer cost- sensitive boosting approach for cross-project defect prediction”, Software Quality Journal, vol. 25, pp. 235–272, 2017.
44. L. Zhou, R. Li, S. Zhang, and H. Wang, “Imbalanced data processing model for software defect prediction”, Wireless Personal Communications, vol. 102, pp. 937–950, 2018.
45. H. He, X. Zhang, Q. Wang, J. Ren, J. Liu, X. Zhao, and Y. Cheng, “Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data”, IEEE Access, vol. 7, pp. 110 333–110 343, 2019.
46. C. Zeng, C. Y. Zhou, S. K. Lv, P. He, and J. Huang, “Gcn2defect: Graph convolutional networks for smotetomek based software defect prediction” in 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 69–79.
47. A. Joon, R. K. Tyagi, and K. Kumar, “Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite” in 2020 5th International Conference on Communication and Electronics Systems (ICCES). IEEE, 2020, pp. 1381–1389.
48. L. Chen, B. Fang, Z. Shang, and Y. Tang, “Tackling class overlap and imbalance problems in software defect prediction”, Software Quality Journal, vol. 26, pp. 97–125, 2018.
49. S. Riaz, A. Arshad, and L. Jiao, “Rough noise-filtered easy ensemble for software fault prediction”, IEEE Access, vol. 6, pp. 46 886–46 899, 2018.
50. X. Wan, Z. Zheng, and Y. Liu, “Spe2: Self-paced ensemble of ensembles for software defect prediction”, IEEE Transactions on Reliability, vol. 71, no. 2, pp. 865–879, 2022.
51. G. Menardi and N. Torelli, “Training and assessing classification rules with imbalanced data”, Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 92–122, 2012. [Online]. Available: http://dx.doi.org/10.1007/s10618-012-0295-5.
52. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over- samplingtechnique”, Journal of Artificial Intelligence Research, vol. 16, no. nil, pp. 321–357, 2002. [Online]. Available: http://dx.doi.org/10.1613/jair.953.
53. H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 6 2008, p. nil. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2008.4633969.
54. G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data”, ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004. [Online]. Available: http://dx.doi.org/10.1145/1007730.1007735.
55. I. Mani and I. Zhang, “knn approach to unbalanced data distributions: a case study involving information extraction” in Proceedings of workshop on learning from imbalanced datasets, vol. 126, no. 1. ICML, 2003, pp. 1–7.
56. D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-2, no. 3, pp. 408–421, 1972.
57. I. Tomek, “An experiment with the edited nearest- neighbor rule”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 6, pp. 448–452, 1976. [Online]. Available: http://dx.doi.org/10.1109/TSMC.1976.4309523.
58. B. R. Manju and A. R. Nair, “Classification of Cardiac Arrhythmia of 12 Lead ECG Using Combination of SMOTEENN, XGBoost and Machine Learning Algorithms” in 2019 9th International Symposium on Embedded Computing and System Design (ISED), 2019, pp. 1–7.
59. G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard, “Balancing training data for automated annotation of keywords: a case study” in WOB, 2003. [Online]. Available: https://api.semanticscholar.org/CorpusID:1579194
60. I. Tomek, “Two modifications of cnn”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 11, pp. 769–772, 1976.
61. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, ser. Routledge, 2017. [Online]. Available: http://dx.doi.org/10.1201/9781315139470.
62. D. A. Cieslak and N. V. Chawla, “Learning decision trees for unbalanced data” in Machine Learning and Knowledge Discovery in Databases, W. Daelemans, B. Goethals, and K. Morik, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 241–256.
63. E. Fix and J. L. Hodges, “Discriminatory analysis. nonparametric discrimination: Consistency properties”, International Statistical Review / Revue Internationale de Statistique, vol. 57, no. 3, p. 238, 1989. [Online]. Available: http://dx.doi.org/10.2307/1403797.
64. L. Breiman, “Random forests”, Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Online]. Available: http://dx.doi.org/ 10.1023/A:1010933404324
65. Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm” in Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ser. ICML’96. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1996, p. 148–156.
66. J. H. Friedman, “Stochastic Gradient boosting”, Computational Statistics Data Analysis, vol. 38, no. 4, pp. 367–378, 2002, nonlinear Methods and Data Mining. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S01 67947301000652.
67. D. J. Hand and K. Yu, “Idiot’s bayes-not so stupid after all?” International Statistical Review, vol. 69, no. 3, pp. 385–398, 2001. [Online]. Available: http://dx.doi.org/10.1111/j.1751-823.2001.tb00465.x.
68. G. E. Hinton, Connectionist learning procedures, ser. Machine Learning. Elsevier, 1990, pp. 555–610. [Online]. Available: http://dx.doi.org/10.1016/B978- 0-08-051055-2.50029-8.
69. T. Dyba, V. B. Kampenes, and D. I. Sj.berg, “A systematic review of statistical power in software engineering experiments”, Information and Software Technology, vol. 48, no. 8, pp. 745–755, 2006.
70. J. Sánchez-García, “Statistical tests among groups”, [Data set], 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13239734.
71. D. S. Moore and G. P. McCabe, Introduction to the practice of statistics. WH Freeman/Times Books/Henry Holt & Co, 1989.
72. J. Sánchez-García, “Statistical tests results”, [Data set], 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13240040.
73. R. Malhotra and M. Khanna, “Threats to validity in search based predictive modelling for software engineering”, IET Software, vol. 12, no. 4, pp. 293-305, 2018.
74. I. Bronshteyn, “Study of defects in a program code in python”, Programming and Computer Software, vol. 39, pp. 279–284, 2013.
75. A. Belevantsev, “Multilevel static analysis for improving program quality”, Programming and Computer Software, vol. 43, pp. 321–336, 2017.
Review
For citations:
SÁNCHEZ-GARCÍA Á., LIMÓN R., DOMÍNGUEZ-ISIDRO S., OLVERA-VILLEDA D., PÉREZ-ARRIAGA J. Class Balancing Approaches to Improve for Software Defect Prediction Estimations. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(6):19-38. https://doi.org/10.15514/ISPRAS-2024-36(6)-2