Подходы к балансировке классов для улучшения оценок прогнозирования дефектов программного обеспечения
https://doi.org/10.15514/ISPRAS-2024-36(6)-2
Аннотация
Постоянной проблемой в разработке программного обеспечения является устранение дефектов в этом обеспечении. И эффективное управление и устранение дефектов имеют жизненно важное значение для обеспечения надежности программного обеспечения, что, в свою очередь, является важнейшим атрибутом качества любой системы программного обеспечения. Прогнозирование программных дефектов, поддерживаемое методами машинного обучения (ML) – это многообещающий подход к решению проблемы программных дефектов. Тем не менее, одной из общих проблем в прогнозировании дефектов программного обеспечения на основе ML является проблема дисбаланса данных. В этой статье мы представляем эмпирическое исследование, направленное на оценку влияния различных методов балансировки классов на проблему дисбаланса классов в прогнозировании дефектов программного обеспечения. Мы провели ряд экспериментов, которые включали девять различных методов балансировки классов по семи различным классификаторам. Мы использовали наборы данных из репозитория PROMISE, предоставленные программным проектом NASA. Мы также использовали различные метрики, включая AUC, точность, полнота, отзыв и меру F1, чтобы оценить эффективность методов балансировки различных классов. Кроме того, мы применили проверку гипотез, чтобы определить любые существенные различия в метрических результатах между наборами данных со сбалансированными и несбалансированными классами. Основываясь на наших выводах, мы пришли к выводу, что балансировка классов в прогнозировании дефектов программного обеспечения дает значительное улучшение общей производительности. Поэтому мы решительно выступаем за включение балансировки классов в качестве этапа предварительной обработки в этой области.
Об авторах
Ангел Хуан САНЧЕС-ГАРСИЯМексика
Имеет степень PhD по искусственному интеллекту, доцент факультета статистики и информатики Университета Веракруса (Мексика). Сфера научных интересов: программометрия, машинное обучение, прогнозирование затрат, эволюционные вычисления.
Рианьо Гектор Ксавьер ЛИМОН
Мексика
Имеет степень PhD по искусственному интеллекту, доцент факультета статистики и информатики Университета Веракруса (Мексика). Сфера научных интересов: интеллектуальный анализ данных, мультиагентные и распределенные системы, архитектура программного обеспечения.
Саул ДОМИНГЕС-ИСИДРО
Мексика
Имеет степень PhD по искусственному интеллекту, доцент факультета статистики и информатики Университета Веракруса (Мексика). Сфера научных интересов: распределенные системы, разработка программного обеспечения, вычислительный интеллект, машинное обучение.
Дан Хавьер ОЛВЕРА-ВИЙЕДА
Мексика
Имеет степень бакалавра в области программной инженерии, сотрудник факультета статистики и информатики Университета Веракруса (Мексика). Сфера научных интересов: разработка программного обеспечения, вычислительный интеллект, машинное обучение.
Хуан Карлос ПЕРЕС-АРРИАГА
Мексика
Имеет степень магистра по программированию, доцент факультета статистики и информатики Университета Веракруса (Мексика). Сфера научных интересов: доступность программного обеспечения, безопасность программного обеспечения, конструирование программ.
Список литературы
1. D. J. Olvera-Villeda, A. J. Sánchez-García, X. Limón, and S. Domínguez Isidro, “Class balancing approaches in dataset for software defect prediction: A systematic literature review” in 2023 11th International Conference in Software Engineering Research and Innovation (CONISOFT). IEEE, 2023, pp. 1–6.
2. M. Glinz, “A glossary of requirements engineering terminology”, Standard Glossary of the Certified Professionalfor Requirements Engineering (CPRE) Studies and Exam, Version, vol. 1, p. 56, 2011.
3. J. D. Musa, “Software reliability measurement”, Journal of Systems and Software, vol. 1, pp. 223–241, 1979.
4. I. Iso and N. IEC, “Iso/iec”, IEEE International Standard-Systems and software engineering- Vocabulary, pp. 1–541,2017.
5. P. D. Singh and A. Chug, “Software defect prediction analysis using machine learning algorithms” in 2017 7th international conference on cloud computing, data science & engineering confluence. IEEE, 2017, pp. 775–781.
6. J. Sayyad Shirabad and T. Menzies, “The PROMISE Repository of Software Engineering Databases”, School of Information Technology and Engineering, University of Ottawa, Canada, 2005. [Online]. Available: http://PROMISEsite.uottawa.ca/SERepository
7. T. McCabe, “A complexity measure”, IEEE Transactions on Software Engineering, vol. 2, no. 4, pp. 308–320, December 1976.
8. M. Halstead, Elements of Software Science. Elsevier, 1977.
9. D. H. Wolpert and W. G. Macready, “No free lunch theorems for optimization”, IEEE transactions on evolutionary computation, vol. 1, no. 1, pp. 67–82, 1997.
10. Y. Zhang, X. Yan, and A. A. Khan, “A kernel density estimation-based variation sampling for class imbalance in defect prediction” in 2020 IEEE Intl Conf. on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2020, pp. 1058–1065.
11. E. Elahi, S. Kanwal, and A. N. Asif, “A new ensemble approach for software fault prediction” in 2020 17th international Bhurban conference on applied sciences and technology (IBCAST). IEEE, 2020, pp. 407–412.
12. J. Zheng, X. Wang, D. Wei, B. Chen, and Y. Shao, “A novel imbalanced ensemble learning in software defect predication”, IEEE Access, vol. 9, pp. 86 855–86 868, 2021.
13. Q. Zha, X. Yan, and Y. Zhou, “Adaptive centre-weighted oversampling for class imbalance in software defect prediction”, in 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom). IEEE, 2018, pp. 223–230.
14. S. Huda, K. Liu, M. Abdelrazek, A. Ibrahim, S. Alyahya, H. Al-Dossari, and S. Ahmad, “An ensemble oversampling model for class imbalance problem in software defect prediction”, IEEE access, vol. 6, pp. 24 184–24 195, 2018.
15. R. Malhotra, N. Nishant, S. Gurha, and V. Rathi, “Application of particle swarm optimization for software defect prediction using object oriented metrics” in 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2021, pp. 88–93.
16. Z. Li, X. Zhang, J. Guo, and Y. Shang, “Class imbalance data generation for software defect prediction”, in 2019 26th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 2019, pp. 276–283.
17. S. Ghosh, A. Rana, and V. Kansal, “Combining integrated sampling with nonlinear manifold detection techniques for software defect prediction” in 2018 3rd International Conference on Contemporary Computing and Informatics (IC3I). IEEE, 2018, pp. 147–154.
18. S. A. Putri et al., “Combining integreted sampling technique with feature selection for software defect prediction” in 2017 5th International Conference on Cyber and IT Service Management (CITSM). IEEE, 2017, pp. 1–6.
19. T. Thaher and N. Arman, “Efficient Multi-Swarm Binary Harris Hawks Optimization as a Feature Selection Approach for Software Fault Prediction” in 2020 11th International conference on information and communication systems (ICICS). IEEE, 2020, pp. 249–254.
20. K. Bashir, T. Li, C. W. Yohannese, and Y. Mahama, “Enhancing software defect prediction using supervised-learning based framework” in 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE). IEEE, 2017, pp. 1–6.
21. S. S. Rathore, S. S. Chouhan, D. K. Jain, and A. G. Vachhani, “Generative oversampling methods for handling imbalanced data in software fault prediction”, IEEE Transactions on Reliability, vol. 71, no. 2, pp. 747–762, 2022.
22. Z. Eivazpour and M. R. Keyvanpour, “Improving performance in software defect prediction using variational autoencoder” in 2019 5th Conference on Knowledge Based Engineering and Innovation (KBEI). IEEE, 2019, pp. 644–649.
23. A. Bispo, R. Prudˆencio, and D. VÅLeras, “Instance selection and class balancing techniques for cross project defect prediction” in 2018 7th Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2018, pp. 552–557.
24. K. E. Bennin, J. Keung, P. Phannachitta, A. Monden, and S. Mensah, “Mahakil: Diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction” IEEE Transactions on Software Engineering, vol. 44, no. 6, pp. 534–550, 2017.
25. R. Malhotra, R. Kapoor, P. Saxena, and P. Sharma, “Saga: A hybrid technique to handle imbalance data in software defect prediction” in 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE, 2021, pp. 331–336.
26. D. Wang and X. Xiong, “Software defect prediction based on combined sampling and feature selection” in ICMLCA 2021; 2nd International Conference on Machine Learning and Computer Application. VDE, 2021, pp. 1–5.
27. Y. Liu, F. Sun, J. Yang, and D. Zhou, “Software defect prediction model based on improved bp neural network” in 2019 6th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 2020, pp. 521–522.
28. R. B. Bahaweres, F. Agustian, I. Hermadi, A. I. Suroso, and Y. Arkeman, “Software defect prediction using neural network based smote” in 2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI). IEEE, 2020, pp. 71–76.
29. S. Choirunnisa, B. Meidyani, and S. Rochimah, “Software defect prediction using oversampling algorithm: A-suwo” in 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS). IEEE, 2018, pp. 337– 341.
30. W. A. Dipa and W. D. Sunindyo, “Software defect prediction using smote and artificial neural network” in 2021 International Conference on Data and Software Engineering (ICoDSE). IEEE, 2021, pp. 1– 4.
31. R. Malhotra, V. Agrawal, V. Pal, and T. Agarwal, “Support vector based oversampling technique for handling class imbalance in software defect prediction” in 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence). IEEE, 2021, pp. 1078–1083.
32. L. Gong, S. Jiang, and L. Jiang, “Tackling class imbalance problem in software defect prediction through clusterbased over-sampling with filtering” IEEE Access, vol. 7, pp. 145 725–145 737, 2019.
33. R. Malhotra and S. Kamal, “Tool to handle imbalancing problem in software defect prediction using oversampling methods” in 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, 2017, pp. 906–912.
34. S. K. Pandey and A. K. Tripathi, “Class imbalance issue in software defect prediction models by various machine learning techniques: an empirical study” in 2021 8th International Conference on Smart Computing and Communications (ICSCC). IEEE, 2021, pp. 58–63.
35. W. Zhang, Y. Li, M. Wen, and R. He, “Comparative study of ensemble learning methods in just-in-time software defect prediction” in 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRSC), 2023, pp. 83–92.
36. X. Yang, S. Wang, Y. Li, and S. Wang, “Does data sampling improve deep learning-based vulnerability detection? yeas! and nays!” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 2287–2298.
37. R. Kumar and A. Chaturvedi, “Software bug prediction using reward-based weighted majority voting ensemble technique”, IEEE Transactions on Reliability, vol. 73, no. 1, pp. 726–740, 2024.
38. M. Devi, T. Rajkumar, and D. Balakrishnan, “Prediction of software defects by employing optimized deep learning and oversampling approaches” in 2024 2nd International Conference on Computer, Communication and Control (IC4), 2024, pp. 1–5.
39. W. Wei, F. Jiang, X. Yu, and J. Du, “An under- sampling algorithm based on weighted complexity and its application in software defect prediction” in Proceedings of the 2022 5th International Conference on Software Engineering and Information Management, 2022, pp. 38–44.
40. G. Abaei, W. Z. Tah, J. Z. W. Toh, and E. S. J. Hor, “Improving software fault prediction in imbalanced datasets using the under-sampling approach” in Proceedings of the 2022 11th International Conference on Software and Computer Applications, 2022, pp. 41–47.
41. Z.-W. Zhang, X.-Y. Jing, and T.-J. Wang, “Label propagation based semi-supervised learning for software defect prediction”, Automated Software Engineering, vol. 24, pp. 47–69, 2017.
42. X. Du, H. Yue, and H. Dong, “Software defect prediction method based on hybrid sampling” in International Conference on Frontiers of Electronics, Information and Computation Technologies, ser. ICFEICT 2021. New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3474198.3478215.
43. D. Ryu, J.-I. Jang, and J. Baik, “A transfer cost- sensitive boosting approach for cross-project defect prediction”, Software Quality Journal, vol. 25, pp. 235–272, 2017.
44. L. Zhou, R. Li, S. Zhang, and H. Wang, “Imbalanced data processing model for software defect prediction”, Wireless Personal Communications, vol. 102, pp. 937–950, 2018.
45. H. He, X. Zhang, Q. Wang, J. Ren, J. Liu, X. Zhao, and Y. Cheng, “Ensemble multiboost based on ripper classifier for prediction of imbalanced software defect data”, IEEE Access, vol. 7, pp. 110 333–110 343, 2019.
46. C. Zeng, C. Y. Zhou, S. K. Lv, P. He, and J. Huang, “Gcn2defect: Graph convolutional networks for smotetomek based software defect prediction” in 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2021, pp. 69–79.
47. A. Joon, R. K. Tyagi, and K. Kumar, “Noise filtering and imbalance class distribution removal for optimizing software fault prediction using best software metrics suite” in 2020 5th International Conference on Communication and Electronics Systems (ICCES). IEEE, 2020, pp. 1381–1389.
48. L. Chen, B. Fang, Z. Shang, and Y. Tang, “Tackling class overlap and imbalance problems in software defect prediction”, Software Quality Journal, vol. 26, pp. 97–125, 2018.
49. S. Riaz, A. Arshad, and L. Jiao, “Rough noise-filtered easy ensemble for software fault prediction”, IEEE Access, vol. 6, pp. 46 886–46 899, 2018.
50. X. Wan, Z. Zheng, and Y. Liu, “Spe2: Self-paced ensemble of ensembles for software defect prediction”, IEEE Transactions on Reliability, vol. 71, no. 2, pp. 865–879, 2022.
51. G. Menardi and N. Torelli, “Training and assessing classification rules with imbalanced data”, Data Mining and Knowledge Discovery, vol. 28, no. 1, pp. 92–122, 2012. [Online]. Available: http://dx.doi.org/10.1007/s10618-012-0295-5.
52. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over- samplingtechnique”, Journal of Artificial Intelligence Research, vol. 16, no. nil, pp. 321–357, 2002. [Online]. Available: http://dx.doi.org/10.1613/jair.953.
53. H. He, Y. Bai, E. A. Garcia, and S. Li, “Adasyn: Adaptive synthetic sampling approach for imbalanced learning” in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 6 2008, p. nil. [Online]. Available: http://dx.doi.org/10.1109/IJCNN.2008.4633969.
54. G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data”, ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004. [Online]. Available: http://dx.doi.org/10.1145/1007730.1007735.
55. I. Mani and I. Zhang, “knn approach to unbalanced data distributions: a case study involving information extraction” in Proceedings of workshop on learning from imbalanced datasets, vol. 126, no. 1. ICML, 2003, pp. 1–7.
56. D. L. Wilson, “Asymptotic properties of nearest neighbor rules using edited data”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-2, no. 3, pp. 408–421, 1972.
57. I. Tomek, “An experiment with the edited nearest- neighbor rule”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 6, pp. 448–452, 1976. [Online]. Available: http://dx.doi.org/10.1109/TSMC.1976.4309523.
58. B. R. Manju and A. R. Nair, “Classification of Cardiac Arrhythmia of 12 Lead ECG Using Combination of SMOTEENN, XGBoost and Machine Learning Algorithms” in 2019 9th International Symposium on Embedded Computing and System Design (ISED), 2019, pp. 1–7.
59. G. E. A. P. A. Batista, A. L. C. Bazzan, and M. C. Monard, “Balancing training data for automated annotation of keywords: a case study” in WOB, 2003. [Online]. Available: https://api.semanticscholar.org/CorpusID:1579194
60. I. Tomek, “Two modifications of cnn”, IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-6, no. 11, pp. 769–772, 1976.
61. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, ser. Routledge, 2017. [Online]. Available: http://dx.doi.org/10.1201/9781315139470.
62. D. A. Cieslak and N. V. Chawla, “Learning decision trees for unbalanced data” in Machine Learning and Knowledge Discovery in Databases, W. Daelemans, B. Goethals, and K. Morik, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 241–256.
63. E. Fix and J. L. Hodges, “Discriminatory analysis. nonparametric discrimination: Consistency properties”, International Statistical Review / Revue Internationale de Statistique, vol. 57, no. 3, p. 238, 1989. [Online]. Available: http://dx.doi.org/10.2307/1403797.
64. L. Breiman, “Random forests”, Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [Online]. Available: http://dx.doi.org/ 10.1023/A:1010933404324
65. Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm” in Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ser. ICML’96. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1996, p. 148–156.
66. J. H. Friedman, “Stochastic Gradient boosting”, Computational Statistics Data Analysis, vol. 38, no. 4, pp. 367–378, 2002, nonlinear Methods and Data Mining. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S01 67947301000652.
67. D. J. Hand and K. Yu, “Idiot’s bayes-not so stupid after all?” International Statistical Review, vol. 69, no. 3, pp. 385–398, 2001. [Online]. Available: http://dx.doi.org/10.1111/j.1751-823.2001.tb00465.x.
68. G. E. Hinton, Connectionist learning procedures, ser. Machine Learning. Elsevier, 1990, pp. 555–610. [Online]. Available: http://dx.doi.org/10.1016/B978- 0-08-051055-2.50029-8.
69. T. Dyba, V. B. Kampenes, and D. I. Sj.berg, “A systematic review of statistical power in software engineering experiments”, Information and Software Technology, vol. 48, no. 8, pp. 745–755, 2006.
70. J. Sánchez-García, “Statistical tests among groups”, [Data set], 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13239734.
71. D. S. Moore and G. P. McCabe, Introduction to the practice of statistics. WH Freeman/Times Books/Henry Holt & Co, 1989.
72. J. Sánchez-García, “Statistical tests results”, [Data set], 2024. [Online]. Available: https://doi.org/10.5281/zenodo.13240040.
73. R. Malhotra and M. Khanna, “Threats to validity in search based predictive modelling for software engineering”, IET Software, vol. 12, no. 4, pp. 293-305, 2018.
74. I. Bronshteyn, “Study of defects in a program code in python”, Programming and Computer Software, vol. 39, pp. 279–284, 2013.
75. A. Belevantsev, “Multilevel static analysis for improving program quality”, Programming and Computer Software, vol. 43, pp. 321–336, 2017.
Рецензия
Для цитирования:
САНЧЕС-ГАРСИЯ А., ЛИМОН Р., ДОМИНГЕС-ИСИДРО С., ОЛВЕРА-ВИЙЕДА Д., ПЕРЕС-АРРИАГА Х. Подходы к балансировке классов для улучшения оценок прогнозирования дефектов программного обеспечения. Труды Института системного программирования РАН. 2024;36(6):19-38. https://doi.org/10.15514/ISPRAS-2024-36(6)-2
For citation:
SÁNCHEZ-GARCÍA Á., LIMÓN R., DOMÍNGUEZ-ISIDRO S., OLVERA-VILLEDA D., PÉREZ-ARRIAGA J. Class Balancing Approaches to Improve for Software Defect Prediction Estimations. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(6):19-38. https://doi.org/10.15514/ISPRAS-2024-36(6)-2