Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Named Entity Recognition for Code Review Comments

https://doi.org/10.15514/ISPRAS-2022-35(5)-13

Abstract

This paper addresses the problem of named entities recognition from source code reviews. The paper provides a comparative analysis of existing approaches and proposes its own methods to improve the quality of problem solving. Proposed and implemented improvements include: methods to deal with data imbalances, improved tokenization of input data, the use of large arrays of unlabeled data, and the use of additional binary classifiers. To assess quality, a new set of 3,000 user code reviews was collected and manually labeled. It is shown that the proposed improvements can significantly increase the performance measured by quality metrics, calculated both at the token level (+22%) and at the entire entity level (+13%).

About the Authors

Vladimir Vladimirovich KACHANOV
Institute for System Programming of the Russian Academy of Sciences, Moscow Institute of Physics and Technology
Russian Federation

Postgraduate student. Research interests: machine learning, software engineering.



Ariana Sergeevna KHITROVA
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Bachelor. Research interests: machine learning, natural language processing.



Sergei Igorevich MARKOV
Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Specialist, senior researcher. Research interests: static program analysis, dynamic program analysis, software engineering, machine learning.



References

1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). DOI:10.18653/v1/N19-1423.

2. Tabassum, J., Maddela, M., Xu, W., Ritter, A.: Code and named entity recognition in StackOverflow. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4913–4926. Association for Computational Linguistics, Online (Jul2020). DOI:10.18653/v1/2020.acl-main.443.

3. Rahul Sharnagat, Named Entity Recognition: A Literature Survey, June 30, 2014 https://www.cfilt.iitb.ac.in/resources/surveys/rahul-ner-survey.pdf.

4. Bikel, D.M., Schwartz, R. & Weischedel, R.M. An Algorithm that Learns What's in a Name. Machine Learning 34, 211–231 (1999). DOI:10.1023/A:1007558221122.

5. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289.

6. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03). Association for Computational Linguistics, USA, 188–191. DOI: 10.3115/1119176.1119206.

7. Cortes, C., Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995). DOI:10.1007/BF00994018.

8. McNamee, P., & Mayfield, J. (2002). Entity Extraction without Language-Specific Resources. Conference on Computational Natural Language Learning.

9. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics. DOI: 10.18653/v1/N16-1030.

10. Batbaatar E, Ryu KH. Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach. International Journal of Environmental Research and Public Health. 2019; 16(19):3628. DOI:10.3390/ijerph16193628.

11. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.

12. Cedric Lothritz, Kevin Allix, Lisa Veiber, Tegawendé F. Bissyandé, and Jacques Klein. 2020. Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3750–3760, Barcelona, Spain (Online). International Committee on Computational Linguistics. DOI: 10.18653/v1/2020.coling-main.334.

13. Tikhomirov, M., Loukachevitch, N., Sirotina, A., Dobrov, B. (2020). Using BERT and Augmentation in Named Entity Recognition for Cybersecurity Domain. In Natural Language Processing and Information Systems. NLDB 2020. Lecture Notes in Computer Science, vol 12089. Springer, Cham. https://doi.org/10.1007/978-3-030-51310-8_2.

14. Malik, G., Cevik, M., Bera, S., Yildirim, S., Parikh, D., & Basar, A. (2022). Software requirement specific entity extraction using transformer models. Proceedings of the Canadian Conference on Artificial Intelligence. DOI:10.21428/594757db.9e433d7c.

15. D. Ye, Z. Xing, C. Y. Foo, Z. Q. Ang, J. Li and N. Kapre, "Software-Specific Named Entity Recognition in Software Engineering Social Content," 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Osaka, Japan, 2016, pp. 90-101, DOI:10.1109/SANER.2016.10.

16. Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.

17. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.139.

18. T. -Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, "Focal Loss for Dense Object Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, 1 Feb. 2020, DOI:10.1109/TPAMI.2018.2858826.

19. Anderson, P.E., Reo, N.V., DelRaso, N.J. et al. Gaussian binning: a new kernel-based method for processing NMR spectroscopic data for metabolomics. Metabolomics 4, 261–272 (2008). DOI:10.1007/s11306-008-0117-3.

20. Charlie Xu, Solomon Barth, Zoe Solis. Applying Ensembling Methods to BERT to Boost Model Performance. Stanford University. Accessed at: 01.11.2023.

21. Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics.


Review

For citations:


KACHANOV V.V., KHITROVA A.S., MARKOV S.I. Named Entity Recognition for Code Review Comments. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(5):193-214. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-35(5)-13



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)