Извлечение именованных сущностей из рецензий к исходному коду
https://doi.org/10.15514/ISPRAS-2022-35(5)-13
Аннотация
В данной статье рассматривается задача извлечения именованных сущностей из рецензий исходного кода. В работе приводится сравнительный анализ существующих подходов и предлагаются собственные методы для улучшения качества решения задачи. Предложенные и реализованные улучшения включают в себя: методы борьбы с дисбалансом данных, улучшения токенизации входных данных, использование больших массивов неразмеченных данных и применение дополнительных бинарных классификаторов. Для оценки качества собран и размечен вручную новый набор из 3000 пользовательских рецензий. Показано, что предложенные улучшения позволяют значительно увеличить показатели метрик качества, вычисляемых как на уровне токенов (+22%), так и на уровне сущностей целиком (+13%).
Об авторах
Владимир Владимирович КАЧАНОВРоссия
Аспирант. Сфера научных интересов: машинное обучение, программная инженерия.
Ариана Сергеевна ХИТРОВА
Россия
Бакалавр. Сфера научных интересов: машинное обучение, обработка естественного языка.
Сергей Игоревич МАРКОВ
Россия
Специалист, старший научный сотрудник. Сфера научных интересов: статический анализ кода, динамический анализ кода, программная инженерия, машинное обучение.
Список литературы
1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019). DOI:10.18653/v1/N19-1423.
2. Tabassum, J., Maddela, M., Xu, W., Ritter, A.: Code and named entity recognition in StackOverflow. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4913–4926. Association for Computational Linguistics, Online (Jul2020). DOI:10.18653/v1/2020.acl-main.443.
3. Rahul Sharnagat, Named Entity Recognition: A Literature Survey, June 30, 2014 https://www.cfilt.iitb.ac.in/resources/surveys/rahul-ner-survey.pdf.
4. Bikel, D.M., Schwartz, R. & Weischedel, R.M. An Algorithm that Learns What's in a Name. Machine Learning 34, 211–231 (1999). DOI:10.1023/A:1007558221122.
5. John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML '01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289.
6. Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4 (CONLL '03). Association for Computational Linguistics, USA, 188–191. DOI: 10.3115/1119176.1119206.
7. Cortes, C., Vapnik, V. Support-vector networks. Mach Learn 20, 273–297 (1995). DOI:10.1007/BF00994018.
8. McNamee, P., & Mayfield, J. (2002). Entity Extraction without Language-Specific Resources. Conference on Computational Natural Language Learning.
9. Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270, San Diego, California. Association for Computational Linguistics. DOI: 10.18653/v1/N16-1030.
10. Batbaatar E, Ryu KH. Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach. International Journal of Environmental Research and Public Health. 2019; 16(19):3628. DOI:10.3390/ijerph16193628.
11. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
12. Cedric Lothritz, Kevin Allix, Lisa Veiber, Tegawendé F. Bissyandé, and Jacques Klein. 2020. Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3750–3760, Barcelona, Spain (Online). International Committee on Computational Linguistics. DOI: 10.18653/v1/2020.coling-main.334.
13. Tikhomirov, M., Loukachevitch, N., Sirotina, A., Dobrov, B. (2020). Using BERT and Augmentation in Named Entity Recognition for Cybersecurity Domain. In Natural Language Processing and Information Systems. NLDB 2020. Lecture Notes in Computer Science, vol 12089. Springer, Cham. https://doi.org/10.1007/978-3-030-51310-8_2.
14. Malik, G., Cevik, M., Bera, S., Yildirim, S., Parikh, D., & Basar, A. (2022). Software requirement specific entity extraction using transformer models. Proceedings of the Canadian Conference on Artificial Intelligence. DOI:10.21428/594757db.9e433d7c.
15. D. Ye, Z. Xing, C. Y. Foo, Z. Q. Ang, J. Li and N. Kapre, "Software-Specific Named Entity Recognition in Software Engineering Social Content," 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Osaka, Japan, 2016, pp. 90-101, DOI:10.1109/SANER.2016.10.
16. Liu Zhuang, Lin Wayne, Shi Ya, and Zhao Jun. 2021. A Robustly Optimized BERT Pre-training Approach with Post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 1218–1227, Huhhot, China. Chinese Information Processing Society of China.
17. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1536–1547, Online. Association for Computational Linguistics. DOI: 10.18653/v1/2020.findings-emnlp.139.
18. T. -Y. Lin, P. Goyal, R. Girshick, K. He and P. Dollár, "Focal Loss for Dense Object Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318-327, 1 Feb. 2020, DOI:10.1109/TPAMI.2018.2858826.
19. Anderson, P.E., Reo, N.V., DelRaso, N.J. et al. Gaussian binning: a new kernel-based method for processing NMR spectroscopic data for metabolomics. Metabolomics 4, 261–272 (2008). DOI:10.1007/s11306-008-0117-3.
20. Charlie Xu, Solomon Barth, Zoe Solis. Applying Ensembling Methods to BERT to Boost Model Performance. Stanford University. Accessed at: 01.11.2023.
21. Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages 1–9, Atlanta, Georgia, USA. Association for Computational Linguistics.
Рецензия
Для цитирования:
КАЧАНОВ В.В., ХИТРОВА А.С., МАРКОВ С.И. Извлечение именованных сущностей из рецензий к исходному коду. Труды Института системного программирования РАН. 2023;35(5):193-214. https://doi.org/10.15514/ISPRAS-2022-35(5)-13
For citation:
KACHANOV V.V., KHITROVA A.S., MARKOV S.I. Named Entity Recognition for Code Review Comments. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(5):193-214. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-35(5)-13