Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Evaluation of Similarity of Javadoc Comments

https://doi.org/10.15514/ISPRAS-2023-35(4)-10

Abstract

Code comments are an essential part of software documentation. Many software projects suffer the problem of low-quality comments that are often produced by copy-paste. In case of similar methods, classes, etc. copy-pasted comments with minor modifications are justified. However, in many cases this approach leads to degraded documentation quality and, subsequently, to problematic maintenance and development of the project. In this study, we address the problem of near-duplicate code comments detection, which can potentially improve software documentation. We have conducted a thorough evaluation of traditional string similarity metrics and modern machine learning methods. In our experiment, we use a collection of Javadoc comments from four industrial open-source Java projects. We have found out that LCS (Longest Common Subsequence) is the best similarity algorithm taking into account both quality (Precision 94%, Recall 74%) and performance.

About the Authors

Dmitry Vladimirovich KOZNOV
Saint-Petersburg State Uinversity
Russian Federation

Doctor of Technical Sciences, Professor of the System Programming Department. Research interests: software engineering, model-driven software development, program data, machine learning.



Ekaterina Iurevna LEDENEVA
Yandex LLC
Russian Federation

Software engineer at Yandex LLC, Saint Petersburg State University alumni. Research interests: software data analysis, technical documentation analysis.



Dmitry Vadimovich LUCIV
Saint-Petersburg State Uinversity
Russian Federation

PhD in computer science, associate professor оf System Programming Department at Saint Petersburg State University, Russia. Research interests: software engineering, software data analysis, documentation analysis, systems programming.



Pavel Isaakovich BRASLAVSKI
HSE University
Russian Federation

PhD in computer science, senior researcher at the Laboratory for Models and Methods of Computational Pragmatics, HSE University. Research interests: resources and methods for evaluation of NLP and IR models, computational humor.



References

1. Spinellis D. Code Documentation // IEEE Softw. – 2010. – Vol. 27, no. 4. – pp. 18–19.

2. Oumaziz M. A. et al. Documentation Reuse: Hot or Not? An Empirical Study // Proc. of ICSR 2017. – 2017. – pp. 12–27.

3. Blasi A., Gorla A. Replicomment: identifying clones in code comments // Proc. of ICPC 2018, Gothenburg, Sweden. – ACM. – 2018. – pp. 320–323.

4. Nosál M., Porubän J. Reusable software documentation with phrase annotations // Central Eur. J. Comput. Sci. – 2014. – Vol. 4, no. 4. – pp. 242–258.

5. Corazza A. et al. On the Coherence between Comments and Implementations in Source Code // EUROMICRO-SEAA. – 2015. – pp. 76–83.

6. Chin F. Y. L., Poon C. K. Binary Codes Capable of Correcting Deletions, Insertions and Reversals // Afast algorithm for computing longest common subsequences of small alphabet size. – 1991. – Vol. 13(4). – pp. 463–469.

7. Manning C. D. et al. Introduction to information retrieval. – 2008. – Vol. 1.

8. Gionis A. et al. Similarity Search in High Dimensions via Hashing // Proc. of VLDB 1999, Edinburgh, Scotland, UK. – Morgan Kaufmann. – 1999. – pp. 518–529.

9. Broder A. Z. On the resemblance and containment of documents // Proc. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171). – IEEE, 1997. – с. 21-29.

10. Kusner M. J. et al. From Word Embeddings To Document Distances // Proc. of ICML 2015, Lille, France. – JMLR.org. – 2015. – Vol. 37. – pp. 957–966.

11. Mikolov T. et al. Efficient estimation of word representations in vector space //arXiv preprint arXiv:1301.3781. – 2013.

12. Le Q.V., Mikolov T. Distributed Representations of Sentences and Documents // ICML 2014. – 2014. – Vol. 32 of JMLR Workshop and Conference Proceedings. – pp. 1188–1196.

13. Mueller J., Thyagarajan A. Siamese Recurrent Architectures for Learning Sentence Similarity // Proc. AAAI, 2016. – AAAI Press. – 2016. – pp. 2786–2792.

14. Tan S. H. et al. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies // ICST 2012. – IEEE Computer Society. – 2012. – pp. 260–269.

15. Fluri B., Würsch M., Gall H. C. Do Code and Comments Co-Evolve? On the Relation between Source Code and Comment Changes // Proc. of WCRE 2007, Vancouver, BC, Canada. – IEEE Computer Society. – 2007. – pp. 70–79.

16. Luciv D., Koznov D., Chernishev G., et al. Detecting Near Duplicates in Software Documentation // Programming and Computer Software. – 2018. – Vol. 44, no. 5. – pp. 335–343.

17. Wen F. et al. A large-scale empirical study on code-comment inconsistencies // Proc. of ICPC 2019 / ed. by Guéhéneuc Y. et al. – IEEE / ACM. – 2019. – pp. 53–64.

18. Wang D. et al. Deep Code-Comment Understanding and Assessment // IEEE Access. – 2019. – Vol. 7. – pp. 174200– 174209.

19. Zhou Y. et al. Analyzing APIs documentation and code to detect directive defects // Proc of the ICSE 2017, Buenos Aires, Argentina. – IEEE / ACM. – 2017. – pp. 27–37.

20. Ratol I. K., Robillard M. pp. Detecting fragile comments // Proc. of ASE 2017, Urbana, IL, USA. – IEEE Computer Society. – 2017. – pp. 112–122.

21. Otaibi J. A. et al. Machine Learning and Conceptual Reasoning for Inconsistency Detection // IEEE Access. – 2017. – Vol. 5. – pp. 338–346.

22. Koznov D. V. et al. Clone Detection in Reuse of Software Technical Documentation // 10th International Andrei Ershov Informatics Conference, PSI 2015. – Springer. – 2015. – Vol. 9609 of LNCS. – pp. 170–185.

23. Soto A. J. et al. Similarity-Based Support for Text Reuse in Technical Writing // Proc. of the ACM DocEng 2015, Lausanne, Switzerland / ed. by Vanoirbeek C., Genevès pp. – ACM. – 2015. – pp. 97–106.

24. Luciv D., Koznov D., Chernishev G., et al. Duplicate finder toolkit // Proc. of ICSE 2018: Companion Proceeedings. – 2018. – pp. 171–172.

25. Wagner S., Fernández D. M. Analyzing Text in Software Projects // The Art and Science of Analyzing Software Data / ed. by Bird Christian et al. – Morgan Kaufmann / Elsevier, 2015. – pp. 39–72.

26. Basit H. A. et al. Efficient token based clone detection with flexible tokenization // Proceedings of the ESEC/SIGSOFT FSE, 2007. – 2007. – pp. 513–516.

27. Кознов Д.В., Ольхович Л.Б.Визуальные языки проектов // Системное программирование. 2005. Т. 1. с. 148-167.

28. Кознов Д.В. Методология и инструментарий предметно-ориентированного моделирования // диссертация на соискание ученой степени доктора технических наук / Санкт-Петербургский государственный университет. Санкт-Петербург, 2016.

29. Гаврилова Т., Алсуфьев А., Янсон А.С. Современные нотации бизнес-моделей: визуальный тренд // Форсайт. 2014. Т. 8. № 2. с. 56-70.

30. Гаврилова Т.А. Логико-лингвистическое управление как введение в управление знаниями // Новости искусственного интеллекта. 2002. № 6. с. 36-40.


Review

For citations:


KOZNOV D.V., LEDENEVA E.I., LUCIV D.V., BRASLAVSKI P.I. Evaluation of Similarity of Javadoc Comments. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(4):177-186. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(4)-10



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)