Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Discovering Near Duplicate Text in Software Documentation

https://doi.org/10.15514/ISPRAS-2017-29(4)-21

Abstract

Development of software documentation often involves copy-pasting, which produces a lot of duplicate text. Such duplicates make it difficult and expensive documentation maintenance, especially in case of long life cycle of software and its documentation. The situation is further complicated by duplicate information frequently being near duplicate, i.e., the same information may be presented many times with different levels of detail, in various contexts, etc. There are a number approaches to deal with duplicates in software documentation. But most of them use software clone detection technique, that is make difficult to provide efficient near duplicate detection: source code algorithms ignore a document structure, and they produce a lot of false positives. In this paper, we present an algorithm aiming to detect near duplicates in software documentation using natural language processing technique called as N-gramm model. The algorithm has a considerable limitation: it only detects single sentences as near duplicates. But it is very simple and may be easily improved in future. It is implemented with use of Natural Language Toolkit (NLTK), and. Evaluation results are presented for five real life documents from various industrial projects. Manual analysis shows 39 % of false positives in automatic detected duplicates. The algorithm demonstrates reasonable performance: documents of 0,8-3 Mb are processed 5-22 min.

About the Authors

L. D. Kanteev
Saint Petersburg State University
Russian Federation


Yu. O. Kostyukov
Saint Petersburg State University
Russian Federation


D. V. Luciv
Saint Petersburg State University
Russian Federation


D. V. Koznov
Saint Petersburg State University
Russian Federation


M. N. Smirnov
Saint Petersburg State University
Russian Federation


References

1. Wagner S., Fernández D.M. Analysing Text in Software Projects. Preprint, 2016. URL: https://arxiv.org/abs/1612.00164

2. Parnas D. L. Precise Documentation: The Key To Better Software. Nanz S. (ed.) The Future of Software Engineering, Springer, 2011. DOI: 10.1007/978-3-642-15187-3_8

3. Akhin, M., Itsykson, V. Clone Detection: Why, What and How? Proceedings of CEE-SECR’10, 2010, pp. 36–42. DOI: 10.1109/CEE-SECR.2010.5783148

4. Juergens E. et al. Can clone detection support quality assessments of requirements specifications? Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering, 2010, vol. 2, pp. 79–88.

5. Wingkvist A., Ericsson M., Lincke R., Löwe W. A Metrics-Based Approach to Technical Documentation Quality. Proceedings of 7th International Conference on the Quality of Information and Communications Technology, 2010, pp. 476–481.

6. Nosál M., Porubän J. Preliminary report on empirical study of repeated fragments in internal documentation. Proceedings of the Federated Conference on Computer Science and Information Systems, Gdansk, 2016, pp. 1573–1576.

7. Sajnani H., Saini V., Svajlenko J., Roy C.K., Lopes C.V. Sourcerercc: Scaling code clone detection to big-code. Proceedings of the 38th International Conference on Software Engineering, ACM, New York, USA, 2016, pp. 1157–1168. DOI: 10.1145/2884781.2884877

8. Jiang L., Misherghi G., Su Z., Glondu S. DECKARD: Scalable and accurate tree-based detection of code clones. Proceedings of 29th International Conference on Software Engineering. Institute of Electrical and Electronics Engineers, 2007, pp. 96–105. DOI: 10.1109/ICSE.2007.30

9. Huang T.K., Rahman M.S., Madhyastha H.V., Faloutsos M., Ribeiro B. An analysis of socware cascades in online social networks. Proceedings of the 22Nd International Conference on World Wide Web, 2013, pp. 619–630.

10. Cordy J.R., Roy C.K.: The NiCad clone detector. Proceedings of the 19th IEEE International Conference on Program Comprehension. Institute of Electrical and Electronics Engineers, 2011, pp. 219–220. DOI: 10.1109/ICPC.2011.26

11. Lutsiv D.V., Koznov D.V., Basit H.A., Lieh O.E., Smirnov M.N., Romanovsky K.Yu. An approach for clone detection in documentation reuse. Nauchno-tehnicheskij vestnik informacionnyh tehnologij, mehaniki i optiki [Scientific and Technical Journal of Information Technologies, Mechanics and Optics] vol. 92, issue 4, 2014, pp. 106–114 (in Russian).

12. Koznov D. et al. Clone detection in reuse of software technical documentation. Mazzara M., Voronkov A. (eds.), International Andrei Ershov Memorial Conference on Perspectives of System Informatics, 2015; Lecture Notes in Computer Science, vol. 9609, 2016, pp. 170–185. DOI: 10.1007/978-3-319-41579-6_14

13. Luciv D., Koznov D., Basit H.A., Terekhov A.N. On fuzzy repetitions detection in documentation reuse. Programming and Computer Software, vol. 42, issue 4, 2016, pp. 216–224. DOI: 10.1134/s0361768816040046

14. Basit H.A., Smyth W.F., Puglisi S.J., Turpin A., Jarzabek S. Efficient Token Based Clone Detection with Flexible Tokenization. Proceedings of ACM SIGSOFT International Symposium on the Foundations of Software Engineering, ACM Press, 2007, pp. 513–516. DOI: 10.1145/1295014.1295029

15. Natural Language Toolkit, URL: http://nltk.org/

16. Horie M., Chiba S. Tool support for crosscutting concerns of API documentation. Proceedings of 9th International Conference on Aspect-Oriented Software Development, 2010, pp. 97–108. DOI: 10.1145/1739230.1739242

17. Rago A., Marcos C., Diaz-Pace J.A. Identifying duplicate functionality in textual use cases by aligning semantic actions. International Journal on Software and Systems Modeling, vol. 15, issue 2, 2016, pp. 579–603. DOI: 10.1007/s10270-014-0431-3

18. Nosál’ M., Porubän J. Reusable software documentation with phrase annotations. Open Computer Science, vol. 4, issue 4, 2014, pp. 242-258. DOI: 10.2478/s13537-014-0208-3

19. Bassett P. Framing software reuse – lessons from real world. Prentice Hall, 1996. ISBN: 0-13-327859-X

20. Jarzabek S., Bassett P., Zhang H., Zhang W. XVCL: XML-based Variant Configuration Language. Proceedings of 25th International Conference on Software Engineering, 2003, pp. 810–811. DOI: 10.1109/ICSE.2003.1201298

21. Koznov D., Romanovsky K.. DocLine: A Method for Software Product Lines Documentation Development. Programming and Computer Software, vol. 34, issue 4, 2008, pp. 216–224. DOI: 10.1134/S0361768808040051

22. Romanovsky K., Koznov D., Minchin L. Refactoring the Documentation of Software Product Lines. Central and East European Conference on Software Engineering Techniques, Brno (Czech Republic), 2008; Lecture Notes in Computer Science, vol. 4980, Springer, 2011, pp. 158–170. DOI: 10.1007/978-3-642-22386-0_12

23. Broder A.Z. et al. Syntactic clustering of the web. Computer Networks and ISDN Systems. vol. 29, issue 8, 1997, pp. 1157–1166. DOI: 10.1016/S0169-7552(97)00031-7

24. Documentation Refactoring Toolkit,

25. URL: http://www.math.spbu.ru/user/kromanovsky/docline/index_en.html

26. Basili V., Caldiera G., Rombach H. The Goal Question Metric Approach. Encyclopedia of Software Engineering, Wiley, 1994. DOI: 10.1002/0471028959.sof142

27. Frakes W., Terry C.. Software reuse: metrics and models. ACM Computing Surveys, vol. 28, issue 2, 1996, pp. 415–435. DOI: 10.1145/234528.234531

28. Linux Kernel Documentation, snapshot on Dec 11, 2013.

29. URL: https://github.com/torvalds/linux/tree/master/Documentation/DocBook/

30. Zend PHP Framework documentation, snapshot on Apr 24, 2015.

31. URL: https://github.com/zendframework/zf1/tree/master/documentation

32. DocBook Definitive Guide, snapshot on Apr 24, 2015.

33. URL: http://sourceforge.net/p/docbook/code/HEAD/tree/trunk/defguide/en/

34. SVN Book, snapshot on Apr 24, 2015.

35. URL: http://sourceforge.net/p/svnbook/source/HEAD/tree/trunk/en/book/

36. Braun R.K., Kaneshiro R. Exploiting topic pragmatics for new event detection. Technical report. National Institute of Standards and Technology, Topic Detection and Tracking Workshop, 2004.

37. Jaccard P. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines [Distribution of Alpine flora in the Dranses Basin and some neighboring regions]. Bulletin de la Société Vaudoise des Sciences Naturelles [Bulletin of the Vaudois Society of Natural Sciences], vol. 140, issue 37, 1901, pp. 241–272 (in French)

38. Drobintsev P.D., Kotlyarov V. P., Letichevsky A.A. A formal approach to test scenarios generation based on guides. Automatic Control and Computer Sciences, vol. 48, issue 7, 2014, pp. 415–423. DOI: 10.3103/S0146411614070062

39. Zelenov S.V., Silakov D.V., Petrenko A.K., Conrad M., Fey I. Automatic test generation for model-based code generators. Proceedings of 2nd International Symposium on Leveraging Applications of Formal Methods, Verification and Validation, pp. 75–81. DOI: 10.1109/ISoLA.2006.70


Review

For citations:


Kanteev L.D., Kostyukov Yu.O., Luciv D.V., Koznov D.V., Smirnov M.N. Discovering Near Duplicate Text in Software Documentation. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2017;29(4):303-314. https://doi.org/10.15514/ISPRAS-2017-29(4)-21



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)