Hierarchical Rubrication of Text Documents

Dmitry Igorevich SOROKIN; Anton Sergeevich NUZHNY; Elena Alexandrovna SAVELEVA

doi:10.15514/ISPRAS-2020-32(6)-10

Hierarchical Rubrication of Text Documents

Dmitry Igorevich SOROKIN, Anton Sergeevich NUZHNY, Elena Alexandrovna SAVELEVA

https://doi.org/10.15514/ISPRAS-2020-32(6)-10

Full Text:

PDF (Rus) |

Generate QR code

Abstract

Topic modeling is an important and widely used method in the analysis of a large collection of documents. It allows us to digest the content of documents by examination of the selected topics. It has drawbacks such as a need to specify the number of topics. The topics can become too local or too global, depending on that number. Also, it does not provide a relation between local and global topics. Here we present an algorithm and a computer program for the hierarchical rubrication of text documents. The program solves these problems by creating a hierarchy of automatically selected topics in which local topics are connected of the global topics. The program processes PDF documents split them into text segments, builds vector representations using word2vec model and stores them in a database. The vector embeddings are structured in the form of a hierarchy of automatically constructed categories. Each category is identified by automatically selected keywords. The result is visualized in an interactive map. Traversing the hierarchy of topics is done by zooming the map. An analysis of the constructed hierarchy of categories allows us to evaluate the minimum and maximum depth of the hierarchy corresponding to a minimum and a maximum number of different topics contained in the collection of documents. The program was tested on documents on deep nuclear waste disposal.

Keywords

rubrication, hierarchical clustering, natural language processing, machine learning

About the Authors

Dmitry Igorevich SOROKIN

Nuclear safety institute of the Russian Academy of Sciences
Russian Federation
Engineer

Anton Sergeevich NUZHNY

Nuclear safety institute of the Russian Academy of Sciences
Russian Federation
Ph.D. in Physical and Mathematical Sciences, senior researcher. Research interests: theory of machine learning

Elena Alexandrovna SAVELEVA

Nuclear safety institute of the Russian Academy of Sciences
Russian Federation
Ph.D. in Physical and Mathematical Sciences, head of geostatistical laboratory

References

1. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. In Proc. of the International Conference on Learning Representations, Workshop Track, 2013, 12 p.

2. Bojanowskij P., Grave E., Joulin A., and Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, vol. 5, 2017, pp. 135-146.

3. Pennington J., Socher R., Manning C. GloVe: Global Vectors for Word Representation. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532-1543.

4. Le Q., Mikolov T. Distributed representations of sentences and documents In Proc. of the 31st International Conference on Machine Learning, 2014, pp. 1188-1196.

5. Нужный А.С., Сорокин Д.И. Создание программы интеллектуального анализа текстовой документации по вопросам захоронения РАО. Труды МФТИ, том 12, № 1(45), 2020 г., стр. 104-111 / Nuzhny A.S., Sorokin D.I. Development of a text-mining program for analysis of documentation on the disposal of radioactive wasteproblem. Proceedings of MIPT, vol. 12, № 1(45), 2020, pp. 104-111 (in Russian).

6. Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of Deep Bidirectional Transformers for Language understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2019, pp. 4171-4186.

7. Sia S., Dalmia A., Mielke S.J. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1728–1736.

8. Mullner D. Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378v1, 2011, 29 p.

9. Свительман В.С., Савельева Е.А., Бутов Р.А., Линге Ин.И., Дорофеев А.Н., Тихоновский В.Л. Информационно-аналитическая платформа программы исследований по обоснованию долговременной безопасности российского ПГЗРО. Радиоактивные отходы, № 2 (3), 2018 г., стр. 79-87 / Svitelman V.S., Dorofeev A.N., Saveleva E.A., Butov R.A., Linge I.I., Tikhonovsky V.L. Informational and Software Environment of the Russian Deep Geological Repository Research Program. Radioactive Waste, № 2 (3), 2018, pp. 79–87 (In Russian).

10. Jin X., Han J. K-Means Clustering. In Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining, Springer, 2011.

11. Bouguettaya A., Yu Q., Liu X., Zhou X., Song A. Efficient agglomerative hierarchical clustering. Expert Systems with Applications, vol. 42, issue 5, 2015, pp. 2785-2797.

12. Peng T., Liu L. A novel incremental conceptual hierarchical text clustering method using CFu-tree. Applied Soft Computing, vol. 27, 2015, pp. 268-278.

13. Nagarajan R., Nair S.A.H., Puviarasan N., Aruna P. Document clustering using agglomerative hierarchical clustering approach (AHDC) and proposed TSG keywords extraction method. IJRET: International Journal of Research in Engineering and Technology, vol. 05, issue 18, 2016, pp. 118-124.

14. Ester M., Kriegel H., Sander J., and Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd ACM International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226-231.

15. Астраханцев Н.А., Федоренко Д.Г., Турдаков Д.Ю. Методы автоматического извлечения терминов из коллекции текстов предметной области. Программирование, том 41, № 6, 2015 г., стр. 33-52 / Astrakhantsev N.A., Fedorenko D.G., Turdakov D.Yu. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software, vol. 41, № 6, 2015, pp. 336-349.

16. Peganova I., Rebrova A., and Nedumov Y. Labelling Hierarchical Clusters of Scientific Articles. In Proc. of the 2019 Ivannikov Memorial Workshop (IVMEM), 2019, pp. 26-32.

17. Kohonen T. Self–Organizing Maps. Springer, 1997, 426 p.

18. van der Maaten L., Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research, vol. 9, 2008, pp. 2579-2605.

19. Robertson S.E., Walker S., Beaulieu M. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In Proc. of the Seventh Text Retrieval Conference, 1998, pp. 253-264.

20. Рукавичникова А.А., Валетов Д.К., Бутов Р.А., Свительман В.С. Средства тематической кластеризации документов для систематизации библиографической информации по вопросам ПГЗРО. Сборник трудов XIX научной школы молодых ученых ИБРАЭ РАН, 2018 г., стр. 145-148 / Rukavichnikova A.A., Valetov D.K., Butov R.A., Svitelman V.S. Tools for thematic clustering of documents for systematization of bibliographic information on the issues of PGWDF. In Proc. of the XIX Scientific School of Young Scientists IBRAE RAS, 2018, pp. 145-148 (in Russian).

Review

For citations:

SOROKIN D.I., NUZHNY A.S., SAVELEVA E.A. Hierarchical Rubrication of Text Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(6):127-136. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(6)-10

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Hierarchical Rubrication of Text Documents

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy