Hierarchical Rubrication of Text Documents
https://doi.org/10.15514/ISPRAS-2020-32(6)-10
Abstract
About the Authors
Dmitry Igorevich SOROKINRussian Federation
Engineer
Anton Sergeevich NUZHNY
Russian Federation
Ph.D. in Physical and Mathematical Sciences, senior researcher. Research interests: theory of machine learning
Elena Alexandrovna SAVELEVA
Russian Federation
Ph.D. in Physical and Mathematical Sciences, head of geostatistical laboratory
References
1. Mikolov T., Chen K., Corrado G., Dean J. Efficient Estimation of Word Representations in Vector Space. In Proc. of the International Conference on Learning Representations, Workshop Track, 2013, 12 p.
2. Bojanowskij P., Grave E., Joulin A., and Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, vol. 5, 2017, pp. 135-146.
3. Pennington J., Socher R., Manning C. GloVe: Global Vectors for Word Representation. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532-1543.
4. Le Q., Mikolov T. Distributed representations of sentences and documents In Proc. of the 31st International Conference on Machine Learning, 2014, pp. 1188-1196.
5. Нужный А.С., Сорокин Д.И. Создание программы интеллектуального анализа текстовой документации по вопросам захоронения РАО. Труды МФТИ, том 12, № 1(45), 2020 г., стр. 104-111 / Nuzhny A.S., Sorokin D.I. Development of a text-mining program for analysis of documentation on the disposal of radioactive wasteproblem. Proceedings of MIPT, vol. 12, № 1(45), 2020, pp. 104-111 (in Russian).
6. Devlin J., Chang M.-W., Lee K., Toutanova K. Bert: Pre-training of Deep Bidirectional Transformers for Language understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2019, pp. 4171-4186.
7. Sia S., Dalmia A., Mielke S.J. Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1728–1736.
8. Mullner D. Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378v1, 2011, 29 p.
9. Свительман В.С., Савельева Е.А., Бутов Р.А., Линге Ин.И., Дорофеев А.Н., Тихоновский В.Л. Информационно-аналитическая платформа программы исследований по обоснованию долговременной безопасности российского ПГЗРО. Радиоактивные отходы, № 2 (3), 2018 г., стр. 79-87 / Svitelman V.S., Dorofeev A.N., Saveleva E.A., Butov R.A., Linge I.I., Tikhonovsky V.L. Informational and Software Environment of the Russian Deep Geological Repository Research Program. Radioactive Waste, № 2 (3), 2018, pp. 79–87 (In Russian).
10. Jin X., Han J. K-Means Clustering. In Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining, Springer, 2011.
11. Bouguettaya A., Yu Q., Liu X., Zhou X., Song A. Efficient agglomerative hierarchical clustering. Expert Systems with Applications, vol. 42, issue 5, 2015, pp. 2785-2797.
12. Peng T., Liu L. A novel incremental conceptual hierarchical text clustering method using CFu-tree. Applied Soft Computing, vol. 27, 2015, pp. 268-278.
13. Nagarajan R., Nair S.A.H., Puviarasan N., Aruna P. Document clustering using agglomerative hierarchical clustering approach (AHDC) and proposed TSG keywords extraction method. IJRET: International Journal of Research in Engineering and Technology, vol. 05, issue 18, 2016, pp. 118-124.
14. Ester M., Kriegel H., Sander J., and Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. of the 2nd ACM International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226-231.
15. Астраханцев Н.А., Федоренко Д.Г., Турдаков Д.Ю. Методы автоматического извлечения терминов из коллекции текстов предметной области. Программирование, том 41, № 6, 2015 г., стр. 33-52 / Astrakhantsev N.A., Fedorenko D.G., Turdakov D.Yu. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software, vol. 41, № 6, 2015, pp. 336-349.
16. Peganova I., Rebrova A., and Nedumov Y. Labelling Hierarchical Clusters of Scientific Articles. In Proc. of the 2019 Ivannikov Memorial Workshop (IVMEM), 2019, pp. 26-32.
17. Kohonen T. Self–Organizing Maps. Springer, 1997, 426 p.
18. van der Maaten L., Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research, vol. 9, 2008, pp. 2579-2605.
19. Robertson S.E., Walker S., Beaulieu M. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive. In Proc. of the Seventh Text Retrieval Conference, 1998, pp. 253-264.
20. Рукавичникова А.А., Валетов Д.К., Бутов Р.А., Свительман В.С. Средства тематической кластеризации документов для систематизации библиографической информации по вопросам ПГЗРО. Сборник трудов XIX научной школы молодых ученых ИБРАЭ РАН, 2018 г., стр. 145-148 / Rukavichnikova A.A., Valetov D.K., Butov R.A., Svitelman V.S. Tools for thematic clustering of documents for systematization of bibliographic information on the issues of PGWDF. In Proc. of the XIX Scientific School of Young Scientists IBRAE RAS, 2018, pp. 145-148 (in Russian).
Review
For citations:
SOROKIN D.I., NUZHNY A.S., SAVELEVA E.A. Hierarchical Rubrication of Text Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(6):127-136. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(6)-10