Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Automatic search for fragments containing biographical information in a natural language text

https://doi.org/10.15514/ISPRAS-2018-30(6)-12

Abstract

The search and classification of text documents are used in many practical applications. These are the key tasks of information retrieval. Methods of text searching and classifying are used in search engines, electronic libraries and catalogs, systems for collecting and processing information, online education and many others. There are a large number of particular applications of these methods, but each such practical task is characterized, as a rule, by weak formalizability and narrow objectivity. Therefore, it requires individual study and its own approach to the solution. This paper discusses the task of automatically searching and typing text fragments containing biographical information. The key problem in solving this problem is to conduct a multi-class classification of text fragments, depending on the presence and type of biographical information contained in them. After reviewing the related works, the author concluded that the use of neural network methods is promising and widespread for solving such problems. Based on this conclusion, the paper compares various architectures of neural network models, as well as basic text presentation methods (Bag-Of-Words, TF-IDF, Word2Vec) on a pre-assembled and marked corpus of biographical texts. The article describes the steps involved in preparing a training set of text fragments for teaching models, methods for text representation and classification methods chosen for solving the problem. The results of the multi-class classification of text fragments are also presented. The examples of automatic search for fragments containing biographical information are shown for the texts that did not participate in the model learning process.

About the Author

A. V. Glazkova
University of Tyumen
Russian Federation


References

1. [1]. Terpugova A.V. Biographical text as an object of linguistic researchю. Author’s abstract of the PhD thesis. Institute of Linguistics RAS, Moscow, 2011, 26 p. (in Russian).

2. [2]. Manning C., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008. 506 p.

3. [3]. Adamovich I.M., Volkov O.I. The system of facts extraction from historical texts. Sistemy i sredstva informatiki [Systems and Means of Informatics], vol. 25, № 3, 2015, p. 235-250 (in Russian).

4. [4]. Cybulska, A., Vossen, P. Historical Event Extraction From Text. In Proc. of 5th ACL-HLT Workshop on Language Technology on Cultural Heritage, 2011, pp. 39–43.

5. [5]. Hienert D., Luciano F. Extraction of Historical Events from Wikipedia. Lecture Notes in Computer Science, vol. 7540, 2015, pp. 16–28.

6. [6]. Santos C., Xiang B., Zhou B. Classifying Relations by Ranking with Convolutional Neural Networks. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015, pp. 626-634.

7. [7]. Meerkamp P., Zhou Z. Information Extraction with Character-level Neural Networks and Free Noisy Supervision. Cornell University Library [электронный ресурс]. 2016. URL: https://arxiv.org/abs/1612.04118 (дата обращения 21.09.2018).

8. [8]. Homma Y., Sadamitsu K., Nishida K., Higashinaka R., Asano H., Matsuo Y. A Hierarchical Neural Network for Information Extraction of Product Attribute and Condition Sentences. In Proc. of the Open Knowledge Base and Question Answering (OKBQA), 2016, pp. 21-29.

9. [9]. Arkhipenko K., Kozlov I., Trofimovich J., Skorniakov K., Gomzin A., Turdakov D. Comparison of Neural Architectures for Sentiment Analysis of Russian Tweets. In Proc. of the International Conference “Dialogue 2016”, 2016, pp. 50-58.

10. [10]. Andrianov I., Mayorov V., Turdakov D. Modern Approaches to Aspect-Based Sentiment Analysis. Trudy ISP RAN/Proc. ISP RAN, vol. 27, №. 5, 2015 г., p. 5-22 (in Russian). DOI: 10.15514/ISPRAS-2015-27(5)-1.

11. [11]. Parhomenko P.A., Grigorev A.A., Astrakhantsev N.A. A survey and an experimental comparison of methods for text clustering: application to scientific articles. Trudy ISP RAN/Proc. ISP RAN, vol. 29, №. 2, 2017 г., p. 161-200 (in Russian). DOI: 10.15514/ISPRAS-2017-29(2)-6.

12. [12]. Ravuri S., Stolcke A. A Comparative Study of Recurrent Neural Network Models for Lexical Domain Classification. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 6075-6079

13. [13]. Yogatama D., Dyer C., Ling W., Blunsom P. Generative and discriminative text classification with recurrent neural networks. arXiv preprint arXiv:1703.01898, 2017.

14. [14]. Chen G., Ye D., Xing Z., Chen J., Cambria E. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proc. of the International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2377-2383.

15. [15]. Valgina N.S., Rosental D.E., Fomina M.I. Modern Russian Language. Moscow, Logos, 2002, 528 p. (in Russian).

16. [16]. Wikipedia. The free encyclopedia. URL: https://ru.wikipedia.org/, accessed 26.11.2018.

17. [17]. Glazkova A. V. Building a text corpus for automatic biographical facts extraction from Russian texts. Sovremennyye informatsionnyye tekhnologii i IT-obrazovaniye [Modern Information Technologies and IT-education], vol 14, №. 4, 2018 (in Russian).

18. [18]. The corpus of biographical texts, URL https://sites.google.com/site/utcorpus/, accessed 01.12.2018.

19. [19]. Morphological analyzer pymorphy2, URL: [19]. https://pymorphy2.readthedocs.io/en/latest/, accessed 01.12.2018.

20. [20]. Mikolov T., Sutskever I., Chen K., Corrado G. S., Dean J. Distributed representations of words and phrases and their compositionality. In Proc. of the 26th International Conference on Neural Information Processing Systems, vol. 2, 2013, pp. 3111-3119.

21. [21]. Hochreiter S., Schmidhuber J. Long Short-term Memory. Neural computation, vol. 9, № 8, 1997, pp. 1735-1780.

22. [22]. Bai T., Dou H. J., Zhao W. X., Yang D. Y., Wen J. R. An Experimental Study of Text Representation Methods for Cross-Site Purchase Preference Prediction Using the Social Text Data. Journal of Computer Science and Technology, vol. 32, №. 4, 2017, pp. 828-842.

23. [23]. Keras: The Python Deep Learning library. URL: https://keras.io/, accessed 17.11.2018.

24. [24]. URL: https://github.com/oldaandozerskaya/biographical_samples.git, accessed 27.12.2018.

25. [25]. [gazeta.ru]. URL: https://www.gazeta.ru/, accessed 09.12.2018.


Review

For citations:


Glazkova A.V. Automatic search for fragments containing biographical information in a natural language text. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2018;30(6):221-236. (In Russ.) https://doi.org/10.15514/ISPRAS-2018-30(6)-12



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)