Text sampling strategies for predicting missing bibliographic links
https://doi.org/10.15514/ISPRAS-2022-34(2)-7
Abstract
The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighbouring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems. Keywords: text sampling, sampling strategy, citation analysis, bibliographic link prediction, sentence classification.
About the Authors
Fedor Vladimirovich KRASNOVRussian Federation
Doctor of Technical Sciences, expert of the Department of Information Technologies of Management
Irina Sergeevna SMAZNEVICH
Russian Federation
Business Analyst, Department of Semantic Systems
Elena Nikolaevna BASKAKOVA
Russian Federation
Leading Systems Analyst, Semantic Systems Department
References
1. . Merton R. K. The sociology of science: Theoretical and empirical investigations. – University of Chicago press, 1973.
2. . Москалева О. В., Акоев М. А. Наукометрия: немного истории и современные российские реалии // Управление наукой: теория и практика. – 2019. – Т. 1. – №. 1. – С. 135-148. https://doi.org/10.19181/smtp.2019.1.1.5
3. . Зеленков Ю.А., Анисичкина Е.А. Динамика исследований в области интеллектуального анализа данных: тематический анализ публикаций за 20 лет // Бизнес-информатика. 2021. Т. 15. № 1. С. 30–46. https://doi.org/10.17323/2587-814X.2021.1.30.46
4. . Emerson L., Rees M. T., MacKay B. Scaffolding academic integrity: Creating a learning context for teaching referencing skills // Journal of university teaching & learning practice. – 2005. – Т. 2. – №. 3. – С. 17-30. https://doi.org/10.14453/jutlp.v2i3.3
5. . Web 2.0 authorship: Issues of referencing and citation for academic integrity / K. Gray [et al.] // The Internet and Higher Education. – 2008. – Т. 11. – №. 2. – С. 112-118. https://doi.org/10.1016/j.iheduc.2008.03.001
6. . Pears R., Shields G. Cite them right: the essential reference guide. – Macmillan International Higher Education, 2019.
7. . Arsyad S., Ramadhan S., Maisarah I. The rhetorical problems experienced by Indonesian lecturers in social sciences and humanities in writing research articles for international journals // The Asian Journal of Applied Linguistics. – 2020. – Т. 7. – №. 1. – С. 116-129.
8. . Important citation identification using sentiment analysis of in-text citations / H. Aljuaid [et al.] // Telematics and Informatics. – 2021. – Т. 56. – С. 101492. https://doi.org/10.1016/j.tele.2020.101492
9. . Prester, J., Wagner G., Schryen G., Hassan N. R. Classifying the ideational impact of information systems review articles: A content-enriched deep learning approach. // Decision Support Systems. – 2021. – Т. 140 – С. 113432. https://doi.org/10.1016/j.dss.2020.113432
10. . Varanasi K. K., Ghosal T., Tiwary P., Singh M. Iitp-cuni@ 3c: Supervised approaches for citation classification (task a) and citation significance detection (task b) // Proceedings of the Second Workshop on Scholarly Document Processing. – 2021. – С. 140-145.
11. . Färber M., Sampath A. Determining how citations are used in citation contexts // International Conference on Theory and Practice of Digital Libraries. – Springer, Cham, 2019. – С. 380-383. https://doi.org/10.1007/978-3-030-30760-8_38
12. . Fu J., Huang X., Liu P. Spanner: Named entity re-/recognition as span prediction // Препринт. – 2021. https://doi.org/10.48550/arXiv.2106.00641.
13. . Example-based named entity recognition / M. Ziyadi [et al.] // Препринт. – 2020. https://doi.org/10.48550/arXiv.2008.10570
14. . Li B. Named entity recognition in the style of object detection // Препринт. – 2021. https://doi.org/10.48550/arXiv.2101.11122
15. . Improving named entity recognition by external context retrieving and cooperative learning / X. Wang [et al.] // Препринт. – 2021. https://doi.org/10.48550/arXiv.2105.03654
16. . Fiok K., Karwowski W., Gutierrez E., Reza-Davahli M. Comparing the quality and speed of sentence classification with modern language models // Applied Sciences. – 2020. – Т. 10. – №. 10. – С. 3386. https://doi.org/10.3390/app10103386
17. . Glazkova A. V. Topical classification of text fragments accounting for their nearest context // Automation and Remote Control. – 2020. – Т. 81. – №. 12. – С. 2262-2276. https://doi.org/10.1134/S0005117920120097
18. . John M., Jayasudha J. S. Enhancing Performance of Deep Learning Based Text Summarizer // Int. J. Appl. Eng. Res. – 2017. – Т. 12. – №. 24. – С. 15986-15993.
19. . Akkasi A., Varoğlu E., Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text // Applied Intelligence. – 2018. – Т. 48. – №. 8. – С. 1965-1978. https://doi.org/10.1007/s10489-017-0920-5
20. . A novel oversampling method based on SeqGAN for imbalanced text classification / Y. Luo [et al.] // 2019 IEEE International Conference on Big Data (Big Data). – IEEE, 2019. – С. 2891-2894. https://doi.org/10.1109/BigData47090.2019.9006138
21. . Imbalanced text sentiment classification using universal and domain-specific knowledge / Y. Li [et al.] // Knowledge-Based Systems. – 2018. – Т. 160. – С. 1-15. https://doi.org/10.1016/j.knosys.2018.06.019
22. . Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: synthetic minority over-sampling technique // Journal of artificial intelligence research. – 2002. – Т. 16. – С. 321-357. https://doi.org/10.1613/jair.953
23. . Taha A. Y., Tiun S., Abd Rahman A. H., Sabah A. Multilabel Over-sampling and Under-sampling with Class Alignment for Imbalanced Multilabel Text Classification // Journal of Information and Communication Technology. – 2021. – Т. 20. – №. 3. https://doi.org/10.32890/jict2021.20.3.6
24. . Gallant S. I. A practical approach for representing context and for performing word sense disambiguation using neural networks // Neural Computation. – 1991. – Т. 3. – №. 3. – С. 293-309. https://doi.org/10.1162/neco.1991.3.3.293
25. . Huang E. H., Socher R., Manning C. D., Ng A. Y. Improving word representations via global context and multiple word prototypes // Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). – 2012. – С. 873-882.
26. . Devlin J., Chang M. W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding // Препринт. – 2018. https://doi.org/10.48550/arXiv.1810.04805
27. . Language models are few-shot learners / T. Brown [et al.] // Advances in neural information processing systems. – 2020. – Т. 33. – С. 1877-1901. https://doi.org/10.48550/arXiv.2005.14165
28. . A discourse-aware attention model for abstractive summarization of long documents / A. Cohan [et al.] // Препринт. – 2018. https://doi.org/10.48550/arXiv.1804.05685
29. . ExplainaBoard — Named Entity Recognition [Электронный ресурс] http://explainaboard.nlpedia.ai/leaderboard/task-ner/ (дата обращения 16.05.2022).
Review
For citations:
KRASNOV F.V., SMAZNEVICH I.S., BASKAKOVA E.N. Text sampling strategies for predicting missing bibliographic links. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(2):77-88. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(2)-7