Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Text sampling strategies for predicting missing bibliographic links

https://doi.org/10.15514/ISPRAS-2022-34(2)-7

Abstract

The paper proposes various strategies for sampling text data when performing automatic sentence classification for the purpose of detecting missing bibliographic links. We construct samples based on sentences as semantic units of the text and add their immediate context which consists of several neighbouring sentences. We examine a number of sampling strategies that differ in context size and position. The experiment is carried out on the collection of STEM scientific papers. Including the context of sentences into samples improves the result of their classification. We automatically determine the optimal sampling strategy for a given text collection by implementing an ensemble voting when classifying the same data sampled in different ways. Sampling strategy taking into account the sentence context with hard voting procedure leads to the classification accuracy of 98% (F1-score). This method of detecting missing bibliographic links can be used in recommendation engines of applied intelligent information systems. Keywords: text sampling, sampling strategy, citation analysis, bibliographic link prediction, sentence classification.

About the Authors

Fedor Vladimirovich KRASNOV
NAUMEN
Russian Federation

Doctor of Technical Sciences, expert of the Department of Information Technologies of Management



Irina Sergeevna SMAZNEVICH
NAUMEN
Russian Federation

Business Analyst, Department of Semantic Systems



Elena Nikolaevna BASKAKOVA
NAUMEN
Russian Federation

Leading Systems Analyst, Semantic Systems Department



References

1. . Merton R. K. The sociology of science: Theoretical and empirical investigations. – University of Chicago press, 1973.

2. . Москалева О. В., Акоев М. А. Наукометрия: немного истории и современные российские реалии // Управление наукой: теория и практика. – 2019. – Т. 1. – №. 1. – С. 135-148. https://doi.org/10.19181/smtp.2019.1.1.5

3. . Зеленков Ю.А., Анисичкина Е.А. Динамика исследований в области интеллектуального анализа данных: тематический анализ публикаций за 20 лет // Бизнес-информатика. 2021. Т. 15. № 1. С. 30–46. https://doi.org/10.17323/2587-814X.2021.1.30.46

4. . Emerson L., Rees M. T., MacKay B. Scaffolding academic integrity: Creating a learning context for teaching referencing skills // Journal of university teaching & learning practice. – 2005. – Т. 2. – №. 3. – С. 17-30. https://doi.org/10.14453/jutlp.v2i3.3

5. . Web 2.0 authorship: Issues of referencing and citation for academic integrity / K. Gray [et al.] // The Internet and Higher Education. – 2008. – Т. 11. – №. 2. – С. 112-118. https://doi.org/10.1016/j.iheduc.2008.03.001

6. . Pears R., Shields G. Cite them right: the essential reference guide. – Macmillan International Higher Education, 2019.

7. . Arsyad S., Ramadhan S., Maisarah I. The rhetorical problems experienced by Indonesian lecturers in social sciences and humanities in writing research articles for international journals // The Asian Journal of Applied Linguistics. – 2020. – Т. 7. – №. 1. – С. 116-129.

8. . Important citation identification using sentiment analysis of in-text citations / H. Aljuaid [et al.] // Telematics and Informatics. – 2021. – Т. 56. – С. 101492. https://doi.org/10.1016/j.tele.2020.101492

9. . Prester, J., Wagner G., Schryen G., Hassan N. R. Classifying the ideational impact of information systems review articles: A content-enriched deep learning approach. // Decision Support Systems. – 2021. – Т. 140 – С. 113432. https://doi.org/10.1016/j.dss.2020.113432

10. . Varanasi K. K., Ghosal T., Tiwary P., Singh M. Iitp-cuni@ 3c: Supervised approaches for citation classification (task a) and citation significance detection (task b) // Proceedings of the Second Workshop on Scholarly Document Processing. – 2021. – С. 140-145.

11. . Färber M., Sampath A. Determining how citations are used in citation contexts // International Conference on Theory and Practice of Digital Libraries. – Springer, Cham, 2019. – С. 380-383. https://doi.org/10.1007/978-3-030-30760-8_38

12. . Fu J., Huang X., Liu P. Spanner: Named entity re-/recognition as span prediction // Препринт. – 2021. https://doi.org/10.48550/arXiv.2106.00641.

13. . Example-based named entity recognition / M. Ziyadi [et al.] // Препринт. – 2020. https://doi.org/10.48550/arXiv.2008.10570

14. . Li B. Named entity recognition in the style of object detection // Препринт. – 2021. https://doi.org/10.48550/arXiv.2101.11122

15. . Improving named entity recognition by external context retrieving and cooperative learning / X. Wang [et al.] // Препринт. – 2021. https://doi.org/10.48550/arXiv.2105.03654

16. . Fiok K., Karwowski W., Gutierrez E., Reza-Davahli M. Comparing the quality and speed of sentence classification with modern language models // Applied Sciences. – 2020. – Т. 10. – №. 10. – С. 3386. https://doi.org/10.3390/app10103386

17. . Glazkova A. V. Topical classification of text fragments accounting for their nearest context // Automation and Remote Control. – 2020. – Т. 81. – №. 12. – С. 2262-2276. https://doi.org/10.1134/S0005117920120097

18. . John M., Jayasudha J. S. Enhancing Performance of Deep Learning Based Text Summarizer // Int. J. Appl. Eng. Res. – 2017. – Т. 12. – №. 24. – С. 15986-15993.

19. . Akkasi A., Varoğlu E., Dimililer N. Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text // Applied Intelligence. – 2018. – Т. 48. – №. 8. – С. 1965-1978. https://doi.org/10.1007/s10489-017-0920-5

20. . A novel oversampling method based on SeqGAN for imbalanced text classification / Y. Luo [et al.] // 2019 IEEE International Conference on Big Data (Big Data). – IEEE, 2019. – С. 2891-2894. https://doi.org/10.1109/BigData47090.2019.9006138

21. . Imbalanced text sentiment classification using universal and domain-specific knowledge / Y. Li [et al.] // Knowledge-Based Systems. – 2018. – Т. 160. – С. 1-15. https://doi.org/10.1016/j.knosys.2018.06.019

22. . Chawla N. V., Bowyer K. W., Hall L. O., Kegelmeyer W. P. SMOTE: synthetic minority over-sampling technique // Journal of artificial intelligence research. – 2002. – Т. 16. – С. 321-357. https://doi.org/10.1613/jair.953

23. . Taha A. Y., Tiun S., Abd Rahman A. H., Sabah A. Multilabel Over-sampling and Under-sampling with Class Alignment for Imbalanced Multilabel Text Classification // Journal of Information and Communication Technology. – 2021. – Т. 20. – №. 3. https://doi.org/10.32890/jict2021.20.3.6

24. . Gallant S. I. A practical approach for representing context and for performing word sense disambiguation using neural networks // Neural Computation. – 1991. – Т. 3. – №. 3. – С. 293-309. https://doi.org/10.1162/neco.1991.3.3.293

25. . Huang E. H., Socher R., Manning C. D., Ng A. Y. Improving word representations via global context and multiple word prototypes // Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). – 2012. – С. 873-882.

26. . Devlin J., Chang M. W., Lee K., Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding // Препринт. – 2018. https://doi.org/10.48550/arXiv.1810.04805

27. . Language models are few-shot learners / T. Brown [et al.] // Advances in neural information processing systems. – 2020. – Т. 33. – С. 1877-1901. https://doi.org/10.48550/arXiv.2005.14165

28. . A discourse-aware attention model for abstractive summarization of long documents / A. Cohan [et al.] // Препринт. – 2018. https://doi.org/10.48550/arXiv.1804.05685

29. . ExplainaBoard — Named Entity Recognition [Электронный ресурс] http://explainaboard.nlpedia.ai/leaderboard/task-ner/ (дата обращения 16.05.2022).


Review

For citations:


KRASNOV F.V., SMAZNEVICH I.S., BASKAKOVA E.N. Text sampling strategies for predicting missing bibliographic links. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(2):77-88. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(2)-7



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)