Сross-lingual transfer learning in drug-related information extraction from user-generated texts
https://doi.org/10.15514/ISPRAS-2021-33(6)-15
Abstract
Aggregating knowledge about drug, disease, and drug reaction entities across a broader range of domains and languages is critical for information extraction (IE) applications. In this work, we present a fine-grained evaluation intended to understand the efficiency of multilingual BERT-based models for biomedical named entity recognition (NER) and multi-label sentence classification tasks. We investigate the role of transfer learning (TL) strategies between two English corpora and a novel annotated corpus of Russian reviews about drug therapy. Labels for sentences include health-related issues or their absence. The sentences with one are additionally labelled at the expression level to identify fine-grained subtypes such as drug names, drug indications, and drug reactions. Evaluation results demonstrate that BERT trained on Russian and English raw reviews (5M in total) shows the best transfer capabilities on evaluation of adverse drug reactions on Russian data. The macro F1 score of 74.85% in the NER task was achieved by our RuDR-BERT model. For the classification task, our EnRuDR-BERT model achieves the macro F1 score of 70%, gaining 8.64% over the score of a general domain BERT model.
Keywords
About the Authors
Andrey Sergeyevich SAKHOVSKIYRussian Federation
Laboratory Assistant at the "Chemoinformatics and Molecular Modeling'' Research Laboratory of the Kazan Federal University; 1st-year graduate student of the Department of Mathematical Forecasting Methods, Faculty of Computational Mathematics and Cybernetics, Moscow State University
Elena Viktorovna TUTUBALINA
Russian Federation
Candidate of Physical and Mathematical Sciences, Researcher at the "Models and Methods of Computational Pragmatics" Research Laboratory of the Faculty of Computer Science, Higher School Economics; Senior Researcher at the "Chemoinformatics and Molecular Modeling" Research Laboratory, Kazan Federal University; Executive Director of Data Science at Sber AI
References
1. Huang C.C., Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in bioinformatics, vol. 17, no. 1, 2016, pp. 132-144.
2. Vaswani A., Shazeer N. et al. Attention is all you need. In Proc. of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000-6010.
3. Devlin J., Chang M. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, volume 1 (Long and Short Papers), 2019, pp. 4171-4186.
4. Conneau A., Lample G. Cross-lingual language model pretraining. Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 7059-7069.
5. Lample G., Conneau A. et al. Unsupervised Machine Translation Using Monolingual Corpora Only. In Proc. of the International Conference on Learning Representations, 2018, 14 p.
6. Artetxe M., Schwenk H. Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3197-3203.
7. Tutubalina E., Alimova I. et al. The Russian Drug Reaction Corpus and neural models for drug reactions and effectiveness detection in user reviews. Bioinformatics, vol. 37, issue 2, 2021, pp. 243-249.
8. Alvaro N., Miyao Y., Collier N. TwiMed: Twitter and PubMed comparable corpus of drugs, diseases, symptoms, and their relations. JMIR public health and surveillance, vol. 3, issue 2, 2017, article id. e6396.
9. Zolnoori M. et al. A systematic approach for developing a corpus of patient reported adverse drug events: a case study for SSRI and SNRI medications. Journal of biomedical informatics, vol. 90, 2019, article no. 103091.
10. Karimi S., Metke-Jimenez A. et al. Cadec: A corpus of adverse drug event annotations. Journal of biomedical informatics, vol. 55, 2015, pp. 73-81.
11. Sarker A., Belousov M. et al. Data and systems for medication-related text classification and concept normalization from Twitter: insights from the Social Media Mining for Health (SMM4H)-2017 shared task. Journal of the American Medical Informatics Association, vol. 25, issue 10, 2018, pp. 1274-1283.
12. Moreno I., Boldrini E. et al. Drugsemantics: a corpus for named entity recognition in spanish summaries of product characteristics. Journal of biomedical informatics, vol. 72, 2017, pp. 8-22.
13. Névéol A., Anderson R.N. et al. CLEF eHealth 2017 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in English and French. CLEF 2017 Working Notes. CEUR Workshop Proceedings, vol. 1866, 2017, 17 p.
14. Névéol A. et al. CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian. CLEF 2018 Working Notes. CEUR Workshop Proceedings, vol. 2125, 2018, 18 p.
15. Shelmanov A.O., Smirnov I.V., Vishneva E.A. Information extraction from clinical texts in Russian. In Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, issue 14, 2015, pp. 560-572.
16. Miftahutdinov Z., Sakhovskiy A., Tutubalina E. Kfu nlp team at smm4h 2020 tasks: Cross-lingual transfer learning with pretrained language models for drug reactions. In Proc. of the Fifth Social Media Mining for Health Applications Workshop & Shared Task, 2020, pp. 51-56.
17. Gusev A., Kuznetsova A. et al. Bert implementation for detecting adverse drug effects mentions in russian In Proc. of the Fifth Social Media Mining for Health Applications Workshop & Shared Task, 2020, pp. 46-50.
18. Alimova I., Tutubalina E. et al. A Machine learning approach to classification of drug reviews in Russian. In Proc. of the Ivannikov ISPRAS Open Conference, 2017, pp. 64-69.
19. Klein A., Alimova I. et al. Overview of the fifth social media mining for health applications (# smm4h) shared tasks at coling 2020. In Proc. of the Fifth Social Media Mining for Health Applications Workshop & Shared Task, 2020, pp. 27-36.
20. Magge A., Klein A. et al. Overview of the sixth social media mining for health applications (# smm4h) shared tasks at NAACL 2021. In Proc. of the Sixth Social Media Mining for Health (# SMM4H) Workshop and Shared Task, 2021, pp. 21-32.
21. Kuratov Y., Arkhipov M. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv:1905.07213, 2019.
22. Тутубалина Е. В., Мифтахутдинов З. Ш. и др. Идентификация лекарственных средств со схожим терапевтическим действием на основе семантического анализа текстов. Известия академии наук. Серия химическая, no. 11, 2017 г., стр. 2180-2189 / Tutubalina E.V., Miftahutdinov Z. Sh. et al. Using semantic analysis of texts for the identification of drugs with similar therapeutic effects. Russian Chemical Bulletin, vol. 66. issue 11, 2017, pp. 2180-2189.
Review
For citations:
SAKHOVSKIY A.S., TUTUBALINA E.V. Сross-lingual transfer learning in drug-related information extraction from user-generated texts. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(6):217-228. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(6)-15