Разрешение неоднозначности на основе псевдоаннотированной коллекции
https://doi.org/10.15514/ISPRAS-2021-33(6)-13
Аннотация
Передовые системы разрешения неоднозначности основаны на обучении с учителем, однако для создания таких моделей требуются большие объемы размеченных данных, которые отсутствуют для большинства языков с ограниченными ресурсами. Для того, чтобы решить проблему недостатка аннотированных данных в русском языке, в данной статье предлагается подход для автоматической разметки значений многозначных слов с использованием ансамбля моделей, базирующихся на слабо контролируемом обучении. Для первичной разметки данных использовался автоматический метод, основанный на концепте однозначных родственных слов. С помощью этих синтетических данных были обучены три модели для разрешения неоднозначности, которые затем применялись в ансамбле для получения значений ключевых многозначных слов. Проведенные эксперименты показали, что модели, обученные на данных, размеченных предобученными моделями, демонстрируют более высокое качество разрешения неоднозначности. Помимо этого, в статье изучается влияние различных подходов к аугментации текстовых данных на качество предсказаний.
Об авторах
Ангелина Сергеевна БОЛЬШИНАРоссия
Аспирант кафедры теоретической и прикладной лингвистики
Наталья Валентиновна ЛУКАШЕВИЧ
Россия
доктор технических наук, ведущий научный сотрудник
Список литературы
1. Peters M. E., Neumann M. et al. Deep contextualized word representations. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018, pp. 2227–2237.
2. Devlin J., Chang M.-W. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2019, pp. 4171-4186.
3. Leacock C., Chodorow M., Miller G.A. Using corpus statistics and WordNet relations for sense identification. Computational Linguistics, vol. 24, no. 1, 1998, pp. 147–165.
4. Miller G. A. WordNet: a lexical database for English. Communications of the ACM, vol. 38, no. 11, 1995, pp. 39–41.
5. Przybyła P. How big is big enough? Unsupervised word sense disambiguation using a very large corpus. arXiv preprint arXiv:1710.07960, 2017.
6. Mihalcea R., Moldovan D.I. An Iterative Approach to Word Sense Disambiguation. In Proc. of the Thirteenth International Florida Artificial Intelligence Research Symposium Conference, 2000, pp. 219-223.
7. Yuret D. KU: Word sense disambiguation by substitution. In Proc. of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), 2007, pp. 207-214.
8. Martinez D., Agirre E., Wang X. Word relatives in context for word sense disambiguation. In Proc.of the Australasian Language Technology Workshop, 2006, pp. 42-50.
9. Taghipour K., Ng H.T. One million sense-tagged instances for word sense disambiguation and induction. In Proc. of the Nineteenth Conference on Computational Natural Language Learning, 2015, pp. 338–344.
10. Otegi A., Aranberri N. et al. QTLeap WSD/NED corpora: Semantic annotation of parallel corpora in six languages. In Proc. of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 2016, pp. 3023–3030.
11. Bovi C.D., Camacho-Collados J. et al. Eurosense: Automatic harvesting of multilingual sense annotations from parallel text. In Proc. of the 55th Annual Meeting of the Association for Computational Linguistics (vol. 2: Short Papers), 2017, pp. 594–600.
12. Hauer B., Kondrak G. et al. Semi-Supervised and Unsupervised Sense Annotation via Translations. arXiv preprint arXiv:2106.06462, 2021.
13. Henrich V., Hinrichs E., Vodolazova T. WebCAGe–A Web-harvested corpus annotated with GermaNet senses. In Proc. of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012, pp. 387-396.
14. Saif A., Omar N. et al. Building Sense Tagged Corpus Using Wikipedia for Supervised Word Sense Disambiguation. Procedia Computer Science, vol. 123, 2018, pp. 403-412.
15. Raganato A., Bovi C.D., Navigli R. Automatic Construction and Evaluation of a Large Semantically Enriched Wikipedia. In Proc. of the Twenty-Fifth International Joint Conference on Artificial Intelligence, 2016, pp. 2894-2900.
16. Scarlini B., Pasini T., Navigli R. Just “OneSeC” for producing multilingual sense-annotated data. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 699-709.
17. Mihalcea R. Co-training and self-training for word sense disambiguation. In Proc. of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004), 2004, pp. 33-40.
18. Pham T.P., Ng H.T., Lee W.S. Word sense disambiguation with semi-supervised learning. Lecture Notes in Computer Science, vol. 3406, 2005, pp. 238-241.
19. Khapra M. M., Joshi S. et al. Together we can: Bilingual bootstrapping for WSD. In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011, pp. 561-569.
20. Lison P., Barnes J., Hubin A. skweak: Weak supervision made easy for NLP. In Proc. of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021, pp. 337-346.
21. Lin Y., Shen S. et al. Neural relation extraction with selective attention over instances. In Proc. of the 54th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), 2016, pp. 2124-2133.
22. Li Z., Hu F. et al. Selective kernel networks for weakly supervised relation extraction. CAAI Transactions on Intelligence Technology, vol. 6, no. 2, 2021, pp. 224-234.
23. Le P., Titov I. Boosting entity linking performance by leveraging unlabeled documents. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1935-1945.
24. Wang Y., Sohn S. et al. A clinical text classification paradigm using weak supervision and deep representation. BMC medical informatics and decision making, vol. 19, no. 1, 2019, pp. 1-13.
25. Kutuzov A., Kuzmenko E. To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation. arXiv preprint arXiv:1909.03135, 2019.
26. Wiedemann G., Remus S. et al. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Proc. of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, 2019, pp. 161-170.
27. Vial L., Lecouteux B., Schwab D. Sense vocabulary compression through the semantic knowledge of wordnet for neural word sense disambiguation. arXiv preprint arXiv:1905.05677, 2019.
28. Kumar S., Jat S. et al. Zero-shot word sense disambiguation using sense definition embeddings. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5670-5681.
29. Bevilacqua M., Navigli R. Breaking through the 80% glass ceiling: Raising the state of the art in Word Sense Disambiguation by incorporating knowledge graph information. In Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2854-2864.
30. Berend G. Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 8498-8508.
31. Blevins T., Zettlemoyer L. Moving down the long tail of word sense disambiguation with gloss-informed biencoders. arXiv preprint arXiv:2005.02590, 2020.
32. Loukachevitch N. V., Lashevich G., Gerasimova A. A., Ivanov V. V., Dobrov B. V. Creating Russian wordnet by conversion. In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference Dialogue, 2016, pp. 405-415.
33. Shavrina T., Shapovalova O. To the methodology of corpus construction for machine learning: “Taiga” syntax tree corpus and parser. In Proc. of “CORPORA-2017” International Conference, 2017, pp. 78-84.
34. Lesk M. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In Proc. of the 5th Annual International Conference on Systems Documentation, 1986, pp. 24-26.
35. Luo F., Liu T. et al. Incorporating glosses into neural word sense disambiguation. arXiv preprint arXiv:1805.08028, 2018.
36. Huang L., Sun C. et al. GlossBERT: BERT for word sense disambiguation with gloss knowledge. arXiv preprint arXiv:1908.07245, 2019.
37. Loureiro D., Jorge A. Language modelling makes sense: Propagating representations through WordNet for full-coverage word sense disambiguation. arXiv preprint arXiv:1906.10007, 2019.
38. Bolshina A., Loukachevitch N. Exploring the limits of word sense disambiguation for Russian using automatically labelled collections. In Proc. of the Linguistic Forum 2020: Language and Artificial Intelligence (LFLAI), 2020, 14 p.
39. Bolshina A., Loukachevitch N. Generating training data for word sense disambiguation in Russian. In Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, 2020, pp. 119-132.
40. Panchenko A., Lopukhina A., Ustalov D., Lopukhin K., Arefyev N., Leontyev A., Loukachevitch N. RUSSE’2018: A Shared Task on Word Sense Induction for the Russian Language. In Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, 2018, pp. 547–564.
41. Kuratov Y., Arkhipov M.Y. Adaptation of deep bidirectional multilingual transformers for Russian language. arXiv preprint arXiv: 1905.07213, 2019.
42. Kutuzov A., Kuzmenko E. WebVectors: a toolkit for building web interfaces for vector semantic models. Communications in Computer and Information Science, vol. 661, 2016, pp. 155-161.
43. Kohli H. Transfer learning and augmentation for word sense disambiguation. In Advances in Information Retrieval, Springer, 2021, pp. 303-311.
44. Large J., Lines J., Bagnall A. A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates. Data Mining and Knowledge Discovery, vol. 33, no. 6, 2019, pp. 1674-1709.
45. Gale W. A., Church K. W., Yarowsky D. One sense per discourse. In Proc. of the Workshop on Speech and Natural Language, 1992, pp. 233-237.
46. Wei J., Zou K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 6383-6389.
Рецензия
Для цитирования:
БОЛЬШИНА А.С., ЛУКАШЕВИЧ Н.В. Разрешение неоднозначности на основе псевдоаннотированной коллекции. Труды Института системного программирования РАН. 2021;33(6):193-204. https://doi.org/10.15514/ISPRAS-2021-33(6)-13
For citation:
BOLSHINA A.S., LOUKACHEVITCH N.V. Weakly Supervised Word Sense Disambiguation Using Automatically Labelled Collections. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(6):193-204. https://doi.org/10.15514/ISPRAS-2021-33(6)-13