Cross-lingual similar document retrieval methods

Denis Vladimirovich Zubarev; Ilya Vladimirovich Sochenkov

doi:10.15514/ISPRAS-2019-31(5)-9

Cross-lingual similar document retrieval methods

Denis Vladimirovich Zubarev, Ilya Vladimirovich Sochenkov

https://doi.org/10.15514/ISPRAS-2019-31(5)-9

Full Text:

PDF (Eng)

Generate QR code

Abstract

In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.

Keywords

cross-lingual document retrieval, cross-lingual plagiarism detection, cross-lingual word embeddings

About the Authors

Denis Vladimirovich Zubarev

FRC CSC RAS
Russian Federation

Engineer of FRC CSC RAS.

Ilya Vladimirovich Sochenkov

FRC CSC RAS
Russian Federation

PhD, Head of the Department of Intelligent Technologies and Systems of FRC CSC RAS.

References

1. Romanov A., Kuznetsova R., Bakhteev O., Khritankov A. Machine-Translated Text Detection in a Collection of Russian Scientific Papers. In Proc. of the Annual International Conference “Dialogue”, 2016.

2. Zubarev D.V., Sochenkov I.V. Cross-language text alignment for plagiarism detection based on contextual and context-free models. In Proc. of the Annual International Conference “Dialogue” 2019, v. 1, pp. 799-810.

3. Ferrero J., Agnes F., Besacier L., Schwab D. Using Word Embedding for Cross-language Plagiarism Detection. arXiv:1702.03082, 2017.

4. Franco-Salvador M., Gupta P., Rosso P., Banchs R.E. Cross-language plagiarism detection over continuous space and knowledge graph-based representations of language. Knowledge-based systems, vol. 111, 2016, pp. 87-99.

5. Jiang J.Y., Zhang M., Li C., Bendersky M., Golbandi N., Najork M. Semantic Text Matching for Long-Form Documents. In Proc. of the World Wide Web Conference, 2019, pp. 795-806.

6. Gillick D., Presta A., Tomar G.S. End-to-End Retrieval in Continuous Space. arXiv:1811.08008, 2018.

7. Vulić I., Moens M.F. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proc. of the 38th international ACM SIGIR conference on research and development in information retrieval, 2015, pp. 363-372.

8. Barrón-Cedeño A., Gupta P., Rosso P. Methods for cross-language plagiarism detection. Knowledge-Based Systems, vol. 50, 2013, pp. 211-217.

9. Potthast M., Barrón-Cedeño A., Stein B., Rosso P. Cross-language plagiarism detection. Language Resources and Evaluation, vol. 45, issue 1, 2011, pp. 45-62.

10. Bakhteev O., Ogaltsov A., Khazov A., Safin K., Kuznetsova R. CrossLang: the system of cross-lingual plagiarism detection. In Proc. of the KDD Workshop on Deep Learning for Education, 2019. Available at: https://truth-discovery-kdd2019.github.io/papers/crosslang.pdf, accessed 15.11.2019

11. Kutuzov A., Kopotev M., Sviridenko T., Ivanova L. Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints. In Proc. of the Ninth Workshop on Building and Using Comparable Corpora, 2016, pp. 3-10

12. Straka M., Hajic J., Straková J. UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proc. of the tenth international conference on language resources and evaluation, 2016, pp. 4290-4297.

13. Tiedemann J. Parallel Data, Tools and Interfaces in OPUS. In Proc. of the language resources and evaluation (LREC), 2012, pp. 2214-2218.

14. Antonova A., Misyurev A. Building a web-based parallel corpus and filtering out machine-translated text. In Proc. of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, 2011, pp. 136-144.

15. Ruder S., Vulić I., Søgaard A. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, vol. 65, issue 1, 2019, pp. 569-631.

16. Mikolov T., Sutskever I., Chen K., Corrado G.S., Dean J. Distributed representations of words and phrases and their compositionality. In Proc. of the 26th International Conference on Neural Information Processing Systems, 2013, vol. 2, pp. 3111-3119.

17. Artetxe M., Labaka G., Agirre E. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In Proc. of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 5012-5019.

18. Glavas G., Litschko R., Ruder S., Vulic I. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. arXiv:1902.00508, 2019.

19. Conneau A., Lample G., Ranzato M. A., Denoyer L., Jégou H. Word translation without parallel data. arXiv:1710.04087, 2017.

20. Vulić I., Moens M.F. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 2015, vol. 2, pp. 719-725.

21. Sochenkov I.V., Zubarev D.V., Tikhomirov I.A. Exploratory patent search. Informatics and its Applications, vol. 12, issue 1, 2018, pp. 89-94 (in Russian). / Соченков И.В., Зубарев Д.В., Тихомиров И.А. Эксплоративный патентный поиск. Информатика и ее применения, том 12, вып. 1, 2018 г., стр. 89-94..

22. Gabrilovich E., Markovitch S. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proc. of the 20th international joint conference on Artifical intelligence, 2007, pp. 1606-1611.

23. Johnson J., Douze M., Jégou H. Billion-scale similarity search with GPUs. arXiv:1702.08734, 2017.

24. Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, vol. 5, pp. 135-146.

25. Zweigenbaum P., Sharoff S., Rapp R. Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. In Proc. of 11th Workshop on Building and Using Comparable Corpora, 2018, pp. 39-42.

Review

For citations:

Zubarev D.V., Sochenkov I.V. Cross-lingual similar document retrieval methods. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2019;31(5):127-136. https://doi.org/10.15514/ISPRAS-2019-31(5)-9

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Cross-lingual similar document retrieval methods

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy