H1: гибридная система извлечения информации для поиска товаров в электронной торговле
https://doi.org/10.15514/ISPRAS-2024-36(5)-16
Аннотация
В данной статье обоснована продуктивность использования системы H1 для поиска товаров различных поставщиков на торговой интернет-площадке. Как и все современные системы поиска товаров, гибридная система H1 соединяет в себе преимущества лексических методов извлечения товаров и семантических методов, основанных на многомерных векторных представлениях. Новизна предложенного подхода заключается в объединении методов извлечения на уровне токенов. Дополнительное преимущество H1, по сравнению с другими индустриальными системами, – обработка поисковых запросов, состоящих из нескольких слов. Например, поисковые запросы «конфеты рот фронт», «gloria jeans детская одежда» будут выделять сущность бренда в отдельный токен – «рот фронт», «gloria jeans», что позволит уменьшить размер модели и улучшить автономные показатели системы извлечения. Полученные на публичном наборе данных WANDS значения показателей усредненной пороговой точности mAP@12 = 56.1% и пороговой полноты R@1k = 86.6% для H1 превышают самые современные аналоги.
Об авторе
Федор Владимирович КРАСНОВРоссия
Доктор технических наук, специалист по поисковым и рекомендательным системам в электронной коммерции, сотрудник Исследовательского центра ООО "ВБ СК” на базе Инновационного Центра Сколково.
Список литературы
1. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. pp. 232–241 (1994)
2. Zeng C. et al. FAERY: An FPGA-accelerated Embedding-based Retrieval System //16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). – 2022. – С. 841-856.
3. Zeng S. et al. DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search //Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. – 2023. – С. 283-296.
4. Pan Z. et al. RECom: A Compiler Approach to Accelerate Recommendation Model Inference with Massive Embedding Columns. – 2023.
5. Hofstätter S. et al. Improving efficient neural ranking models with cross-architecture knowledge distillation //arXiv preprint arXiv:2010.02666. – 2020.
6. George W. Furnas, Thomas K. Landauer, Louis M. Gomez, and Susan T. Dumais. 1987. The vocabulary problem in human-system communication. Commun. ACM 30, 11 (1987), 964–971.
7. Le Zhao and Jamie Callan. 2010. Term necessity prediction. In Proceedings of the 19th ACM international conference on Information and knowledge management. 259–268.
8. Hang Li and Jun Xu. 2014. Semantic matching in search. Foundations and Trends in Information retrieval 7, 5 (2014), 343–469.
9. Victor Lavrenko and W. Bruce Croft. 2001. Relevance Based Language Models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, Louisiana, USA) (SIGIR ’01). Association or Computing Machinery, New York, NY, USA, 120–127. https://doi.org/10.1145383952.383972
10. Michael E Lesk. 1969. Word-word associations in document retrieval systems. American documentation 20, 1 (1969), 27–38.
11. Yonggang Qiu and Hans-Peter Frei. 1993. Concept Based Query Expansion. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Pittsburgh, Pennsylvania,USA) (SIGIR ’93). Association for Computing Machinery, New York, NY, USA, 160–169. https://doi.org/10.1145/160688.160713
12. Jinxi Xu and W Bruce Croft. 2017. Quary expansion using local and global document analysis. In Acm sigir forum,Vol. 51. ACM New York, NY, USA, 168–175.
13. Miles Efron, Peter Organisciak, and Katrina Fenlon. 2012. Improving retrieval of short texts through document expansion. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval. 911–920.
14. Xiaoyong Liu and W Bruce Croft. 2004. Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 186–193.
15. Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, and Guihong Cao. 2004. Dependence Language Model for Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Sheffield, United Kingdom) (SIGIR ’04). Association for Computing Machinery, New York, NY,USA, 170–177. https://doi.org/10.1145/1008992.1009024
16. Donald Metzler and W. Bruce Croft. 2005. A Markov Random Field Model for Term Dependencies. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Salvador, Brazil) (SIGIR ’05). Association for Computing Machinery, New York, NY, USA, 472–479. https://doi.org/10.1145/1076034.1076115
17. Jun Xu, Hang Li, and Chaoliang Zhong. 2010. Relevance ranking using kernels. In Asia Information Retrieval Symposium. Springer, 1–12.
18. Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science 41, 6 (1990), 391–407.
19. Xing Wei and W. Bruce Croft. 2006. LDA-Based Document Models for Ad-Hoc Retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington,USA) (SIGIR ’06). Association for Computing Machinery, New York, NY, USA, 178–185. https://doi.org/10.1145/1148170.1148204
20. Adam Berger and John Lafferty. 1999. Information Retrieval as Statistical Translation. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, California,USA) (SIGIR ’99). Association for Computing Machinery, New York, NY, USA, 222–229. https://doi.org/10.1145/312624.312681
21. Maryam Karimzadehgan and ChengXiang Zhai. 2010. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR ’10). Association for Computing Machinery, New York, NY, USA, 323–330. https://doi.org/10.1145/1835449.1835505
22. Felipe Bravo-Marquez, Gaston L’Huillier, Sebastián A. Ríos, and Juan D. Velásquez. 2010. Hypergeometric Language Model and Zipf-like Scoring Function for Web Document Similarity Retrieval. In Proceedings of the 17th International Conference on String Processing and Information Retrieval (Los Cabos, Mexico) (SPIRE’10). Springer-Verlag, Berlin,Heidelberg, 303–308.
23. Mikolov T. et al. Distributed representations of words and phrases and their compositionality //Advances in neural information processing systems. – 2013. – Т. 26.
24. Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation //Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). – 2014. – С. 1532-1543.
25. Clinchant S., Perronnin F. Aggregating continuous word embeddings for information retrieval //Proceedings of the workshop on continuous vector space models and their compositionality. – 2013. – С. 100-109.
26. Ganguly D. et al. Word embedding based generalized language model for information retrieval //Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. – 2015. – С. 795-798.
27. Vulić I., Moens M. F. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings //Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. – 2015. – С. 363-372.
28. Boytsov L. et al. Off the beaten path: Let's replace term-based retrieval with k-nn search //Proceedings of the 25th ACM international on conference on information and knowledge management. – 2016. – С. 1099-1108.
29. Henderson M. et al. Efficient natural language response suggestion for smart reply //arXiv preprint arXiv:1705.00652. – 2017.
30. Bai Y. et al. SparTerm: Learning term-based sparse representation for fast text retrieval //arXiv preprint arXiv:2010.00768. – 2020.
31. Dai Z., Callan J. Context-aware sentence/passage term importance estimation for first stage retrieval //arXiv preprint arXiv:1910.10687. – 2019.
32. Nogueira R. et al. Document expansion by query prediction //arXiv preprint arXiv:1904.08375. – 2019.
33. Gillick D., Presta A., Tomar G. S. End-to-end retrieval in continuous space //arXiv preprint arXiv:1811.08008. – 2018.
34. Jean S. et al. On using very large target vocabulary for neural machine translation //arXiv preprint arXiv:1412.2007. – 2014.
35. Khattab O., Zaharia M. Colbert: Efficient and effective passage search via contextualized late interaction over bert //Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. – 2020. – С. 39-48.
36. Zamani H. et al. From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing //Proceedings of the 27th ACM international conference on information and knowledge management. – 2018. – С. 497-506.
37. Li S. Embedding-based product retrieval in Taobao search // Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. – 2021. – С. 3181-3189.
38. Magnani A. Semantic retrieval at Walmart // Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. – 2022. – С. 3495-3503.
39. Nigam P. et al. Semantic product search //Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. – 2019. – С. 2876-2885.
40. Huang P. S. et al. Learning deep structured semantic models for web search using clickthrough data //Proceedings of the 22nd ACM international conference on Information & Knowledge Management. – 2013. – С. 2333-2338.
41. Gillick D., Presta A., Tomar G. S. End-to-end retrieval in continuous space //arXiv preprint arXiv:1811.08008. – 2018.
42. Yang Y. et al. Multilingual universal sentence encoder for semantic retrieval //arXiv preprint arXiv:1907.04307. – 2019.
43. Karpukhin V. et al. Dense Passage Retrieval for Open-Domain Question Answering //EMNLP (1). – 2020. – С. 6769-6781.
44. Vanderkam D. et al. Nearest neighbor search in google correlate. – 2013.
45. Johnson J., Douze M., Jégou H. Billion-scale similarity search with gpus //IEEE Transactions on Big Data. – 2019. – Т. 7. – №. 3. – С. 535-547.
46. Chang W. C. et al. Pre-training tasks for embedding-based large-scale retrieval //arXiv preprint arXiv:2002.03932. – 2020.
47. Xiong L. et al. Approximate nearest neighbor negative contrastive learning for dense text retrieval //arXiv preprint arXiv:2007.00808. – 2020.
48. Lu W., Jiao J., Zhang R. Twinbert: Distilling knowledge to twin-structured compressed bert models for large-scale retrieval //Proceedings of the 29th ACM International Conference on Information & Knowledge Management. – 2020. – С. 2645-2652.
49. Gao L. et al. Complement lexical retrieval model with semantic residual embeddings //Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part I 43. – Springer International Publishing, 2021. – С. 146-160.
50. Kudo T. Subword regularization: Improving neural network translation models with multiple subword candidates //arXiv preprint arXiv:1804.10959. – 2018.
51. Sennrich R., Haddow B., Birch A. Neural machine translation of rare words with subword units //arXiv preprint arXiv:1508.07909. – 2015.
52. Provilkov I., Emelianenko D., Voita E. BPE-dropout: Simple and effective subword regularization //arXiv preprint arXiv:1910.13267. – 2019.
53. Gage P. A new algorithm for data compression //C Users Journal. – 1994. – Т. 12. – №. 2. – С. 23-38.
54. Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm //IEEE transactions on Information Theory. – 1967. – Т. 13. – №. 2. – С. 260-269.
55. Lin S. C., Yang J. H., Lin J. Distilling dense representations for ranking using tightly-coupled teachers //arXiv preprint arXiv:2010.11386. – 2020.
56. Krasnov F.V. Embedding-based retrieval: measures of threshold recall and precision to evaluate product search. // Business Informatics. – 2024 – Т. 18. – №. 2. – С.22–34. DOI:10.17323/2587-814X.2024.2.22.34.
57. Chen Y. Wands: Dataset for product search relevance assessment // European Conference on Information Retrieval. – Cham : Springer International Publishing, 2022. – С. 128-141.
58. Macdonald C., Tonellotto N. Declarative experimentation in information retrieval using PyTerrier //Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval. – 2020. – С. 161-168.
Рецензия
Для цитирования:
КРАСНОВ Ф.В. H1: гибридная система извлечения информации для поиска товаров в электронной торговле. Труды Института системного программирования РАН. 2024;36(5):227-240. https://doi.org/10.15514/ISPRAS-2024-36(5)-16
For citation:
KRASNOV F.V. H1: Hybrid Retrieval System for Product Search in E-Commerce. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(5):227-240. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(5)-16