Neural Vector Compression In Approximate Nearest Neighbor Search On Large Datasets

Igor Olegovich BUYANOV; Vasiliy Vladimirovich YADRINSEV; Ilia Vladimirovich SOCHENKOV

doi:10.15514/ISPRAS-2024-36(1)-1

Neural Vector Compression In Approximate Nearest Neighbor Search On Large Datasets

Igor Olegovich BUYANOV, Vasiliy Vladimirovich YADRINSEV, Ilia Vladimirovich SOCHENKOV

https://doi.org/10.15514/ISPRAS-2024-36(1)-1

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper examines the hypothesis of the applicability of neural autoencoders as a method of vector compression in the pipeline of approximate nearest neighbor search. The evaluation was conducted on several large datasets using various autoencoder architectures and indexes. It has been demonstrated that, although none of the combinations of autoencoders and indexes can fully outperform pure solutions, in some cases, they can be useful. Additionally, we have identified some empirical relationships between the optimal dimensionality of the hidden layer and the internal dimensionality of the datasets. It has also been shown that the loss function is a determining factor for compression quality.

Keywords

approximate nearest neighbor search, autoencoders, large datasets

About the Authors

Igor Olegovich BUYANOV

Federal Research Center "Computer Science and Control" of the Russian
Russian Federation

Postgraduate student at FRC CSC RAS, senior developer at MTS AI. Research interests: natural language processing, embedding space analysis, computational psychology.

Vasiliy Vladimirovich YADRINSEV

Federal Research Center "Computer Science and Control" of the Russian
Russian Federation

Junior researcher at department 73 intellectual technologies and systems of FRC CSC RAS. Research interests: data vector indexing, natural language processing.

Ilia Vladimirovich SOCHENKOV

Federal Research Center "Computer Science and Control" of the Russian, Institute for System Programming of the Russian Academy of Sciences, Innopolis University, Sechenov University,
Russian Federation

Cand. Sci. (Phys.-Math.), lead researcher at FRC CSC RAS, lead researcher at ISP RAS, lead expert at Innopolis University. Research interests: natural language processing, plagiarism detection.

References

1. J. L. Bentley. Multidimensional binary search trees used for associative searching. Commun. ACM 18, 509–517 (1975).

2. S. M. Omohundro. Five balltree construction algorithms (2009).

3. H. J´egou, M. Douze, C. Schmid. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 117–128 (2011).

4. P. Alexander, Y. Malkov, L. Andrey, V. Krylov. Approximate nearest neighbor search small world approach (2011).

5. Y. A. Malkov, D. A. Yashunin. Efficient and robust approximate nearest neighbor search using hierarchicalnavigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 824–836 (2016).

6. H. Zhang, H. Shen, Y. Qiu, Y. Jiang, S. Wang, S. Xu, Y. Xiao, B. Long, W. Yang. Joint learning of deep retrieval model and product quantization based embedding index. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021).

7. D. E. Rumelhart, G. E. Hinton, R. J. Williams. Learning internal representations by error propagation (1986).

8. Y. Lee, H. Kwon, F. C. Park. Neighborhood reconstructing autoencoders. In Neural Information Processing Systems (2021).

9. X. Guo, X. Liu, E. Zhu, J. Yin. Deep clustering with convolutional autoencoders. In International Conference on Neural Information Processing (2017).

10. L. Mirvakhabova, E. Frolov, V. Khrulkov, I. Oseledets, A. Tuzhilin. Performance of hyperbolic geometry models on top-n recommendation tasks. Proceedings of the 14th ACM Conference on Recommender Systems (2020).

11. H. Gong, H. Schwenk. Multimodal and multilingual embeddings for large-scale speech mining. In Neural Information Processing Systems (2021).

12. A. Kuznetsova, H. Rom, N. G. Alldrin, J. R. R. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, V. Ferrari. The open images dataset v4. International Journal of Computer Vision 128, 1956–1981 (2018).

13. J. Johnson, M. Douze, H. J´egou. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 535–547 (2017).

14. V. Erba, M. Gherardi, P. Rotondo. Intrinsic dimension estimation for locally undersampled data. Scientific Reports 9 (2019).

15. J. Bac, E. M. Mirkes, A. N. Gorban, I. Y. Tyukin, A. Y. Zinovyev. Scikit-dimension: A python package for intrinsic dimension estimation. Entropy 23 (2021).

16. N. Chen, P. van der Smagt, B. Cseke. Local distance preserving auto-encoders using continuous k-nearest neighbours graphs. ArXiv abs/2206.05909 (2022), https://api.semanticscholar.org/CorpusID:249626370.

Review

For citations:

BUYANOV I.O., YADRINSEV V.V., SOCHENKOV I.V. Neural Vector Compression In Approximate Nearest Neighbor Search On Large Datasets. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(1):7-22. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(1)-1

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Neural Vector Compression In Approximate Nearest Neighbor Search On Large Datasets

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy