Effective implementations of topic modeling algorithms
https://doi.org/10.15514/ISPRAS-2020-32(1)-8
Abstract
Topic modeling is an area of natural language processing that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. The exploding volume of text data motivates the community to constantly upgrade topic modeling algorithms for multiprocessor systems. In this paper, we provide an overview of effective EM-like algorithms for learning latent Dirichlet allocation (LDA) and additively regularized topic models (ARTM). Firstly, we review 11 techniques for efficient topic modeling based on synchronous and asynchronous parallel computing, distributed data storage, streaming, batch processing, RAM optimization, and fault tolerance improvements. Secondly, we review 14 effective implementations of topic modeling algorithms proposed in the literature over the past 10 years, which use different combinations of the techniques above. Their comparison shows the lack of a perfect universal solution. All improvements described are applicable to all kinds of topic modeling algorithms: PLSA, LDA, MAP, VB, GS, and ARTM.
Keywords
About the Author
Murat Azamatovich ApishevRussian Federation
Postgraduate student at Faculty of Computational Mathematics and Cybernetics, Department of Mathematical Methods of Forecasting
References
1. Hofmann T. Probabilistic Latent Semantic Indexing. In Proc. of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 50-57.
2. Blei D., Ng A., Jordan M. Latent Dirichlet Allocation. Journal of Machine Learning Research, vol. 3, 2003, pp. 993-1022.
3. Kochedykov D., Apishev M., Golitsyn L., Vorontsov K. Fast and Modular Regularized Topic Modeling. In Proc. of the 21st Conference of Open Innovations Association (FRUCT), 2017, pp. 182-193.
4. Воронцов К.В. Аддитивная регуляризация тематических моделей коллекций текстовых документов. Доклады РАН, том 455, №3, 2014, стр. 268–271 // Vorontsov K.V. Additive regularization for topic models of text collection. Doklady Mathematics. vol. 89, №3, 2014, pp. 301-304.
5. Vorontsov K.V., Potapenko A.A. Additive regularization of topic models. Machine Learning, Special Issue on Data Analysis and Intelligent Optimization with Applications, vol. 101, no. 1, 2015, pp. 303-323.
6. Dempster A.P., Laird N.M., Rubin D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, 1977, pp. 1-38.
7. Teh Y.W., Newman D., Welling M. A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. In Proc. of the 19th International Conference on Neural Information Processing, 2006, pp. 1353-1360.
8. Steyvers M., Griffiths T. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, vol. 101, Suppl. 1, 2004, pp. 5228-5235.
9. Воронцов К.В., Потапенко А.А. Модификации EM-алгоритма для вероятностного тематического моделирования. Машинное обучение и анализ данных, том 1, № 6, 2013 г., стр. 657–686 / Vorontsov K.V., Potapenko A.A. EM-like Algorithms for Probabilistic Topic Modeling. Machine Learning and Data Analysis, vol. 1, no. 6, 2013, pp. 657-686 (in Russian).
10. Asuncion A., Wekking M., Smyth P., Teh Y. W. On Smoothing and Inference for Topic Models. In Proc. of the International Conference on Uncertainty in Artificial Intelligence, 2009, pp. 27-34.
11. Newman D., Asuncion A., Smyth P., Welling M. Distributed Algorithms for Topic Models. The Journal of Machine Learning Research, vol. 10, 2009, pp. 1801-1828.
12. Wang Y., Bai H., Stanton M., Chen W., Chang E. PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications. Lecture Notes in Computer Science, vol. 5564, 2009, pp. 301-314.
13. Asuncion A., Smyth P., Welling M. Asynchronous Distributed Estimation of Topic Models for Document Analysis. Statistical Methodology, vol. 8, no. 1, 2010, pp. 3-17.
14. Liu Z., Zhang Y., Chang E., Sun M. PLDA+: Parallel Latent Dirichlet Allocation withData Placement and Pipeline Processing. ACM Transactions on Intelligent Systems and Technology, vol. 2, issue 3, article no. 26.
15. Smola A., Narayanamurthy S. An architecture for parallel topic models // Proceedings of the VLDB Endowment, 2010. Vol. 3, Issue 1-2. Pp. 703-710.
16. Řehůřek R., Sojka P. Software Framework for Topic Modelling with Large Corpora. In Proc. of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 2010, pp. 45-50.
17. Zhai K., Boyd-Graber J., Asadi N., Alkhouja M. Mr. LDA: A Flexible Large Scale Topic Modeling Package using Variational Inference in MapReduce. In Proc, of the 21st International Conference on World Wide Web, 2012, pp. 879-888.
18. Qiu Z., Wu B., Wang B., Yu L. Collapsed Gibbs Sampling for Latent Dirichlet Allocation on Spark. Proceedings of Machine Learning Research, vol. 36, 2014, pp. 17-28.
19. Wang Y., Zhao X., Sun Z., Yan H., Wang L., Jin Z., Wang L., Gao Y., Law C., Zeng J. Peacock: Learning Long-Tail Topic Features for Industrial Applications. ACM Transactions on Intelligent Systems and Technology, 2015, vol. 6, no. 4, article no. 47.
20. Yuan J., Gao F., Ho Q., Dai W., Wei J., Zheng X., Xing E., Liu T., Ma W. LightLDA: Big Topic Models on Modest Computer Clusters. In Proc. of the 24th International Conference on World Wide Web, 2015, pp. 1351-1361.
21. Zhao B., Zhou H., Li G., Huang Y. ZenLDA: An Efficient and Scalable Topic Model Training System on Distributed Data-Parallel Platform. Big Data Mining and Analytics, vol. 1, issue 1, 2018, pp. 57-74.
22. Vorontsov K., Frei O., Apishev M., Romov P., Suvorova M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections. In Proc. of the 2015 Workshop on Topic Models: Post-Processing and Applications, 2015, pp. 29-37.
23. Frei O., Apishev M. Parallel Non-blocking Deterministic Algorithm for Online Topic Modeling. Communications in Computer and Information Science, vol. 661, 2016, pp. 132-144.
24. Zeng J., Liu Z., Cao X. Fast Online EM for Big Topic Modeling. IEEE Transactions on Knowledge and Data Engineering, vol. 28, issue 3, 2016, pp. 675-688.
25. Hoffman M., Blei D., Bach F. Online Learning for Latent Dirichlet Allocation. In Proc. of the 23rd International Conference on Neural Information Processing Systems, vol. 1, 2010, pp. 856-864.
26. Vowpal Wabbit ML library. Available at: https://github.com/JohnLangford/vowpal_wabbit, accessed 15.01.2020.
27. White T. Hadoop: The definitive guide. O'Reilly Media, Inc., 2012, 688 p.
28. Zaharia M., Chowdhury M., Franklin M. Spark: Cluster Computing with Working Sets. In Proc. of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010, 10 p.
29. MPI: A Message-Passing Interface Standard, Version 3.0. Forum Message Passing Interface, 2012. Available at: https://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, accessed 15.01.2020.
30. Dean J., Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the 6th Symposium on Operating Systems Design and Implementation, 2004, pp. 137-149.
31. Petuum ML platform. Available at: http://www.petuum.com, accessed 15.01.2020.
32. Spark GraphX library. Available at: https://spark.apache.org/graphx, accessed 15.01.2020.
Review
For citations:
Apishev M.A. Effective implementations of topic modeling algorithms. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(1):137-152. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(1)-8