Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Topic modeling in natural language texts

https://doi.org/10.15514/ISPRAS-2012-23-13

Abstract

Topic modeling is a method for building a model of a collection of text documents. The model is able to determine topics for each of documents. Shifting from term space to space of extracted topics helps resolving synonymy and polysemy of terms. Besides, it allows for more efficient topic-sensitive search, classification, summarization, and annotation of document collections and news feeds. The paper shows an evolution of topic modeling techniques. The earlier methods are based on clustering. These algorithms use some similarity function defined on two documents. The next generation of topic modeling techniques is based on Latent Semantic Indexing (LSA). Words co-occurrences in documents are analyzed here. Currently, the most popular are approaches based on Bayesian networks — directed probabilistic graphical models which incorporate different kinds of entities and metadata: document authorship, connections between words, topics, documents, and authors, etc. The paper contains a comparative survey of different models along with methods for parameter estimation and accuracy measurement. The following topic models are considered in the paper: Probabilistic Latent Semantic Indexing, Latent Dirichlet Allocation, non-parametric models, dynamic models, and semi-supervised models. The paper describes well-known quality evaluation metrics: perplexity and topic coherence. Freely available implementations are listed as well.

About the Authors

Anton Korshunov
ISP RAS, Moscow
Russian Federation


Andrey Gomzin
ISP RAS, Moscow
Russian Federation


References

1. James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic Detection and Tracking Pilot Study. Final Report. Proceedings of the Broadcast News Transcription and Understanding Workshop (Sponsored by DARPA), Feb. 1998

2. A.K. Jain, M.N. Murty, P.J. Flynn. Data Clustering: A Review; ACM Computing Surveys, Vol. 31, No. 3, September 1999

3. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002.

4. Allan, J. and Lavrenko, V. and Malin, D. and Swan, R. Detections, bounds, and timelines: UMass and TDT-3. In Proceedings of Topic Detection and Tracking Workshop, pages 167–174.p. 167-174, Vienna, VA, 2000

5. Blei, David M. (April 2012). Introduction to Probabilistic Topic Models. Comm. ACM 55 (4): 77–84.

6. Thomas Hofmann. Probabilistic Latent Semantic Analysis. UAI 1999: 289-296

7. Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR 1999: 50-57

8. T.K. Moon. The expectation-maximization algorithm. IEEE Signal Processing Mag., vol. 13, pp. 47–60, Nov. 1996

9. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003

10. Gregor Heinrich. Parameter estimation for text analysis. Technical report, Fraunhofer IGD, 2005

11. D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. Neural Information Processing Systems 16, 2003

12. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101:476, 1566-1581, 2006

13. C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics , 2011

14. Mining Text Data (Springer) Ed. Charu Aggarwal, ChengXiang Zhai, March 2012

15. D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006

16. Xuerui Wang, Andrew McCallum. Topics over time: a non-Markov continuous-time model of topical trends. KDD 2006: 424-433

17. M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. Neural Information Processing Systems, 2010

18. Kevin Robert Canini, Lei Shi, Thomas L. Griffiths. Online Inference of Topics with Latent Dirichlet Allocation. Journal of Machine Learning Research - Proceedings Track 5: 65-72 (2009)

19. D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA. A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing, pages 248–256, 2009

20. G. Lisowsky and L. Rost. Konkordanz zum hebräischen Alten Testament. Deutsche Bibelgesellschaft, 1958.

21. Lee, S., Song, J., and Kim, Y. An Empirical Comparison of Four Text Mining Methods. Journal of Computer Information Systems, (51:1), 2010, pp. 1-10

22. D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009

23. David M. Blei topic modeling page - http://www.cs.princeton.edu/~blei/topicmodeling.html

24. D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008

25. Zelong Liu, Maozhen Li, Yang Liu, Mahesh Ponraj. Performance evaluation of Latent Dirichlet Allocation in text mining. FSKD 2011: 2695-2698

26. Steyvers, M. & Griffiths, T. Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2007

27. Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. Knowledge discovery through directed probabilistic topic models: a survey. In Proceedings of Frontiers of Computer Science in China. 2010, 280-301.

28. Buntine W. L. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 1994, 2: 159 – 225

29. S. Choi, S. Cha, C. C. Tappert. A Survey of Binary Similarity and Distance Measures, Journal of Systemics, Cybernetics and Informatics, Vol 8 No 1 2010, pp 43-48

30. Rui Xu, Donald C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645-678 (2005)

31. L. Bahl, J. Baker, E. Jelinek, and R. Mercer. Perplexity — a measure of the difficulty of speech recognition tasks. In Program, 94th Meeting of the Acoustical Society of America, volume 62, page S63, 1977

32. H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), 2009

33. Jonathan Chang, Jordan L. Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. NIPS 2009: 288-296

34. Newman, Lau, Grieser, Baldwin. Automatic Evaluation of Topic Coherence. NAACL HLT 2010

35. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. Proc. of Conf. on Uncertainty in Artificial Intelligence (UAI’04) (pp. 487–494)

36. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988). Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285

37. Sanjeev Arora, Rong Ge, Ankur Moitra. Learning Topic Models - Going beyond SVD. CoRR abs/1204.1956 (2012)

38. Daniel D. Lee and H. Sebastian Seung (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755): 788–791


Review

For citations:


Korshunov A., Gomzin A. Topic modeling in natural language texts. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2012;23. (In Russ.) https://doi.org/10.15514/ISPRAS-2012-23-13



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)