Тематическое моделирование текстов на естественном языке

Антон Коршунов; Андрей Гомзин

doi:10.15514/ISPRAS-2012-23-13

Тематическое моделирование текстов на естественном языке

Антон Коршунов, Андрей Гомзин

https://doi.org/10.15514/ISPRAS-2012-23-13

Полный текст:

PDF (Rus)

сгенерировать QR код

Аннотация

Тематическое моделирование - способ построения модели коллекции текстовых документов, которая определяет, к каким темам относится каждый из документов. Переход из пространства терминов в пространство найденных тематик помогает разрешать синонимию и полисемию терминов, а также эффективнее решать такие задачи, как тематический поиск, классификация, суммаризация и аннотация коллекций документов и новостных потоков. Наибольшее применение в современных приложениях находят подходы, основанные на Байесовских сетях - ориентированных графических вероятностных моделях, позволяющих учитывать авторство документов, связи между словами, темами, документами и авторами, а также другие типы сущностей и метаданных. В статье приведён сравнительный обзор различных моделей, описаны способы оценивания их параметров и качества результатов, а также приведены примеры открытых программных реализаций.

Ключевые слова

тематическое моделирование, тематический поиск, классификация документов, графические вероятностные модели, Байесовские сети, скрытое размещение Дирихле, уменьшение размерности, анализ текста, извлечение информации, машинное обучение

Об авторах

Антон Коршунов

ИСП РАН
Россия

Андрей Гомзин

ИСП РАН
Россия

Список литературы

1. James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic Detection and Tracking Pilot Study. Final Report. Proceedings of the Broadcast News Transcription and Understanding Workshop (Sponsored by DARPA), Feb. 1998

2. A.K. Jain, M.N. Murty, P.J. Flynn. Data Clustering: A Review; ACM Computing Surveys, Vol. 31, No. 3, September 1999

3. Fabrizio Sebastiani. Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002.

4. Allan, J. and Lavrenko, V. and Malin, D. and Swan, R. Detections, bounds, and timelines: UMass and TDT-3. In Proceedings of Topic Detection and Tracking Workshop, pages 167–174.p. 167-174, Vienna, VA, 2000

5. Blei, David M. (April 2012). Introduction to Probabilistic Topic Models. Comm. ACM 55 (4): 77–84.

6. Thomas Hofmann. Probabilistic Latent Semantic Analysis. UAI 1999: 289-296

7. Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR 1999: 50-57

8. T.K. Moon. The expectation-maximization algorithm. IEEE Signal Processing Mag., vol. 13, pp. 47–60, Nov. 1996

9. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003

10. Gregor Heinrich. Parameter estimation for text analysis. Technical report, Fraunhofer IGD, 2005

11. D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. Neural Information Processing Systems 16, 2003

12. Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei. Hierarchical Dirichlet Processes. Journal of the American Statistical Association, 101:476, 1566-1581, 2006

13. C. Wang, J. Paisley, and D. Blei. Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics , 2011

14. Mining Text Data (Springer) Ed. Charu Aggarwal, ChengXiang Zhai, March 2012

15. D. Blei and J. Lafferty. Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning, 2006

16. Xuerui Wang, Andrew McCallum. Topics over time: a non-Markov continuous-time model of topical trends. KDD 2006: 424-433

17. M. Hoffman, D. Blei, and F. Bach. Online learning for latent Dirichlet allocation. Neural Information Processing Systems, 2010

18. Kevin Robert Canini, Lei Shi, Thomas L. Griffiths. Online Inference of Topics with Latent Dirichlet Allocation. Journal of Machine Learning Research - Proceedings Track 5: 65-72 (2009)

19. D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled LDA. A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing, pages 248–256, 2009

20. G. Lisowsky and L. Rost. Konkordanz zum hebräischen Alten Testament. Deutsche Bibelgesellschaft, 1958.

21. Lee, S., Song, J., and Kim, Y. An Empirical Comparison of Four Text Mining Methods. Journal of Computer Information Systems, (51:1), 2010, pp. 1-10

22. D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications. Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009

23. David M. Blei topic modeling page - http://www.cs.princeton.edu/~blei/topicmodeling.html

24. D. Mimno and A. McCallum. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In UAI, 2008

25. Zelong Liu, Maozhen Li, Yang Liu, Mahesh Ponraj. Performance evaluation of Latent Dirichlet Allocation in text mining. FSKD 2011: 2695-2698

26. Steyvers, M. & Griffiths, T. Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum, 2007

27. Ali Daud, Juanzi Li, Lizhu Zhou, Faqir Muhammad. Knowledge discovery through directed probabilistic topic models: a survey. In Proceedings of Frontiers of Computer Science in China. 2010, 280-301. — перевод на русский К. В. Воронцов, А. В. Темлянцев и др.

28. Buntine W. L. Operations for learning with graphical models. Journal of Artificial Intelligence Research, 1994, 2: 159 – 225

29. S. Choi, S. Cha, C. C. Tappert. A Survey of Binary Similarity and Distance Measures, Journal of Systemics, Cybernetics and Informatics, Vol 8 No 1 2010, pp 43-48

30. Rui Xu, Donald C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3): 645-678 (2005)

31. L. Bahl, J. Baker, E. Jelinek, and R. Mercer. Perplexity — a measure of the difficulty of speech recognition tasks. In Program, 94th Meeting of the Acoustical Society of America, volume 62, page S63, 1977

32. H. Wallach, I. Murray, R. Salakhutdinov, and D. Mimno. Evaluation methods for topic models. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009), 2009

33. Jonathan Chang, Jordan L. Boyd-Graber, Sean Gerrish, Chong Wang, David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. NIPS 2009: 288-296

34. Newman, Lau, Grieser, Baldwin. Automatic Evaluation of Topic Coherence. NAACL HLT 2010

35. Rosen-Zvi, M., Griﬃths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. Proc. of Conf. on Uncertainty in Artiﬁcial Intelligence (UAI’04) (pp. 487–494)

36. Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988). Using latent semantic analysis to improve information retrieval. In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285

37. Sanjeev Arora, Rong Ge, Ankur Moitra. Learning Topic Models - Going beyond SVD. CoRR abs/1204.1956 (2012)

38. Daniel D. Lee and H. Sebastian Seung (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755): 788–791

Рецензия

Для цитирования:

Коршунов А., Гомзин А. Тематическое моделирование текстов на естественном языке. Труды Института системного программирования РАН. 2012;23. https://doi.org/10.15514/ISPRAS-2012-23-13

For citation:

Korshunov A., Gomzin A. Topic modeling in natural language texts. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2012;23. (In Russ.) https://doi.org/10.15514/ISPRAS-2012-23-13

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Труды Института системного программирования РАН

Тематическое моделирование текстов на естественном языке

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов