Обзор и экспериментальное сравнение методов кластеризации текстов

П. А. Пархоменко; А. А. Григорьев; Н. А. Астраханцев

doi:10.15514/ISPRAS-2017-29(2)-6

Обзор и экспериментальное сравнение методов кластеризации текстов

П. А. Пархоменко, А. А. Григорьев, Н. А. Астраханцев

https://doi.org/10.15514/ISPRAS-2017-29(2)-6

Полный текст:

PDF (Rus)

сгенерировать QR код

Аннотация

Кластеризация текстовых документов применяется во многих приложениях, таких как информационный поиск, исследовательский поиск, определение спама. Этой задаче посвящено множество научных работ, однако в настоящее время остается недостаточно изученным влияние специфики научных статей, в частности принадлежности документов одной предметной области или недоступности полных текстов, на эффективность кластеризации. В данной работе предлагаются обзор и экспериментальное сравнение методов кластеризации текстовых документов в приложении к научным статьям. Исследуются методы, основанные на мешке слов, извлечении терминологии, тематическом моделировании, а также векторном представлении слов (word embedding) и документов, полученном с помощью искусственных нейронных сетей (word2vec, paragraph2vec).

Ключевые слова

кластеризация текстовых документов, мешок слов, извлечение терминологии, тематическое моделирование, векторное представление, искусственные нейронные сети

Об авторах

П. А. Пархоменко

Институт системного программирования РАН; Московский государственный университет имени М.В. Ломоносова
Россия

А. А. Григорьев

Институт системного программирования РАН; Национальный исследовательский университет «Высшая школа экономики»
Россия

Н. А. Астраханцев

Институт системного программирования РАН
Россия

Список литературы

1. Liu Xiaoyong, Croft W Bruce. Cluster-based retrieval using language models. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2004, pp. 186–193.

2. Sasaki Minoru, Shinnou Hiroyuki. Spam detection using text clustering. 2005 International Conference on Cyberworlds (CW’05). IEEE. 2005, pp. 316-319.

3. Sergio Decherchi, Simone Tacconi, Judith Redi et al. Text clustering for digital forensics analysis. Computational Intelligence in Security for Information Systems. Springer, 2009, pp. 29–36.

4. E Dransfield, G Morrot, J-F Martin et al. The application of a text clustering statisticalanalysis to aid the interpretation of focus group interviews.Food Quality and Preference. 2004. Т. 15, № 5, pp. 477–488.

5. Bader Aljaber, Nicola Stokes, James Bailey et al. Document clustering of scientific texts using citation contexts. Information Retrieval. 2010. Т. 13, № 2, pp. 101–131.

6. Marchionini Gary. Exploratory search: from finding to understanding. Communications of the ACM. 2006. Т. 49, № 4, pp. 41–46.

7. Andrews Nicholas O, Fox Edward A. Recent developments in document clustering: Tech. Rep.: Technical report, Computer Science, Virginia Tech, 2007.

8. Huang Anna. Similarity measures for text document clustering. Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008, pp. 49–56.

9. Sathiyakumari K, Manimekalai G, Preamsudha V. A survey on various approaches in document clustering.

10. Popat Shraddha K, Emmanuel M. Review and comparative study of clustering techniques.

11. Anastasiu David C, Tagarelli Andrea, Karypis George. Document Clustering: The Next Frontier. 2013.

12. Aggarwal Charu C, Reddy Chandan K. Data clustering: algorithms and applications. CRC Press, 2013.

13. Aggarwal Charu C, Zhai Cheng Xiang. Mining text data. Springer Science & Business Media, 2012.

14. Saiyad Nagma Y, Prajapati Harshadkumar B, Dabhi Vipul K. A Survey of Document Clustering using Semantic Approach.

15. Salton Gerard, Buckley Christopher. Termweighting approaches in automatic text retrieval. Information processing & management. 1988. Т. 24, № 5, pp 513–523.

16. Whissell John S, Clarke Charles LA. Improving document clustering using Okapi BM25 feature weighting. Information retrieval. 2011. Т. 14, № 5, pp. 466–487.

17. Голомазов Д. Д. Методы и средства управления научной информацией с использованием онтологий. Диссертация кандидата физико-математических наук. Москва. 2012.

18. Pinto David, Jim´enez-Salazar H´ector, Rosso Paolo. Clustering abstracts of scientific texts using the transition point technique. International Conference on Intelligent Text Processing and Computational Linguistics. Springer. 2006, pp. 536–546.

19. Scott Deerwester, Susan T Dumais, George W Furnas et al. Indexing by latent semantic analysis. Journal of the American society for information science. 1990. Т. 41, № 6, pp. 391.

20. Xu Wei, Liu Xin, Gong Yihong. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM. 2003, pp. 267–273.

21. Hofmann Thomas. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 1999, pp. 50–57.

22. Blei David M, Ng Andrew Y, Jordan Michael I. Latent dirichlet allocation. Journal of machine Learning research. 2003. Т. 3, № Jan., pp. 993–1022.

23. Tomas Mikolov, Kai Chen, Greg Corrado et al. Efficient estimation of word representationsin vector space. arXiv preprint, arXiv:1301.3781. 2013.

24. Chao Xing, Dong Wang, Xuewei Zhang et al. Document classification with distributions ofword vectors. Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific. IEEE. 2014, pp. 1–5.

25. Le Quoc V, Mikolov Tomas. Distributed Representations of Sentences and Documents. ICML. Т. 14. 2014, pp. 1188– 1196.

26. Slonim Noam, Tishby Naftali. Document clustering using word clusters via the information bottleneck method. Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval. ACM. 2000, pp. 208–215.

27. Cao Qimin, Guo Qiao, Wang Yongliang et al. Text clustering using VSM with feature clusters. Neural Computing and Applications. 2015. Т. 26, № 4, pp. 995–1003.

28. Hotho Andreas, Maedche Alexander, Staab Steffen. Ontology-based text document clustering.

29. Choudhary Bhoopesh, Bhattacharyya Pushpak. Text clustering using semantics. Proceedings of the 11th International World Wide Web Conference. 2002, pp. 1–4.

30. Jayarajan Dinakar, Deodhare Dipti, Ravindran B. Lexical Chains as Document Features. Third International Joint Conference on Natural Language Processing. Citeseer. 2008, pp. 111.

31. Enrique Amigo´, Julio Gonzalo, Javier Artiles et al. A comparison of extrinsic clustering evaluationmetrics based on formal constraints. Information retrieval. 2009. Т. 12, № 4, pp. 461–486.

32. Zhao Ying, Karypis George, Du Ding-Zhu. Criterion functions for document clustering: Tech. Rep.: Technical Report, 2005.

33. Meil˘a Marina. Comparing clusterings by the variation of information. Learning theory and kernel machines. Springer, 2003, pp. 173–187.

34. Hubert Lawrence, Arabie Phipps. Comparing partitions. Journal of classification. 1985. Т. 2, № 1, pp. 193–218.

35. Bakus J, Hussin MF, Kamel M. A SOM-based document clustering using phrases. Neural Information Processing, 2002. ICONIP’02. Proceedings of the 9th International Conference on. IEEE. Т. 5. 2002, pp. 2212–2216.

36. Vinh Nguyen Xuan, Epps Julien, Bailey James. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. Proceedings of the 26th Annual International Conference on Machine Learning. ACM. 2009, pp. 1073–1080.

37. Strehl Alexander, Ghosh Joydeep. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of machine learning research. 2002. Т. 3, № Dec., pp. 583–617.

38. Rosenberg Andrew, Hirschberg Julia. VMeasure: A Conditional Entropy-Based External Cluster Evaluation Measure. EMNLP-CoNLL. Т. 7. 2007, pp. 410–420.

39. Bagga Amit, Baldwin Breck. Entity-based cross-document coreferencing using the vector space model. Proceedings of the 17th international conference on Computational linguistics-Volume 1. 1998, pp. 79–85.

40. Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza et al. An extensive comparative study of cluster validity indices. Pattern Recognition. 2013. Т. 46, № 1, pp. 243–256.

41. Yanchi Liu, Zhongmou Li, Hui Xiong et al. Understanding of internal clustering validation measures. 2010 IEEE International Conference on Data Mining. IEEE. 2010, pp. 911–916.

42. Er´endira Rend´on, Itzel Abundez, Alejandra Arizmendi et al. Internal versus external cluster validation indexes.. International Journal of computers and communications. 2011. Т. 5, № 1, pp. 27–34.

43. Rousseeuw Peter J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987. Т. 20, pp. 53–65.

44. Davies David L, Bouldin Donald W. A cluster separation measure. IEEE transactions on pattern analysis and machine intelligence. 1979. № 2, pp. 224–227.

45. Calin´ski Tadeusz, Harabasz Jerzy. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods. 1974. Т. 3, № 1, pp. 1–27.

46. Bezdek James C, Pal Nikhil R. Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 1998. Т. 28, № 3, pp. 301–315.

47. Ibai Gurrutxaga, In˜aki Albisua, Olatz Arbelaitz et al. SEP/COP: An efficient method to find the bestpartition in hierarchical clustering based on a new cluster validity index.Pattern Recognition. 2010. Т. 43, № 10, pp. 3364–3373.

48. Halkidi Maria, Vazirgiannis Michalis. Clustering validity assessment: Finding the optimal partitioning of a data set. Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE. 2001, pp. 187–194.

49. Bird Steven. NLTK: the natural language toolkit. Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics. 2006, pp. 69–72.

50. Scikit-learn: Machine Learning in Python. F. Pedregosa, G. Varoquaux, A. Gramfort [и др.]. Journal of Machine Learning Research. 2011. Т. 12, pp. 2825–2830.

51. Astrakhantsev N.A., Fedorenko D.G., Turdakov D.Yu. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software. 2015. Т. 41, № 6, pp. 336–349.

52. Astrakhantsev Nikita. ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. arXiv preprint, arXiv:1611.07804. 2016.

53. Reh˚uˇrek R., Sojka P.ˇ Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Valletta, Malta: ELRA, 2010, pp. 45–50.

54. Martin Ester, Hans-Peter Kriegel, J¨org Sander Er´endira Rend´on, Itzel Abundez, Alejandra Arizmendi et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. Т. 96. 1996, pp. 226–231.

55. Arthur David, Vassilvitskii Sergei. kmeans++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics. 2007, pp. 1027–1035.

56. Lang Ken. Newsweeder: Learning to filter netnews. Proceedings of the 12th international conference on machine learning. 1995, pp. 331–339.

57. Krapivin M., Autaeu A., Marchese M. Large dataset for keyphrases extraction. 2009. URL: http://eprints.biblio.unitn.it/1671/1/disi09055krapivin-autayeu-marchese.pdf.

58. William Hersh, Aaron Cohen, Lynn Ruslen et al. TREC 2007 Genomics Track Overview. 2007.

59. Xie Pengtao, Xing Eric P. Integrating document clustering and topic modeling. arXiv preprint, arXiv:1309.6874. 2013.

60. Simone Romano, Nguyen Xuan Vinh, James Bailey et al. Adjusting for Chance Clustering Comparison Measures. arXiv preprint, arXiv:1512.01286. 2015.

61. Van Craenendonck Toon, Blockeel Hendrik. Using internal validity measures to compare clustering algorithms. AutoML Workshop at ICML 2015, pp. 1–8.

62. Field Andy. Discovering statistics using IBM SPSS statistics. Sage, 2013.

63. Kendall Maurice G. A new measure of rank correlation. Biometrika. 1938. Т. 30, № ½, pp. 81–93.

Рецензия

Для цитирования:

Пархоменко П.А., Григорьев А.А., Астраханцев Н.А. Обзор и экспериментальное сравнение методов кластеризации текстов. Труды Института системного программирования РАН. 2017;29(2):161-200. https://doi.org/10.15514/ISPRAS-2017-29(2)-6

For citation:

Parhomenko P.A., Grigorev A.A., Astrakhantsev N.A. A survey and an experimental comparison of methods for text clustering: application to scientific articles. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2017;29(2):161-200. (In Russ.) https://doi.org/10.15514/ISPRAS-2017-29(2)-6

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Труды Института системного программирования РАН

Обзор и экспериментальное сравнение методов кластеризации текстов

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов