Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Two Step Method for Grouping News with Similar Topics

https://doi.org/10.15514/ISPRAS-2020-32(4)-12

Abstract

Amount of news is rapidly growing up in recent years. People cannot handle them effectively. This is the main reason why automatic methods of news stream analysis have become an important part of modern science. The paper is devoted to the part of the news stream analysis which is called “event detection”. “Event” is a group of news dedicated to one real-world event. We study news from Russian news agencies. We consider this task as clusterization on news and compare algorithms by external clusterization metrics. The paper introduces a novel approach to detect events at news in Russian language. We propose a two-staged clustering method. It comprises “rough” clustering algorithm at the first stage and clarifying classifier at the second stage. At the first stage, a combination of shingles method and naive named entity based clusterization is used. Also we present a labeled dataset of news event detection based on «Yandex News» service. This manually labeled dataset can be used to estimate event detection methods performance. Empirical evaluation on these corpora proved the effectiveness of the proposed method for event detection at news texts.

About the Authors

Kirill Andreevich SKORNYAKOV
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow Institute of Physics and Technology
Russian Federation
postgraduate student


Anna Sergeevna LASKINA
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow Institute of Physics and Technology
Russian Federation
Master's student


Denis Yurievich TURDAKOV
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation
Ph.D. head of the "Information Systems" Department at ISP RAS, associated professor at MSU


References

1. J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report, In Proc. of the DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp, 194-218.

2. T. Brants, F. Chen, and A. Farahat. A system for new event detection. In Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003, pp. 330-337.

3. G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proc. of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004, pages 297-304.

4. A.-M. Popescu and M. Pennacchiotti. Detecting controversial events from twitter. In Proc. of the 19th ACM International Conference on Information and Knowledge Management, 2010, pp. 1873-1876.

5. S. Petrović, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In Proc. of the Annual Conference of the North American Chapter of the Association for Computational linguistics, 2010, pp. 181-189.

6. H. Becker, M. Naaman, and L. Gravano. Beyond trending topics: real-world event identification on twitter. In Proc. of the Fifth International AAAI Conference on Weblogs and Social Media, 2011, pp. 438-441.

7. J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling. Twitterstand: news in tweets. In Proc. of the 17th ACM Sigspatial International Conference on Advances in Geographic Information Systems, 2009, pp. 42-51.

8. R. Long, H. Wang, Y. Chen, O. Jin, and Y. Yu. Towards effective event detection, tracking and summarization on microblog data. Lecture Notes in Computer Science, vol. 6897, 2011, pp. 652-663.

9. T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proc. of the 19th International Conference on World Wide web, 2010, pp. 851-860.

10. J. G. Conrad and M. Bender. Semi-supervised events clustering in news retrieval. In Proc. of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval, 2016, pp. 21–26.

11. M. Mohd. Named entity patterns across news domains. In Proc. of the 1st BCS IRSG Conference on Future Directions in Information Access, 2007, 5 p.

12. T. Hua, F. Chen, L. Zhao, C.-T. Lu, and N. Ramakrishnan. Sted: semi-supervised targeted-interest event detectionin in twitter. In Proc. of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 1466-1469.

13. K. N. Vavliakis, F. A. Tzima, and P. A. Mitkas. Event detection via lda for the mediaeval 2012 sed task. In Proc. of the Multimedia Benchmark Workshop, 2012, 2 p.

14. X. Zhou and L. Chen. Event detection over twitter social media streams. The VLDB journal, vol. 23, no. 3, 2014, pp. 381-400.

15. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDNS, vol. 29, no. 8-13, 1997, pp. 1157-1166.

16. D. J. Pearce. An improved algorithm for finding the strongly connected components of a directed graph. Victoria University, Wellington, NZ, Tech. Rep, 2005.

17. P. Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura. Bulletin del la Société Vaudoise des Sciences Naturelles, vol. 37. 1901, pp. 547–579 (in French).

18. L. R. Dice. Measures of the amount of ecologic association between species. Ecology, vol. 26, no. 3, 1945, pp. 297-302.

19. В.И. Левенштейн. Двоичные коды, способные исправлять удаления, вставки и обращения. Доклады Академии Наук СССР, том 163, no. 4, 1966 г., стр. 845-848. / V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707-710.

20. R. Cilibrasi and P. M. Vitányi. Clustering by compression. IEEE Transactions on Information Theory, vol. 51, no. 4, 2005, pp. 1523–1545.

21. А.К. Яцков, М.И. Варламов, Д.Ю. Турдаков. Сбор и извлечение данных с веб-сайтов СМИ. Программирование, том 44, no. 5, 2018 г., стр. 68-80 / A.K. Yatskov, M.I. Varlamov, and D.Yu. Turdakov. Extraction of data from mass media web sites. Programming and Computer Software, vol. 44, no.5, 2018, pp. 344-352.

22. E. Pronoza, E. Yagunova, and A. Pronoza. Construction of a russian paraphrase corpus: unsupervised paraphrase extraction. In Proc. of the Russian Summer School in Information Retrieval, 2015, pp. 146–157.

23. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, vol. 20, no. 1, 1960, pp. 37–46.

24. П. А. Пархоменко, А. А. Григорьев, and Н. А. Астраханцев. Обзор и экспериментальное сравнение методов кластеризации текстов. Труды ИСП РАН, том 29, вып. 2, 2017. DOI: 10.15514/ISPRAS-2017-29(2)-6 / Parhomenko P.A., Grigorev A.A., Astrakhantsev N.A. A survey and an experimental comparison of methods for text clustering: application to scientific articles. Trudy ISP RAN/Proc. ISP RAS, 2017, vol.29, issue 2, pp.161-200 (in Russian).


Review

For citations:


SKORNYAKOV K.A., LASKINA A.S., TURDAKOV D.Yu. Two Step Method for Grouping News with Similar Topics. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):165-174. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-12



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)