Detecting Content Spam on the Web through Text Diversity Analysis
Abstract
About the Authors
Anton S. PavlovRussian Federation
Boris V. Dobrov
Russian Federation
References
1. Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
2. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, S. Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006.
3. M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002.
4. Web Spam Challenge. http://webspam.lip6.fr/wiki/pmwiki.php, 2008.
5. A. Ntoulas , M. Najork , M. Manasse , D. Fetterly, Detecting spam web pages through content analysis, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland.
6. J. Piskorski , M. Sydow , D. Weiss, Exploring linguistic features for web spam detection: a preliminary study, Proceedings of the 4th international workshop on Adversarial information retrieval on the web, April 22, 2008, Beijing, China.
7. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993–1022, 2003.
8. I. Biro, J. Szabo, A. A. Benczur, Latent Dirichlet allocation in web spam filtering, Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, April 22, 2008, Beijing, China.
9. I. Biro, D. Siklosi, J. Szabo, A. A. Benczur, Linked latent Dirichlet allocation in web spam filtering, Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, April 21-21, 2009, Madrid, Spain.
10. Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004.
11. B. Wu, B. D. Davison. Identifying link farm spam pages. Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05. 2005.
12. J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
13. D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005.
14. H. Dang. Overview of DUC 2006. Proceedings of the Document Understanding. 2006.
15. Yahoo! Research: "Web Spam Collections". http://barcelona.research.yahoo.net/webspam/datasets/ Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/. URLs retrieved May 2007.
16. A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7(Dec):2673–2698, 2006.
17. G. Zipf, Selective Studies and the Principle of Relative Frequency in Language (Cambridge, Mass, 1932).
18. C. Andrieu, N. de Freitas, A. Doucet, M. Jordan, An introduction to MCMC for machine learning. Machine Learning, 50: 5–43, 2003.
19. X.-H. Phan, C.-T. Nguyen, Gibbs LDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for Parameter Estimation and Inference. http://gibbslda.sourceforge.net/, 2008.
20. G. Geng, X. Jin, C.-H. Wang. CASIA at Web Spam Challenge 2008 Track III. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
21. N. Dai, B.D. Davison, X. Qi. Looking into the past to better classify web spam. Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web - AIRWeb ’09. 2009.
Review
For citations:
Pavlov A.S., Dobrov B.V. Detecting Content Spam on the Web through Text Diversity Analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2011;21. (In Russ.)