Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Detection of demographic attributes of microblog users

Abstract

Users of internet services often make errors or intentionally provide misleading information about their demographic attributes, including gender, age, marital status, education, religious and political views. At the same time, knowing values of user attributes allows to enhance the performance of recommender systems, internet marketing solutions, and other applications based on personalized results. In the paper, a method is proposed for automatic detection of demographic attributes of Twitter users by analyzing their textual messages and other data from their profiles. The method is based on a machine learning algorithm. Its distinctive features are fully automatic compilation of training and testing data sets as well as support for a broad and extendable range of languages and demographic attributes. Experimental study showed high accuracy of gender, age, and marital status detection for the most popular languages: English, Russian, German, French, Italian, and Spanish. Besides, detection of education, religious and political views is also supported for English.

About the Authors

Anton Korshunov
ISP RAS, Moscow
Russian Federation


Ivan Beloborodov
ISP RAS, Moscow
Russian Federation


Andrey Gomzin
ISP RAS, Moscow
Russian Federation


Christina Chuprina
ISP RAS, Moscow
Russian Federation


Nikita Astrakhantsev
ISP RAS, Moscow
Russian Federation


Yaroslav Nedumod
ISP RAS, Moscow
Russian Federation


Denis Turdakov
ISP RAS, Moscow
Russian Federation


References

1. Sloan L. et al. Knowing the Tweeters: Deriving Sociologically Relevant Demographics from Twitter. Sociological Research Online. – 2013. – T. 18. – №. 3. – p. 7.

2. Tang C. et al. What’s in a name: A study of names, gender inference, and gender behavior in facebook. Database Systems for Adanced Applications. – Springer Berlin Heidelberg, 2011. – pp. 344–356.

3. Miller Z., Dickinson B., Hu W. Gender Prediction on Twitter Using Stream Algorithms with N-Gram Character Features. International Journal. – 2012. – T. 2.

4. Deitrick W. Gender identification on twitter using the modified balanced winnow. Communications and Network. – 2012. – T. 4. – №. 3. – pp. 189–195.

5. Burger J. D. et al. Discriminating gender on Twitter. Proceedings of the Conference on Empirical Methods in Natural Language Processing. – Association for Computational Linguistics, 2011. – pp. 1301–1309.

6. Schwartz H. A. et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. PloS one. – 2013. – T. 8. – №. 9. – p. 73791.

7. Filippova K. User demographics and language in an implicit social network. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. – Association for Computational Linguistics, 2012. – pp. 1478–1488.

8. Cheng N., Chandramouli R., Subbalakshmi K. P. Author gender identification from text. Digital Investigation. – 2011. – T. 8. – №. 1. – pp. 78–88.

9. Rao D. et al. Classifying latent user attributes in twitter. Proceedings of the 2nd international workshop on Search and mining user-generated contents. – ACM, 2010. – pp. 37–44.

10. Rao D. et al. Hierarchical Bayesian Models for Latent Attribute Detection in Social Media. ICWSM. – 2011.

11. Mukherjee A., Liu B. Improving gender classification of blog authors. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. – Association for Computational Linguistics, 2010. – pp. 207–217.

12. Liu W. Ruths D. What’s in a Name? Using First Names as Features for Gender Inference in Twitter. 2013 AAAI Spring Symposium Series. – 2013.

13. Al Zamal F., Liu W., Ruths D. Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. ICWSM. – 2012.

14. Garera N., Yarowsky D. Modeling latent biographic attributes in conversational genres. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. – Association for Computational Linguistics, 2009. – Vol. 2, pp. 710–718.

15. Schler J. et al. Effects of Age and Gender on Blogging. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs. – 2006. – pp. 199–205.

16. Goswami S., Sarkar S., Rustagi M. Stylometric analysis of bloggers’ age and gender. Third International AAAI Conference on Weblogs and Social Media. – 2009.

17. Nguyen D. Smith N.A., Rosé C.P. Author age prediction from text using linear regression. Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities. – Association for Computational Linguistics, 2011. – pp. 115–123.

18. van Heerden C. et al. Combining regression and classification methods for improving automatic speaker age recognition. Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on. – IEEE, 2010. – pp. 5174–5177.

19. Nguyen D. et al. “How Old Do You Think I Am?”: A Study of Language and Age in Twitter. Seventh International AAAI Conference on Weblogs and Social Media. – 2013.

20. Peersman C., Daelemans W., Van Vaerenbergh L. Predicting age and gender in online social networks. Proceedings of the 3rd international workshop on Search and mining user-generated contents. – ACM, 2011. – pp. 37–44.

21. Rosenthal S. McKeown K. Age prediction in blogs: A study of style, content, and online behavior in pre-and post-social media generations. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. – Association for Computational Linguistics, 2011. – Vol. 1, pp. 763–772.

22. Pennacchiotti M., Popescu A.M. A Machine Learning Approach to Twitter User Classification. ICWSM. – 2011.

23. Conover M. D. Predicting the political alignment of twitter users. Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom). – IEEE, 2011. – pp. 192–199.

24.

25.

26. Eisenstein J. et al. A latent variable model for geographic lexical variation Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. – Association for Computational Linguistics, 2010. – pp. 1277–1287.

27. Cheng Z., Caverlee J., Lee K. You are where you tweet: a content-based approach to geo-locating twitter users. Proceedings of the 19th ACM international conference on Information and knowledge management. – ACM, 2010. – pp. 759–768.

28. Al Zamal F., Liu W., Ruths D. Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors. ICWSM. – 2012.

29. Rao D. et al. Classifying latent user attributes in twitter. Proceedings of the 2nd international workshop on Search and mining user-generated contents. – ACM, 2010. – S. 37-44.

30. Burger J. D. et al. Discriminating gender on Twitter. Proceedings of the Conference on Empirical Methods in Natural Language Processing. – Association for Computational Linguistics, 2011. – S. 1301-1309.

31. Kótyuk G., Buttyán L. A machine learning based approach for predicting undisclosed attributes in social networks. Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on. – IEEE, 2012. – S. 361-366.

32. Caruana, G., Li M. A survey of emerging approaches to spam filtering. ACM Computing Surveys (CSUR), Vol. 44, No.2, February 2012. pp. 1-27.

33. Stafford G., Yu L. L. An Evaluation of the Effect of Spam on Twitter Trending Topics. Social Computing (SocialCom), 2013 International Conference on. – IEEE, 2013. – С. 373-378.

34. Martinez-Romo, J., Araujo L. Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Systems with Applications, Vol. 40, No.8, June 2013. pp. 2992-3000.

35. Almeida T. A., Yamakami A. Advances in spam filtering techniques. Computational Intelligence for Privacy and Security. – Springer Berlin Heidelberg, 2012. – С. 199-214.

36. Wang, A. H. Machine Learning for the Detection of Spam in Twitter Networks. e-Business and Telecommunications, Vol. 222, 2012. pp. 319-333.

37. Ahmed, F., Abulaish M.A. Generic Statistical Approach for Spam Detection. Computer Communications, Vol. 36, June 2013. pp. 1120-1129.

38. Thomas, K., Grier C., Paxson V., Song D. Suspended Accounts in Retrospect: An Analysis of Twitter Spam. Proceedings of the Internet Measurement Conference 2011 (IMC 2011) , Berlin, Germany, November 2-4. 2011. pp. 243-258.

39. Sridharan V., Shankar V., Gupta M. Twitter games: how successful spammers pick targets. Proceedings of the 28th Annual Computer Security Applications Conference, Orlando, Florida, USA, December 3-7. 2012. pp. 389-398.

40. Levenshtejn V.I. Dvoichnye kody s ispravleniem vypadenij, vstavok i zameshhenij simvolov [Binary codes with correction for deletions, insertions, and substitutions of characters]. Doklady Аkademij Nauk SSSR [The Proceedings of the USSR Academy of Sciences], 1965, T. 163, №4. C. 845-848. (in Russian)

41. Lin P.C., Huang P.M. A study of effective features for detecting long-surviving Twitter spam accounts. The 15th International Conference on Advanced Communications Technology, Phoenix Park, PyeongChang, South Korea, January 27-30. 2013. pp. 841 — 846

42. Romanov А.S., Meshheryakov R.V. Opredelenie pola avtora korotkogo ehlektronnogo soobshheniya [Gender identification of the author of a short message]. Komp'yuternaya lingvistika i intellektual'nye tekhnologii: Po materialam ezhegodnoj Mezhdunar. konf. «Dialog» (Bekasovo, 25–29 maya 2011 g.). [Computational Linguistics and Intellectual Technologies: papers from the Annual conference “Dialogue” (Bekasovo, May 25-29, 2011)] Moscow.: RGGU, 2011.Issue 10 (17) – Moscow, RGGU, 2011. pp. 620–626. (in Russian)


Review

For citations:


Korshunov A., Beloborodov I., Gomzin A., Chuprina Ch., Astrakhantsev N., Nedumod Ya., Turdakov D. Detection of demographic attributes of microblog users. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2013;25:179-194. (In Russ.)



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)