Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis

V. Y. Korolev; A. Y. Korchagin; I. V. Mashechkin; M. I. Petrovskiy; D. V. Tsarev

doi:10.15514/ISPRAS-2015-27(1)-8

Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis

V. Y. Korolev, A. Y. Korchagin, I. V. Mashechkin, M. I. Petrovskiy, D. V. Tsarev

https://doi.org/10.15514/ISPRAS-2015-27(1)-8

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper presents the novel approach of user identification based on behavior analytics of user operations with a text information. It is offered to describe user behavior by content of his text documents. The structured representation of the considered behavioral information is carried out based on representation of documents text content in the user topic space, which is created by non-negative matrix factorization. The topic weights in the document characterize the user’s topic trend during an operating time with this document. The time variation of the topic weight values creates multidimensional time series that describe the history of user behavior when working with text data. Forecasting of such time series will allow for user identification based on estimated deviation of observed topic trend from the predicted topic weight values. This paper also presents the new time series forecasting method based on orthogonal nonnegative matrix factorization (ONMF) which is used within proposed user identification approach. It is worth noting that nonnegative matrix factorization methods were not used before for the time series forecasting task. The proposed user identification approach has been experimentally verified on the example of real corporate email correspondence created from the Enron dataset. In addition, experiments with other today popular forecasting methods have shown the superiority of proposed forecasting method in quality of user’s topic weights classification. Also we investigated two different approaches to estimates of the deviation of a time series point from the predicted value: absolute deviation and p-value estimation. Experiments have shown that both discussed approaches of deviation estimates are applicable in the proposed user identification approach.

Keywords

computer security, user identification, topic modeling, orthogonal nonnegative matrix factorization, time series forecasting.

About the Authors

V. Y. Korolev

Lomonosov Moscow State University, Moscow
Russian Federation

Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia.

A. Y. Korchagin

Lomonosov Moscow State University, Moscow
Russian Federation
Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia.

I. V. Mashechkin

Lomonosov Moscow State University, Moscow
Russian Federation
Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia.

M. I. Petrovskiy

Lomonosov Moscow State University, Moscow
Russian Federation
Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia.

D. V. Tsarev

Lomonosov Moscow State University, Moscow
Russian Federation
Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, GSP-1, Leninskie Gory, Moscow, 119991, Russia.

References

1. R.V. Yampolskiy, V. Govindaraju, Behavioural biometrics: a survey and classification. International Journal of Biometrics (IJBM), Vol. 1, No. 1, 2008.

2. Vremennoi ryad [Time Series]. March 24 2015. (http://www.machinelearning.ru/wiki/index.php?title=Временной_ряд) (in Russian)

3. I.V. Mashechkin, M.I. Petrovskiy, D.V.Tsarev. Metody vychislenija relevantnosti fragmentov teksta na osnove tematicheskix modelej v zadache avtomaticheskogo annotirovanija [Methods of text fragment relevance estimation based on the topic model analysis in the text summarization problem]. Vychislitel’nye Metody i Programmirovanie [Numerical Methods and Programming], 2013, vol. 14, pp. 91–102. (in Russian).

4. I.V. Mashechkin, M.I. Petrovskiy, D.S. Popov, D.V. Tsarev. Automatic text summarization using latent semantic analysis. Programming and Computer Software, 2011, pp. 299-305.

5. D.V. Tsarev, M.I. Petrovskiy, I.V. Mashechkin. Using NMF-based text summarization to improve supervised and unsupervised classification. 11th International Conference on Hybrid Intelligent Systems (HIS), 2011. Malacca, MALAYSIA. P. 185-189.

6. D.V. Tsarev, M.I. Petrovskiy I.V. Mashechkin. Supervised and Unsupervised Text Classification via Generic Summarization. International Journal of Computer Information Systems and Industrial Management Applications. MIR Labs, Volume 5, 2013, pp. 509-515.

7. I.V. Mashechkin, M.I. Petrovskiy, D.S. Popov, D.V. Tsarev. Applying Text Mining Methods for Data Loss Prevention. Programming and Computer Software. January 2015, Volume 41, Issue 1, pp 23-30.

8. C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.

9. A. Mirzal. Converged Algorithms for Orthogonal Nonnegative Matrix Factorizations. CoRR abs/1010.5290, 2010.

10. Wei Xu, Xin Liu, Yihong Gong. Document clustering based on non-negative matrix factorization. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003.

11. Chris Ding, Tao Li, Wei Peng, Haesun Park. Orthogonal Nonnegative Matrix Tri-Factorizations for Clustering. SIGKDD, 2006.

12. M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, R.J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, pp. 155-173, 2007.

13. J. Yoo, S. Choi. Orthogonal Nonnegative Matrix Factorization: Multiplicative Updates on Stiefel Manifolds. Intelligent Data Engineering and Automated Learning – IDEAL 2008, vol. 5326 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2008, pp. 140–147.

14. C. Meek, D.M. Chickering, D. Heckerman. Autoregressive Tree Models for Time-Series Analysis, 2002. (http://go.microsoft.com/fwlink/?LinkId=45966)

15. Tekhnicheskii spravochnik po algoritmu vremennykh ryadov (Microsoft) [Microsoft Time Series Algorithm Technical Reference]. (http://msdn.microsoft.com/ru-ru/library/bb677216.aspx) (in Russian)

16. T. Hastie, R. Tibshirani, G. Sherlock, M. Eisen, P. Brown, D. Botstein. Imputing Missing Data for Gene Expression Arrays. Technical report, Stanford Statistics Department 1999.

17. O. Troyanskaya. Missing value estimation methods for DNA microarrays. Bioinformatics, , vol. 17, no. 6, 2001. pp. 520-525.

18. D.V. Tsarev, R.V. Kurynin, M.I. Petrovskiy, I.V. Mashechkin. Applying non-negative matrix factorization methods to discover user’s resource access patterns for computer security tasks. Proceedings of the 2014 International Conference on Hybrid Intelligent Systems (HIS 2014). IEEE Computer Society [New York], United States, 2014. pp. 43–48.

19. D. Lee, S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401, 1999. pp. 788-791.

20. Enron Email Dataset. March 24 2015. (http://www.cs.cmu.edu/~./enron/)

21. Natural Language Toolkit (NLTK). March 24 2015. (http://www.nltk.org)

22. M. Kendall, A. Stuart. Statisticheskie vyvody i svyazi [Statistical derivations and associations.]. M.: Nauka, 1973 (In Russian).

23. Krivaya oshibok [Receiver Operating Characteristic, ROC curve]. March 24 2015. (http://www.machinelearning.ru/wiki/index.php?title=ROC-кривая) (In Russian)

Review

For citations:

Korolev V.Y., Korchagin A.Y., Mashechkin I.V., Petrovskiy M.I., Tsarev D.V. Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2015;27(1):151-172. (In Russ.) https://doi.org/10.15514/ISPRAS-2015-27(1)-8

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Applying Time Series to The Task of Background User Identification Based on Their Text Data Analysis

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy