Methodology for Collecting a Training Dataset for an Intrusion Detection Model
https://doi.org/10.15514/ISPRAS-2021-33(5)-5
Abstract
The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.
About the Authors
Aleksandr Igorevich GETMANRussian Federation
PhD in physical and mathematical sciences, senior researcher at ISP RAS, associate professor at HSE
Maxim Nikolaevich GORYUNOV
Russian Federation
Ph.D.
Andrey Georgievich MATSKEVICH
Russian Federation
Ph.D., associate professor
Dmitry Aleksandrovich RYBOLOVLEV
Russian Federation
Ph.D.
References
1. Sarker I.H., Furhad M.H., Nowrozy R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN Computer Science, vol. 2, issue 3, 2021, article no: 173.
2. Sharafaldin I., Lashkari A.H., Ghorbani Ali A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proc. of the 4th International Conference on Information Systems Security and Privacy (ICISSP), 2018, pp. 108-116.
3. Ring M., Wunderlich S. et al. Computers & Security, vol. 86, 2019, pp. 147-167.
4. Гетьман А.И., Иконникова М.К. Обзор методов классификации сетевого трафика с использованием машинного обучения. Труды ИСП РАН, том 32, вып. 6, 2020 г., стр. 137-154 / Getman A.I., Ikonnikova M.K. A survey of network traffic classification methods using machine learning. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 6, 2020, pp. 137-154 (in Russian). DOI: 10.15514/ISPRAS–2020–32(6)–11.
5. Khatouni A.S., Heywood N.Z. How much training data is enough to move a ML-based classifier to a different network? Procedia Computer Science, vol. 155, 2019, pp. 378-385.
6. Ghurab M., Gaphari G. et al. A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System. Asian Journal of Research in Computer Science, 2021, vol. 7, issue 4, pp. 14-33.
7. Magán-Carrión R., Urda D. et al. Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches. Applied Sciences, 2020, vol. 10, issue 5.
8. Горюнов М.Н., Мацкевич А.Г., Рыболовлев Д.А. Синтез модели машинного обучения для обнаружения компьютерных атак на основе набора данных CICIDS2017. Труды ИСП РАН, том 32, вып. 5, 2020 г., стр. 81-94 / Goryunov M.N., Matskevich A.G., Rybolovlev D.A. Synthesis of a machine learning model for detecting computer attacks based on the CICIDS2017 dataset. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 5, 2020, pp. 81-94 (in Russian). DOI: 10.15514/ISPRAS–2020–32(5)–6.
9. DARPA Intrusion Detection Evaluation Dataset. URL: https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset, accessed 24.10.2021.
10. KDD Cup 1999 Data. URL: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, accessed 24.10.2021.
11. Traffic Data from Kyoto University's Honeypots. URL: http://www.takakura.com/Kyoto_data/, accessed 24.10.2021.
12. NSL-KDD dataset. URL: https://www.unb.ca/cic/datasets/nsl.html, accessed 24.10.2021.
13. Intrusion detection evaluation dataset (ISCXIDS2012). URL: https://www.unb.ca/cic/datasets/ids.html, accessed 24.10.2021.
14. CTU-13 Dataset. URL: https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/, accessed 24.10.2021.
15. UNSW-NB15 Dataset. URL: https://ieee-dataport.org/documents/unswnb15-dataset#files, accessed 24.10.2021.
16. CIDDS-001 Coburg Intrusion Detection Data Set. URL: https://www.hs-coburg.de/fileadmin/hscoburg/WISENT-CIDDS-001.zip, accessed 24.10.2021.
17. UGR'16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. URL: https://nesg.ugr.es/nesg-ugr16/index.php, accessed 24.10.2021.
18. Intrusion Detection Evaluation Dataset (CIC-IDS2017). URL: https://www.unb.ca/cic/datasets/ids-2017.html, accessed 24.10.2021.
19. Canadian Institute for Cybersecurity datasets. URL: https://www.unb.ca/cic/datasets/index.html, accessed 24.10.2021.
20. Argus. URL: https://openargus.org/, accessed 24.10.2021.
21. CICFlowMeter. URL: https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter, accessed 24.10.2021.
22. NFStream: a Flexible Network Data Analysis Framework. URL: https://github.com/nfstream/nfstream, accessed 24.10.2021.
23. FCParser: Feature as a Counter Parser for Networkmetrics. URL: https://github.com/josecamachop/FCParser, accessed 24.10.2021.
24. Kostas K. Anomaly Detection in Networks Using Machine Learning. Master’s Thesis. University of Essex, 2018, 70 p.
25. Wilkinson M., Dumontier M. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, vol. 3, 2016, article number 160018.
26. Gharib A., Sharafaldin I. et al. An Evaluation Framework for Intrusion Detection Dataset. In Proc. of the International Conference on Information Science and Security (ICISS), 2016, pp. 1-6.
27. Sharafaldin I., Gharib A. et al. Towards a reliable intrusion detection benchmark dataset. Software Networking, issue 1, 2017, pp. 177–200.
Review
For citations:
GETMAN A.I., GORYUNOV M.N., MATSKEVICH A.G., RYBOLOVLEV D.A. Methodology for Collecting a Training Dataset for an Intrusion Detection Model. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(5):83-104. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(5)-5