Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Methodology for Collecting a Training Dataset for an Intrusion Detection Model

https://doi.org/10.15514/ISPRAS-2021-33(5)-5

Abstract

The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.

About the Authors

Aleksandr Igorevich GETMAN
1Ivannikov Institute for System Programming of the Russian Academy of Sciences, HSE University
Russian Federation

PhD in physical and mathematical sciences, senior researcher at ISP RAS, associate professor at HSE



Maxim Nikolaevich GORYUNOV
The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D. 



Andrey Georgievich MATSKEVICH
The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D., associate professor



Dmitry Aleksandrovich RYBOLOVLEV
The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D.



References

1. Sarker I.H., Furhad M.H., Nowrozy R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN Computer Science, vol. 2, issue 3, 2021, article no: 173.

2. Sharafaldin I., Lashkari A.H., Ghorbani Ali A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proc. of the 4th International Conference on Information Systems Security and Privacy (ICISSP), 2018, pp. 108-116.

3. Ring M., Wunderlich S. et al. Computers & Security, vol. 86, 2019, pp. 147-167.

4. Гетьман А.И., Иконникова М.К. Обзор методов классификации сетевого трафика с использованием машинного обучения. Труды ИСП РАН, том 32, вып. 6, 2020 г., стр. 137-154 / Getman A.I., Ikonnikova M.K. A survey of network traffic classification methods using machine learning. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 6, 2020, pp. 137-154 (in Russian). DOI: 10.15514/ISPRAS–2020–32(6)–11.

5. Khatouni A.S., Heywood N.Z. How much training data is enough to move a ML-based classifier to a different network? Procedia Computer Science, vol. 155, 2019, pp. 378-385.

6. Ghurab M., Gaphari G. et al. A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System. Asian Journal of Research in Computer Science, 2021, vol. 7, issue 4, pp. 14-33.

7. Magán-Carrión R., Urda D. et al. Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches. Applied Sciences, 2020, vol. 10, issue 5.

8. Горюнов М.Н., Мацкевич А.Г., Рыболовлев Д.А. Синтез модели машинного обучения для обнаружения компьютерных атак на основе набора данных CICIDS2017. Труды ИСП РАН, том 32, вып. 5, 2020 г., стр. 81-94 / Goryunov M.N., Matskevich A.G., Rybolovlev D.A. Synthesis of a machine learning model for detecting computer attacks based on the CICIDS2017 dataset. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 5, 2020, pp. 81-94 (in Russian). DOI: 10.15514/ISPRAS–2020–32(5)–6.

9. DARPA Intrusion Detection Evaluation Dataset. URL: https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset, accessed 24.10.2021.

10. KDD Cup 1999 Data. URL: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, accessed 24.10.2021.

11. Traffic Data from Kyoto University's Honeypots. URL: http://www.takakura.com/Kyoto_data/, accessed 24.10.2021.

12. NSL-KDD dataset. URL: https://www.unb.ca/cic/datasets/nsl.html, accessed 24.10.2021.

13. Intrusion detection evaluation dataset (ISCXIDS2012). URL: https://www.unb.ca/cic/datasets/ids.html, accessed 24.10.2021.

14. CTU-13 Dataset. URL: https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/, accessed 24.10.2021.

15. UNSW-NB15 Dataset. URL: https://ieee-dataport.org/documents/unswnb15-dataset#files, accessed 24.10.2021.

16. CIDDS-001 Coburg Intrusion Detection Data Set. URL: https://www.hs-coburg.de/fileadmin/hscoburg/WISENT-CIDDS-001.zip, accessed 24.10.2021.

17. UGR'16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. URL: https://nesg.ugr.es/nesg-ugr16/index.php, accessed 24.10.2021.

18. Intrusion Detection Evaluation Dataset (CIC-IDS2017). URL: https://www.unb.ca/cic/datasets/ids-2017.html, accessed 24.10.2021.

19. Canadian Institute for Cybersecurity datasets. URL: https://www.unb.ca/cic/datasets/index.html, accessed 24.10.2021.

20. Argus. URL: https://openargus.org/, accessed 24.10.2021.

21. CICFlowMeter. URL: https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter, accessed 24.10.2021.

22. NFStream: a Flexible Network Data Analysis Framework. URL: https://github.com/nfstream/nfstream, accessed 24.10.2021.

23. FCParser: Feature as a Counter Parser for Networkmetrics. URL: https://github.com/josecamachop/FCParser, accessed 24.10.2021.

24. Kostas K. Anomaly Detection in Networks Using Machine Learning. Master’s Thesis. University of Essex, 2018, 70 p.

25. Wilkinson M., Dumontier M. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, vol. 3, 2016, article number 160018.

26. Gharib A., Sharafaldin I. et al. An Evaluation Framework for Intrusion Detection Dataset. In Proc. of the International Conference on Information Science and Security (ICISS), 2016, pp. 1-6.

27. Sharafaldin I., Gharib A. et al. Towards a reliable intrusion detection benchmark dataset. Software Networking, issue 1, 2017, pp. 177–200.


Review

For citations:


GETMAN A.I., GORYUNOV M.N., MATSKEVICH A.G., RYBOLOVLEV D.A. Methodology for Collecting a Training Dataset for an Intrusion Detection Model. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(5):83-104. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(5)-5



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)