Methodology for Collecting a Training Dataset  for an Intrusion Detection Model

Aleksandr Igorevich GETMAN; Maxim Nikolaevich GORYUNOV; Andrey Georgievich MATSKEVICH; Dmitry Aleksandrovich RYBOLOVLEV

doi:10.15514/ISPRAS-2021-33(5)-5

Methodology for Collecting a Training Dataset for an Intrusion Detection Model

Aleksandr Igorevich GETMAN, Maxim Nikolaevich GORYUNOV, Andrey Georgievich MATSKEVICH, Dmitry Aleksandrovich RYBOLOVLEV

https://doi.org/10.15514/ISPRAS-2021-33(5)-5

Full Text:

PDF (Rus)

Generate QR code

Abstract

The paper discusses the issues of training models for detecting computer attacks based on the use of machine learning methods. The results of the analysis of publicly available training datasets and tools for analyzing network traffic and identifying features of network sessions are presented sequentially. The drawbacks of existing tools and possible errors in the datasets formed with their help are noted. It is concluded that it is necessary to collect own training data in the absence of guarantees of the public datasets reliability and the limited use of pre-trained models in networks with characteristics that differ from the characteristics of the network in which the training traffic was collected. A practical approach to generating training data for computer attack detection models is proposed. The proposed solutions have been tested to evaluate the quality of model training on the collected data and the quality of attack detection in conditions of real network infrastructure.

Keywords

information security, network intrusion detection system, machine learning, dataset, transfer learning, random forest, network traffic, computer attack

About the Authors

Aleksandr Igorevich GETMAN

1Ivannikov Institute for System Programming of the Russian Academy of Sciences, HSE University
Russian Federation

PhD in physical and mathematical sciences, senior researcher at ISP RAS, associate professor at HSE

Maxim Nikolaevich GORYUNOV

The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D.

Andrey Georgievich MATSKEVICH

The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D., associate professor

Dmitry Aleksandrovich RYBOLOVLEV

The Academy of Federal Security Guard Service of the Russian Federation
Russian Federation

Ph.D.

References

1. Sarker I.H., Furhad M.H., Nowrozy R. AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions. SN Computer Science, vol. 2, issue 3, 2021, article no: 173.

2. Sharafaldin I., Lashkari A.H., Ghorbani Ali A. Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization. In Proc. of the 4th International Conference on Information Systems Security and Privacy (ICISSP), 2018, pp. 108-116.

3. Ring M., Wunderlich S. et al. Computers & Security, vol. 86, 2019, pp. 147-167.

4. Гетьман А.И., Иконникова М.К. Обзор методов классификации сетевого трафика с использованием машинного обучения. Труды ИСП РАН, том 32, вып. 6, 2020 г., стр. 137-154 / Getman A.I., Ikonnikova M.K. A survey of network traffic classification methods using machine learning. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 6, 2020, pp. 137-154 (in Russian). DOI: 10.15514/ISPRAS–2020–32(6)–11.

5. Khatouni A.S., Heywood N.Z. How much training data is enough to move a ML-based classifier to a different network? Procedia Computer Science, vol. 155, 2019, pp. 378-385.

6. Ghurab M., Gaphari G. et al. A Detailed Analysis of Benchmark Datasets for Network Intrusion Detection System. Asian Journal of Research in Computer Science, 2021, vol. 7, issue 4, pp. 14-33.

7. Magán-Carrión R., Urda D. et al. Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches. Applied Sciences, 2020, vol. 10, issue 5.

8. Горюнов М.Н., Мацкевич А.Г., Рыболовлев Д.А. Синтез модели машинного обучения для обнаружения компьютерных атак на основе набора данных CICIDS2017. Труды ИСП РАН, том 32, вып. 5, 2020 г., стр. 81-94 / Goryunov M.N., Matskevich A.G., Rybolovlev D.A. Synthesis of a machine learning model for detecting computer attacks based on the CICIDS2017 dataset. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 5, 2020, pp. 81-94 (in Russian). DOI: 10.15514/ISPRAS–2020–32(5)–6.

9. DARPA Intrusion Detection Evaluation Dataset. URL: https://www.ll.mit.edu/r-d/datasets/1998-darpa-intrusion-detection-evaluation-dataset, accessed 24.10.2021.

10. KDD Cup 1999 Data. URL: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, accessed 24.10.2021.

11. Traffic Data from Kyoto University's Honeypots. URL: http://www.takakura.com/Kyoto_data/, accessed 24.10.2021.

12. NSL-KDD dataset. URL: https://www.unb.ca/cic/datasets/nsl.html, accessed 24.10.2021.

13. Intrusion detection evaluation dataset (ISCXIDS2012). URL: https://www.unb.ca/cic/datasets/ids.html, accessed 24.10.2021.

14. CTU-13 Dataset. URL: https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/, accessed 24.10.2021.

15. UNSW-NB15 Dataset. URL: https://ieee-dataport.org/documents/unswnb15-dataset#files, accessed 24.10.2021.

16. CIDDS-001 Coburg Intrusion Detection Data Set. URL: https://www.hs-coburg.de/fileadmin/hscoburg/WISENT-CIDDS-001.zip, accessed 24.10.2021.

17. UGR'16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. URL: https://nesg.ugr.es/nesg-ugr16/index.php, accessed 24.10.2021.

18. Intrusion Detection Evaluation Dataset (CIC-IDS2017). URL: https://www.unb.ca/cic/datasets/ids-2017.html, accessed 24.10.2021.

19. Canadian Institute for Cybersecurity datasets. URL: https://www.unb.ca/cic/datasets/index.html, accessed 24.10.2021.

20. Argus. URL: https://openargus.org/, accessed 24.10.2021.

21. CICFlowMeter. URL: https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter, accessed 24.10.2021.

22. NFStream: a Flexible Network Data Analysis Framework. URL: https://github.com/nfstream/nfstream, accessed 24.10.2021.

23. FCParser: Feature as a Counter Parser for Networkmetrics. URL: https://github.com/josecamachop/FCParser, accessed 24.10.2021.

24. Kostas K. Anomaly Detection in Networks Using Machine Learning. Master’s Thesis. University of Essex, 2018, 70 p.

25. Wilkinson M., Dumontier M. et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, vol. 3, 2016, article number 160018.

26. Gharib A., Sharafaldin I. et al. An Evaluation Framework for Intrusion Detection Dataset. In Proc. of the International Conference on Information Science and Security (ICISS), 2016, pp. 1-6.

27. Sharafaldin I., Gharib A. et al. Towards a reliable intrusion detection benchmark dataset. Software Networking, issue 1, 2017, pp. 177–200.

Review

For citations:

GETMAN A.I., GORYUNOV M.N., MATSKEVICH A.G., RYBOLOVLEV D.A. Methodology for Collecting a Training Dataset for an Intrusion Detection Model. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(5):83-104. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(5)-5

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Methodology for Collecting a Training Dataset for an Intrusion Detection Model

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy