Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Model and declarative specification language of binary data formats

https://doi.org/10.15514/ISPRAS-2021-33(6)-3

Abstract

A number of tasks related to binary data formats include the tasks of parsing, generating and сonjoint code and data analysis. A key element for all of these tasks is a universal data format model. This paper proposes an approach to modeling binary data formats. The described model is expressive enough to specify the most common data formats. The distinctive feature of the model its flexibility in specifying field locations, as well as the ability to describe external fields, which do not resolve into detailed structure during parsing. Implemented infrastructure allows to create and modify a model using application programming interfaces. An algorithm is proposed for parsing binary data by a model, based on the concept of computability of fields. The paper also presents a domain-specific language for data format specification. The specified formats and potential applications of the model for programmatic analysis of formatted data are indicated.

About the Authors

Alexander Aleksandrovich EVGIN
Ivannikov Institute for System Programming of the Russian Academy of Sciences
Russian Federation

PhD Student



Mikhail Aleksandrovich SOLOVEV
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Candidate of physical and mathematical sciences, senior researcher at the compiler technologies department of ISP RAS; senior lecturer at the system programming department of the faculty of Computational Mathematics and Cybernetics of MSU



Vartan Andronikovich PADARYAN
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Candidate of physical and mathematical sciences, leading researcher at the compiler technologies department of ISP RAS; associate professor of the system programming department of the faculty of Computational Mathematics and Cybernetics of MSU



References

1. G. Back. DataScript - A specification and scripting language for binary data. Lecture Notes in Computer Science, vol. 2487, 2002, pp. 66-77.

2. Хмельнов А.Е., Бычков И.В., Михайлов А.А. Декларативный язык FlexT - инструмент анализа и документирования бинарных форматов данных. Труды ИСП РАН, том 28, вып. 5, 2016 г., стр. 239-268 / Hmelnov A.Y., Bychkov I.V., Mikhailov A.A. A declarative language FlexT for analysis and documenting of binary data formats. Trudy ISP RAN/Proc. ISP RAS, vol. 28, issue 5, 2016, pp. 239-268 (in Russian). DOI: 10.15514/ISPRAS-2016-28(5)-15.

3. Kaitai Struct: declarative binary format parsing language. URL: https://kaitai.io/.

4. McCann P.J., Chandra S. Packet Types: abstract specification of network protocol messages. ACM SIGCOMM Computer Communication Review, vol. 30, issue 4, 2000, pp. 321-333.

5. Pang R., Paxson V. et al. binpac: a yacc for writing application protocol parsers. In Proc. of the 6th ACM SIGCOMM Conference on Internet Measurement (IMC '06), 2006, pp. 289-300.

6. Borisov N., Brumley D. et al. Generic Application-Level Protocol Analyzer and its Language. In Proc. of the Network and Distributed System Security Symposium, 2007, 16 p.

7. Hopcroft J.E., Motwani R., Ullman J.D. Introduction to Automata Theory, Languages, and Computation. 3-е изд. Pearson, 2006, 560 p.

8. Knuth D.E. Semantics of context-free languages. Mathematical systems theory, vol. 2, issue 2, 1968, pp. 127-145.

9. Ford B. Parsing expression grammars: a recognition-based syntactic foundation. ACM SIGPLAN Notices, vol. 39, issue 1, 2004, pp. 111-122.

10. Jim T., Mandelbaum Y., Walker D. Semantics and Algorithms for Data-Dependent Grammars. In Proc. of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2010, pp. 417–430.

11. Afroozeh A., Izmaylova A. Iguana: A Practical Data-Dependent Parsing Framework. In Proc. of the 25th International Conference on Compiler Construction, 2016, pp. 267-268.

12. Earley J. An Efficient Context-Free Parsing Algorithm. Communications of the ACM, vol. 13, issue 2, 1970, pp. 94-102.

13. Jim T., Mandelbaum Y. A New Method for Dependent Parsing. In Proc. of the 20th European Conference on Programming Languages and Systems, 2011, pp. 378-397.

14. Ganty P., Köpf B., Valero P. A Language-Theoretic View on Network Protocols. Lecture Notes in Computer Science, vol. 10482, 2017, pp. 363-379.

15. Peach: a fuzzing framework which uses a DSL for building fuzzers and an observer based architecture to execute and monitor them. URL: https://github.com/MozillaSecurity/peach.

16. Netzob: Protocol Reverse Engineering, Modeling and Fuzzing. URL: https://github.com/netzob/netzob.

17. Sommer R., Amann J., Hall S. Spicy: a unified deep packet inspection framework for safely dissecting all your data. In Proc. of the 32nd Annual Conference on Computer Security Applications, 2016, pp. 558-569.

18. Fisher K., Mandelbaum Y., Walker D. The next 700 data description languages. ACM SIGPLAN Notices, vol. 4, issue 1, 2006, pp 2-15.

19. Fisher K., Gruber R. PADS: a domain-specific language for processing ad hoc data. ACM SIGPLAN Notices,vol. 40, issue 6, 2005, pp. 295-304.

20. boofuzz: Network Protocol Fuzzing for Humans. URL: https://github.com/jtpereyda/boofuzz/.

21. GitLab Protocol Fuzzer Community Edition. URL: https://gitlab.com/gitlab-org/security-products/protocol-fuzzer-ce.

22. Editor - Pro Text/Hex Editor. URL: https://www.sweetscape.com/010editor/.

23. GNU poke, an extensible editor for structured binary data. URL: 10.5446/46118.

24. Соловьев М.А., Бакулин М.Г. и др. Практическая абстрактная интерпретация бинарного кода. Труды ИСП РАН, том 32, вып. 6, 2020 г., стр. 101-110 / Solovev M.A., Bakulin M.G. et al. Practical abstract interpretation of binary code. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 6, 2020, pp. 101-110 (in Russian). DOI: 10.15514/ISPRAS-2020-32(6)-8.

25. Соловьев М.А., Бакулин М.Г. и др. О новом поколении промежуточных представлений, применяемых для анализа бинарного кода. Труды ИСП РАН, том 30, вып. 6, 2018 г., стр. 39-68 / Solovev M.A., Bakulin M.G. et al. Next generation intermediate representations for binary code analysis. Trudy ISP RAN/Proc. ISP RAS, vol. 30, issue 6, 2018, pp. 39-68 (in Russian). DOI: 10.15514/ISPRAS-2018-30(6)-3.

26. Cousot P., Cousot R. Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proc. of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, 1977, pp. 238-252.


Review

For citations:


EVGIN A.A., SOLOVEV M.A., PADARYAN V.A. Model and declarative specification language of binary data formats. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(6):27-50. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(6)-3



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)