C# parser for extracting cryptographic protocols structure from source code

. Cryptographic protocols are the core of any secure system. With the help of them, data is transmitted securely and protected from third parties' negative impact. As a rule, a cryptographic protocol is developed, analyzed using the means of formal verification and, if it is safe, gets its implementation in the programming language on which the system is developed. However, in the practical implementation of a cryptographic protocol, errors may occur due to the human factor, the assumptions that are necessary for the possibility of implementing the protocol, which entail undermining its security. Thus, it turns out that the protocol itself was initially considered to be safe, but its implementation is in fact not safe. In addition, formal verification uses rather abstract concepts and does not allow to fully analyze the protocol. This paper presents an algorithm for analyzing the source code of the C# programming language to extract the structure of cryptographic protocols. The features of the implementation of protocols in practice are described. The algorithm is based on the searching of important code sections that contain cryptographic protocol-specific constructions and finding of a variable chain transformations from the state of sending or receiving messages to their initial initialization, taking into account possible cryptographic transformations, to compose a tree, from which a simplified structure of a cryptographic protocol will be extracted. The algorithm is implemented in the C# programming language using the Roslyn parser. As an example, a cryptographic protocol is presented that contains the basic operations and functions, namely, asymmetric and symmetric encryption, hashing, signature, random number generation, data concatenation. The analyzer work is shown using this protocol as an example. The future work is described. example of cryptographic


Introduction
The problem of verifying the security of cryptographic protocols is relevant nowadays despite the existence of a large number of already verified protocols. The need to use self-written protocols that use lightweight cryptography for IoT, mobile robots, as well as the imperfection of formal verification of protocols is a new challenge for verification methods, in particular, the possibility of verifying the security of cryptographic protocols implementation. Nearly all protocols are changed and supplemented during implementation, and for their initial analysis, for example, by means of formal verification this is not taken into account. Also there can be programming mistakes and logic flaws on source code. So we need verify cryptographic protocols on their last developing iterationon implementation level for more attack finding which can help make any system more secure. Due to this fact this work is actual nowadays. The primary task in this matter is to extract the structure of the protocol from the source code. At the moment there are works in which the problem of extracting an abstract model from the source code of programming languages C [1][2][3], Java [4][5][6], F# [7][8][9][10][11][12] is being considered. Most of them require a special programming style for the possibility of use these algorithms or the use of additional annotations in the source code. The paper proposes to analyze the source code of the C# programming language. There are no other works, in which code analysis would be carried out, not involving the use of annotations or a special programming style.

Cryptographic protocols
Cryptographic protocols are a set of cryptographic algorithms and functions, with a correct combination of which is obtained a secure process of transferring messages between the parties. Protocol security is defined as complying with security requirements, the main of which are mutual authentication of the parties, protection against time attacks such as replay attacks, privacy and integrity of the transmitted data. Below is an example of a test protocol that does not have a special meaning, but contains all the basic cryptographic algorithms and functions: asymmetric and symmetric encryption, hashing, signature, random number generation.
→ : ( 3) At the beginning of this protocol, messages 1-3 use the Needham-Schroeder public key protocol (NSPK) [13] for mutual authentication of the parties. In message 3, in addition to the random number , the key is also transmitted for further communication between the parties using a symmetric cipher. In message 4, 2 data is transmitted, asymmetrically encrypted on partys' public key, and some 1 data. All this is encrypted symmetrically using the key , after which the data hash 1 is applied. In message 5, side applies its 3 data to the previously sent data 1 and 2, encrypts all this symmetrically on key , applies a signature and sends this message to side . In message 6, sends 3 data encrypted symmetrically on key .

Features of the cryptographic protocols implementation
There are a number of problems with the implementation of cryptographic protocols. One of the problems is the dynamic size of messages. In the programming language, the transfer of messages between the parties is implemented using sockets. In this case, the party that receives the message must know in advance the size of the buffer to receive. For example, in the protocol described in the previous paragraph, in the first three messages random numbers and identifiers of the parties with a fixed length are used. In this case, everything is simple and at the reception of the message by the party, it will expect a previously calculated static message length. However, messages 4-6 use data 1, 2, 3, which may have different lengths. For example, in message 4, 1 data can be a video file, the length of which can vary from 1 MB to several GB. And the question is how to tell the receiving party the size of the receiving buffer. There are various options for how this can be done, for example, to add information about its length to the beginning of a message, to put a mark at the end of the message. Let us consider in more detail the option with the addition of information about the length of the message. This option involves the use of additional data before the main message, which will contain the size of the future message. An example of a message with additional size information is shown in fig. 1. The receiving party in this case receives a fixed array of bytes, which contains the size of the message, after which the second portion takes the rest of the message knowing in advance its length. Since Message is usually encrypted and, in the context of a protocol, its transmission is protected, the question arises of how to protect information in Buffer size. All security requirements are important for us, except secrecy. To ensure them, you can, for example, use the signature of this area with timestamps. Thus, the transmission, for example, message 4, will have the following form when implementing the protocol: Another way is to get data into a fixed-length buffer until the buffer becomes empty. In this case, problems can also arise as shown in fig. 2.

Message part 1
Message part 2 Intruder's part Receive in buffer 1 Receive in buffer 2

Fig. 2. Intruders' attack on the addition of real data
The result is that the message will be received longer than necessary and in some implementations, in which further processing of the message by the receiving party is tied to the use of the message length, some data may be imperceptibly corrupted when decrypting and dividing the data into the message elements (random numbers, keys, etc.). In order to avoid this, various methods of controlling the length of a message are also used.

Source code analysis algorithm
As an example for describing the operation of the algorithm, the previously considered protocol was taken and implemented in the C# programming language in the form of a client server application.
→ : ( 3) The analysis algorithm uses the C# Roslyn source code parser [14]. With it you can get the tree structure of the source code, and you can use filters. We need these filters: 1) InvocationExpressionSyntax -call expressions; 2) VariableDeclarationSyntax -declaration of variables; 3) AssignmentExpressionSyntax -an assignment expression; 4) IfStatementSyntax -statement with a condition statement. Using filters, you can get the desired expression, after which you can view the tree structure of this expression. For example, using «AssignmentExpressionSyntax» we can find the expression « 1 1 = .
( 1, )». The derived linear tree structure of the expression is shown in fig. 3. The main purpose of using this parser is to find the transition from one variable to another. In this case, we are interested in the transition 1 1 → 1. This is achieved by searching for data such as «IdentifierName» together with the use of a black list of expressions. For example, it uses the call of the «Encrypt» method, as well as the previously declared object of the asymmetric encryption class «RSA», which are present in the black list, and 1 1 and 1 that we need can be obtained from here, where the first element will be the variable to which the value will be assigned, and the rest of those that are lower and not included in the black list will be the new value assigned. The algorithm is based on the definition of important code sections containing constructs specific to cryptographic protocols. Ultimately, the task is to find a chains of variables transformation from the state of sending or receiving messages (socket send/receive) to their initial initialization (static initialization, load from file, etc.), while taking into account possible cryptographic transformations (hash, encryption, etc.). In the course of building a chain, a tree is constructed, the nodes of which are variables with additional information about them, including data type definitions for the final leaves of the tree and cryptographic algorithms in the tree nodes. The tree structure allows you to describe all the chains of data transformations, since the data in the message is combined in various ways, the chains can be strongly branched and joined. Below is a fragment of the source code for the implementation of a part of the cryptographic protocol (messages 1-3) from participant A.  To find variable of the Socket class object, the sending and receiving messages is searched. In this case, there are 3 such constructions. At this stage, you can construct an interaction scheme of the following form: 1.

197
To determine the structure of the message, it is necessary to build a tree, the nodes of which contain variables with additional information. Consider an example for determining the content of the first message. The order of the algorithm is as follows, 1. The expression of the first message socA.Send (M1enc) is taken as the root of the tree. It is necessary to understand the contents of the variable M1enc.

First you need to find the declaration of the variable M1enc using the filter
VariableDeclarationSyntax. However, in our case, the variable is declared, but not initialized (line 23). In this case, the filter AssignmentExpressionSyntax is used and you can find in line 29 the assignment of the value to our variable. M1enc is added as a child node with the «var» tag, which means it is just a variable. 3. The simplest case of assignment is when the value of one variable is assigned to another. In this case, the situation is more difficult. The variable M1enc is assigned the value of the result of the work of the Encrypt method for an object of the asymmetric encryption class RSACryptoServiceProvider, which takes two parameters as input: what to encrypt and flag whether to use optimal asymmetric encryption with addition (OAEP padding). At the current stage, we remember that the content of the variable M1 was asymmetrically encrypted and assigned to the variable for sending message 1. In the tree structure, this is displayed as adding a child node M1 with the note «AsymENC», which means that the value of the variable M1 is encrypted using an asymmetric cipher. 4. Similar to paragraph 2, we are looking for the initialization of the variable M1. Using the first filter, you can find out that the variable is a one-dimensional array (line 17). Using the second filter, you must find the assignment of values to our array. These are lines 19 and 20. Two children Na and A with the mark «var» are added to node M1. 5. For variable A, the final value can be found using the first VariableDeclarationSyntax filter (line 11). This is where static initialization occurs in the source code. It is enough for a person to simply understand that this is the initial value, but for the automated determination of this fact it is necessary to understand that this is not a variable. One way to solve this problem is to re-search the right side of the expression, and since more in the design code of the assignment is not detected, this value is final. In the tree structure for node A, the initialization leaf is added «new byte [] {132, 114};» marked «DATA», which means the presence of some semantic data in the variable A. 6. For the Na variable, the search is carried out further. Using filters, we look for the declaration of the array and its initialization. The declaration occurs in line 14, and initialization occurs in line 15 by calling some method of the rng variable, which in turn is an object of the RNGCryptoServiceProvider class of random numbers, thus, the value of this variable is defined as a random number. The last leaf «rng.GetBytes (NaPrev);» is added to the tree structure marked «RANDOM», which means generating a random number. 7. Further search initialization for current leaves gives nothing, therefore the structure of the tree is considered final. The output tree view is shown in fig. 4

Return data problem
At the moment there is a problem in determining the returned data. For example, in message 1, a random number Na is sent, and then in the second message it is sent back. By default, there are currently two data concepts: DATA and RANDOM. All that is not a random number -is considered semantic data, for example: keys, identifiers, transferred files, etc. And at this stage, all values are considered different. For example, for the following protocol: → : ( , ) The result of the work will be as follows: → : ( , ) And in our context, the default DATA in the first message is different from the one in the second message. If the protocol takes the following form: 1.
→ : ( , ) There is a problem. Na just comes back, and on the receiving side we need to understand that this is the same data. For example, when processing message 2 (lines 34-58), we can trace the separated parts. In line 50, the value of the random number Na is obtained, after which it is checked for coincidence with what was sent in line 52. Most often in the context of cryptographic protocols, returned values are used for mutual authentication. There can be 2 types: the return of the same number or the return of a function from this number. In both cases, the return value is checked for a match with the one sent earlier. In our case, this is line 53. However, another value is checked here -identifier B. In this case, one of the solutions to this problem would be to find the situation when the variable was sent, and then a value is checked for a match with this variable. In this case, you can assume that this is the case of the return value. However, there may be a number of problems, in particular, just the occurrence of an error in writing code, or simply the absence of such a check of the return value. At the moment, the abstract notion of the type of the RETURN variable is used. This means that a variable of this type was returned in the current message.

Protocol output structure
Using the algorithm presented in the preceding paragraphs, the complete output structure of the protocol is constructed according to the messages. It is obtained both in short form for formal verification, and in full form for dynamic verification. The full view contains the last variable, before serving in the cryptographic function, the names of the last variables and their initial initialization, for example, static in the code or loading data from a file. Dynamic analysis will be considered in further work and therefore the contents of the full protocol can be changed.

Experiments
For testing parser on real project we take our previous project -e-voting system based on blinded intermediaries [15], which implemented on C# language. It consists 3 main components: Voter application, Authentication server, Voting server. The protocol in main voting stage is: → : → : ( , , " ") 7.
→ : ( , , ℎ ) Before the protocol session keys vas, vvs, asvs were generated with ECDHE (the Diffie-Hellman protocol on elliptical curves using ephemeral keys and signing the secret parts) protocol. So at the beginning of the main voting protocol session keys are created. It is necessary to say that Nb is a number of blinding, a non-random random number, which is regenerated each time. It is introduced in order to add some data before the semantic random number for making full search more complicated (in particular, it is necessary to select two encryption keys for message 7 in order to find userData). Randomly generated random numbers are sent to authenticate the parties as shown in (1)- (3). The message (4) uses the principle of blind intermediaries. The voter encrypts his vote filledBallot on the session key with VS, applies his personal data to the ciphertext, and encrypts it on the session key with AS. AS hashes the sent personal data, searches for the hash in the database Pisarev I.A., Babenko L.K. C# parser for extracting cryptographic protocols structure from source code. Trudy ISP RAN/Proc. ISP RAS, vol. 31, issue 3, 2019. pp. 191-202 200 and, and, if detected, redirects the message to the VS component. VS memorizes the vote, generates a checkID through which the user can check his vote after the end of the election, and sends it to the user. Code organization of cryptographic protocols in this project is simple. Message sending or receiving located in methods' block, so there is no difficult code structure. Our parser was launched for this project and we cad this result: → : → : → : ( , , ) As we can see from output cryptographic protocol structure was extracted correctly. It is necessary to say that in message 4 A gets «SymENC(RETURN,RANDOM,DATA)», but in message 5 it sends this like «RETURN». So side A doesn't know key for decryption and for it this is some data that was sent to it and it sends this data to another side so there is 1 element «RETURN» instead of 3.

Future work
Future work primarily includes a segmentation of DATA semantic data into classes: 1) party identifiers; 2) keys; 3) timestamps; 4) authentication Codes; 5) data received from the user. It is also an important point to determine the ownership of a key by any of the parties in the case of asymmetric encryption, and to the list of parties in the case of symmetric encryption. Support for protocols involving more than two parties will also be needed. In addition, a complete solution to the problem of accurately determining the returned data is necessary to make it possible to build a complete structure of a cryptographic protocol and its further analysis using formal verification tools. After obtaining the structure of the cryptographic protocol, it is necessary to develop an algorithm for automated translation into the specification language of the most well-known protocol verification tools, such as Avispa [16], Scyther [17], ProVerif [18], and others. It is also necessary to improve the parser. At the moment, the structure can only be retrieved from areas of code where all functions for sending and receiving messages are combined into one block, for example, into the body of a function or class method. In the future, it is planned to improve the parser to work with complex code structures.

Conclusion
An algorithm was presented for analyzing the source code of the C# programming language for extracting the structure of cryptographic protocols, based on identifying important code sections that contain cryptographic protocol-specific constructions and determining the chain of variable transformations from the sending or receiving status to their initial initialization, taking into account possible cryptographic transformations to compose a tree, from which it is possible to get simplified structure of a cryptographic protocol. An example of a protocol containing all cryptographic functions is given. The output structure of the cryptographic protocol is shown. Successful practical testing on real e-voting system based on blinded intermediaries is done. For the further possibility of the application of formal verification of protocols and dynamic analysis, it is necessary to make
Liudmila Klimentevna BABENKO is currently a professor at the Department of Information Technology Security at the Southern Federal University. The area of scientific interests includes cryptographic methods and means of ensuring information security, technology of parallel-vector computing, evaluation of the strength of cryptographic methods of information protection.