REDoS Detection in “Domino” Regular Expressions by Ambiguity Analysis

—The Regular Expression Denial of Service (REDoS) problem refers to a time explosion caused by the high computational complexity of matching a string against a regex pattern. This issue is prevalent in popular regex engines, such as P YTHON , J AVA S CRIPT , and C++. In this paper, we examine several existing open-source tools for detecting REDoS and identify a class of regexes that can create REDoS situations in popular regex engines but are not detected by these tools. To address this gap, we propose a new approach based on ambiguity analysis, which combines a strong star-normal form test with an analysis of the transformation monoids of Glushkov automata orbits. Our experiments demonstrate that our implementation outperforms the existing tools on regexes with polynomial matching complexity and complex subexpression overlap structures.


I. INTRODUCTION
Popular regular expression (regex) engines typically use non-deterministic finite automata (NFA) as their internal representation for regexes.This choice is motivated by the flexibility of the NFA concept, which can be extended to support a wider range of regex operations with little effort.For instance, back-references and lookaheads can be easily added to the NFA model.Although, in theory, every string can be matched against a regex in linear time using deterministic finite automata (DFA) conversion, popular regex engines may admit exponential matching time due to a phenomenon called "catastrophic backtracking".This phenomenon occurs only for a specific class of regular expressions.For example, consider the regex pa | bq ¦ a, which is non-deterministic due to the unavoidable non-determinism in the transition to the last occurrence of the letter a.However, every string has a unique parsing tree with respect to this regex.In contrast, the regex pa ¦ b ¦ q ¦ has an infinite number of accepting parsing trees for any given string, as inner Kleene stars can degenerate to the empty word, causing a combinatorial explosion of parse paths.Intuitively, the latter regex can be considered "bad", while the former is considered "good".
Matching against "bad" regexes can yield a situation called a Regular Expression Denial of Service (REDoS), when the matching time grows super-linearly and can cause performance issues in, for instance, a web service that uses such a regex to parse user input.To avoid these situations, it is essential to detect unsafe regexes and replace them with safe equivalents.
The number of research papers mentioning the REDoS problem has increased rapidly in the last decade [1]- [7].Several tools have been developed to detect REDoS, using both static analysis and random search.Some of these tools aim to detect the entire class of extended regexes, while others focus on academic ones.However, for a class of simple regexes, which are not safe in theory, the tools considered either take too long time to process, or give an incorrect answer, falsely witnessing their safety.These regexes usually have overlapping, but not completely coinciding, structure of the expressions under the Kleene stars (being a simple analogue of Fig. 1: Thompson automaton for pa | bq ¦ a dominoes in the Post Correspondence Problem).An example of such a regex is pbaa | abq ¦ pb | qpapba | aqba ¦ bq ¦ paabq ¦ : the ambiguity occurs both in prefixes pbaaq  and pabq  , which can be constructed in several ways from primitive "dominoes".Thus, the two natural research questions arise: do the "domino" regexes really contain REDoS situations w.r.t. the modern regex engines?
if the answer is yes, what methods can deal with such regexes in order to analyse them without blow-up of the analysis time because of the overlaps?The main contributions of the paper are: a method for REDoS situations detection, utilizing properties of non-deterministic finite automata and their transition monoids.This approach is novel, since previous static-analysis-based methods use NFA intersection.For "domino" regexes our method is shown to perform better than the open-source analogues REGEX STATIC ANA-LYZER [3], RESCUE [5], and REVEALER [2].experimental testing of the relevance of the NFA model used and the vulnerabilities found, by investigating real regex engines behaviour on the attack strings.The method is implemented only for the academic regexes for now.Surprisingly, for this case, the tested open-source tools perform significantly worse on domino tests, especially for polynomial REDoS situations.
The paper is organized as follows.Section II contains preliminaries on finite automata, and theoretical concepts that are used further.The proposed REDoS detection method is given in Section III, preceded by lemmas used for its optimisation.Section IV discusses relevance of the chosen model with respect to the real regex matching engines, and provides a result of comparative testing of our method and three other open-source REDoS detection tools.We discuss the results of the experiments and the related works in more detail in Section V. Section VI concludes the paper.

II. PRELIMINARIES
We denote automata with calligraphic A ; states are denoted with the letters  and , or with the set of these letters (if an automaton is a result of a closure operation).The empty word is denoted by ; concrete elements from the input alphabet are denoted with a, b, c, ..., and letter parameters are denoted with ;  and  denote word parameters.We use only the basic academic regular expression constructing operations: concatenation (which is omitted in notation), alternation (denoted with |), and Kleene star (denoted with ¦ ).If  is a regex, L pq denotes its language.
Let us recall basic definitions and describe the finite automata models used in this paper.

A. Finite Automata
Definition II.1.A non-deterministic finite automaton (NFA) is a tuple x, Σ,  0 , , y, where: is a set of transitions of the form x  , p  |q,   y, where   ,   Σ,   2  ;  0  is the initial state;   is a set of final states.Every transition in an NFA maps a pair x  , p  |qy into a set of states, contrary to transitions in a deterministic finite automaton (DFA), which map every pair x  ,   y (where   is essentially not equal to ) to a single state.Thus, if a word is parsed by a DFA, the parse trace is always unique (i.e., DFAs are unambiguous); in an NFA, there can be a set of parse traces for a single word.This set can even be infinite in case of NFA with -transitions.An NFA can be transformed into an equivalent DFA using a textbook subset-constructing algorithm Determinize, which generates states of the DFA corresponding to the sets of the states of the initial NFA resulted in the transitions along the same input symbols.
The NFA models used in regex engines are primarily based on the classical Thompson construction, which provides an algorithm for transforming a regex into an NFA that recognizes the same language.While the implementation details of the transformation may vary, the experiments presented in Section IV provide evidence that the Thompson model remains relevant for identifying inefficient regexes with respect to NFA-based parsing engines.
In the following descriptions, we only give details of the constructed NFAs in terms of their states and transitions, without mentioning the alphabet construction.
The Thompson construction algorithm ensures that any NFA produced by the algorithm has a unique final state, and that each state has at most two outgoing and two incoming transition arcs.The uniqueness of the final state implies that the reverse NFA for Thompsonpq is exactly Thompsonp  q, where   is the reverse of the regex .Additionally, all subregex automata can be treated as isolated directed acyclic graphs, which makes the construction easily extensible and decomposable.An example of a Thompson automaton for a regex is shown in Figure 1.The states labels follow the corresponding regex operations given in Definition II.2.
One drawback of the Thompson construction is that it introduces non-deterministic transitions corresponding to alternating operations (i.e., alternatives or Kleene stars), even in the cases when the regex itself imposes no non-determinism (e.g. for the regex apa | bq ¦ , which is a reverse of the regex shown in Fig. 1).To avoid the redundant non-determinism, the regex engine RE2 [8] processes such strongly deterministic regexes (also known as 1-unambiguous regexes [9]) constructing another NFA based on the regex structure, but without -transitions.This automaton is known as the Glushkov automaton since 1960s, and in the last two decades it attracted considerable interest, shown to be efficient and extensible to construct deterministic parsing engines for a larger class of regexes (such as memory finite automata for the regexes with back-references [10]).
The classical Glushkov construction is based on so-called follow-relation on linearised regexes.By construction, every state in the Glushkov automaton except the initial state corresponds to an occurrence of some  Σ in the input regex ; conversely, any letter occurrence in the regex  corresponds to exactly one state in Glushkovpq, whose incoming arcs are all marked with .Now we can reformulate this property in the terms of Thompson and Glushkov automata.
Proposition II.1.There is a bijection from state set in Glushkovpq minus the initial state to state set   in Thompsonpq (where   are final states of the primitive automata reading ).
In the paper [11], it was shown that Glushkovpq can be also obtained from Thompsonpq merging its -closures.
Definition II.3.Given an NFA A and its state , -closure of  is the maximal set of states reachable from  following only -transitions.
Closure-merging1 -free automaton (denoted with RemEpspA q) is constructed from A as follows: An example of closure-merging operation is given in Fig. 2.

B. Transformation Monoid of NFA
Let us consider an automaton with no useless states and -transitions.Its transitions over the alphabet Σ and the states set 2  form the function  : Σ ¢  Ñ 2  taking a pair x,   y.This function, when curried and specialized in the first argument, becomes   :  Ñ 2  (where  Σ).We can form a monoid over the set of such partially specialized functions (transformations) if we continue them on strings as follows: Then associativity is provided "for free", given associativity of string concatenation, and  becomes the monoid unit, because The formal definition is as follows [12].
Definition II.4.Given an -free automaton A over the alphabet Σ ¦ , its transformation monoid M TransMonoidpA q is the monoid of transformations imposed by elements of Σ ¦ on the states of A .
The monoid construction does not depend on the choice of the final or initial states of A (except the condition that all the states are useful, i.e. reachable and producing), thus, instead of classical NFAs, the monoid is based on a labelled transition system.Since the set of functions  Ñ 2  is finite, the transformation monoid of an NFA always contains a finite number of equivalence classes.The pair x, y, where  is a finite set of lexicographically minimal elements of the equivalence classes and  is a set of simplification rules is considered a standard representation of the transformation monoid.Such a representation for TransMonoidpGlushkovpapa | bq ¦ qq is given in Fig. 3.The monoid representation uncovers some useful NFA properties.For example, we can immediately conclude that the words aa and ab are synchronizing, since for all   ,   aa Ý↠  2 ,   ab Ý↠  3 , and no other transition is possible.

C. Ambiguity of NFAs and REDoS
Intuitively, the worst-case scenario for backtracking-based matching of a string against a regex  occurs when the matched string has a prefix  1 with a large set of parse paths, and a suffix  2 s.t. 1  2 L pq.In this case, in order to determine that  1  2 is not recognizable by , a regex engine must backtrack through all the parse variants of  1 .Obviously, we can choose such a suffix  3 that  1  3 L pq, and  1  3 will still have a large number of parse trees (although the regex engine will report a success after finding a first one).Therefore, worst-case matching time depends on the upper bound on the parse paths in a regex.
In the domain of finite automata, the following definition is used [13], [14].
Definition II.5.A degree of ambiguity of an NFA A is a worst-case bound on the number of paths recognizing an input string (in a length of the string).
The ambiguity of NFAs is known to be either a constant, an exponential, or a polynomial [13].If the ambiguity degree A minimal EDA-generating regex example is pa | aq ¦ .
A minimal example of a regex producing IDA but not EDA automaton is a ¦ a ¦ .For regexes s.t.pa ¦ b ¦ q ¦ , Glushkovpq is unambiguous, despite Thompsonpq is EDA.We can notice that in Thompsonppa ¦ b ¦ q ¦ q, a special situation occurs: there is a loop inside an -closure of a state (i.e., there is at least one Kleene star in a regex iterating over an expression   s.t. L p  q).Further we show that such a case is the only possible situation when Thompsonpq and Glushkovpq have distinct ambiguity degrees.
The following criterion estimates the degree of ambiguity in any NFA.
Theorem II.2.A satisfies IDA condition ô there exist states   ,   in A , and a word  s.t.A contains paths from   and   to themselves, and a path from   to   all accepting the word .
A satisfies EDA condition ô there exists a state   in A and a word  s.t.A contains two distinct loops from   to itself both accepting the word .
We can also say than if EDA occurs in an NFA, then Ý↠   q (see Fig. 4).
After the work [9], we use the term "orbit of state " for the maximal strongly connected component containing .We assume that orbits are non-trivial, i.e. contain at least one transition.If a state  of A satisfies EDA criterion for some , then all states belonging to its orbit also satisfy EDA.Thus, to check the EDA condition, it is sufficient to check if any state of some strongly connected component of an NFA satisfies EDA; for the IDA condition, it is sufficient to check if there are two strongly connected components satisfying it.
An approach to the IDA and EDA detection used in the RE-DoS analysers [3], [4] tests the above criterion constructing single or double intersections of A with itself.Although the intersection construction can be done in polynomial time on an NFA size, it may lead to large NFAs if there are many crossing components (i.e., matching the same string sets) in the initial NFA.
The IDA criterion can be also reformulated in the terms of transformation monoids.
Proposition II.3.An -free A satisfies IDA ô its transformation monoid contains an equivalence class  s.t. for some states   ,   ,     p  q,     p  q, and     p  q.
Using this criterion for an initial NFA "as is" is highly impractical: if the NFA contains non-crossing components, the transformation monoid becomes exponentially huge.However, with some refinements, we observed that the monoid criterion can be applied (and even be fast) in the cases when the intersection criterion is slow.Moreover, Proposition II.3 provides explicit construction of a string with the ambiguity, allowing the analysing algorithm to reconstruct the REDoS situation easily.First, take any NFA path from the initial state of A to   , recognizing some prefix  1 .Then pump  to construct an infix with superlinear number of parse trees, and then take some string  2 s.t.any path from   recognizing 2 does not end in a final state of A .The string  1    2 will force an NFA parsing device to do superlinear backtracking.
If the monoid criterion is applied to the orbit automaton of state , the REDoS pump can be constructed as well.Just choose some  1 ,  2 s.t. 0 1 Ý↠ , d   p2  2  Ý↠   q.

III. OUR APPROACH
As a starting point, we prefer to use the Thompson automaton as a preliminary NFA model for a regex since regex matching engines rely on it in their internal algorithms, and experiments in Section IV demonstrate that the Thompson construction is suitable for analysing real REDoS.However, in order to apply the monoid criterion, we must first eliminate -transitions in the regex and ensure that the removal of transitions does not affect the degree of ambiguity.
Let us say that regex  is in a (strong) star-normal form (SSNF) if it does not contain a subexpression p 0 q ¦ s.t. L p 0 q [15].The following proposition gives an equivalent criterion.
Proposition III.1. is in SSNF ô no -closure of Thompsonpq contains a loop.
Proof.ð: Let  contain a subexpression p  q ¦ , where  L p  q, and let   and   be the initial and final states of Thompsonp  q.Since there is a path in If Thompsonpq contains an IDA which is not an EDA, then the IDA-producing states belong to distinct Kleene star subexpressions.Moreover, since  $ , in the paths producing the IDA situation, there are at least two distinct states   ,  I  , which are final states of primitive automata for letter  Σ and have distinct orbits.Thus, their -closures remain to be distinct.Thus, it is sufficient to test  for the strong star-normal form property and then, if necessary, continue the ambiguity analysis operating with the Glushkov automaton, having significantly less states.If there are loops in -closures, the further analysis is not needed: these loops already produce EDA situations.
Given a state  in A and its orbit  , an orbit automaton of  is automaton   including all states and transitions from  , having  as is the initial state, and whose final states are either final states of A or states with outgoing transitions outside the orbit  in A .
If we choose one state   from each strongly connected component   of A , then testing an IDA criterion for TransMonoidp  q is enough to reveal all EDA situations.However, in the case of a polynomial IDA, we must test pairs of the strongly connected components (together with the transitions from one component to another), and building a monoid for any such pair-generated NFA is too timeconsuming.Thus, we use the following simple necessary condition for the polynomial IDA.
Proposition III.3.Let  1 ,  2 be distinct strongly connected components of A .If A contains a polynomial IDA within the components, then there exist two states,  1  1 ,  2  2 , s.t.DeterminizepA q contains a subset state including both  1 and  2 .Moreover, such a subset state occurs also in DeterminizepReversepA qq.
Although the determinization algorithm is exponentially hard in the worst case, it is known to be fast in most practical cases [16].Thus, the subset test accelerates candidates search for the polynomial IDA.However, it is not sufficient, which can be shown by analysing regex pa | bq ¦ pb | cqpa | cq ¦ whose Thompson automaton contains no IDA.

Fig. 5: Algorithm of ambiguity analysis for regexes
The pseudocode of the complete algorithm2 is given in Fig. 5.There A 1 2 includes the orbit automata  1 and  2 of  1 and  2 , and all states reachable from  1 and reaching  2 together with their transitions.Its initial state coincides with initial state of  1 , and its final states are final states of  2 (ignoring final states of A belonging either to  1 or intermediate states).The condition  1 Ý↠  2 ensures that the component  2 is reachable from  1 , and they do not coincide.Operator r1s takes a first state from the component  (since the Ambiguity.TransMonoid and determinization tests results do not depend on the choice of the initial state in the orbit automata3 ).Function SCCpA q returns all strongly connected components of A .

A. Data Set
In order to evaluate the effectiveness of our approach on the "domino" regexes, a dataset of 100 academic regexes was generated.The regexes satisfy the following properties: their length and alphabet are small (not more than 50 terms and not more than 5 distinct letters); they have iterated elements; all are in SSNF.
The first condition allows significant subexpression languages overlap, without blowing up the regex length.However, the test set contains not only complex dominoes, but also regexes with simple ambiguity situations like b ¦ cpac | paa | aq ¦ dq ¦ .The second condition is necessary for REDoS situations.The third condition mostly excludes the trivial SSNF test, returning EDA value using our method too quickly.
We explored the dependence of the regexes matching time from the input length on the popular engines in PYTHON, JAVASCRIPT, C++, JAVA 8, JAVA 11, GO, and RUST.
In order to detect super-linear dependencies, it is necessary to generate potentially attacking input, for which the string pumping method is used.The attacking input must match a pattern of 3 components: a prefix that satisfies the regular expression, a pumping core whose repetition can lead to a rapid increase in the number of parsing paths (i.e., malicious pump), and a suffix whose mismatch leads to catastrophic backtracking.
The results obtained by applying JAVASCRIPT, PYTHON, C++ and JAVA 8 standard regex engines are the same, according to them, the data set contains 34 exponential, 36 polynomial and 30 safe regexes.Also, the experiments indicated that JAVA 11 standard regex engine handles some polynomial and exponential cases, but when the length of the input data increases significantly, it throws a stack overflow exception, which may be due to the introduction of the local storage of indexes to the regex module in the 11 version of JAVA.The regexes are safe for GO and RUST engines, which are based on the deterministic structures.Nevertheless, it was noted that there are frequent single outliers in trends when matching strings in GO.
During testing, we observed that polynomial regexes only lead to critical matching times (more than 1 minute) with significant input string lengths (approximately more than 500 characters), while expressions that have exponential matching complexity can reach critical time when parsing even relatively small input strings.In the simplest case, such a time explosion can be achieved with regexes that have large star nesting or multiple alternatives under a star quantifier.For instance, the PYTHON, JAVASCRIPT, JAVA 8, and C++ regex engines are vulnerable to attacks in the case of the ppa ¦ q ¦ q ¦ regex, and even the optimized JAVA 11 engine, which successfully handles double star nesting, reaches critical time processing such an expression.
However, more non-trivial cases were encountered in the proposed data set.For example, the regex bpabppa | bpa ¦ aq ¦ qa ¦ b ¦ q ¦ |a ¦ aaaa ¦ q ¦ , when matched against the input of 32 characters that satisfies the pattern with prefixb, pumpabab, suffixbbd, achieves the following timings: PYTHON engine -over 3 minutes, JAVA 8 -over 3 minutes, JAVA 11 -0.80 minutes, C++over 3 minutes, JAVASCRIPT -1.73 minutes.
In general, the REDoS vulnerability degree coincides with the theoretical expectations, taking into account the asymptotic growth of the ambiguity function for the corresponding Thompson automata.Non-SSNF regexes cause critical time explosion, which is an evidence that the regex engines do not apply SSNF transformation to their input.In addition to non-SSNF regexes, critical REDoS situations occur on polynomial ambiguities iterated under a Kleene star.

B. Comparing with other in-research tools
We evaluated the effectiveness of the proposed approach by comparing it with three state-of-the-art open-source tools for detecting vulnerabilities in regexes: RSA [3], [17], a static analysis tool, RESCUE [5], [18], a genetic fuzzing tool, REVEALER [2], [19], an automated hybrid analysis tool that uses static and dynamic approaches.
The qualitative results of the experiments are described in Table II.To evaluate the effectiveness of detection of vulnerable and safe regexes, we used  1 -score, where true positive values are all vulnerable regular expressions that were classified as exponential or polynomial, the absence of results due to a timeout is taken into account as a false result, also we used the error rate, where a cumulative error on all classes of regexes -total error rate and a classification error among vulnerable regexes -vulnerable error rate.It should be noted that RESCUE does not support the exponential-polynomial classification, therefore, not all values were calculated for this tool.
The results of measuring the execution time for the considered tools are shown in Table I.When measuring time, all extended features of the tools were disabled, and their parameters were optimized.For each class of correctly classified regexes: exponential, polynomial, safe, unsafe (union of vulnerable regexes), the average running time () and the stan-dard deviation () of this value were estimated, the amount of timeouts was also calculated.
Additionally, we chose 25 regexes with non-SSNF structure, which are analysed in our method by the preliminary -loop test.While our approach proved to be the fastest (which is not a surprise, provided the algorithm structure), the static part of REVEALER also had 100% success rate on this set, although, taking at average 4¢ more time.
It is important to note that the theoretical results obtained by using static analysis methods, determining ambiguity degree of the Thompson automata, completely coincide with the experimental results obtained when testing the domino regexes on the PYTHON, JAVASCRIPT, JAVA 8, and C++ regex engines.This is a strong witness that regexes declared safe by dynamic or combined methods are their false negatives.
From the test results, we can conclude that the detection efficiency of the static analyser is high, but in non-trivial exponential or polynomial cases such as pbaa | abq ¦ bpapb | aqba ¦ bq ¦ paabq ¦ , timeouts occur.The recognition efficiency of RESCUE and REVEALER tools on this data set is low.However, the proposed approach has the maximum quality of vulnerability detection, the average execution time is also superior to other implementations.This is partly explained by its narrow domain: testing only academic regexes.But RSA also aims at the academic regexes, and still has several timeouts; on the other hand, it seems that extension of REDoS-detection tools to non-academic regexes made them to miss almost all polynomial REDoS with domino structure.
V. DISCUSSION AND RELATED WORKS Initially, our finite automata transforming tool was not designed to reveal REDoS situations.However, attempts to use open-source tools like Regex Static Analyser or RESCUE to analyze simple academic regexes with non-trivial ambiguity structure failed.The main purpose of the work was educational, so we designed our algorithm in such a way that it not only detects vulnerabilities, but also demonstrates them on the automata graphs (Fig. 6), at the cost of longer execution time.Since the tool was initially designed for demonstrations, only core academic regexes were considered.The algorithms used in the monoid-based approach have poor worst-case complexity, so its efficiency, compared to RSA and RESCUE, was a real surprise.
What features of the analysers caused such a situation?RSA uses NFA intersection construction, based on the wellknown paper of Mohri et al [14].To detect polynomial ambiguities, the algorithm requires self-intersecting an NFA twice.The automata intersection problem is known to be PSPACE-complete [20], [21], thus, every additional intersection results in a significant slowdown.Maybe that is the main cause why the polynomial detection results in timeouts in RSA.The monoid and determinization algorithms are known to be worst-case exponential.However, the determinization is proven to be fast 4 in average [16], while the monoid representation depends heavily on the automata structure and, implemented to orbit automata, generates significantly fewer equivalence classes, compared to the case when automata are not cyclic.Another well-known problem in static analysers is dealing with -transitions, which can ruin the intersection construction, as well as the monoid.Surprisingly, the tools do not use the simple and natural conversion to the Glushkov construction preceded by the SSNF test.
Error rate of static tools is usually much lower than in tools using genetic algorithms and fuzzing, since REDoSprovoking strings can be disguised, requiring several explicit iterations to construct, or be combined from several alternative subexpressions under an iteration.Even using two approaches in REVEALER cannot help to find vulnerabilities, if the malicious pump is hidden in overlaps and crossing occurrences.For example, in paper [6], four REDoS classes are provided, based on a regex structure, and the regex a ¦ pabq ¦ apbaq ¦ satisfies neither of them, because the vulnerability appears due to the crossing occurrence of the string ab on the border of the two orbits, whereas the expressions under Kleene stars have languages with empty intersection, which makes the regex "seemingly safe".A similar pattern-based approach is used in [7], resulting in the same sort of false negatives.So, regex-based heuristics showed themselves to be too weak as compared to the model NFA analysis in the domino ambiguity cases.
If a malicious pump for a regex is found, the natural question arises: how to correct the regex?We did not consider the whole implementation of the regex correction, but implemented a trial algorithm constructing a 1-unambiguous regex 5 , if it exists [9].However, for most regexes with overlaps, even if the equivalent 1-unambiguous regex can be built, the algorithm given in [9] produces exponentially longer result, as compared to the input, processing all overlap combinations separately.A more optimistic regex correcting heuristic is the Star Normal Form transformation: it is performed in linear time and produces regexes approximately of the same length.Moreover, the SSNF transformation is rather local, does not require transition to NFA, and can be applied even to extended regexes, which is useful, taking in account that non-SSNF regexes cause critical REDoS w.r.t.PYTHON and JAVASCRIPT regex engines.In general, the question what theoretical results can be used to fix REDoS regexes, is still a subject of research.

VI. CONCLUSION
The research resulted in the following answers to our research questions.
RQ1: how relevant is NFA static analysis w.r.t. to popular regex engines?Our experiments demonstrated that the Thompson NFA model is entirely suitable for evaluating REDoS situations concerning the most widely used regex engines, including PYTHON, JAVASCRIPT, JAVA, and C++.Interestingly, although the GO regex machine uses conversion to DFA, it still produces surges on some ambiguous regexes with complex structures.The RUST DFA engine proved to be the most stable.
RQ2: what features of the REDoS analysers considered cause errors and time explosion on the regexes with complex overlap structure?How they can be processed reliably with less risk of time explosion?We found out that considering orbit automata (instead of performing ambiguity analysis on the entire NFA) and using the Glushkov construction, preceded by the Strong Star Normal Form test, do not result in any loss of relevance, but significantly speed up the static analysis.Another interesting approach is to use monoid analysis as the primary ambiguity-detecting algorithm instead of NFA intersection analysis.If there are multiple substring overlaps in the orbits, this method performs significantly faster.However, if the overlaps are small, the number of equivalence classes in the monoid increases dramatically, making the intersection method more preferable.We also provided experimental evidence that the genetic search REDoS detection methods still miss complex REDoS cases, easily detected by static NFA analysis approaches.Despite our approach proved itself to be efficient and reliable on the test set of domino regexes, it still requires many refinements.First, the monoid construction may explode if we take large alphabets, so the input regexes may need some alphabet factorization.E.g., if no overlaps are contained within a long string, then this string sometimes can be considered as a single letter.Second, it would be interesting to test the method on extended regexes approximation, and to combine the monoid-based and intersection-based ambiguity detection algorithms.

Fig. 6 :
Fig. 6: Revealing strongly connected components with ambiguity situations in NFA graph The notation    Ñ ... is overloaded to denote either NFA transition x  , ,   y (written as    Ñ   ) or a transition to a single state belonging to   (written as    Ñ   ).Existence of a path from   to   marked by  Σ ¦ is denoted by Ý↠   .
Thompsonpapa | bq ¦ q with colored -closures Thompsonp 1 q, A 2 Thompsonp 2 q, and   and   are their initial and final states, respectively, thenThompsonp 1 |  2 q is constructedby merging the A 1 and A 2 states sets and transitions sets, and introducing the transitions  alt  Ñ t 1 ,  2 u;  1  Ñ t alt u;  2  Ñ t alt u.Thompsonp 1  2 q is again constructed by merging Thompsonp  q states and transitions sets, and making  1 the initial state,  2 the final state, with the additional transition  1 If the loop follows -transitions, it also contains a path    Ý↠   , so  L p  q, and p  q ¦ breaks the SSNF condition.In the following proposition, Ambiguity is valued either EDA, IDA (not EDA), or safe.Proposition III.2.If  is in the SSNF, then Proof.Any strongly connected component in Thompsonpq, as well as in Glushkovpq, corresponds to a subexpression  I under a Kleene star in  (by construction).If an IDA occurs in this subexpression in Thompson, say, for a state  and word  $ , then there exist two distinct states   ,  I  2 ,  1 $ .Thus,  occurs twice in  I , hence these occurrences correspond to distinct states of the Glushkov automaton.Therefore, the paths EpsClosurepq Thompsonpq from   to   recognizing , -closure of   contains a loop.ñ: Any loop must contain a backward arc, and in any Thompson automaton, the only backward arcs are transitions from   to   , where   is a subexpression under a Kleene star.1 Ý↠ EpsClosurep  q and EpsClosurepq 1 Ý↠ EpsClosurep I  q are distinct, and there is the EDA situation in EpsClosurepq.

TABLE I :
Time measurements