Localized Lama Gradual Typing

. Gradual typing is a modern approach for combining benefits of static typing and dynamic typing. Although scientific research aim for soundness of type systems, many of languages intentionally make their type system unsound for speeding up performance. This paper describes an implementation of a dialect for Lama programming language that supports gradual typing with explicit annotation of dangerous parts of code. The target of current implementation is to grant type safety to programs while keeping their power of untyped expressiveness. This paper covers implementation issues and properties of created type system. Finally, some perspectives on improving precision and soundness of type system are discussed


Introduction
There are different approaches of type system implementation. Static type systems are well-known for preventing many undesired behaviors of the program at compile time by reasoning about possible values that expression may or may not take (e.g., Java, Haskell, ...). On the opposite side, dynamic type systems are well-known to be the most flexible type systems -low compilation prerequisites and delegation type safety to runtime allows rapid development and prototyping (e.g., Python, Racket, ...). Kryshtapovich

62
There is a combination of both mentioned approaches named «Gradual Typing». This technique of program typing drained a lot of attention since the article of Siek and Taha [1] was published. Article presents sound type system for Lisp dialect which represents partially typed functional language. The presence of sound system for this model language gave rise to lots of research in this field. But practical application of sound gradual systems is still questionable because of the performance issues [2]. The key purpose of this article is to see how gradual typing and explicit unsafe code annotations can be integrated with each other as native language syntax. The desired result is to acquire language that allows programmer to control trade-off between performance and type safety. The Lama [3] version 1.00 will be used as our target language of research. Let us imagine typical Python code, and most probably it would be some untyped piece of code. Surprisingly or not, only . % of repositories have type annotations by 2020 year [4]. But the idea of gradual typing is powerful: let programmers add static type information expression by expression in the code. Thus, we can step-by-step convert untyped code into fully statically typed code with corresponding static guarantees. This is so called gradual typing: on the one hand we have power of static annotations preventing us from misusing functions, modules and preserving contracts. On the other hand, we shut down static type system whenever we choke down with abyss of static type errors. The most important result of original article [1] was soundness of gradual type system. This was reached by exploiting cast calculus and rewriting original program with casts. The cast can be imagined as the bridge that value surpass during runtime from untyped part of code to typed part of code. This kind of "bridge" is annotated with static type and value should conform to it while moving from less typed part of code to more typed part of code. So, the main idea is to correctly insert casts and yield a program with soundness property. 1) If program does not typecheck, the program execution path may stuck with static type error emerged at runtime. (If there is a possibility to launch untyped programms at all) 2) If program typechecks, it can produce only dynamic type error or cast errors. No errors involving incompatibility of static types may occur at runtime. In other words, if program is accepted by sound typechecker it can never fail contracts that was given to expression by the programmers in the form of types. For instance, you cannot acquire string value in variable statically typed as integer. Gradual typing has been presented in several languages and in various forms, such as: 1) Python [5,6] (MyPy [7] and PyType [8] projects); 2) Typed Racket [9]; 3) JavaScript: TypeScript; 4) C⧣4.0 with dynamic keyword. Although they are all have gradual typing property (in the sense, that not all objects have known type at compile time), their implementation of gradual type system has strong differences. Some of them are compiled into dynamic target language, such as TypeScript program is converted to pure JavaScript after compilation. Some of them are static by the nature as C⧣ and then bring up a «dynamic» keyword which marks that object has unknown type until runtime. Some of them incorporate optional typing annotations and leave them alone for documentation and external tools (linters, typecheckers, IDE) as Python do. The most noticeable state-of-the-art of gradual typing: every industrial-level language doesn't care much about soundness of the type system. This is because of the performance issues. Some real programs exhibit slowdown over ×, likely rendering them unusable for their actual purpose. To increase performance many of them reduce number of dynamic casts or remove them at all. This leads to trade-off between soundness and performance of gradually typed language. To sum up, gradual typing provides mechanism to check program correctness having this pros and cons: • Types can be added ad hoc by the programmers.
• Gradual type system can be sound in certain languages (more frequently academic ones).
• Dynamic typechecks is giving significant overhead at runtime.
No doubt: looking at the diversity of implementation and approaches it is interesting to look at the result of implementation of gradual typing in the language with different model of computation and semantics. We will test some new syntax conceptions experimenting with Lama programming language.
is a programming language developed by JetBrains Research for educational purposes as an exemplary language to introduce the domain of programming languages, compilers and tools [3]. The most noticeable property of this language that it is fundamentally untyped. The reference manual says that the lack of a type system is an intentional decision which allows to show the unchained diversity of runtime behaviors. But at the same time manual says that the language can be used in future as a raw substrate to apply various ways of software verification (including type systems) on [10]. So why wouldn't we try to implement some kind of type system upon it? In our work we will test new approach of combining parts of code where different rules of static verification are applied: some parts of code will be gradually typed, and some parts of code will be left untyped. The expected result is programming language that can mix two types of code: • with semantics that respects type safety in necessary parts of the code (e.g., sound); • with original semantics without overheads.
This should allow programmer to choose what parts of program should be gradually typed, and what parts of program should not be typed. Another expected result is producing a program with decreasing speed of execution of gradually typed code. The slowdown may be arbitrary, but we will try to reproduce results from article (at least × slowdown).

Examples
To give reader a proof of concept we should consider concrete syntax and pragmatics of the pieces of code written in Lama and describe how to introduce types into our language and what they expected to do. Normally, code in Lama looks as follows. No types, just anarchy of undefined behaviors: In this example we see function that takes x as an argument and returns function that multiplies input argument by 2 * x. One expects it to be used upon integers, but Lama won't restrict to call function like closure ("Hello, ") ("world!") and pray for runtime not to fall. We can use type annotations to designate our intentions about the code like so: 2*x*y } } If x and y have known at compile time types, then type of the functions can be inferred: inner function has type Int -> Int, and outer function has type Int -> Int -> Int. Moreover, Lama nowadays supports operations only with integer constants (Int). If we take a closer look to the untyped example, it can be inferred that x should have type Int, y should have type Int, because they are used in expression like 2 * x * y, and further infer function types, which makes this concrete piece of code fully typed. At first glance type inference seems to be contradictory with backward compatibility. That is because some of the untyped expressions become implicitly typed, as first example do. Thus, runtime typechecks are inserted in parts of code that were initially untyped, which affects their semantics. Thankfully, the developers of Lama left regression tests that check backward compatibility. So we can bring up type inference features with awareness on backward compatibility. Another example of typing Lama programs is pattern matching The A(0) notation is so called S-expression [11]. Quick Lama-specific introduction: you can consider S-expression as labeled array of arbitrary values. Name should be capitalized, number of values is not bounded. Two S-expression labels are considered equal in Lama if their five first letters are the same, so Branch(Leaf, Leaf, 3) and Branc(Leaf, Leaf, 3) are equal Sexpressions. By the way, Leaf is nested S-expression with zero values in it, so brackets are optional for zero-arity S-expressions. Side note: S-exprs like Int and Str has type Int :: Int() and Str :: Str() to distinguish them from integers (3 :: Int) and strings ("smoothie" :: Str) type. Back to our processA function, we can see, that if a matches A(0), then "1" produced, for other value A(smth) where smth is not 0 we would get "2" produced by the function. If we call processA(B(0)) we would get runtime error from pattern matching. So, other things that we would like from our type system are: • Check that all branches cover matching expressions. E.g. no runtime error would occur in pattern matching. • Check branches that would never succeed: either covered by previous branch or just don't conform to matching expression. For example, type system should reject this Lama program: Here type system can check two things. First of all, x = A(1) won't meet any branch, so not whole possible values of x are covered. And the second: A(x, y) would never match values with type x :: A(Int). Also note, that functions in Lama has beautiful sugar that combines pattern matching, that can be used to check input arguments: public fun id2 (Abc (x, y)) :: ? { x } write(id2(Abc(6, 8))); write(id2(Xyz(6, 8))); --static fail The last example that we should consider relates to runtime checks. Let's look at this simple piece of code: fun intStringer(x :: Int) { x.string } local dyn :: ? = "Can be anything"; dyn := intStringer; --forget type dyn("input") --should it fail?
At first glance it is unclear, where is the problem, because dyn("input") would reduce to "input".string and then to "input". Do we actually care about function, that originally takes Int and store it at runtime? The answer is yes: Of course, if we try to reduce dyn("input") we get "input" + 1, and then we will now end up with runtime error of casting "input" to Int. But what is the real cause of this error, whom to blame [12] [13] [14] for this mess -a plus operator, or input to the intStringer? That is why we should check function arguments wrapping them with appropriate dynamic casts. So, if follow blame ideology in both implementations dyn("input") would fail with the same reason: function expected Int, but given Str. But this solution could lead to extra checks and execution speed decrease. After seeing quite a bit of examples we conclude that these features would be handful in untyped Lama language. Typechecker would decrease number of errors in code made by programmers and runtime casts would inform programmer when untyped code does not conform contracts of the typed code. In next section we will define syntax of gradual types and their semantics.

Type Annotations Definition and Semantics
Gradual typing assumes that user annotates parts of the program with certain type. So, we should provide this feature in Lama compiler. Syntax rules have been described in Lama specification. We will fix them a little bit, because we only change variable definition (global and scope), function definition and their input parameters, look at p. 10 [10] for more detailed language syntax specification. We slightly modified this nonterminals on the fig. 1: just put static type annotations to variable definition and function definition. Also, nonterminal functionArguments was slightly changed in comparison to specification to respect pattern matching sugar. This sugar is not included in concrete syntax definition for some reason. Other nonterminals assumed taken from section "Concrete syntax and semantics" of specification [10]. The definition of type annotations typeExpression is presented on the fig. 2. It semantic (see in fig.  3) is almost straightforward: syntax rule typeAny corresponds to dynamic type TAny, which can hold arbitrary value. Syntax rule typeArray corresponds to the array TArr of certain type. Syntax rule typeSexp corresponds to TSexp with parsed UIDENT as the name of S-expression and list of types forming type of S-expression. Syntax rule typeArrow corresponds to arrow TLambda. Note that input arguments can vary from zero to arbitrary amount. Syntax rule typeUnion corresponds to TUnion and lists all types that value can conform. Only typeSexp rule with zero arity has non straightforward semantics. If type parameters of Sexpression type are not presented, and UIDENT is one of the • Int -corresponds to integers τ = TInteger; • Str -corresponds to strings τ = TString; • Void -corresponds to empty set of values τ = TVoid; • otherwise, it corresponds to S-expression with specified name and no arguments.
If typeSexp is specified with brackets, it has straightforward semantics of S-expression. So, for example, Cons and Cons() has the same semantics of TSexp("Cons"), but semantics of Int and Int() are different as integer and S-expression types: TConst and TSexp("Int") correspondingly.

Typechecking Rules
The typechecking is inserted in the compilation pipeline directly after AST (Abstract Syntax Tree) representation of the program has been built (see "src/Language.ml" and "src/Driver.ml" in Lama source code [3]). The typechecking simultaneously performs the following procedures with AST: type checking, type inference and cast insertion. For detailed description of this three type system problems we need to describe such classes as expressions, values, patterns and types of the language.
• is class of type expressions (see fig. 3); • is class of expressions (see fig. 4); • is class of values (see fig. 5); • is class of patterns (see fig. 6). There is also additional classes that are built-in of implementation language (OCaml). They can be considered as value class: • -integer; • -string. Let us denote set of variables by , which represented by OCaml string , and set of types . We should think about wider, that types induced by type constructors of fig. 3. In other words, some type ∈ may not be expressed with type constructors. If we simplify process of compilation a little bit and ignore external symbol resolvance, Lama parser generates expression of class without Cast constructors, i.e. pure untyped Lama expression. Notice, that expression can also contain patterns due to pattern matching in Case expression. Then, we have some options how to deal with generated AST. The trivial option is to left expression untouched and get the semantics of classic Lama language. The first option is trying to statically typecheck expression. If we succeed to acquire static type of program represented as whole expression, we can conclude that there is no static misuse of typed expressions. The second option is to transform AST to insert casts where values are passing from untyped parts of code to typed one. We will build up an algorithm that makes static typechecking and dynamic cast insertion simultaneously. For type checking we need to answer a question: does some type _ ∈ conforms to other type _ ∈ ? That answer is given by ∼ relationship named "conforms" which is constructed by axioms presented at fig. 7. We should put additional attention to TUnion type and its rules. It denotes type that holds all possible values which can hold its constituent types. It is naturally coming from such language expressions as If, Case and Return. We have chosen set-theoretic approach on typing such expressions. Although there is an algorithm for union contraction, set-theoretic approach for type combination may lead to certain drawback in correctness and decreased performance during compile time. Speaking about correctness: rules ConfTUnion1 and ConfTUnion2 generally cannot proof that two type representation conform to each other if they really do. Thus, the lack of completeness is 68 reflected in false positives generated by static typechecker. That means correct type-annotated Lama expressions can be rejected by typechecker with such relationship definition ∼. This is a common illness of every static typechecker because we would like to check nontrivial property of the code: to be statically correct [15].

Fig. 7. Rules of conformance to the other type
But the good news is that no type intersections TIntersection or type subtractions TSubstraction are coming -we try to avoid them when building type system for Lama. Now we can make an analogy of ∼ relation for expression and type . But instead we will be inferring type of expression. To start with something simple let's define type inference for patterns (see fig. 8). Notice, that we infer both lower and upper bound for pattern type. This interval style inference of patterns is crucial for analyzing case expressions. Let's denote ( ) ∈ for lower bound inferred type for pattern and ( ) ∈ for upper bound inferred type for pattern. Notation ( ) means theoretic set of all possible values that are captured by pattern . With the chosen type constructors and their semantics we can conclude: • is representing type that covers all possible values captured by pattern (upper bound); • is representing type that is covered by all possible values captured by pattern (lower bound). For example, value Suc(1) has type TSexp("Suc", TConst), but this value alone covers almost nothing, so TVoid ⊏ {Suc(1)} ⊏ TSexp("Suc", TConst). Now we are ready to describe our main part of algorithm: type inference and cast insertion for Lama expressions. We will use such notation: ↦ ′: . That means that expression has type , and cast insertion into that expression produces expression ′, which has the same type . In addition, we have two types of contexts: : → for typing context of variables (which assigns types to variable typenames) and set of types ⊂ for collecting information about function return type. Then, typechecker by given context and collected return types produce another collection of return types (probably, bigger than the original), expression rewritten with casts and it's type. So, the full notation of this algorithm should be: , ⊢ ↦ ′ ⊢ ′: . The set of return types for expression is initialized with ⌀. Note, that initial context maps every variable occurrence to type TAny. The typechecker does not check, is symbol is defined in upper scopes or correctly imported, but context is called to provide correct surrounding type information for expressions. Notation ∈ ⟨ , , . . . ⟩ in rule [InferLength] means that 's top level constructor should be one of the listed in angle brackets. In rule InferCall cast to TAny is optional. It is used in inference rules to be consistent with InferCall3 rule which process call of the union type object. Many of the rules can be simplified by removing because they do not change it, such as InferArr and InferSexp, et cetera. That is because they recompute for expressions that never change in correct Lama expressions. There are a few places where is useful: it is InferLambda, InferReturn1 and InferReturn2 rules. Notice, that we are inferring return type of the function just to acknowledge that it fits type declared by the user, the declared interface is not changing. But if the type is not specified by user, the inferred type for variable will be used implicitly. Also notice rules in InferCase. First, we collect return types from the branches while dragging through the computation pipeline. The second one, look at notation ∪ ( _ ) -it fulfills typing context with mapping of PNamed named pattern to its types. The can be defined via as follows: The third one about InferCase is that there is a check that all branches cover target type: ∼ ( ( )). And the fourth: notice that each pattern is checked for code execution availability ( ) ∼ , and at the same time we check that branch is not hidden by earlier branch ( ) ≁ ( ). According to inequalities ( ) ∼ ( ) ⇒ ( ) ⊏ ( ) ⊏ ( ) ⊏ ( ). In other words, when expression holds, it is certain that pattern was covered by more recent cover . In that way we eliminated the need of introduction of intersection or difference types in our type system. But it doesn't mean we cannot deal with intersection and difference types, see [17] or [18] for example of polymorphic type system that handles that. The most complex is [InferScope] rule. It is intentionally simplified, because it's implementations is more subtle. Here it simply overwrites variable or function definition and updates context . But implementation also checks that previous usage is corresponding with current typing when no expression is provided to variable. But to describe that strictly we would need to introduce a class for declarations and this rule would get even more complex. So, this rule lead to new language feature -type usage of expression inside the scope: Other type checking rules either trivial or common in corresponding field of study [16], so we wouldn't dive too deep into them. In next chapter we will discuss performance issues of our typechecking algorithm.

Cast Performance Analyzing
It is obvious that rules presented at fig.9 introduce new kind of expression ( , ). It's runtime semantics is simple: when expression evaluates to value , we should check that value corresponds to type . If conforms to , the result of evaluation of ( , ) is , otherwise cast error ⊥ produced as the result. Runtime check that value corresponds to some type may be time consumptive, especially when type and expression are complex and have big nestings. Thus, we can introduce and explicit syntax for parts of code where we wish not to insert casts like this: Typechecker will see this annotation and completely ignore annotated part of code. The implementation of gradual typing for Lama offers us three options to maintain typechecking procedure: • #NoTypecheck -drops AST from typechecking at all; • #StaticTypecheck -disables cast insertion into AST, but static checks are still enabled; • #GradualTyping -enables cast insertion into AST.
You can nest #StaticTypecheck and #GradualTyping annotations in order to enable or disable cast insertion while typechecking. But there is no point to nest type related information into #NoTypecheck annotation, because they would be completely ignored by typechecker. Having all power of gradual types and unchained diversity of undefined behaviour, let's user interpretation mode of Lama compiler to see the slowdown in the code execution. We will use sample code: if k == 0 then return 0 elif k == 1 then return 1 elif k < 0 then return -1 else return fibonacci(k-1) + fibonacci(k-2) fi } write(fibonacci(read())) It is not obvious where are the casts in this example, but in section 2 we have noticed that + operator coerces both its arguments to Const at runtime, so appropriate casts to TConst types from unknown type are inserted. Hence, this code is modeling situation of frequent value passage from untyped part of code to typed part of code.
We will compare this code wrapped in #GradualTyping which is the default, and The average of slowdown = from the point of actual slowdown registered = is: As we can see, section of code with active gradual typing runtime type checking exhibit almost × . slowdown. Thus, we have reproduced the result of an article [2] but in the case Lama semantics using this artificially small example.

Conclusion
We introduced type system with following properties: • Monomorphic; • Gradual.
It would be nice to introduce such features in type system as:

74
In the future work it is desired to use type equations and Hindley-Milner style inference with unification algorithm as presented in [17] and [19]. It is worth to mention the reproduction of the result of a recent article about industrial-level languages that use gradual types unsoundly [2]. We have modeled the situation of values constantly transiting from untyped part to typed parts of program and expectedly acquired slowdown of execution.
In addition, we have provided a simple and powerful, yet dangerous, method of maintaining tradeoff between type safety and execution performance: let programmer choose areas of code where he needs extra performance and where he needs static and runtime type safety guaranties, either with #NoTypecheck, or better with #StaticTypecheck and #GradualTyping annotations. The idea goes further. It would be nice to introduce some other sections of static verification that programmers can apply at their taste. For instance, live variable analysis #LiveVarAnalysis, or memory access safety. Thus, programmer acquire framework with bunch of static verifiers and the ability to choose what guaranties is the most important at applied piece of code. To sum up, programmer maintains compilation time and acquires code with the needed guarantees unified in one syntax. Even though the type system soundness is still questionable and should be proved or improved, several tests are added to codebase to check type system, including not compiling tests, runtime error tests and positive example tests. Introduced type system enhances coding experience and points out at least silly and obvious errors that programmers are frequently making. Moreover, Lama's facility has been extended by logger to generate warning messages, mostly for case expression coverage. The implementation of gradual typing for Lama language resides in personal repository within branch named "GraduLama" [20].