Friday, September 12, 2008

Tagless Interpreters in C#

Creating DSLs is all the rage these days, and for good reason. Most abstractions are actually a little language struggling to get out. Design consists of creating abstractions with maximum power, and minimum restrictions, and reusing this abstraction as much as possible. Small domain-specific languages are the ticket.

However, language implementations are often written in fairly ad-hoc ways, and most interpreters are difficult to extend to compilers and partial evaluators. Languages are usually described by an "initial algebra", which basically just means that language terms are described by data types. Here's a simple definition of a language with integers, variables and addition:

* Expressions := [0-9]* | e1 + e2 | e1 - e2
type expression = Int of int | Add of expression * expression | Sub of expression * expression

So a program in this language can consist of integers, subtraction or addition expressions. The interpreter for this language unpacks the expression by checking the tags, then evaluates the result by dispatching to the appropriate handler.
let eval exp env = match exp with
| Int(i) -> i
| Sub(e1,e2) -> (eval e1) - (eval e2)
| Add(e1,e2) -> (eval e1) + (eval e2);;

This is fine for simple languages, but it's not very efficient since tag decode and dispatch can expensive. Eliminating tags via program analysis, like partial evaluation, yields a simple compiler of sorts, thus yielding efficient DSLs.

Furthermore, we would like to exploit the type checking of the host language to ensure the type safety of the embedded language's interpreter, and type safety of any expressions of the embedded language. Unfortunately, more sophisticated languages written using this structure make it difficult to ensure that all the cases in the language are properly handled, which then requires the use of runtime errors.

By far the simplest solution to all of these problems that I've come across, is a paper by Jacques Carette, Oleg Kiselyov, and Chung-chieh Shan, Finally Tagless, Partially Evaluated, Tagless Staged Interpreters for Simpler Typed Languages.

If I understand the terminology correctly, they use a final algebra instead of an initial algebra to describe a language. In other words, instead of using data types to describe language expressions, they use functions. The language above is described using four functions provided by a module:
module type L = sig
type exp
val Int : int -> exp
val Add : exp -> exp -> exp
val Sub : exp -> exp -> exp

An interpreter for this language is trivial:
module I : L = struct
let Int i = i
let Add i1 i2 = i1 + i2
let Sub i1 i2 = i1 -i2

Constructing an expression of this little language from within the host language is simple:
let test l:L = 
let i = l.Int 1 in
let sum = l.Add i (l.Int 2);; (* sum is now 3 *)

This embedded program abstracts over the type of the "interpreter" used to evaluate it. In other words, if we build a compiler backend for the module type L, we can use it to evaluate this term. A compiler is overkill for this simple language, but the more sophisticated the embedded language, the more attractive a compiler looks.

This toy arithmetic language just illustrates the main ideas, but the language described in the paper is actually a full-fledged programming language, with if-statements, first-class functions, etc. The authors then extend their tagless interpreter into a full fledged compiler using MetaOCaml's staging facilities, and then further extend the compiler into a partial evaluator. There are also more sections on abstracting evaluation order of the embedded language, ie. making the embedded language lazy even if the host language is strict, and self-interpretation (ie. metacircular interpretation).

It's a very exciting result, but as Oleg describes in the LtU thread linked to above, the abstraction over "interpreter" types L, requires the host language to abstract over type constructors. As I explained before, C# cannot do this. In order to implement a program which abstracts over type constructors in some way, we have to resort to dynamics. In other words, we have to use casts.

I have devised three techniques to structure these interpreters to be mostly safe. The tradeoffs are slightly different in each, but in all cases, the interpreter and all embedded language terms are type-safe as long as the client doesn't try any hijinks. If they do, the resulting runtime errors are a perfectly acceptable compromise IMO. The embedded language I describe is also a full-fledged programming language with if-statements, lambdas, and so on, and all examples are written in C#.

The first implementation uses a ML-module-to-C# translation I described in a previous post on this blog. It's actually rather complicated, and you should probably skip over it unless you're masochistic enough to want to delve into nitty gritty details. This encoding is the safest of the three however, as I believe there is no way to trigger a runtime exception except by resorting to reflection.

The second implementation uses a much more natural functional-style encoding, where the language is described by an interface, and an expression is an extensible data type Exp that must be implemented by an interpreter. All Exp types are "branded" by the type of the interpreter, so it's not possible to accidentally mix expressions from different interpreters.

However, because Exp is extensible, this leaves the interpreter open to client injections, where a client can trigger a runtime exception if they subclass Exp themselves, and inject such an Exp value into a program term. This is only a problem if a trusted component evaluates program terms built by untrusted clients. The trusted component must still be aware that runtime errors are possible if clients can be malicious, which seems like a perfectly acceptable compromise for the simplicity of this solution.

The third implementation is a more object-oriented style encoding, reminiscent of Smalltalk. Each core type is now its own class, and operations on those types are methods on that class. For instance, an if-statement is the If method on the Bool class.

The language interface consists only of constructors for the core language types. This approach is more verbose than the functional-style implementation, but it permits a more natural embedding of program terms in C#, because we can now use operators and ordinary method calls to operate on program terms. Like the second implementation, this too can throw runtime errors based on client-injected Exp types, so the same caveats apply.

It's not yet clear which of the last two implementations is preferable, but I prefer the functional-style for its brevity if you're writing a real interpreter and/or compiler for a language. Functional-style programs excel at symbol manipulation. If you're constructing embedded programs from within C# only, then the last implementation is probably preferable, since the availability of operators can make program terms less verbose.

All three implementations abstract over the type of "interpreter" for program terms. It could be a straight-laced interpreter, or it could be a code generator backend (like System.Runtime.Emit). For instance, in the post for the third implementation, I briefly describe a JavaScript-generating backend for an embedded language I'm playing with. It would allow an embedded program to transparently execute server-side via an interpreter or client-side via JavaScript depending on a client's capabilities.

Consider the possible uses for transparently displaying AJAX pages server-side for thin HTML-only clients, and client side for real browsers. In other words, you write this code only once. Witness the power of DSLs.

No comments: