Monday, November 16, 2009

Extensible, Statically Typed Pratt Parser in C#

I just completed a statically typed Pratt-style single-state extensible lexer+parser, otherwise known as a top-down operator precedence parser, for the Sasa library. The implementation is available in the Sasa.Parsing dll, under Sasa.Parsing.Pratt. Two simple arithmetic calculators are available in the unit tests.

This implementation is novel in two ways:
  1. Aside from an alleged implementation in Ada, this is the only statically typed Pratt parser I'm aware of.

  2. Pratt parsers typically require a pre-tokenized input before parsing semantic tokens, but I've eliminated this step by using the symbol definitions to drive a longest-match priority-based lexer.

Symbols by default match themselves, but you can optionally provide a scanning function used to match arbitrary string patterns. The symbol selected for the current position of the input is the symbol that matches the longest substring. If two symbols match equally, then the symbol with higher precedence is selected.

The design was heavily influenced by this Python implementation, though numerous departures were made to restore static typing and integrate lexing. In particular, symbols and semantic tokens are separate, where symbols are used primarily during lexing, and construct the appropriate semantic tokens on match.

Here is a simple arithmetic calculator:
class Calculator : PrattParser<int>
{
public Calculator()
{
Infix("+", 10, Add); Infix("-", 10, Sub);
Infix("*", 20, Mul); Infix("/", 20, Div);
InfixR("^", 30, Pow); Postfix("!", 30, Fact);
Prefix("-", 100, Neg); Prefix("+", 100, Pos);

// grouping symbols, ie. "(_)"
Group("(", ")", int.MaxValue);

// provide a predicate to identify valid literals
Match("(digit)", char.IsDigit, 1, Int);

// ignore whitespace
SkipWhile(char.IsWhiteSpace);
}

int Int(string lit) { return int.Parse(lit); }
int Add(int lhs, int rhs) { return lhs + rhs; }
int Sub(int lhs, int rhs) { return lhs - rhs; }
int Mul(int lhs, int rhs) { return lhs * rhs; }
int Div(int lhs, int rhs) { return lhs / rhs; }
int Pow(int lhs, int rhs) { return (int)Math.Pow(lhs, rhs); }
int Neg(int arg) { return -arg; }
int Pos(int arg) { return arg; }
int Fact(int arg)
{
return arg == 0 || arg == 1 ? 1 : arg * Fact(arg - 1);
}
}

The above is a simplified version of the calculators used in the unit tests. You can clearly see the definition of left and right-associative infix operators, and postfix operators. The numbers are precedence levels, and the delegates are the semantic actions associated with each token type.

There are pre-defined functions for creating left and right-associative infix, prefix, postfix, and both prefix and infix ternary operators -- prefix ternary is a simple "if _ then _ else _", infix ternary is like C's ternary operation, "_ ? _ : _". You are not constrained to these symbol types however, as you can define arbitrary character sequences as symbols and parse the results at your whim.

Adding variable binding to the above calculator requires a few extensions to support lexically scoped environments, but the parser requires only the following two extensions:
// symbol for identifiers, ie. variable names
Match("(ident)", char.IsLetter, 0,
name => new Var { Name = name });
// let binding, ie. "let x = 1 in x", is a
// prefix-ternary operator
TernaryPrefix("let", "=", "in", 90, Let);

"let" is not matched as an identifier because the "let" symbol has a higher precedence. Parser definitions with ambiguous parses thus require some care to resolve the ambiguity either via match length or precedence relationships.

There are two popular choices when it comes to parsing: parser generator tools and parser combinator libraries. I've never liked generators personally, as they require a separate language and entirely separate tools to define and process the grammar thus increasing the learning and maintenance curve.

I actually like parser combinators, and there are now good techniques for supporting arbitrary left recursion in grammars (see Packrat parsing in OMeta). However, handling operator precedence often involves a series of complex extensions to the otherwise simple recursive descent, or it requires the parsed results to maintain all operator and grouping information so a post-processing phase can readjust the parse tree to reflect the proper priorities. The post-processing phase typically uses a shunting yard or Pratt-style operator precedence algorithm to resolve precedence, so why not just build the parser itself directly on Pratt parsing?

Unlike parser-combinator libraries, of which I've built a few, Pratt parsers natively support arbitrary operator precedence and are efficient as they require no backtracking. Pratt parsers are also Turing-complete, because the semantic actions can perform arbitrary computations on the parser input. Turing completeness carries both benefits and costs of course, but the single-state nature of Pratt parsers keeps it manageable. The "single state" of the parser is simply the current token matched from the input.

For example, if prefix-ternary operators weren't already available, you could simply define one like so:
// "let" is a prefix symbol on a name
var let = Prefix("let", 90, x =>
{
// lexer matched "let" prefix operator on x
// argument, a variable name
Advance("="); // advance past the "="
var xValue = Parse(0); // parse the value to bind to x
Advance("in"); // advance past the "in"
var body = Parse(0); // parse the let expression body
// return a Let expression node
return new Let{ Variable = x, Value = xValue, Body = body};
});
// declare the other symbols involved for the lexer
// but they have no associated semantic action
Symbol("=");
Symbol("in");

The above builds on "Prefix", but you could break this down even further as in Pratt's original paper and define arbitrary symbol behaviour by providing a scanning function, "left denotation", and "null denotation", and so on.

To define a parser, you need only inherit from Sasa.Parsing.Pratt.PrattParser, provide the semantic type T, and override the constructor for your parser to define your symbol table. The only downside is that each parser instance is monolithic and not thread-safe, so it can only parse one input at a time. There is no limit to how many parsers you can create however.

Pratt parsing is simple, compact, and efficient, so I hope it finds more widespread use, particularly now that there is a statically typed implementation for a popular language.

[Edit: the PrattParser will be available in the upcoming v0.9.3 release of the Sasa class library.]

12 comments:

Eoin said...

Asked this out loud on reddit and then realized that you could probably provide a way better answer ;-)

I wonder why you specify a type argument to the Parser super class though? Does it automagically reduce the parsed input stream down to a result of that type? Or is that what it excepts every identifier node to have a has a (base) type for its value?

Sandro Magi said...

Not automagically, no. T is the type of the value parsed from the input. You provide callbacks that transform tokens from strings to T, or transform one T to another T depending on the operator invoked. The type parameter T just ensures that the functions are working with the correct types; it defines a contract that the implementation must fulfill.

A good example is in the unit tests I linked to. There's actually an intermediate MathParser abstract class where I keep T abstract, and merely define the grammar of the calculator's expressions and the required operations, ie. add, subtract, etc. I then inherit from this twice, once to implement the simple calculator operating on int as described in this post, and a second time to implement a more sophisticated calculator with lexically scoped variables which operates on an abstract math expression type Exp. Anyone wanting to implement their own calculator can simply inherit from MathParser like I did, just fill in the details, and the type system will ensure that the contract is fulfilled.

In both cases the structure of the interpreter is the same, and all that changes is the type of value I'm operating on. Generics like this provide a good way to specify contractual obligations.

Rohan said...

Statically typed: While I was teaching myself Haskell last year I translated Douglas Crockford's JavaScript parser directly. The (incorrectly coloured) source can be seen here. To make this a generic parser the function initial_symbol_table should be passed in.

Sandro Magi said...

I had an inkling that I had once come across a Haskell Pratt parser, but couldn't find it when I was searching for it. Thanks for the pointer Rohan.

I believe yours and Crockford's implementation supports a few more features than does mine, in particular some form of operator scoping, but I haven't delved into it yet. I look forward to browsing the code.

Rohan said...

Any excuse to increase my Google rank ;-)

Moving the symbol table from State to Reader monads might work. I was doing something similar early on but ended up stuffing everything into State to avoid monad transformers.

Feel free to change/mutilate that code as you like.

Peter Goodman said...

Last summer I created a Pratt EDSL in C++ where the parser is resumable, i.e. one streams tokens to the parser rather than having it take in tokens. It supports some neat things like grouping denotations into categories, and when specifying binding precedences, one can limit them to a specific category. Finally, it allows new operators to be added at runtime.

Looking back on it, the implementation is ugly, but you can find it here: http://projects.ioreader.com/pag/src/tools/tdop/

An example of it being used is here: http://projects.ioreader.com/pag/src/tools/xy/lib/Grammar.cpp

Sandro Magi said...

I admit my C++fu is weak. I think I understand what you mean, where the tokens are pushed into the parser rather than pulled in, but I'm not sure this is the type of "resumption" I'm after. The existence of a token stream abstraction of some sort is all that's needed to make it resumable in this sense (whether it's pull or push based doesn't really matter).

What I'm after is to make the parser a pure function of sorts, where we can capture its state at any point, modify the input after that point, then resume parsing with the remainder. This sort of incremental/resumable parsing is done in IDEs for instance, ie. for intellisense. Does your approach allow this application?

Re: operators at runtime, I left that for future work. The current structures only allows static operators with custom precedence, or general operators with fixed precedence. General operators with custom precedence will come eventually when I have need.

Peter Goodman said...

Sandro, my parser is not pure and so it cannot be resumed from some point in the past. However, if one were to capture its previous state in its entirety (the parsing stack, whatever other state variables) then it could be resumed/rewound. If, what you mean, is to allow changing of the token stream while parsing then that is possible, but much less interesting.

The goal of streaming was two-fold: First, to separate the lexer from the parser. Second, separate error handling from the parser proper, i.e. allow me to figure out what went wrong outside of the parser, at the same "level" as the code outside of the parser. This last part was nice because the parser has no real understanding of what it's parsing and so it was up to my code to impose a meaning on an error.

One interesting thing I did that allowed for scoped user-defined operators is for denotation categories to have a single fall-back category. That is, if no denotation can be applied in the current category then parser tries to keep going using the fall-back category. Naturally, this linked-list of fall-back categories maps to the stack of lexical scopes. What I did was to have an empty sentinel category with a fall-back category of whatever basic operators I wanted to support. Then, when I would enter a lexical scope, I could push a new category into the list immediately after the sentinel. When the lexical scope was left, I would remove the category immediately following the sentinel. This allowed me to statically specify the sentinel in the grammar but still have the scoped operators.

Sandro Magi said...

If, what you mean, is to allow changing of the token stream while parsing then that is possible

I mean something like this, yes: capture the entire parsing state at certain points of interest, and having the ability to change the downstream tokens, then resume. As I said, this makes incremental parsing applications, like Intellisense, possible using the same parser.

Second, separate error handling from the parser proper, i.e. allow me to figure out what went wrong outside of the parser, at the same "level" as the code outside of the parser.

Can you describe how you accomplish this? Better parse errors is definitely something I'm interested in. I don't quite see how code outside the parser can provide better information using only the unstructured text, because it seems you would need a parser to make sense of everything leading up to it anyway. What's the protocol between this error code and the parser and/or token stream?

Naturally, this linked-list of fall-back categories maps to the stack of lexical scopes.

Thanks, this gave me an idea for supporting something similar in my implementation.

Peter Goodman said...

Can you describe how you accomplish this?

The parser is able to detect the following simple errors, as all errors are really one of the following: missing null denotation, missing left denotation, unexpected terminal, and unexpected end-of-input.

The parser returns an error code for each of these plus as much context as it can safely provide.

What's the protocol between this error code and the parser and/or token stream?

The error context contains whatever tokens were expected (I allowed ordered choice of denotations, where, for example, the first non-left binding environment denoted by a precedence acted as a cut, stopping the parser's ability to backtrack and try the next denotation), the denotation that was being parsed, the current denotation category, and whatever partial "result" the parser has constructed.

The current denotation category, along with the expected and failed tokens were usually enough to give context to an error. I think this shows another advantage of parser categories as they allow one to take a token and explicitly give it a different meaning in a different category.

I was satisfied with this level of parse errors, although if you are constructing an explicit parse tree then the partial tree could also be used.

hyperion said...

Is it possible to parse ((bool) ? : ) construction?

Sandro Magi said...

Yes, e?e:e is a ternary infix operator, ie. three-symbol operator whose first symbol is in infix position. The TernaryInfix method allows you to specify this sort of operator.