Friday, March 15, 2013

Sasa.Parsing - An Overview

Continuing my series of posts describing the new features in Sasa v0.9.4, I'm going to cover the new Sasa.Parsing.dll assembly. As the name implies, the assembly provides abstractions to define grammars and build parsers.

Sasa.Parsing.Lexing is the namespace that provides abstractions for lexical analysis. Sasa.Parsing.Pratt is the namespace that provides an abstraction to specify a grammar that will be lexed and parsed by a Pratt parser, aka a top-down operator precedence parser.

Sasa.Parsing.Lexing

Understanding lexing in Sasa starts with a very simple lexing interface that is general enough to tokenize pretty much any sort of string:

/// <summary>
/// Abstract lexer construction interface.
/// </summary>
/// <typeparam name="T">The type of a built lexer.</typeparam>
/// <typeparam name="TLexerState">The lexer state.</typeparam>
public interface ILexerBasic<T, TLexerState>
    where TLexerState : ILexerState, new()
{
    /// <summary>
    /// Lex nothing.
    /// </summary>
    /// <returns>
    /// A lexer that matches nothing, but returns trivially true as
    /// matched.
    /// </returns>
    T Empty { get; }

    /// <summary>
    /// Lex a character terminal.
    /// </summary>
    /// <param name="character">The character to match.</param>
    /// <returns>
    /// A lexer that matches <paramref name="character"/>.
    /// </returns>
    T Terminal(char character);

    /// <summary>
    /// Choose amongst a set of possible lexes.
    /// </summary>
    /// <param name="lexers">The list of lexers to apply.</param>
    /// <param name="merge">
    /// A function that collapses the set of lexes.
    /// </param>
    /// <returns>
    /// A lexer that matches given any of <paramref name="lexers"/>.
    /// </returns>
    T Alternate(LexMerge<TLexerState> merge, params T[] lexers);

    /// <summary>
    /// Compose a series of lexers.
    /// </summary>
    /// <param name="lexers">The list of lexers that must match, in
    /// sequence.</param>
    /// <returns>
    /// A lexer that matches an ordered sequence of lexers.
    /// </returns>
    T Sequence(params T[] lexers);

    /// <summary>
    /// Construct a recursive lexer.
    /// </summary>
    /// <param name="lexer">A function that builds a recursive lexer
    ///  given the current lexer.</param>
    /// <returns>A recursive lexer.</returns>
    T Recursive(T lexer);

    /// <summary>
    /// Applies a tag to the lexer state if lexing succeeds.
    /// </summary>
    /// <param name="identifier">
    /// The identifier to associate with the lexing rule.
    /// </param>
    /// <param name="lexer">The lexer to wrap.</param>
    /// <returns>
    /// A lexer that sets the identifier on success.
    /// </returns>
    T Tag(string identifier, T lexer);
}

The basic interface defines the operations from which you can construct a lexer that can recognize individual characters, compose a fixed sequence of lexers, choose from a fixed set of alternate lexers, or recursively apply a lexer until it terminates. The "Tag" operation simply adds a descriptive name to the lexer for debugging purposes. This interface is sufficient to lex pretty much any sort of token stream you'd need to, but won't necessarily build the most efficient lexer for complex strings.

The Alternate operation additionally accepts a "merge" parameter of the following delegate type:

/// <summary>
/// A lex merging function used to process all alternates using
/// only the stack.
/// </summary>
/// <typeparam name="T">The type of lexer state.</typeparam>
/// <param name="root">The actual lexer state returned.</param>
/// <param name="original">
/// The original lexer state before applying alternates.
/// </param>
/// <param name="current">
/// The lexer state from the current alternate.
/// </param>
public delegate void LexMerge<T>(ref T root, T original, ref T current)
    where T : ILexerState;

The Alternate lexer processes a series of possible lexes, and then invokes the LexMerge function to choose among them. LexMerge can keep all the lexes and the parser would return a parse forest, or it can apply some set of rules to choose the best lex to use, eg. choose longest match, or enforce only a single match.

We add efficiency to the basic lexing interface with the extended lexer interface:

/// <summary>
/// An extended lexer interface providing complex, performance-optimized
/// lexer combinators.
/// </summary>
/// <typeparam name="T">The type of the built lexer.</typeparam>
/// <typeparam name="TState">The lexer state.</typeparam>
public interface ILexer<T, TState> : ILexerBasic<T, TState>
    where TState : ILexerState, new()
{
    /// <summary>
    /// Lex a string terminal.
    /// </summary>
    /// <param name="x">The string literal to match.</param>
    /// <returns>A lexer that matches <paramref name="x"/>.</returns>
    /// <remarks>
    /// This is merely a performance optimization. The following code
    /// snippet demonstrates the equivalence:
    /// <code>
    /// Literal("foo") == ILexerBasic.Sequence(
    ///                   ILexerBasic.Terminal('f'), 
    ///                   ILexerBasic.Terminal('o'), 
    ///                   ILexerBasic.Terminal('o'))
    /// </code>
    /// </remarks>
    T Literal(string x);

    /// <summary>
    /// Lex a character range.
    /// </summary>
    /// <param name="lower">
    /// The inclusive lower bound on the character range.
    /// </param>
    /// <param name="upper">
    /// The inclusive upper bound on the character range.
    /// </param>
    /// <returns>A lexer that matches a character range.</returns>
    /// <remarks>
    /// This is functionally equivalent to the base lexer interface as 
    /// follows:
    /// <code>
    /// Range('A', 'z') == ILexerBasic.Alternate(
    ///                    ILexerBasic.Terminal('A'),
    ///                    ILexerBasic.Terminal('B'),
    ///                    ...,
    ///                    ILexerBasic.Terminal('z'));
    /// </code>
    /// </remarks>
    T Range(char lower, char upper);

    /// <summary>
    /// Match a character using a predicate.
    /// </summary>
    /// <param name="matches">The predicate to check.</param>
    /// <returns>A lexer that matches the given predicate.</returns>
    /// <remarks>
    /// Whatever predicate is provided can be matched by a composition
    /// of the base lexer combinators.
    /// </remarks>
    T Where(Predicate<char> matches);

    /// <summary>
    /// A lexer that repeatedly attempts to match an underlying lexer.
    /// </summary>
    /// <param name="lexer">The underlying lexer.</param>
    /// <returns>A lexer that matches one or more times.</returns>
    /// <remarks>
    /// This is equivalent to:
    /// <code>
    ///  OneOrMore(X) ==  ILexerBasic.Recursive(x =>
    ///                   ILexerBasic.Sequence(X, x))
    /// </code>
    /// </remarks>
    T OneOrMore(T lexer);

    /// <summary>
    /// A lexer that repeatedly attempts to match an underlying lexer.
    /// </summary>
    /// <param name="lexer">The underlying lexer.</param>
    /// <returns>A lexer that matches zero or more times.</returns>
    /// <remarks>
    /// This is equivalent to:
    /// <code>
    /// ZeroOrMore(X) == ILexerBasic.Recursive(x =>
    ///                  ILexerBasic.Alternate(ILexerBasic.Empty, X, x))
    /// </code>
    /// </remarks>
    T ZeroOrMore(T lexer);

    /// <summary>
    /// A lexer that matches any one of a number of characters.
    /// </summary>
    /// <param name="chars"></param>
    /// <returns></returns>
    /// <remarks>
    /// This is equivalent to:
    /// <code>
    /// Any(c0, c1, ... cN) == ILexerBasic.Alternate(
    ///                        ILexerBasic.Terminal(c0),
    ///                        ILexerBasic.Terminal(c1),
    ///                        ...,
    ///                        ILexerBasic.Terminal(cN))
    /// </code>
    /// </remarks>
    T Any(string chars);

    /// <summary>
    /// A lexer that succeeds whether or not it matches.
    /// </summary>
    /// <param name="lexer">The lexer to apply.</param>
    /// <returns>An optional lexer.</returns>
    /// <remarks>
    /// This is equivalent to:
    /// <code>
    /// Optional(X) == ILexerBasic.Alternate(ILexerBasic.Empty, X)
    /// </code>
    /// </remarks>
    T Optional(T lexer);

    /// <summary>
    /// Lex using a regular expression.
    /// </summary>
    /// <param name="format">The regex to use.</param>
    /// <param name="options">The regex options.</param>
    /// <returns>
    /// A lexer that scans using a regular expression.
    /// </returns>
    T Regex(string format, RegexOptions options = RegexOptions.None);
}

These extended lexers don't add any lexing power, since they can be implemented in terms of the basic lexer operations as the code comments show. However, the implementations were added to optimize performance.

Given any string, here are some examples of building lexers:

ILexer<T, TState> x = ...
var whitespace   = x.OneOrMore(x.Where(char.IsWhiteSpace));
var digits       = x.OneOrMore(x.Where(char.IsDigits));

// the following recognizes valid variable names in some programming
// language ie. variables must start with a letter, followed by a
// sequence of letters or numbers
var variableName = x.Sequence(x.Where(char.IsLetter),
                              x.ZeroOrMore(x.Where(char.IsLetterOrDigit)))

In general, the lexer takes a string as input and produces a series of outputs designating the tokens. For more detailed information, you can review ILexerState and the lexer implementation details, but for the most part you will never need to know these details since lexing is completely abstract when defining grammars.

Sasa.Parsing.Pratt

Having covered lexing and produced a stream of tokens, we can now move to parsing those tokens using a simple, yet efficient Pratt parser. Pratt parsing performs no backtracking and is technically Turing-complete, so you should be able to parse as complicated a token stream as you like. However, we can start simply with the canonical calculator example. Here's an interface that defines the semantics of simple arithmetic:

/// <summary>
/// The semantics of a math language.
/// </summary>
/// <typeparam name="T">The type of math values.</typeparam>
interface IMathSemantics<T>
{
    T Int(string lit);
    T Add(T lhs, T rhs);
    T Sub(T lhs, T rhs);
    T Mul(T lhs, T rhs);
    T Div(T lhs, T rhs);
    T Pow(T lhs, T rhs);
    T Neg(T arg);
    T Pos(T arg);
    T Fact(T arg);
}

This clearly defines simple arithmetic operations on an abstract number type T. Here's a simple implementation for T = int:

/// <summary>
/// A simple arithmetic interpreter.
/// </summary>
sealed class MathInterpreter : IMathSemantics<int>
{
    public int Int(string lit) { return int.Parse(lit); }
    public int Add(int lhs, int rhs) { return lhs + rhs; }
    public int Sub(int lhs, int rhs) { return lhs - rhs; }
    public int Mul(int lhs, int rhs) { return lhs * rhs; }
    public int Div(int lhs, int rhs) { return lhs / rhs; }
    public int Pow(int lhs, int rhs) { return (int)Math.Pow(lhs, rhs); }
    public int Neg(int arg) { return -arg; }
    public int Pos(int arg) { return arg; }
    public int Fact(int arg)
    {
        return arg == 0 || arg == 1 ? 1 : arg * Fact(arg - 1);
    }
}

There shouldn't be anything too surprising here, so let's proceed to defining the parser for this math language. We start defining a grammar by inheriting from Grammar<T>:

/// <summary>
/// The structure of abstract arithmetic parsers.
/// </summary>
/// <typeparam name="T">The type of element being parsed.</typeparam>
class MathGrammar<T> : Sasa.Parsing.Pratt.Grammar<T>

Grammar inherits from a concrete implementation of ILexer, and thus exposes all the lexing operations, and additionally some operations for declaring parsing rules. Here's the full math parser:

/// <summary>
/// The structure of abstract arithmetic parsers.
/// </summary>
/// <typeparam name="T">The type of element being parsed.</typeparam>
class MathGrammar<T> : Sasa.Parsing.Pratt.Grammar<T>
{
    public MathGrammar(IMathSemantics<T> math)
       : base(Lexers.LongestMatch)
    {
        Infix("+", 10, math.Add);   Infix("-", 10, math.Sub);
        Infix("*", 20, math.Mul);   Infix("/", 20, math.Div);
        InfixR("^", 30, math.Pow);  Postfix("!", 30, math.Fact);
        Prefix("-", 100, math.Neg); Prefix("+", 100, math.Pos);

        Group("(", ")", int.MaxValue);
        Match("(digit)", OneOrMore(Where(char.IsDigit)), 0, math.Int);
        Skip("(whitespace)", OneOrMore(Where(char.IsWhiteSpace)));
    }
}

The MathGrammar accepts a math interpreter as an argument, and while declaring various parsing rules, registers callbacks into the interpreter to process the parsed values. The base interface of Sasa.Parsing.Pratt.Grammar accepts a LexMerge function to choose among alternate lexes. Sasa parsing provides a number of choices, though longest match will probably meet the most common needs.

Proceeding with operator declarations, here's the signature for the Infix operators:

/// <summary>
/// Left-associative infix symbol.
/// </summary>
/// <param name="op">The operator symbol.</param>
/// <param name="bindingPower">The binding power of the operator.</param>
/// <param name="selector">
/// The function transforming the infix token.
/// </param>
/// <returns>A right-associative operator symbol.</returns>
Rule<T> Infix(string op, int bindingPower, Func<T, T, T> selector)

Rule<T> is a parsing rule that encapsulates the lexer, the precedence of the operator, and the function to apply when this rule is triggered. The expanded declaration for the addition operator should now be clear:

Infix(op: "+", bindingPower: 10, selector: math.Add);

The "op" parameter is simply a shortcut that declares the operator symbol to associate with this rule, and creates a lexer that recognizes only that symbol. The parser triggers this rule when it recognizes a token stream where this symbol appears in infix position, ie. where this symbol is surrounded by two valid parses.

The "bindingPower" parameter is the absolute precedence of the declared operator. You can see from the math grammar above, that addition has lower precedence than multiplication and equal precedence to subtraction, just as we've all learned in elementary arithmetic. Given the definition of Infix, the Postfix and Prefix should be obvious as declaring postfix and prefix operators. Infix is left-associative by default, and InfixR is simply Infix with a right-associative default.

The Group rule is defined as follows:

/// <summary>
/// A grouping operator, typically parenthesis of some sort.
/// </summary>
/// <param name="bindingPower">The binding power of the operator.</param>
/// <param name="leftGrouping">The left grouping symbol.</param>
/// <param name="rightGrouping">The right grouping symbol.</param>
/// <returns>A grouping operator.</returns>
Rule<T> Group(string leftGrouping, string rightGrouping, int bindingPower)

It simply declares that a token stream that starts with "leftGrouping", followed by some other parse, and terminated by "rightGrouping", should trigger this rule with the given precedence. Brackets have the highest precedence of all symbols in arithmetic, so the math parser declares groupings with precedence = int.MaxValue.

Grammar's Match rule is where the parsing all starts, since it recognizes and parses numbers. Here is the signature:

/// <summary>
/// A rule that matches a custom lex.
/// </summary>
/// <param name="id">The identifier for this symbol.</param>
/// <param name="lex">The lexer for this rule.</param>
/// <param name="bindingPower">Operator's left binding power.</param>
/// <param name="parse">
/// Parsing function taking a string to a <typeparamref name="T"/>.
/// </param>
/// <returns>Literal symbol.</returns>
Rule<T> Match(string id, TLexerState lex, int bindingPower,
                    Func<string, T> parse)

The "id" parameter is simply a unique identifier for this rule, similar to how tagging is used for identifying lex rules. The Match rule accepts an arbitrary lexer as a parameter so you can recognize any sort of token sequence as belonging to this rule. The "bindingPower" is the precedence of this rule, as before, and the "parse" parameter is where we actually take the string recognized by the lexer, and turn it into a real value produced by the parser.

The only values we're concerned with are numbers, so the declaration for parsing numbers should be clear:

Match("(digit)", OneOrMore(Where(char.IsDigit)), 0, math.Int);

This declares a rule called "(digit)" with a lexer that recognizes characters that are digits. The rule has precedence of 0, so it has the lowest of all the rules in math. When a number is recognized, it then dispatches into the math interpreter which can turn a digit string into a value of type T, which are the values operated on by that interpreter.

What's particularly noteworthy in this design is that the types of values produced by the parser are completely abstract, and so we don't even need to produce actual numbers for values. For instance, we can create an implementation of IMathSemantics that produces an abstract syntax tree describing the expression. Furthermore, we can actually extend the parser definition more or less arbitrarily to add new operations, new parseable values, etc.

Extensible Parsers

Starting with IMathSemantics, we can easily extend the math language by inheriting from IMathSemantics and adding variable declarations and applications:

/// <summary>
/// The semantics of an equation language.
/// </summary>
/// <typeparam name="T">The type of equation values.</typeparam>
interface IEquationSemantics<T> : IMathSemantics<T>
{
    T Var(string name);
    T Let(T x, T value, T body);
}

Note that the type T must be able to support both values and bindings to values, so we can't provide an implementation of math semantics where T = int. We must add a notion of an environment into which bindings are injected. This is pretty straightforward. Here's the type we'll use to unify numbers and environments:

delegate Pair<Env<string, int>, int> Eval(Env<string, int> env);

So T will be a delegate that accepts an environment as a parameter, and produces a new environment and an integer value as a result. Here's the corresponding implementation of IEquationSemantics:

class EquationSemantics : IEquationSemantics<Eval>
{
    public Eval Var(string name)
    {
       return e => Tuples.Create(e.Bind(name, 0),
                      e.FirstOrDefault(x => x.Key == name).Value);
    }
    public Eval Int(string lit)
    {
        return e => Tuples.Create(e, int.Parse(lit));
    }
    public Eval Add(Eval lhs, Eval rhs)
    {
        return e => Tuples.Create(e, lhs(e).Second + rhs(e).Second);
    }
    public Eval Sub(Eval lhs, Eval rhs)
    {
        return e => Tuples.Create(e, lhs(e).Second - rhs(e).Second);
    }
    public Eval Mul(Eval lhs, Eval rhs)
    {
        return e => Tuples.Create(e, lhs(e).Second * rhs(e).Second);
    }
    public Eval Div(Eval lhs, Eval rhs)
    {
        return e => Tuples.Create(e, lhs(e).Second / rhs(e).Second);
    }
    public Eval Pow(Eval lhs, Eval rhs)
    {
        return e => Tuples.Create(e, (int)Math.Pow(lhs(e).Second,
                                        rhs(e).Second));
    }
    public Eval Neg(Eval arg)
    {
        return e => Tuples.Create(e, -arg(e).Second);
    }
    public Eval Pos(Eval arg)
    {
        return e => Tuples.Create(e, arg(e).Second);
    }
    int Fact(int arg)
    {
        return arg == 0 || arg == 1 ? 1 : arg * Fact(arg - 1);
    }
    public Eval Fact(Eval arg)
    {
        return e => Tuples.Create(e, Fact(arg(e).Second));
    }
    public Eval Let(Eval x, Eval value, Eval body)
    {
        return e =>
        {
            var name = x(e).First.First();
            var bound = value(e).Second;
            var eval = body(e.Bind(name.Key, bound));
            return Tuples.Create(e, eval.Second);
        };
    }
}

A little more involved than the simple math language, but most of that is due to passing around the environment. The original operations are all essentially the same, just operating on the tuple's second int value. Only the new Let/Var operations are markedly different. Let adds a new binding to the environment, and var looks up the most closest binding with the given name in the current environment.

The parser for this extended language is actually quite trivial:

/// <summary>
/// A parser for math expressions with lexically-scoped variables.
/// </summary>
/// <typeparam name="T">The type of equation values.</typeparam>
class EquationGrammar<T> : MathGrammar<T>
{
    public EquationGrammar(IEquationSemantics<T> eq) : base(eq)
    {
        Match("(ident)", Sequence(Where(char.IsLetter),
              ZeroOrMore(Where(char.IsLetterOrDigit))), 0, eq.Var);
        TernaryPrefix("let", "=", "in", 90, eq.Let);
        Disambiguate("let", "in", "(ident)");
    }
}

Like the semantics, the equation grammar inherits the implementation of the simple math grammar and simply adds a few new rules. The Match rule identifies variable names that must start with a letter, followed by an arbitrarily long sequence of letters and numbers.

The TernaryPrefix rule identifies operators that have the following form: PREFIX expression INFIX expression POSTFIX. This is again a shortcut for defining a common operator type. In this case, we use the let-binding syntax from F# and other ML languages, ie. let x = 3 in x + 1.

The Disambiguate declaration is intended to simplify the common case where we only want a single valid parse. Basically, the keywords "let" and "in" also match the "(ident)" rule for variable names. Without the disambiguate declaration, some parses would be ambiguous, and since we want to produce a single parse for any given string, Disambiguate allows you to specify the priority of the parses. In this case, the "let" operator rule dominates over the identifier rule, so "let" will never be parsed as an identifier.

As with MathGrammar, the concrete type of the parse, T, is completely abstract, so we can produce values, or parse trees, or any other type that can satisfy the constraints of IEquationSemantics. This parsing design allows independent extensions of grammars, and parsed values, the latter of which can include direct interpreters, compilers, or other data structures that satisfy the required semantics.

Turing-Complete Parsing

So far we've reviewed simple parses with a set of pre-defined parse structures, like infix, prefix and ternary operators. However, Pratt parsing is Turing-complete, so the parsing can be arbitrarily complex. Digging into this machinery will demonstrate how the above operator rules were created, and how you can define your own operator rules.

I will demonstrate with a simple example of parsing lists of numbers:

sealed class ListGrammar : Grammar<IEnumerable<int>>
{
    public ListGrammar()
        : base(Lexers.SingleMatch)
    {
        Match("(int)", OneOrMore(Where(char.IsDigit)), 0, Int);
        var rule = Rule(new Rule<IEnumerable<int>>("(list)")
        {
            Lex = Terminal('['),
            LeftBindingPower = 10,
            NullDenotation = parser =>
            {
                var list = new List<int>();
                for (var tok = parser.Token;
                     tok.Id == "(int)";
                     tok = parser.Token)
                {
                    list.AddRange(parser.Parse(0));
                }
                parser.Advance("]");
                return list;
            }
        });
        Symbol("]");
        Skip("(whitespace)", OneOrMore(Where(char.IsWhiteSpace)));
    }
    static IEnumerable<int> Int(string s)
    {
        yield return int.Parse(s);
    }
}

I haven't bothered separating out the semantics for numbers and lists into its own interface for this simple example. The basic idea of defining parsing rules is defining the symbols that should appear in the token stream, and then registering either a NullDenotation, or a LeftDenotation, depending on the type of rule it's suppose to be.

A NullDenotation is a rule that has no parsed values to its left. The rule simply recognizes the starting tokens, and then takes the current parser as an argument and proceeds to manually process a token stream, ultimately returning a parsed value:

NullDenotation: Func<PrattParser<T>, T>

A simple example of a null denotation is a prefix operator like negation. Parses to the right of a negation operator exist, but none to its left.

A LeftDenotation is a rule that expects some parses to its left. The signature of a LeftDenotation is:

LeftDenotation: Func<Token<T>, PrattParser<T>, T, T>

A LeftDenotation accepts the token representing the operator, the current parser, and the parsed value immediately to the left. Infix and Postfix operators are simple examples of a LeftDenotation.

The operation of the Infix, Postfix and Prefix declarations previously covered should now be clear:

Rule<T> Infix(string op, int bindingPower, Func<T, T, T> selector)
{
    var sym = Symbol(op, bindingPower);
        sym.LeftDenotation = (tok, parser, left) =>
                             selector(left, parser.Parse(bindingPower));
    return sym;
}
Rule<T> Postfix(string op, int bindingPower, Func<T, T> selector)
{
    var sym = Symbol(op, bindingPower);
        sym.LeftDenotation = (tok, parser, left) => selector(left);
    return sym;
}
Rule<T> Prefix(string op, int bindingPower, Func<T, T> selector)
{
    // do not set binding power since no left denotation
    var sym = Symbol(op, 0);
        sym.NullDenotation = parser =>
                             selector(parser.Parse(bindingPower));
    return sym;
}

The parser.Parse(int) operation performs a parse according to the expected precedence of the next rule. This makes Pratt parsing a simple and efficient forward-only parser.

We can use any looping or testing constructs within NullDenotation and LeftDenotation, so the token streams that can be recognized are arbitrarily complex. The grammars definable using Sasa.Parsing are declarative, simple and quite efficient, so download and give it a whirl!

No comments: