Wednesday, November 18, 2009

Easy File System Path Manipulation in C#

I came across this type-safe module for handling file paths in the Haskell subreddit this week, and thought it looked kind of neat. Handling paths as strings, even with System.IO.Path always bugged me.

So I created a close C# equivalent of the Haskell type and added it to the Sasa library, to be available in the upcoming v0.9.3 release:
public struct FsPath : IEnumerable<string>, IEnumerable
{
public FsPath(IEnumerable<string> parts);
public FsPath(string path);

public static FsPath operator /(FsPath path1, FsPath path2);
public static FsPath operator /(FsPath path, IEnumerable<string> parts);
public static FsPath operator /(FsPath path, string part);
public static FsPath operator /(FsPath path, string[] parts);
public static FsPath operator /(IEnumerable<string> parts, FsPath path);
public static FsPath operator /(string part, FsPath path);
public static FsPath operator /(string[] parts, FsPath path);
public static implicit operator FsPath(string path);

public FsPath Combine(FsPath file);
public IEnumerator<string> GetEnumerator();
public static FsPath Root(string root);
public override string ToString();
}

The above is just the interface, minus the comments which you can see at the above svn link. This struct tracks path components for you and C#'s operator overloading lets you specify paths declaratively without worrying about combining paths with proper separator characters, etc.:
FsPath root = "/foo/bar";
var baz = root / "blam" / "baz";
var etc = FsPath.Root("/etc/");
var passwd = etc / "passwd";

The library will generate path strings using the platform's directory separator.

One significant departure from the Haskell library is the lack of phantom types used to distinguish the various combinations of relative/absolute and file/directory paths. C# can express these same constraints, but I intentionally left them out for two reasons:
  1. The type safety provided by the file/directory phantom type is rather limited, since it's only an alleged file/directory path; you have to consult the file system to determine whether the alleged type is actually true. The only minor advantage to specifying this in a path type, is as a form of documentation to users of your library that you expect a file path in a particular method, and not a directory path. I could be convinced to add it for this reason.

  2. The relative/absolute phantom type seems a bit useless to me, though the reason may be less obvious to others. IMO, the set of file and directory info classes should not have GetParent() or support ".." directory navigation operations, nor should they support static privilege escalation operations like File.GetAbsolutePath() which ambiently converts a string into an authority on a file/directory, since this inhibits subsystem isolation; without GetParent() or privilege escalation functions, any subsystem you hand a directory object is automatically chroot jailed to operate only in that directory. This has long been known to the capability-security community, and is how they structure all of their IO libraries (see Joe-E and the E programming language). Thus, in a capability-secure file system library, all paths are relative paths interpreted relative to a given directory object. As a result, FsPath also strips all "." and resolve all ".." to within the provided path to supply this isolation.

It goes without saying that .NET does not currently handle files and directories this way, but I plan to handle this eventually as well. FsPath will play a significant role in ensuring that paths cannot escape the chroot jail.

As an interim step along that path, the FsPath interface is a good first step.

[Edit: the reddit thread for this post has some good discussion.]

Monday, November 16, 2009

Extensible, Statically Typed Pratt Parser in C#

I just completed a statically typed Pratt-style single-state extensible lexer+parser, otherwise known as a top-down operator precedence parser, for the Sasa library. The implementation is available in the Sasa.Parsing dll, under Sasa.Parsing.Pratt. Two simple arithmetic calculators are available in the unit tests.

This implementation is novel in two ways:
  1. Aside from an alleged implementation in Ada, this is the only statically typed Pratt parser I'm aware of.

  2. Pratt parsers typically require a pre-tokenized input before parsing semantic tokens, but I've eliminated this step by using the symbol definitions to drive a longest-match priority-based lexer.

Symbols by default match themselves, but you can optionally provide a scanning function used to match arbitrary string patterns. The symbol selected for the current position of the input is the symbol that matches the longest substring. If two symbols match equally, then the symbol with higher precedence is selected.

The design was heavily influenced by this Python implementation, though numerous departures were made to restore static typing and integrate lexing. In particular, symbols and semantic tokens are separate, where symbols are used primarily during lexing, and construct the appropriate semantic tokens on match.

Here is a simple arithmetic calculator:
class Calculator : PrattParser<int>
{
public Calculator()
{
Infix("+", 10, Add); Infix("-", 10, Sub);
Infix("*", 20, Mul); Infix("/", 20, Div);
InfixR("^", 30, Pow); Postfix("!", 30, Fact);
Prefix("-", 100, Neg); Prefix("+", 100, Pos);

// grouping symbols, ie. "(_)"
Group("(", ")", int.MaxValue);

// provide a predicate to identify valid literals
Match("(digit)", char.IsDigit, 1, Int);

// ignore whitespace
SkipWhile(char.IsWhiteSpace);
}

int Int(string lit) { return int.Parse(lit); }
int Add(int lhs, int rhs) { return lhs + rhs; }
int Sub(int lhs, int rhs) { return lhs - rhs; }
int Mul(int lhs, int rhs) { return lhs * rhs; }
int Div(int lhs, int rhs) { return lhs / rhs; }
int Pow(int lhs, int rhs) { return (int)Math.Pow(lhs, rhs); }
int Neg(int arg) { return -arg; }
int Pos(int arg) { return arg; }
int Fact(int arg)
{
return arg == 0 || arg == 1 ? 1 : arg * Fact(arg - 1);
}
}

The above is a simplified version of the calculators used in the unit tests. You can clearly see the definition of left and right-associative infix operators, and postfix operators. The numbers are precedence levels, and the delegates are the semantic actions associated with each token type.

There are pre-defined functions for creating left and right-associative infix, prefix, postfix, and both prefix and infix ternary operators -- prefix ternary is a simple "if _ then _ else _", infix ternary is like C's ternary operation, "_ ? _ : _". You are not constrained to these symbol types however, as you can define arbitrary character sequences as symbols and parse the results at your whim.

Adding variable binding to the above calculator requires a few extensions to support lexically scoped environments, but the parser requires only the following two extensions:
// symbol for identifiers, ie. variable names
Match("(ident)", char.IsLetter, 0,
name => new Var { Name = name });
// let binding, ie. "let x = 1 in x", is a
// prefix-ternary operator
TernaryPrefix("let", "=", "in", 90, Let);

"let" is not matched as an identifier because the "let" symbol has a higher precedence. Parser definitions with ambiguous parses thus require some care to resolve the ambiguity either via match length or precedence relationships.

There are two popular choices when it comes to parsing: parser generator tools and parser combinator libraries. I've never liked generators personally, as they require a separate language and entirely separate tools to define and process the grammar thus increasing the learning and maintenance curve.

I actually like parser combinators, and there are now good techniques for supporting arbitrary left recursion in grammars (see Packrat parsing in OMeta). However, handling operator precedence often involves a series of complex extensions to the otherwise simple recursive descent, or it requires the parsed results to maintain all operator and grouping information so a post-processing phase can readjust the parse tree to reflect the proper priorities. The post-processing phase typically uses a shunting yard or Pratt-style operator precedence algorithm to resolve precedence, so why not just build the parser itself directly on Pratt parsing?

Unlike parser-combinator libraries, of which I've built a few, Pratt parsers natively support arbitrary operator precedence and are efficient as they require no backtracking. Pratt parsers are also Turing-complete, because the semantic actions can perform arbitrary computations on the parser input. Turing completeness carries both benefits and costs of course, but the single-state nature of Pratt parsers keeps it manageable. The "single state" of the parser is simply the current token matched from the input.

For example, if prefix-ternary operators weren't already available, you could simply define one like so:
// "let" is a prefix symbol on a name
var let = Prefix("let", 90, x =>
{
// lexer matched "let" prefix operator on x
// argument, a variable name
Advance("="); // advance past the "="
var xValue = Parse(0); // parse the value to bind to x
Advance("in"); // advance past the "in"
var body = Parse(0); // parse the let expression body
// return a Let expression node
return new Let{ Variable = x, Value = xValue, Body = body};
});
// declare the other symbols involved for the lexer
// but they have no associated semantic action
Symbol("=");
Symbol("in");

The above builds on "Prefix", but you could break this down even further as in Pratt's original paper and define arbitrary symbol behaviour by providing a scanning function, "left denotation", and "null denotation", and so on.

To define a parser, you need only inherit from Sasa.Parsing.Pratt.PrattParser, provide the semantic type T, and override the constructor for your parser to define your symbol table. The only downside is that each parser instance is monolithic and not thread-safe, so it can only parse one input at a time. There is no limit to how many parsers you can create however.

Pratt parsing is simple, compact, and efficient, so I hope it finds more widespread use, particularly now that there is a statically typed implementation for a popular language.

[Edit: the PrattParser will be available in the upcoming v0.9.3 release of the Sasa class library.]