Wednesday, November 18, 2009

Easy File System Path Manipulation in C#

I came across this type-safe module for handling file paths in the Haskell subreddit this week, and thought it look kind of neat. Handling paths as strings, even with System.IO.Path always bugged me.

So I created a close C# equivalent of the Haskell type and added it to the Sasa library, to be available in the upcoming v0.9.3 release:
public struct FsPath : IEnumerable<string>, IEnumerable
{
public FsPath(IEnumerable<string> parts);
public FsPath(string path);

public static FsPath operator /(FsPath path1, FsPath path2);
public static FsPath operator /(FsPath path, IEnumerable<string> parts);
public static FsPath operator /(FsPath path, string part);
public static FsPath operator /(FsPath path, string[] parts);
public static FsPath operator /(IEnumerable<string> parts, FsPath path);
public static FsPath operator /(string part, FsPath path);
public static FsPath operator /(string[] parts, FsPath path);
public static implicit operator FsPath(string path);

public FsPath Combine(FsPath file);
public IEnumerator<string> GetEnumerator();
public static FsPath Root(string root);
public override string ToString();
}

The above is just the interface, minus the comments which you can see at the above svn link. This struct tracks path components for you and C#'s operator overloading lets you specify paths declaratively without worrying about combining paths with proper separator characters, etc.:
FsPath root = "/foo/bar";
var baz = root / "blam" / "baz";
var etc = FsPath.Root("/etc/");
var passwd = etc / "passwd";

The library will generate path strings using the platform's directory separator.

One significant departure from the Haskell library is the lack of phantom types used to distinguish the various combinations of relative/absolute and file/directory paths. C# can express these same constraints, but I intentionally left them out for two reasons:
  1. The type safety provided by the file/directory phantom type is rather limited, since it's only an alleged file/directory path; you have to consult the file system to determine whether the alleged type is actually true. The only minor advantage to specifying this in a path type, is as a form of documentation to users of your library that you expect a file path in a particular method, and not a directory path. I could be convinced to add it for this reason.

  2. The relative/absolute phantom type seems a bit useless to me, though the reason may be less obvious to others. IMO, the set of file and directory info classes should not have GetParent() or support ".." directory navigation operations, nor should they support static privilege escalation operations like File.GetAbsolutePath() which ambiently converts a string into an authority on a file/directory, since this inhibits subsystem isolation; without GetParent() or privilege escalation functions, any subsystem you hand a directory object is automatically chroot jailed to operate only in that directory. This has long been known to the capability-security community, and is how they structure all of their IO libraries (see Joe-E and the E programming language). Thus, in a capability-secure file system library, all paths are relative paths interpreted relative to a given directory object. As a result, FsPath also strips all "." and resolve all ".." to within the provided path to supply this isolation.

It goes without saying that .NET does not currently handle files and directories this way, but I plan to handle this eventually as well. FsPath will play a significant role in ensuring that paths cannot escape the chroot jail.

As an interim step along that path, the FsPath interface is a good first step.

[Edit: the reddit thread for this post has some good discussion.]

Monday, November 16, 2009

Extensible, Statically Typed Pratt Parser in C#

I just completed a statically typed Pratt-style single-state extensible lexer+parser, otherwise known as a top-down operator precedence parser, for the Sasa library. The implementation is available in the Sasa.Parsing dll, under Sasa.Parsing.Pratt. Two simple arithmetic calculators are available in the unit tests.

This implementation is novel in two ways:
  1. Aside from an alleged implementation in Ada, this is the only statically typed Pratt parser I'm aware of.

  2. Pratt parsers typically require a pre-tokenized input before parsing semantic tokens, but I've eliminated this step by using the symbol definitions to drive a longest-match priority-based lexer.

Symbols by default match themselves, but you can optionally provide a scanning function used to match arbitrary string patterns. The symbol selected for the current position of the input is the symbol that matches the longest substring. If two symbols match equally, then the symbol with higher precedence is selected.

The design was heavily influenced by this Python implementation, though numerous departures were made to restore static typing and integrate lexing. In particular, symbols and semantic tokens are separate, where symbols are used primarily during lexing, and construct the appropriate semantic tokens on match.

Here is a simple arithmetic calculator:
class Calculator : PrattParser<int>
{
public Calculator()
{
Infix("+", 10, Add); Infix("-", 10, Sub);
Infix("*", 20, Mul); Infix("/", 20, Div);
InfixR("^", 30, Pow); Postfix("!", 30, Fact);
Prefix("-", 100, Neg); Prefix("+", 100, Pos);

// grouping symbols, ie. "(_)"
Group("(", ")", int.MaxValue);

// provide a predicate to identify valid literals
Match("(digit)", char.IsDigit, 1, Int);

// ignore whitespace
SkipWhile(char.IsWhiteSpace);
}

int Int(string lit) { return int.Parse(lit); }
int Add(int lhs, int rhs) { return lhs + rhs; }
int Sub(int lhs, int rhs) { return lhs - rhs; }
int Mul(int lhs, int rhs) { return lhs * rhs; }
int Div(int lhs, int rhs) { return lhs / rhs; }
int Pow(int lhs, int rhs) { return (int)Math.Pow(lhs, rhs); }
int Neg(int arg) { return -arg; }
int Pos(int arg) { return arg; }
int Fact(int arg)
{
return arg == 0 || arg == 1 ? 1 : arg * Fact(arg - 1);
}
}

The above is a simplified version of the calculators used in the unit tests. You can clearly see the definition of left and right-associative infix operators, and postfix operators. The numbers are precedence levels, and the delegates are the semantic actions associated with each token type.

There are pre-defined functions for creating left and right-associative infix, prefix, postfix, and both prefix and infix ternary operators -- prefix ternary is a simple "if _ then _ else _", infix ternary is like C's ternary operation, "_ ? _ : _". You are not constrained to these symbol types however, as you can define arbitrary character sequences as symbols and parse the results at your whim.

Adding variable binding to the above calculator requires a few extensions to support lexically scoped environments, but the parser requires only the following two extensions:
// symbol for identifiers, ie. variable names
Match("(ident)", char.IsLetter, 0,
name => new Var { Name = name });
// let binding, ie. "let x = 1 in x", is a
// prefix-ternary operator
TernaryPrefix("let", "=", "in", 90, Let);

"let" is not matched as an identifier because the "let" symbol has a higher precedence. Parser definitions with ambiguous parses thus require some care to resolve the ambiguity either via match length or precedence relationships.

There are two popular choices when it comes to parsing: parser generator tools and parser combinator libraries. I've never liked generators personally, as they require a separate language and entirely separate tools to define and process the grammar thus increasing the learning and maintenance curve.

I actually like parser combinators, and there are now good techniques for supporting arbitrary left recursion in grammars (see Packrat parsing in OMeta). However, handling operator precedence often involves a series of complex extensions to the otherwise simple recursive descent, or it requires the parsed results to maintain all operator and grouping information so a post-processing phase can readjust the parse tree to reflect the proper priorities. The post-processing phase typically uses a shunting yard or Pratt-style operator precedence algorithm to resolve precedence, so why not just build the parser itself directly on Pratt parsing?

Unlike parser-combinator libraries, of which I've built a few, Pratt parsers natively support arbitrary operator precedence and are efficient as they require no backtracking. Pratt parsers are also Turing-complete, because the semantic actions can perform arbitrary computations on the parser input. Turing completeness carries both benefits and costs of course, but the single-state nature of Pratt parsers keeps it manageable. The "single state" of the parser is simply the current token matched from the input.

For example, if prefix-ternary operators weren't already available, you could simply define one like so:
// "let" is a prefix symbol on a name
var let = Prefix("let", 90, x =>
{
// lexer matched "let" prefix operator on x
// argument, a variable name
Advance("="); // advance past the "="
var xValue = Parse(0); // parse the value to bind to x
Advance("in"); // advance past the "in"
var body = Parse(0); // parse the let expression body
// return a Let expression node
return new Let{ Variable = x, Value = xValue, Body = body};
});
// declare the other symbols involved for the lexer
// but they have no associated semantic action
Symbol("=");
Symbol("in");

The above builds on "Prefix", but you could break this down even further as in Pratt's original paper and define arbitrary symbol behaviour by providing a scanning function, "left denotation", and "null denotation", and so on.

To define a parser, you need only inherit from Sasa.Parsing.Pratt.PrattParser, provide the semantic type T, and override the constructor for your parser to define your symbol table. The only downside is that each parser instance is monolithic and not thread-safe, so it can only parse one input at a time. There is no limit to how many parsers you can create however.

Pratt parsing is simple, compact, and efficient, so I hope it finds more widespread use, particularly now that there is a statically typed implementation for a popular language.

[Edit: the PrattParser will be available in the upcoming v0.9.3 release of the Sasa class library.]

Thursday, October 1, 2009

Abstracting over Type Constructors using Dynamics in C#

I've written quite a few times about my experiments with the CLR type system [1, 2, 3]. After much exploration and reflection, I had devised a pretty good approach to encoding ML-style modules and abstracting over type constructors in C#.

A recent question on Stack Overflow made me realize that I never actually explained this technique in plain English.

The best encoding of ML modules and type constructor polymorphism requires the use of partly safe casting.
  1. An ML signature maps to a C# interface with a generic type parameter called a "brand". The brand names the class that implements the interface, ie. the module implementation.

  2. An ML module maps to a C# class. If the module implements a signature, then it implements the corresponding interface and specifies itself as the signature's brand.

  3. Since classes and interfaces are first-class values, an ML functor also maps to a class.

  4. An ML type component maps to an abstract class that shares the same brand as the module. This effectively ties the the module data representation and the module implementation together at the interface boundary, and makes the necessary casting partly safe.

I'll use the tagless interpreter from Fig. 2 of tagless staged interpreters as a concrete example:
(* Fig. 2 *)
module type Symantics = sig
type ('c, 'dv) repr
val int : int -> ('c, int) repr
val bool: bool -> ('c, bool) repr
val lam : (('c, 'da) repr -> ('c, 'db) repr) ->
('c, 'da -> 'db) repr
val app : ('c, 'da -> 'db) repr -> ('c, 'da) repr ->
('c, 'db) repr
val fix : ('x -> 'x) -> (('c, 'da -> 'db) repr as 'x)
val add : ('c, int) repr -> ('c, int) repr ->
('c, int) repr
val mul : ('c, int) repr -> ('c, int) repr ->
('c, int) repr
val leq : ('c, int) repr -> ('c, int) repr ->
('c, bool) repr
val if_ : ('c, bool) repr ->
(unit -> 'x) -> (unit -> 'x) ->
(('c, 'da) repr as 'x)
end

In the translation, I omit the 'c type parameter used in OCaml. The type of the expression representation, 'dv, becomes T in C#:
  1. module type Symantics = sig
    maps to
    interface ISymantics<B> where B : ISymantics<B>
    (B is the module's Brand)

  2. type ('c, 'dv) repr
    maps to
    abstract class Repr<T, B> where B : ISymantics<B>
    (B is the module's Brand)

  3. Each signature function maps to a method on ISymantics, ie.
    val int : int -> ('c, int) repr
    maps to
    Repr<int, B> Int(int value)

The final translation will look something like:
// type component
abstract class Repr<T, B> where B : ISymantics<B> { }
// module signature
interface ISymantics<B> where B : ISymantics<B>
{
Repr<int, B> Int(int i);
Repr<int, B> Add(Repr<int, B> left, Repr<int, B> right);
...
}

The implementation undergoes a similar translation:
  1. module R = struct
    maps to
    sealed class R : ISymantics<R>
    (R implements ISymantics and provides itself as the type brand)

  2. type ('c,'dv) repr = 'dv
    maps to
    sealed class ReprR<T> : Repr<T, R>
    (the concrete representation is a sealed class that inherits from Repr, and supplies R as the brand, effectively typing it to the R implementation)

The final mapping looks like:
(* Section 2.2 *)
module R = struct
type ('c,'dv) repr = 'dv (* no wrappers *)
let int (x:int) = x
let add e1 e2 = e1 + e2
...
end
maps to:
// concrete type component for the interpreter
// representation
sealed class ReprR<T> : Repr<T, R>
{
internal T value;
}
sealed class R : ISymantics<R>
{
public Repr<int, R> Int(int i)
{
return new ReprR<int> { value = i };
}
public Repr<int, R> Add(Repr<int, R> left,
Repr<int, R> right)
{
var l = left as ReprR<int>; // semi-safe cast
var r = right as ReprR<int>;// semi-safe cast
return new ReprR<int> { value = l.value + r.value; }; }
}
}

Programs written against tagless interpreters are wrapped in functors in order to properly abstract over the interpreter implementation. As mentioned before, modules and signatures are effectively first-class values in this encoding, so a functor simply becomes a function:
(* Fig. 3 *)
module EX(S: Symantics) = struct
open S
let test1 () = app (lam (fun x -> x)) (bool true)
...
end

maps to:
public static class EX
{
public static Repr<bool, B> Test1<B>(ISymantics<B> s)
{
return s.App(s.Lam(x => x), s.Bool(true));
}
}

The brand/ISymantics type could also be lifted to be a generic class parameter to make it syntactically closer to how it looks in the paper, but the difference is not important.

You can now run EX.Test1 with any conforming implementation of ISymantics, and the type system will prevent you from mixing representations of different implementations just as it would in ML, because the brands will not match. The only way to trigger a type error due to the downcast, is if the client implements his own Repr<T, B> supplying R for the brand, then passing the custom Repr type in to a method on ISymantics<R>. In such cases the client deserves an error.

I think this is a fairly reasonable trade off all things considered. Of course, it would be preferable if the CLR could just support type constructor polymorphism natively. And while all my wishes are coming true, can I have all of these changes too?

Wednesday, September 16, 2009

Disabling EDGE/3G on iPhone OS 3.0

After much struggle and dead ends, this fix worked like a charm. I don't think it's been publicized very much, and there are many people in need of such fixes.

Tuesday, September 15, 2009

Sasa v0.9.2 Released

The newest Sasa release contains a number of bugfixes and increased compliance with MIME standards, including proper MIME MailMessage subject decoding. Perhaps of wider interest to the .NET community, there are a number of extension methods for thread safe and null-safe event raising; safe event raising is a major bane of the .NET developer's existence.

There are specific overloads for all System.Action delegate types. A general RaiseAny extension method is also provided that isn't very efficient, but which should match the signature of any event in the class libraries. More overloads can be provided for more efficient execution if needed.

Here is the complete list of changes from the changelog:
= v0.9.2 =

* fixed bug where quoted printable encoding failed when = was the last
character.
* added Pop3Session.Reset method.
* compact serializer no longer uses stream positions to track cached
objects, so non-indexable streams, like DeflateStream, are now
usable.
* ISerializing interface generalized to support serializing non-primitive
field values.
* added e-mail subject decoding.
* fixed boundary condition on QuotedPrintableEncoding.
* added extension methods to support safely raising events.
* added an extension method to safely generate field and property names
as strings.
* added parameter to Compact serializer to indicate whether
SerializableAttribute should be respected.
* added a boundary check for SliceEquals.
* added a event Raise overload for any event handlers whose arg
inherits from EventArgs, which enables safe event raising within
an object.
* MailAddressCollection already handles parsing comma-delimited e-mail
address strings, so don't attempt to split them manually.
* added a test for an e-mail address that contains a comma.
* default to us-ascii charset if none provided.
* NonNull now throws a ArgumentNullException with a proper
description.
* many FxCop-related fixes and CLS compliance improvements.
* added usage restrictions on Sasa.CodeContracts attributes.

[Edit: generating HTML docs from the XML files used to be such a chore, though Sandcastle at least makes it possible, if convoluted. Mono's mdoc util is barely any help here, as it can only process one Visual Studio generated XML file at a time. I just found ImmDoc.Net which made this process ridiculously easy and stupendously fast, easily an order of magnitude faster than Sandcastle, but it looks like there's a bug handling method ref and out parameters.

In any case, here is some preliminary HTML API documentation for your perusal (CHM file here). Unfortunately, Sandcastle doesn't produce a handy index.html, but hopefully the ImmDoc.Net dev will fix the above bug soon, and I can use that to generate documentation.]

[Edit 2: The ImmDoc.Net dev fixed the bug, so here is a clearer set of docs with a proper index. The previous link will still work for now.]

Sunday, August 16, 2009

SlimTimer Timesheet Processing Utility

I use the very respectable SlimTimer to help me track my hours. Unfortunately, while they display a consistent total across all reports, the entries in each report do not necessarily add up to that total due to the fractional time units and the rounding involved.

Unfortunately, the accounting department tends to frown upon inconsistencies like this, no matter the reason. My process thus far has been to simply export a full timesheet with the report settings specifying time units as precisely as possible, and then performing the sum myself on the resulting chart.

This got a bit tedious, so I wrote a program to compile the necessary tallies over however many timesheet files I wanted to process. The source and binaries are available for download. Simply drag and drop any number of timesheets generated from SlimTimer, and the utility will generate a new csv file for each timesheet, with an extension "-out.csv".

The format of the output csv file is formatted for my invoice structure, but adjusting it for other formats is fairly trivial for anyone familiar with C#.

Tuesday, June 23, 2009

Mobile Code in C# via Finally Tagless Interpreters

Awhile back I described an idea to transparently execute code server-side or client-side given the same program. I've finally gotten around to implementing this using my encoding for type constructor polymorphism in C#.

Here is an example you can run from the original paper showcasing exponentiation which can execute transparently either server-side or client-side. The server-side code is intentionally limited to bases less than 100 and exponents less than 20.

Here is some simplified code that defines an exponentiation function:
void Build<B, R>(B _)
where B : ISymantics<B>, new()
{
var e = _.Lambda<int, Func<int, int>>(
x => _.Fix<int, int>(self => _.Lambda<int, int>(n =>
_.If(_.Lte(n, _.Int(0)), () => _.Int(1),
() => _.Mul(x, _.Apply(self, _.Add(n, _.Int(-1))))))));
}

The parameter "_" is the tagless interpeter and it's type is ISymantics<B>, which defines the semantics of the language being interpreted:
/// <summary>
/// The abstract base class of a typed expression.
/// </summary>
/// <typeparam name="T">The type of the encapsulated expression.</typeparam>
/// <typeparam name="B">The brand of the interpreter that can interpret this expression.</typeparam>
public abstract class E<T, B> where B : ISymantics<B>, new() { }

/// <summary>
/// The interface for interpreters.
/// </summary>
/// <typeparam name="B">The brand of the interpreter. This should be unique for each interpreter.</typeparam>
public interface ISymantics<B>
where B : ISymantics<B>, new()
{
/// <summary>
/// Construct an integer.
/// </summary>
/// <param name="i">The integer value.</param>
/// <returns>An expression encapsulating the value.</returns>
E<int, B> Int(int i);

/// <summary>
/// Add two expressions.
/// </summary>
/// <param name="left">The left expression.</param>
/// <param name="right">The right expression.</param>
/// <returns>The sum of the two expressions</returns>
E<int, B> Add(E<int, B> left, E<int, B> right);

...
}

The code-behind for the .aspx page interprets this code twice, once with a LINQ-based compiler for server-side execution, and once with a JavaScript backend to generate the client-side program.

The client-side program depends only on a utility function to compute the fixed point of a given function:
function fix(f) {
return function self(x) { return f(self)(x); };
}

Clearly any programs written this way are fairly verbose, but it's fairly straightforward to define wrapper structs around E<T, B>, such as Eint, Ebool, etc., to provide overloaded operators such that embedded programs look almost like ordinary C#. The current ISymantics interface is also limited to single-arg functions, but it's possible to add more overloads to support multi-arg functions.

Of course, it's already possible to translate C# expressions to JavaScript if we write a LINQ expression visitor. However, LINQ expressions are in a sense more limited than operating on this lifted data type. For instance, it's not possible to translate a static function to JavaScript using LINQ expressions, but it is possible using the final tagless representation.

It's an open question whether such final tagless interpreters can be made usable enough for real programming. However, the techniques employed here, in particular the structure which enables semi-safe type constructor polymorphism, are definitely useful for other projects that require retargetable abstractions.

The full source code is available, and includes a simple C# evaluator, LINQ-based compiler, a JavaScript backend, and an interpreter that computes the depth of the expression tree, which was also an example in the tagless paper. Also included is a partially working partial evaluator. This proved the trickiest to implement, and the only remaining problem appears to be recursive partial evaluation.

As future work, the partial evaluator can be parameterized by the type of compiler, so it can partially evaluate managed code and JavaScript programs; it is currently needlessly tied to the managed code interpreter. I also plan to implement the aforementioned wrapper structs and other enhancements to improve the clarity of embedded programs, hopefully to the point where they are usable for real programming.

[Edit 2009-08-16: it just came to my attention that the JS failed on the first run, but succeeded after a server-side execution was performed. This was a DOM manipulation bug, and the source and server code have been updated to fix this.]