Saturday, September 21, 2019

Simplest Exceptions Handler Macros for C

More adventures in C, I revisited my exception handler implementation libex. I realized there's an even simpler version that's more efficient if we give up the requirement that exceptions can be propagated to callers:

#define TRY
#define THROW(E) goto E
#define CATCH(E) goto FINALLY_H; E:
#define FINALLY FINALLY_H:

This implementation requires only a single exception handling block per function, which is probably good idea as a general rule. Usage looks like:

static void foo(size_t sz) {
  char* x;
  TRY {
    x = (char*)malloc(sz);
    if (x == NULL) THROW(ENoMem);
    // do something with x
  } CATCH(ENoMem) {
    // handle out of memory
  } FINALLY {
    if (x != NULL) free(x);
  }
}

Unlike libex, this version is actually compatible with break and continue, although using them runs the risk of skipping the FINALLY handler. An almost equally simple version of exception handlers which ensures that break/continue cannot skip the FINALLY block wraps everything in a loop:

#define TRY do {
#define THROW(E) goto E
#define CATCH(E) goto FINALLY_H; E:
#define FINALLY } while(0); FINALLY_H:

This ensures that any break or continue statements executed within the TRY or CATCH blocks immediately exit to the FINALLY handler.

Of course, abort() and exit() invocations will also bypass FINALLY, so if you intend to use these macros, I suggest using them only as a way to organize your code in a more structured fashion.

Friday, September 20, 2019

async.h - asynchronous, stackless subroutines in C

The async/await idiom is becoming increasingly popular. The first widely used language to include it was C#, and it has now spread into JavaScript and Rust. Now C/C++ programmers don't have to feel left out, because async.h is a header-only library that brings async/await to C!

Features:

  1. It's 100% portable C.
  2. It requires very little state (2 bytes).
  3. It's not dependent on an OS.
  4. It's a bit simpler to understand than protothreads because the async state is caller-saved rather than callee-saved.
#include "async.h"

struct async pt;
struct timer timer;

async example(struct async *pt) {
    async_begin(pt);
    
    while(1) {
        if(initiate_io()) {
            timer_start(&timer);
            await(io_completed() || timer_expired(&timer));
            read_data();
        }
    }
    async_end;
}

This library is basically a modified version of the idioms found in the Protothreads library by Adam Dunkels, so it's not truly ground breaking. I've made a few tweaks that make it more understandable and generate more compact code, and I also think it more cleanly maps to the async/await semantics than it does to true threading.

Protothreads and async.h are both based around local continuations, but where protothreads are callee-saved, async.h is caller-saved. This eliminates the need to pass in the local continuation to any async operations except async_begin. This simplifies the macros that implement the async/await idiom, and even simplifies code that uses async.h.

Here's a simple example of fork-join style "parallelism":

#include "async.h"

typedef struct { 
    async_state;
    struct async nested1;
    struct async nested2;
} example_state;
example_state pt;

async nested(struct async *pt){
    async_begin(pt);
    ...
    async_end;
}

async example(example_state *pt) {
    async_begin(pt);

    // fork two nested async subroutines and wait until both complete
    async_init(&pt->nested1);
    async_init(&pt->nested2);
    await(async_call(nested, &pt->nested1) & async_call(nested, &pt->nested2));
    
    // fork two nested async subroutines and wait until at least one completes
    async_init(&pt->nested1);
    async_init(&pt->nested2);
    await(async_call(nested, &pt->nested1) | async_call(nested, &pt->nested2));

    async_end;
}

Tuesday, September 3, 2019

Building a Query DSL in C#

I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string".

This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API:

  1. It does not support disjunctions, ie. "OR" clauses.
  2. It does not support filtering using inequalities on more than one property.
  3. It does not support a not-equal operation.

So in this post, I will describe the design which achieves the following goals:

  1. A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR").
  2. Implementations of this backend that are amenable to unit testing the parser and the query logic.
  3. A simple implementation of this backend for Google Datastore supporting the full query capabilities.

The Query Language

The query language is supposed to support the following filtering operations:

  1. Clauses on properties should support: =, !=, <, >.
  2. Supported data types are: numbers, dates, strings.
  3. Clauses can be combined by "and" and "or" operators.
  4. Precedence must be specified by parenthesis and compound expressions, ie. "(expression)" and "(expression) and (expression)" are valid, but "expression and expression" is not.

Anyone who follows me probably knows I'm a fan of tagless interpreters, also known as object algebras, so I started by defining the query language in this fashion:

public interface IQueryLanguage<T>
{
    T Id();
    T Const(string val);
    T Const(DateTime date);
    T Const(double num);
    T Property(string propertyName);
    T And(T lhs, T rhs);
    T Or(T lhs, T rhs);
    T LessThan(T lhs, T rhs);
    T GreaterThan(T lhs, T rhs);
    T Equal(T lhs, T rhs);
    T NotEqual(T lhs, T rhs);
}

You can clearly see the language structure here and how queries can be composed into larger compound expressions. Another good practice is to factor all backend operations into a repository, so this is roughly how it might look:

public interface IRepository<T>
{
   ...

   IQueryLanguage<T> CreateQuery<TEntity>()
        where TEntity : class;

   Task<IEnumerable<TEntity>> Execute<TEntity>(T query, int limit, int page)
        where TEntity : class;
}

So our repository instances will expose a method to create a query for an entity type, and a method used to execute that query with some extra parameters for paging.

The grammar for the language looks roughly like this in pseudo-EBNF:

number := [0-9]+[.]?[0-9]*
quoted := "'" datetime "'" | "'" char* "'"
e      := number | quoted | (e) | (e) "or" (e) | (e) "and" (e)

Now that we have a general interface and a grammar for the query language, we can build a simple and generic recursive descent parser against that interface to generically construct a query:

public static class QueryParser
{
    public static T Parse<T>(this IQueryLanguage<T> provider, string input)
    {
        if (string.IsNullOrEmpty(input))
            return provider.Id();
        int pos = 0;
        var e = provider.Parse(input, ref pos);
        if (pos < input.Length)
            throw new ArgumentException("Malformed query");
        return e;
    }

    static T Parse<T>(this IQueryLanguage<T> provider, string input, ref int pos)
    {
        var last = pos;
        if (Match(input, "(", ref pos))
        {
            var e = provider.Parse(input, ref pos);
            return !Match(input, ")", ref pos) ? Fail<T>("expected ')'", pos):
                    pos < input.Length         ? provider.Operation(e, input, ref pos):
                                                 e;
        }
        else
        {
            return provider.Operation(provider.Terminal(input, ref pos), input, ref pos);
        }
    }

    static T Operation<T>(this IQueryLanguage<T> provider, T lhs, string input, ref int pos)
    {
        Skip(input, ref pos);
        var last = pos;
        if (Match(input, "=", ref pos))
        {
            return provider.Equal(lhs, provider.Terminal(input, ref pos));
        }
        else if (Match(input, ">", ref pos))
        {
            return provider.GreaterThan(lhs, provider.Terminal(input, ref pos));
        }
        else if (Match(input, "!=", ref pos))
        {
            return provider.NotEqual(lhs, provider.Terminal(input, ref pos));
        }
        else if (Match(input, "<", ref pos))
        {
            return provider.LessThan(lhs, provider.Terminal(input, ref pos));
        }
        else if (Match(input, "and ", ref pos))
        {
            return provider.And(lhs, provider.Parse(input, ref pos));
        }
        else if (Match(input, "or ", ref pos))
        {
            return provider.Or(lhs, provider.Parse(input, ref pos));
        }
        else if (Match(input, ")", ref pos))
        {
            --pos;
            return lhs;
        }
        else
        {
            return Fail<T>("Unrecognized token", pos);
        }
    }

    static T Terminal<T>(this IQueryLanguage<T> provider, string input, ref int pos)
    {
        Skip(input, ref pos);
        var last = pos;
        if (Match(input, "'", ref pos))
        {
            var end = input.IndexOf('\'', pos);
            if (end < 0)
                return Fail<T>("Unrecognized value", pos);
            else if (DateTime.TryParse(input.AsSpan(pos, end - pos), out var date))
            {
                pos = end + 1;
                return provider.Const(DateTime.SpecifyKind(date, DateTimeKind.Utc));
            }
            else
            {
                var x = input.Substring(pos, end - pos);
                pos = end + 1;
                return provider.Const(x);
            }
        }
        else if (Match(input, (x, i) => x[i] == '.' || char.IsDigit(x, i), ref pos))
        {
            return double.TryParse(input.AsSpan(last, pos - last), out var i)
                ? provider.Const(i)
                : Fail<T>("Expected integer", last);
        }
        else if (Match(input, char.IsLetter, ref pos))
        {
            return provider.Property(input.Substring(last, pos - last));
        }
        else
        {
            return Fail<T>("Unrecognized token", pos);
        }
    }

    static T Fail<T>(string error, int pos) =>
        throw new ArgumentException($"Parse error @ {pos}: {error}");

    static void Skip(string input, ref int pos)
    {
        while (pos < input.Length && char.IsWhiteSpace(input[pos]))
            ++pos;
    }

    static bool Match(string input, ReadOnlySpan<char> expect, ref int pos)
    {
        if (pos + expect.Length > input.Length ||
           !input.AsSpan(pos, expect.Length).Equals(expect, StringComparison.OrdinalIgnoreCase))
            return false;
        pos += expect.Length;
        return true;
    }

    static bool Match(string input, Func<string, int, bool> expect, ref int pos)
    {
        var start = pos;
        while (pos < input.Length && expect(input, pos))
            ++pos;
        return start < pos;
    }
}

This isn't the most efficient implementation, but it's hopefully understandable since the parser doesn't implement operator precedence and doesn't really do any backtracking. It recursively descends into a bracketed expression, moving a position pointer along the string as it matches characters, and then invokes any query language methods that matched.

The important part is that this parser is agnostic to the query language implementation, so we can reuse it for any query implementation like so:

IRepository<T> repo = ...;
...
var query = repo.CreateQuery().Parse("(Foo = 'some string') or (Bar > 99)");
var results = repo.Execute(query, limit: 10, page: 1);

The Datastore Implementation

Since the Datastore filtering API has the limitations described above, we need to transform the query into a form that we can actually execute. Here is an outline of the choices I made, which I will then explain in more detail:

  1. No disjunctions: translate a compound query into disjunctive normal form. This consists of a list of queries with no "OR" operators that are executed individually, and whose results are then merged.
  2. No inequalities on more than one property: keep track of the property that was referenced in an inequality, and build a parallel filter using delegates. We can then execute one inequality filter server-side and post-process the results with the delegates to ensure all inequalities are satisfied.
  3. No != operation: translate "x != y" into "(x < y) or (x > y)". All of the supported types have strict orderings, so not-equal is equivalent to a compound inequality defined by greater-than or less-than.

So let's start with a basic filter in our query language:

public struct QueryFilter
{
    // the datastore filter query
    public Google.Cloud.Datastore.V1.Filter Filter { get; set; }

    // the property referenced by any inequality test; null if no inequality test
    public string InequalityProp { get; set; }
    public bool HasInequality => InequalityProp != null;

    // post-processing filter that applies any extra inequalities
    public Func<Google.Cloud.Datastore.V1.Entity, bool> Post { get; set; }
    public QueryFilter(Filter f, string inequalityProp, Func<Google.Cloud.Datastore.V1.Entity, bool> post)
    {
        Filter = f;
        InequalityProp = inequalityProp;
        Post = post;
    }
}

This groups the Datastore filter, the post-processing filter, and the property used for inequality tests (if any). So a simple property test like "Foo = 'hello world!" would generate:


var foo = new QueryFilter(Filter.Equal("Foo", "hello world!"), null,
                          e => CompareTo(e["Foo"], "hello world!") == 0);

The other basic clauses look similar, just expecting different results from CompareTo. The post-processing filter need not be used for basic queries, but compound queries with multiple inequalities need this post-processing filter since only one inequality filter will be applied server-side.

The CompareTo method implements the idiomatic IComparablefor Datastore values:


static int CompareTo(Value x, Value y)
{
    if (x.ValueTypeCase != y.ValueTypeCase)
        throw new InvalidCastException(
            $"Values are not of the same type: {x.ValueTypeCase} != {y.ValueTypeCase}.");
    switch (x.ValueTypeCase)
    {
        case Value.ValueTypeOneofCase.DoubleValue:
            return x.DoubleValue.CompareTo(y.DoubleValue);
        case Value.ValueTypeOneofCase.StringValue:
            return x.StringValue.CompareTo(y.StringValue);
        case Value.ValueTypeOneofCase.TimestampValue:
            return x.TimestampValue.ToDateTime().CompareTo(y.TimestampValue.ToDateTime());
        default:
            throw new InvalidCastException($"Can't compare type: {x.ValueTypeCase}");
    }
}

QueryFilter suffices for basic queries that Google Datastore already supports, but it's insufficient for "OR" queries. So we need another level of indirection to keep track of lists of QueryFilter that represent the query in disjunctive-normal form:

public struct Term
{
    public Term(IEnumerable<QueryFilter> filter) : this()
    {
        Filters = filter;
    }
    public Term(params QueryFilter[] filter) : this()
    {
        Filters = filter;
    }
    public Term(Value value) : this()
    {
        Value = value;
    }
    public Term(string prop) : this()
    {
        Property = prop;
    }
    public IEnumerable<QueryFilter> Filters { get; private set; }
    public Value Value { get; private set; }
    public string Property { get; private set; }
}

These are the real terms generated by our query language. In reality, this should be a set of classes representing the sum term = property | value | query, but it's simpler to keep everything in a struct to avoid casts. The list of QueryFilter values represents the list of queries that themselves contain no "OR" clauses. Implementations of our query combinators must preserve this property, and we'll see later on that it's pretty straightforward to do this. Let's start with the basic value constructors:

public struct DatastoreQueries : IQueryLanguage<Term>
{
    public Term Id() => new Term();

    public Term Const(DateTime date) =>
        new Term(date);

    public Term Const(double num) =>
        new Term(num);

    public Term Const(string val) =>
        new Term((Google.Cloud.Datastore.V1.Value)val);

    public Term Property(string propertyName) =>
        new Term(propertyName);
    ...
}

Pretty straightforward, values are converted into instances of Google.Cloud.Datastore.V1.Value, string property names are tracked as properties. Now let's look at the basic filters:

public struct DatastoreQueries : IQueryLanguage<Term>
{
    ...

    public Term Equal(Term lhs, Term rhs)
    {
        if (lhs.Filters != null || rhs.Filters != null)
            throw new ArgumentException("Operations can only compare properties against values.");
        if (rhs.Property != null && lhs.Property == null)
            return Equal(rhs, lhs);
        if (lhs.Property == null || rhs.Property != null)
            throw new ArgumentException("Operations can only compare properties against values, not two properties.");
        var filter = Filter.Equal(lhs.Property, rhs.Value);
        return new Term(new QueryFilter(filter, null, e => CompareTo(e[lhs.Property], rhs.Value) == 0));
    }
}

Here we ensure that the terms being compared for equality are a property and a value, and we also ensure that the property name is on the left hand side so it's easier to operate on. Equality is symmetric, so we can just reverse the order of the parameter and the meaning is preserved. This is not quite the case for inequality checks, but two inequality operators can be applied in terms of each other:

public struct DatastoreQueries : IQueryLanguage<Term>
{
    ...

    public Term LessThan(Term lhs, Term rhs)
    {
        // insert validations as above
        return
        rhs.Property != null ? GreaterThan(rhs, lhs):
        lhs.Property != null ? new Term(new QueryFilter(Filter.LessThan(lhs.Property, rhs.Value), lhs.Property,
                                                        e => CompareTo(e[p], v) < 0):
                               throw new ArgumentException("A property is required for comparisons.");
    }
    public Term GreaterThan(Term lhs, Term rhs) =>
    {
        // insert validations as above
        return
        rhs.Property != null ? LessThan(rhs, lhs):
        lhs.Property != null ? new Term(new QueryFilter(Filter.GreaterThan(lhs.Property, rhs.Value), lhs.Property,
                                                        e => CompareTo(e[p], v) > 0):
                               throw new ArgumentException("A property is required for comparisons.");
    }
}

I've omitted the validations for brevity, but the basic idea is the same. We ensure the term property is on the left-hand side by transforming greater-than and less-than into each other if they're backwards. As before, we also create a Datastore filter and a post-processing filter for the same operations. And now we get to the composite query combinators. The "OR" combinator is very simple:

public struct DatastoreQueries : IQueryLanguage<Term>
{
    ...

    public Term Or(Term lhs, Term rhs) =>
        new Term(lhs.Filters.Concat(rhs.Filters));
}

OR-ing two queries that themselves lack "OR" clauses is simply merging them into a single list. Given two lists of queries each element of which contain no "OR" clauses is then simply concatenating the two lists. This preserves the invariant that each individual sub-query contains no "OR" clauses. This leaves only the AND combinator:

public struct DatastoreQueries : IQueryLanguage<Term>
{
    ...

    public Term And(Term lhs, Term rhs)
    {
        var permutations =
        from x in lhs.Filters
        from y in rhs.Filters
        select !x.HasInequality || !y.HasInequality || x.InequalityProp.Equals(y.InequalityProp)
                                 ? new QFilter(Filter.And(x.Filter, y.Filter), x.InequalityProp, e => x.Post(e) && y.Post(e)):
                x.HasInequality  ? new QFilter(x.Filter, x.InequalityProp, e => x.Post(e) && y.Post(e)):
                                   new QFilter(y.Filter, y.InequalityProp, e => x.Post(e) && y.Post(e));
        return new Term(permutations);
    }
}

There's a lot going on here, but here's the gist: every filter in the lhs is AND-ed with every filter in the rhs. The various cases you see being checked are to handle inequalities in either the lhs or the rhs. If neither side contains an inequality filter, or if they refer to the same property, then we can combine the queries as a server-side filter. Otherwise, we take one of the filters with an inequality and post-process the other one. Technically one branch is redundant here, I'm just showing it for completeness. I also execute the post-processing for all branches for safety while I was prototyping and testing (see below).

The reason we can just permute the individual lhs and rhs expression this way is because boolean algebra is distributive. That means, given an expression in disjunctive normal form (a series of OR-ed expressions), if you then apply an AND-ed clause to that whole expression, that's equivalent to applying AND to each individual clause:

A and (B or C or D or ...)
= A and B or A and C or A and D or A and ...

This is exactly what we're doing above: we're tracking a set of OR-ed clauses and then combining them with another expression via AND, which amounts to distributing the AND-ed expression across the list. But what if A itself is an OR-ed expression? Just do it all over again:

A = (X or Y or Z or ...)
A and B or A and C or ...
= (X or Y or Z or ...) and B or (X or Y or Z or ...) and C or ...                         
= (X and B or Y and B or Z and B or ...) or (X and C or Y and C or Z and C or ...) or ...

Every element of A and each element of the original "OR" expression were permuted to produce the final expression, still in disjunctive normal form. This is precisely what the LINQ expression above is doing, with some extra logic to handle inequalities.

We can also see here the necessity of carrying around the post-processing filters through all the terms. When an inequality is applied to two different terms in the same filter, only one of them can be applied server-side, and the remainder has to be applied as a delegate filter when we get the results back.

Let's see how executing a query happens in a repository implementation:

public class DatastoreRepository : IRepository<Term>
{
    DatastoreDb db;
    ...
    public async Task<IEnumerable<Entity>> Execute<TEntity>(Term query, int limit, int page)
        where TEntity : class
    {
        if (page <= 0)
            throw new ArgumentOutOfRangeException(nameof(page),
                "Page number must be greater than or equal to 1.");
        if (limit <= 0)
            throw new ArgumentOutOfRangeException(nameof(limit),
                "Number of results per page must be greater than or equal to 1.");

        // execute each OR-free filter server-side, then apply the post-filter to ensure all
        // inequalities are satisfied, then merge all results into a sorted set to remove duplicates
        // and ensure results are always in the same order for paging purposes
        var results = new SortedSet<Entity>(ecompare);
        foreach (var f in query.Filters ?? Enumerable.Repeat(new QueryFilter(), 1))
        {
            // generate the entity kind from type
            var q = new Query(Kind<TEntity>())
            {
                Filter = f.Filter,
            };
            var r = await db.RunQueryAsync(q);
            var filtered = f.Post != null
                ? r.Entities.Where(f.Post)
                : r.Entities;
            foreach(var x in filtered)
                results.Add(x);
        }
        // rough paging implementation that's not particularly efficient, but suffices for now
        return results.Skip((page - 1) * limit).Take(limit);
    }
    
    static readonly EntityComparer ecompare = new EntityComparer();

    sealed class EntityComparer: IComparer<Entity>
    {
        public int Compare(Entity x, Entity y) =>
            Id(x.Key).CompareTo(Id(y.Key));
    }
}

The gist is simply that each QueryFilter is executed individually, then the post-processing filter is applied to ensure that all remaining inequalities are satisfied, and the results from all individual OR-ed queries are merged into a single sorted set. This set is then paged through using the given paging parameters and the limited results returned.

This is probably among the simplest possible implementations for the approach I took, and so is not particularly efficient, but it suffices for prototyping purposes or programs whose filters would return moderately sized result sets.

Possible Extensions

If you've ever wondered what a query planner in a relational database does, you can start with the above naive implementation and consider:

  1. what choices you can make about which inequality to preserve in order to make the server-side query return the smallest data set, and what sort of metadata you might need to track in order to make such choices
  2. stricter clauses can eliminate other clauses, ie. "(Foo < 'hello') or (Foo = 'hell')" => "Foo = 'hell'
  3. a more complex equality query might sometimes be preferable to a simpler inequality query, so sometimes the equality query should be executed server-side instead of the inequality
  4. whether you can exploit the paging parameters to limit the results of each OR query and interleave them somehow
  5. ...and so on...

There's also a way to make the parsing and query construction less error-prone. I currently use dynamic checks where I could make a new type distinction that would eliminate these errors statically. Consider the following interface for the query language and see if you can modify the query parser to accommodate this new distinction and what sorts of errors it statically prevents:

public interface IQueryLanguage<TClause, TQuery>
{
    TClause Id();
    TClause Const(string val);
    TClause Const(DateTime date);
    TClause Const(double num);
    TClause Property(string propertyName);
    TClause LessThan(TClause lhs, TClause rhs);
    TClause GreaterThan(TClause lhs, TClause rhs);
    TClause Equal(TClause lhs, TClause rhs);
    TClause NotEqual(TClause lhs, TClause rhs);

    TQuery Lift(TClause clause);
    TQuery And(TQuery lhs, TQuery rhs);
    TQuery Or(TQuery lhs, TQuery rhs);
}

Finally, my actual prototype also passes in a delegate that compares the property name passed in against a type-specific whitelist of allowed property names, so this too would be a simple extension to what I've presented.

Testing

The abstract interfaces used here make testing pretty straightforward. Here's a simple implementation of a pretty-printer for parsed expressions:

public delegate StringBuilder Eval(StringBuilder buf);

public class PrettyPrinter : IQueryLanguage<Eval>
{
    public Eval Id() => buf => buf;

    public Eval Const(DateTime date) =>
        buf => buf.Append('\'').AppendFormat("{0:yyyy-MM-dd}", date).Append('\'');

    public Eval Const(double num) =>
        buf => buf.Append(num);

    public Eval Const(string val) =>
        buf => buf.Append('\'').Append(val).Append('\'');

    public Eval Property(string propertyName) =>
        buf => buf.Append(propertyName);

    public Eval And(Eval lhs, Eval rhs) => Op(lhs, "and", rhs);
    public Eval Or(Eval lhs, Eval rhs) => Op(lhs, "or", rhs);
    public Eval LessThan(Eval lhs, Eval rhs) => Op(lhs, "<", rhs);
    public Eval GreaterThan(Eval lhs, Eval rhs) => Op(lhs, ">", rhs);
    public Eval Equal(Eval lhs, Eval rhs) => Op(lhs, "=", rhs);
    public Eval NotEqual(Eval lhs, Eval rhs) => Op(lhs, "!=", rhs);

    static Eval Op(Eval lhs, string op, Eval rhs) =>
        buf => rhs(lhs(buf.Append('(')).Append(' ').Append(op).Append(' ')).Append(')');
}

This generates a nicely formatted output from a parsed expression which should be semantically equal to the input. This enables writing simple tests of the parsing algorithm like this:

using Xunit;
...
public class QueryParseTests
{
    static PrettyPrinter provider = new PrettyPrinter();

    static string Parse(string input) =>
        provider.Parse(input)(new StringBuilder()).ToString();

    [Fact]
    static void SimpleQueryParse()
    {
        Assert.Equal("(foo = 3)", Parse("foo = 3"));
        Assert.Equal("(foo != 3)", Parse("foo != 3"));
        Assert.Equal("(foo < 3)", Parse("foo < 3"));
        Assert.Equal("(foo > 3)", Parse("foo > 3"));
        Assert.Equal("(foo > '2019-08-08')", Parse("foo > '2019-08-08'"));
        Assert.Equal("('2019-08-08' = foo)", Parse("'2019-08-08' = foo"));
        Assert.Equal("('2019-08-08' = foo)", Parse("(('2019-08-08' = foo))"));
        Assert.Equal("", Parse(""));
    }

    [Fact]
    static void CompoundQueryParse()
    {
        Assert.Equal("((foo = 3) and (20 != bar))",
               Parse("((foo = 3) and (20 != bar))"));
        Assert.Equal("((foo != 3) or (20 < bar))",
               Parse("((foo != 3) or (20 < bar))"));
        Assert.Equal("((foo != '2018-08-08') or ('2018-08-08' < bar))",
               Parse("((foo != '2018-08-08') or ('2018-08-08' < bar))"));
    }

    [Fact]
    static void CompoundFailParse()
    {
        Assert.Throws<ArgumentException>(() => Parse("((foo = 3) and (20 != bar)"));
        Assert.Throws<ArgumentException>(() => Parse("((foo = 3) not 20 != bar))"));
    }

    [Fact]
    static void SimpleFailParse()
    {
        Assert.Throws<ArgumentException>(() => Parse("foo = 3)"));
        Assert.Throws<ArgumentException>(() => Parse("(foo = 3"));
        Assert.Throws<ArgumentException>(() => Parse("(foo bar 3)"));
    }

    ...
}

I can even test the Datastore query implementation in a similar way, since the implementation above carries all of the post-processing clauses that must be satisfied. For example:

public class QueryTests
{
    static DatastoreQueries provider = new DatastoreQueries();

    static Term Parse(string input) =>
        provider.Parse(input);

    [Fact]
    static void TestComplex()
    {
        var e = new[]
        {
            new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 20.0 },
            new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 23.0 },
            new Entity { ["Date"] = new DateTime(2016, 04, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 23.0 },
            new Entity { ["Date"] = new DateTime(2016, 04, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 20.0 },
            new Entity { ["Date"] = new DateTime(2016, 05, 01, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 2.0 },
            new Entity { ["Date"] = new DateTime(2016, 04, 30, 0, 0, 0, DateTimeKind.Utc), ["Distance"] = 4.0 },
        };
        var t0 = Parse("(Date = '2016-05-01') AND ((Distance > 20) OR (Distance < 10))");
        Assert.Equal(2, t0.Filters.Count());
        var matches = t0.Filters.SelectMany(x => e.Where(x.Post)).Distinct().ToList();
        Assert.Equal(2, matches.Count);
        Assert.Contains(matches, x => 23 == x["Distance"].DoubleValue);
        Assert.Contains(matches, x => 2 == x["Distance"].DoubleValue);

        var t1 = Parse("((Date != '2016-04-01') or (Date = '2016-05-01')) AND ((Distance > 20) OR (Distance < 10))");
        Assert.Equal(6, t1.Filters.Count());
        var match1 = t1.Filters.SelectMany(x => e.Where(x.Post)).Distinct().ToList();
        Assert.Equal(3, match1.Count);
        Assert.Contains(match1, x => 23 == x["Distance"].DoubleValue);
        Assert.Contains(match1, x => 2 == x["Distance"].DoubleValue);
        Assert.Contains(match1, x => 4 == x["Distance"].DoubleValue);
    }

    ...
}

Final Words

Apologies for any typos or if something appears to be missing from the above, these are slightly adjusted code fragments from a larger operational project. Let me know if anything needs clarification!

My intent here is to illustrate the value of applying some ideas from academia to real-world programming challenges that crop up in our daily work. Object algebras/tagless final interpreters are a powerful idea for organizing powerful, reusable abstractions. In particular, it let me design, build and test individual parts of the querying API, which ensured that they would all work when combined into a larger system. Hopefully you too can get some use out of this!

Saturday, January 26, 2019

Sasa v1.0.0-RC1

I've finally updated Sasa to target .NET standard 1.3, removed a lot of superfluous abstractions, and rewrote some of the existing API to target third-party packages that are better supported. This may be the last stable release of Sasa, since I'm shifting. Fortunately, the remaining Sasa features are very stable so there's not much need to evolve them further.

You can find the full API documentation for v1.0.0 here, and obviously you can find Sasa itself via nuget. Here's the current breakdown of features:

Sasa Assembly

The core Sasa assembly contains extensions on standard .NET types, and a few new and useful abstractions that are typical for most programs:

  • Sasa.FilePath: structured file system path handling, including jail, ensuring combined paths don't escape a root path.
  • Sasa.Func: extensions to create delegates on methods and for operators with no restrictions, unlike the standard .NET delegate APIs. For instance, open instance delegates to virtual methods are not permitted, but Sasa.Func handles this case transparently.
  • Sasa.Enums: efficient, typed enum operations, contrasted against the untyped System.Enum API.
  • Sasa.Operators: exposes any type's operators as typed delegates.
  • Sasa.Values: numerous useful extension methods on standard types, like formatting streams of values, line/word wrapping, sequence generators, string splitting, etc.
  • Sasa.VFunc/VAction: typical open instance delegates take a reference type as a first parameter, but value types aren't reference types, so they actually accept a pointer/ref to the struct. Sasa.VFunc/VAction are the struct-counterparts of System.Func/Action.
  • Sasa.Emit.Operators: extensions to simplify working with operators during code generation.
  • Sasa.Either: .NET now includes tuples and value tuples, but no sum/variant type. This is satisfied by Sasa.Either.
  • Sasa.Result: encapsulates either a successfully computed value, or an exception.

Sasa.Binary Assembly

This package provides convenient abstraction for working with binary data.

  • Sasa.Binary.Bits: low-level bit shifting, folding, counting ops.
  • Sasa.Binary.Endian: convenient endianness conversions.
  • Sasa.Binary.UnionXX: types that simplify converting between various types of the same size.
  • Sasa.Binary.IO.BinaryReader/Writer: portable binary reader/writers.

Sasa.Collections Assembly

This package provides a number of very efficient immutable collections, and some extensions on standard .NET collection types.

  • Sasa.Collections.Arrays: extension methods on arrays, to immutably insert, duplicate, append, etc.
  • Sasa.Collections.Fifo: an immutable first-in-first-out queue data structure.
  • Sasa.Collections.FingerTree: an immutable finger-tree.
  • Sasa.Collections.Lifo: an immutable last-in-first-out stack data structure.
  • Sasa.Collections.Vector: the fastest immutable vector for .NET.
  • Sasa.Collections.Trie: the fastest immutable dictionary for .NET.

Sasa.Concurrency Assembly

This packages provides a few abstractions for writing concurrent programs.

  • Sasa.Atomics: convenient atomic operations, including atomic and lock-free read/write, load-linked/store-conditional, etc.
  • Sasa.LLSC: a convenient abstraction encapsulating a load-linked/store-conditional variable.
  • Sasa.RWLock: the most lightweight read/write lock available for .NET. Only 4 bytes!

Sasa.Linq Assembly

This packages provides sophisticated extensions on IEnumerable.

  • Sasa.Linq.Enumerables: extensions to compute the difference between two IEnumerable streams.
  • Sasa.Linq.Change: describes a difference between two enumerable streams at a given position.
  • Sasa.Linq.ChangeType: the type of difference.

Sasa.Linq.Expressions Assembly

This package provides some abstractions for dealing with LINQ expressions.

  • Sasa.Linq.Expressions.Expression: a lightweight typed wrapper around System.Linq.Expressions.Expression.
  • Sasa.Linq.Expressions.Parameter: a lightweight wrapper around System.Linq.Expressions.ParameterExpression.
  • Sasa.Linq.Expressions.ExpressionVisitor: base class for implementing a basic visitor over LINQ expressions.
  • Sasa.Linq.Expressions.IdentityVisitor: an implementation of ExpressionVisitor that by default simply returns the same expression. Useful if you only want to transform only a subset of expressions.
  • Sasa.Linq.Expressions.ErrorVisitor: an implementation of ExpressionVisitor that by default simply throws an error. Useful if you want to ensure you handle all expressions.

Sasa.Mime Assembly

This package makes handling media types and file extension associations convenient.

Sasa.Net Assembly

This package exposes abstractions for network protocols and mail handling.

  • Sasa.Net.Texty: base interface for text-based protocols.
  • Sasa.Net.Pop3.*: abstractions implementing the basic POP3 protocol client.
  • Sasa.Net.Mail.*: extensions on MimeKit's mail abstractions.

Sasa.Numerics Assembly

This package provides a few convenient numerical and statistical calculations.

Sasa.Parsing Assembly

This package provides a few abstractions to creating lexers and parsers via embedded DSLs.

  • Sasa.Parsing.Lexing.*: abstractions for lexing.
  • Sasa.Parsing.Pratt.*: abstractions for easy, extensible and declarative Pratt parsing.

Sasa.Web Assembly

This package provides some convenient extensions for internet programs.

  • Sasa.Web.Uris: efficient URI encoding, decoding and URI query parameter extraction.
  • Sasa.Web.Url64: a base64 encoding that's safe to use in URIs. Much more compact than typical encodings.

SasaMetal Package

This package provides a more useful equivalent to sqlmetal for LINQ 2 SQL. It enables you to map columns and associations to typed properties, including using enums.

Deprecations

The final version of Sasa v1.0.0 will probably remove types or methods which are currently marked obsolete because there are now better alternatives:

  • Sasa.Net: Sasa now uses MailKit for all of its parsing methods, so you're better off using that package directly. Some of the extension methods may remain.
  • Sasa.T, Sasa.Pair, Sasa.Triple, Sasa.Quad: the new System.ValueType package has better integration with languages and the runtime.

Previous version of Sasa had a few more experimental and now unnecessary packages:

  • Sasa.Arrow
  • Sasa.FP
  • Sasa.Partial
  • Sasa.IoC
  • Sasa.Contracts
  • Sasa.Dynamics: see my package Dynamics.NET for further work along these lines.
  • Sasa.Data: LINQ to SQL is not supported on .NET standard, so you will have to use the old package if you still need this.

Saturday, October 6, 2018

RazorInterfaces: interfaces for the standard HTTP methods

In playing around with Razor Pages, I was irritated again that Microsoft couldn't just standardize a set of interfaces for their methods so that they wouldn't need so much magic.

So I just quickly hacked up RazorInterfaces, available also on nuget.org. It's probably most useful to people just learning Razor Pages, since it requires you to correctly implement the common HTTP methods you'll need to get things running:

public class CustomerPage : PageModel
                          , IPageGet
                          , IPagePostAsync<Customer>
{
 public IActionResult OnGet()
 {
  return Page();
 }

 public async Task<IActionResult> OnPostAsync(Customer customer)
 {
  await DataLayer.Update(customer);
  return Page();
 }
}

There are interfaces for all of the standard HTTP methods: GET, POST, PUT, DELETE, HEAD, OPTIONS. The interface names conform to the following format: IPage[HTTP method] and IPage[HTTP method]Async for the async variant, and there are variants also accepting a single type parameter, IPage[HTTP method]<T> and IPage[HTTP method]Async<T>.

Sunday, August 12, 2018

Minimal ASP.NET Core Dependencies for Google Cloud

After a long hiatus, I'm slowly starting back on blogging and some personal projects. To get me back in the swing of things, here's a short post on running ASP.NET core on Google Cloud, since this seems poorly documented online.

The default project created by the Google Cloud tools includes Microsoft.AspNetCore.All which is a huge dependency. If you want something more minimal:

  1. uninstall-package Microsoft.AspNetCore.All
  2. install-package Microsoft.AspNetCore
  3. install-package Microsoft.AspNetCore.Mvc.Core
  4. install-package Microsoft.AspNetCore.Mvc

This creates a runnable project using the standard Google Cloud Web API project template, although it still isn't ideal as it includes the Razor pages dependencies.

Friday, December 2, 2016

Investigative Capabilities in a Digital World - My Responses

A thread on reddit pointed to Canada's consultation on what sort of legislative framework they should create to facilitate investigations in the digital world.

Many of the questions are leading and borderline deceptive in their phrasing, but if we're being charitable, this is a good opportunity to inform policy makers. To that end, these are my responses:

How can the Government address challenges to law enforcement and national security investigations posed by the evolving technological landscape in a manner that is consistent with Canadian values, including respect for privacy, provision of security and the protection of economic interests?

Evidence-based policy suggest you first demonstrate the actual need for such respective legislative changes, and further, that the proposed changes actually solve the problems they are intended to solve. Impartial third-party studies should be required for any proposed measures. As a computer scientist with some expertise in computer security, legislators simply do not have the requisite understanding to judge the need or the effectiveness of any proposed measures.

The greatest dangers to Canadians are automated software systems operating on behalf of criminals looking to compromise servers and personal devices. The most effective measures to protect Canadians would thus secure their devices against any such intrusions. We enjoy physical security due to law enforcement patrols to prevent crime, the justice system to punish, reform and deter criminals, and a good public education system that empowers people to provide for themselves such that they do not need to resort to crime.

No truly effective law enforcement or justice system is possible for online activities, given criminal activity affecting Canadians can originate from anywhere on Earth. The Internet is truly the wild west, and like the wild west, safety for all Canadians can only be achieved by empowering and educating individuals in how to secure their own devices.

On a legislative level, a modicum of regulation on the software industry to use only secure software systems when dealing with personal information, and to require companies to develop software using safer, more secure techniques, such as using safer programming languages, would significantly reduce all Canadians' vulnerabilities to online criminals.

In the physical world, if the police obtain a search warrant from a judge to enter your home to conduct an investigation, they are authorized to access your home. Should investigative agencies operate any differently in the digital world?

Why should they? In each case, our charter guarantees the security of the person against unreasonable search and seizure. Law enforcement is *supposed* to be inconvenienced if they want to violate someone's rights. That's the whole point of the charter, to safeguard against abuse of power.

Currently, investigative agencies have tools in the digital world similar to those in the physical world. As this document shows, there is concern that these tools may not be as effective in the digital world as in the physical world. Should the Government update these tools to better support digital/online investigations?

Of course, authorized use of such tools ought to be as effective as possible, so long as "these tools" do not necessitate compromising citizens' everyday software in order to be effective. For instance, these tools cannot require backdoors into people's software in order to be effective. The tools used by law enforcement *will* leak into the criminal world, and you would place every citizens' bank accounts and private life in jeopardy if you require backdoors.

Is your expectation of privacy different in the digital world than in the physical world?

Of course. Physical security and digital security are necessarily different because physical objects and digital objects have completely different properties. For example, it's trivial to digitally copy an item, it's not easy to copy a physical item. It's thus impossible to enforce the same sorts of property rights for digital items as we enjoy for physical items. You can securely share a private physical photo without too much concern that it will be disseminated globally, but as many people have learned, digital photos are easily leaked and spread online.

Since the Spencer decision, police and national security agencies have had difficulty obtaining BSI in a timely and efficient manner. This has limited their ability to carry out their mandates, including law enforcement's investigation of crimes. If the Government developed legislation to respond to this problem, under what circumstances should BSI (such as name, address, telephone number and email address) be available to these agencies? For example, some circumstances may include, but are not limited to: emergency circumstances, to help find a missing person, if there is suspicion of a crime, to further an investigative lead, etc…

In which *specific* cases has the absence hampered a time-sensitive investigation? I hear this "fact" repeated constantly, but no specifics are ever provided. If no compelling case exists, or alternate investigative methods can and have yielded the same results, then rapid subscriber access is an unnecessary infringement on our rights and freedoms, and frankly, a waste of legislators' time.

If such specifics are too sensitive to disclose, then they can be analyzed by an impartial, third­party investigation consisting of computer experts to determine their validity. I know of no such cases, nor any such investigation. If such an investigation justifies rapid subscriber access, then rather than infringing our rights and eliminating all oversight, establish a court that's dedicated solely to rapid processing of such applications, and cycle the judges sitting on these courts every year to prevent overly favourable relationships with those requesting warrants. No blank cheques, and any such requests must still go through a court.

Finally, subscriber information does not identify a responsible party, so relying on this will generate many false leads and waste investigative resources in time-sensitive investigations. For instance, today's deplorable software security makes it fairly easy to hijack someone else's wifi signal, so the false belief that an IP address that participated in an illegal activity will lead investigators to waste time pursuing the subscriber who was assigned that IP address. Law enforcement needs to be trained to understand these realities.

Do you consider your basic identifying information identified through BSI (such as name, home address, phone number and email address) to be as private as the contents of your emails? your personal diary? your financial records? your medical records? Why or why not?

Of course it's not as private. My kitchen is also less private than my bedroom, but law enforcement still requires a warrant to access it.

Do you see a difference between the police having access to your name, home address and phone number, and the police having access to your Internet address, such as your IP address or email address?

Internet address, home address, phone number and email address are all logical addresses via which I can be personally reached in some way, and behind which I hold significant personal and private information. My name is different from all of the above: it is not unique, and it does not provide a way to reach me personally.

The Government has made previous attempts to enact interception capability legislation. This legislation would have required domestic communications service providers to create and maintain networks that would be technically capable of intercepting communications if a court order authorized the interception. These legislative proposals were controversial with Canadians. Some were concerned about privacy intrusions. As well, the Canadian communications industry was concerned about how such laws might affect it.

As a computer scientist, I can tell you right now that reliable and complete intercept capability is impossible. If you legislate any such requirements, then those wishing to hide their activities have any number of mechanisms available to *undetectably* circumvent it. And not just via encryption, see steganography, for instance.

Given this impossibility, and the fact that IP addresses cannot identify the responsible party, but merely a means by which a criminal activity took place, interception capabilities are likely an unnecessary waste of resources that would yield too many false positives, and resources are probably better spent on more reliable and effective investigative techniques.

Furthermore, the existence of such networks makes them ripe targets for foreign governments and criminals. The potential for abuse, and the vulnerabilities created, are simply too great, even if well-intentioned.

Should Canada's laws help to ensure that consistent interception capabilities are available through domestic communications service provider networks when a court order authorizing interception is granted by the courts?

Yes, but I still question the effectiveness of such measures, and the cost-benefit ratio of creating this infrastructure.

If the Government were to consider options to address the challenges encryption poses in law enforcement and national security investigations, in what circumstances, if any, should investigators have the ability to compel individuals or companies to assist with decryption?

Never. The most significant danger to Canadians is online criminal activity. Software that compromises people's devices to leak personal information, for identity theft among other criminal purposes, is already being traded online. Any attempt to weaken or circumvent encryption puts every Canadian in danger for very little gain considering messages exchanged between criminals can be hidden via other undetectable means (see discussion of Intercept Capability above).

Canadians as a whole are better served by *requiring* devices to implement proper encryption that cannot be circumvented. Any security researcher will tell you exactly the same thing: securing an encrypted device with any backdoor is impossible. You cannot compromise every Canadian's entire life just to assist a small number of investigations.

How can law enforcement and national security agencies reduce the effectiveness of encryption for individuals and organizations involved in crime or threats to the security of Canada, yet not limit the beneficial uses of encryption by those not involved in illegal activities?

They literally cannot. It's impossible. Any computer scientist who has studied security will tell you the same. Even the best design I have seen to date, which holds decryption keys in hardware not accessible to software running on the device, and which looks pretty safe at first glance, is rife with vulnerabilities when you consider that all of our hardware is built overseas under the purview of foreign governments and in manufacturing plants with completely unknown on-premise security measures. Hardware-level security breaches have already occurred even though there's plenty of other lower-hanging fruit.

Should the law require Canadian service providers to keep telecommunications data for a certain period to ensure that it is available if law enforcement and national security agencies need it for their investigations and a court authorizes access?

Retaining information is fine, so long as that data is required to be appropriately secured from access without a warrant. In particular, telecommunication providers should not be permitted to sell this information, and should be required to conduct third-party audits of the security practices used to archive this data.

Even so, the utility of data retention will probably decrease in the coming years as more services move to the internet, and more information exchanged online becomes encrypted. Already, people can switch most of their telephony services to online voice chat services, thus reducing the effectiveness of telephony data retention.

If the Government of Canada were to enact a general data retention requirement, what type of data should be included or excluded? How long should this information be kept?

I can't imagine a use for data more than 5 years old. Retaining all Internet traffic is impossible, but this is what would be needed for truly effective analysis.

However, full retention is an unacceptable security risk. Even enacting a private key encrypted infrastructure, whereby all such data were stored in encrypted form and the decryption keys were held by the courts to be used only when authorized, securing these keys against any possible disclosure is probably impossibly difficult.

To balance the security risks against realizable data capture abilities, retaining only metadata for 5 years and encrypting it all using the above public key infrastructure would be acceptable.

Editorial Note: I have little expectation that such public key protections would be employed though, and without them, I'm not in favour of any data retention at all.