Tuesday, June 23, 2009

Mobile Code in C# via Finally Tagless Interpreters

Awhile back I described an idea to transparently execute code server-side or client-side given the same program. I've finally gotten around to implementing this using my encoding for type constructor polymorphism in C#.

Here is an example you can run from the original paper showcasing exponentiation which can execute transparently either server-side or client-side. The server-side code is intentionally limited to bases less than 100 and exponents less than 20.

Here is some simplified code that defines an exponentiation function:
void Build<B, R>(B _)
where B : ISymantics<B>, new()
{
var e = _.Lambda<int, Func<int, int>>(
x => _.Fix<int, int>(self => _.Lambda<int, int>(n =>
_.If(_.Lte(n, _.Int(0)), () => _.Int(1),
() => _.Mul(x, _.Apply(self, _.Add(n, _.Int(-1))))))));
}

The parameter "_" is the tagless interpeter and it's type is ISymantics<B>, which defines the semantics of the language being interpreted:
/// <summary>
/// The abstract base class of a typed expression.
/// </summary>
/// <typeparam name="T">The type of the encapsulated expression.</typeparam>
/// <typeparam name="B">The brand of the interpreter that can interpret this expression.</typeparam>
public abstract class E<T, B> where B : ISymantics<B>, new() { }

/// <summary>
/// The interface for interpreters.
/// </summary>
/// <typeparam name="B">The brand of the interpreter. This should be unique for each interpreter.</typeparam>
public interface ISymantics<B>
where B : ISymantics<B>, new()
{
/// <summary>
/// Construct an integer.
/// </summary>
/// <param name="i">The integer value.</param>
/// <returns>An expression encapsulating the value.</returns>
E<int, B> Int(int i);

/// <summary>
/// Add two expressions.
/// </summary>
/// <param name="left">The left expression.</param>
/// <param name="right">The right expression.</param>
/// <returns>The sum of the two expressions</returns>
E<int, B> Add(E<int, B> left, E<int, B> right);

...
}

The code-behind for the .aspx page interprets this code twice, once with a LINQ-based compiler for server-side execution, and once with a JavaScript backend to generate the client-side program.

The client-side program depends only on a utility function to compute the fixed point of a given function:
function fix(f) {
return function self(x) { return f(self)(x); };
}

Clearly any programs written this way are fairly verbose, but it's fairly straightforward to define wrapper structs around E<T, B>, such as Eint, Ebool, etc., to provide overloaded operators such that embedded programs look almost like ordinary C#. The current ISymantics interface is also limited to single-arg functions, but it's possible to add more overloads to support multi-arg functions.

Of course, it's already possible to translate C# expressions to JavaScript if we write a LINQ expression visitor. However, LINQ expressions are in a sense more limited than operating on this lifted data type. For instance, it's not possible to translate a static function to JavaScript using LINQ expressions, but it is possible using the final tagless representation.

It's an open question whether such final tagless interpreters can be made usable enough for real programming. However, the techniques employed here, in particular the structure which enables semi-safe type constructor polymorphism, are definitely useful for other projects that require retargetable abstractions.

The full source code is available, and includes a simple C# evaluator, LINQ-based compiler, a JavaScript backend, and an interpreter that computes the depth of the expression tree, which was also an example in the tagless paper. Also included is a partially working partial evaluator. This proved the trickiest to implement, and the only remaining problem appears to be recursive partial evaluation.

As future work, the partial evaluator can be parameterized by the type of compiler, so it can partially evaluate managed code and JavaScript programs; it is currently needlessly tied to the managed code interpreter. I also plan to implement the aforementioned wrapper structs and other enhancements to improve the clarity of embedded programs, hopefully to the point where they are usable for real programming.

Friday, May 22, 2009

Sasa v0.9 Released

Sasa v0.9 has been released. See the changelog for a detailed description of the changes. Here is the original release description for Sasa v0.8. This post will describe only changes from v0.8.

Backwards-incompatible changes:
  • Renamed Sasa.Collections.List to Sasa.Collections.Seq to avoid clashes with System.Collections.Generic.List

  • Restructured the list operators to better support chaining

Useful additions include:
  • Sasa.Weak<T> which wraps WeakReference

  • Additional string processing functions, like StringExt.SliceEquals

  • Array-processing combinators under Sasa.Collections.Arrays (Slice, Dup, etc.)

  • Stream functions under Sasa.IO (CopyTo, ToArray)

  • Support for MIME decoding and encoding for MailMessage parsing

Bugfixes:
  • Better conformance to RFCs for Pop3Client and MailMessage parsing

  • Concurrency bugfix in Sasa.Lazy.

MIME MailMessage parsing and the Pop3Client are already in use in production code, and conformance appears adequate after hundreds of processed messages.

Experimental Additions


There are a few components of this release that I would deem "experimental".

Sasa.Dynamics is intended as a "safe reflection" facility. Basically, this is a "type-case" construct as found in the "intensional type analysis" literature. Any reflective algorithm should be implementable via ITypeFunc<R>, and you cannot forget to handle a particular case. This interface basically factors out object traversal from the actual reflection algorithm.

Under Sasa.Web and Sasa.Web.Asp, I've included a URL-safe Base64 encoder/decoder, Sasa.Web.Url64, and a generic ASP.NET Page base class that is immune to clickjacking and CSRF.

I first proposed this idea on the capability security mailing list. I'm not completely satisfied with the current incarnation, but it's an interesting idea I'm still toying with.

Sunday, March 22, 2009

Garbage Collection Representations Continued II

The tagged pointer representation described in Xavier Leroy's ZINC paper is compelling in its simplicity. Most data manipulated in typical programs is integer or pointer-based in some way, so using a 1-bit tagged representation allows unboxed integers, resulting in an efficient representation of the most common data types.

I've never been satisfied with the size header in heap-allocated types though, and the two bits reserved for GC pretty much forces you to use some mark-sweep variant. Integrating something like age-oriented collection with reference counting for the old generation would require an entirely new word for every block of memory. This is a prohibitive cost in functional programs.

Removing the Size Header


Reading about BIBOP techniques used in the Streamflow allocator gave me the idea of using page masking to determine the block size.

As in Streamflow, consider an allocator where each page obtained from the OS was partitioned into a list of equally-sized blocks. Each of these blocks does not need its size stored in an object header, since the size of the block is a property of the page it was allocated from, and we can easily obtain the page address by simply masking the pointer to the block. So the first block of every page is dedicated to storing the size, and all subsequent blocks are placed in the free list.

This works for structured data allocated in fixed-sized blocks, but unstructured data is dynamic in extent and can span pages. Unstructured data must also carry the GC header in the first word, even when spanning pages. However, we know that arrays and strings are the only primitive unstructured data described in ZINC, and they must now both carry their size information as part of the structure. We can thus easily identify "small" strings and arrays that could fit into a fixed-sized block.

As a possible space optimization, such "small" arrays and strings don't need to be accompanied by a size field. We can perform a simple test to distinguish "small" arrays: if the array pointer is more than one word off from a page-aligned address, it is structured data, since unstructured data that spans pages always starts on a page address + GC header size.

We now have a full 24 free bits in the block header, which we can reserve for use by the GC. 24 bits is enough to employ reference counting, or a hybrid mark-sweep for the nursery, reference counting for the old generation as in age-oriented collection. The GC to employ is now completely at the discretion of the runtime, and can even be changed while the program is running.

The Problem


There is a downside of course. Streamflow allocates fixed-sized blocks for sizes ranging from 4 bytes up to 2kB. Considering the above approach uses the first block to store the block size, we would waste up to 2kB on the larger block sizes. We could amortize this cost by allocating more pages and spreading the cost across them, but then we lose the simple page masking. We could avoid large fixed-sized blocks, maybe cutting them off at 512 bytes, but then we would still need some way to store the size for these "medium-sized" blocks.

We can't simply use a single word of the page to store the size, as we would then have an odd number of bytes to divide into fixed size blocks. Again, it's only a problem for larger block sizes. You can't split 4092 bytes into two 2kB blocks.

Another Approach


As a general solution to these problems, we can eliminate the size tag on the page entirely by maintaining a sorted array of address ranges and the block size of the range. When we need to discover the size of a block, we perform a binary search to find the address range the address falls in, and extract the size. This operation is O(log n), where n is the number of distinct address ranges, ie. sequences of consecutive pages allocated to the same block size. I would expect the number of such page regions to be under 50, so the logarithmic cost should be very low in practice, though the additional indirections accessing the array can be costly due to cache misses.

Summary


A block header representation that provides for GC independence would be compelling. The page-level size header is promising, but clearly has problems. Hopefully, some clever person will devise a general way to handle the larger fixed sized blocks without wasting too much space.

The solution to the problems with the page-level size header using a binary search is a mixed blessing. The additional runtime cost in discovering block size may be prohibitive, since it must be performed during GC traversal, and on every free operation. The costs of free might possibly be amortized somehow when done in bulk, but the cost incurred by GC traversal seems much more challenging to eliminate.

Tuesday, February 24, 2009

Sasa Reborn!

My .NET Sasa library has fallen by the wayside as I experimented with translating various functional idioms in my FP# library. Reading up on what a few other generic class libraries have been experimenting with, like Mono Rocks, spurred me to putting those experiments to use and updating Sasa. I significantly simplified a lot of the code, documented every class and method, and generalized as much as possible. The license is LGPL v2, and you can download the source via svn:
svn co https://sasa.svn.sourceforge.net/svnroot/sasa/tags/v0.8 sasa

Sasa Core v0.8


A set of useful extensions to core System classes and some useful classes for high assurance development.
  • Named tuple types: Pair, Triple, Quad.

  • Either types, representing one of many possible values. There are Either types for 2, 3, and 4 parameters, mimicking the Pair, Triple, and Quad structure. Tuples are "product" types, while Either is a "sum" type, and products and sums are duals. Since products are useful, I figured variously sized sum types might also find some uses. Time will well.

  • Lazy type, for lazily computed values.

  • An immutable list.

  • Various Ruby-like extensions to core types, like generators for int.UpTo, int.DownTo, string.IsNullOrEmpty, string.Slice, etc.

  • Useful extensions to IEnumerable.

  • "Zip" functions from Haskell for anonymous types and tuple types.

  • A NonNull type which decorates method parameters and ensures those parameters are not null; if NonNull is used pervasively, you can ensure that your program is free of NullReferenceExceptions.

  • An Option type indicating values which may be null. Unlike System.Nullable, this works for class types.

  • Function currying extensions, and extensions to lift multi-parameter functions to single-parameter tupled functions

  • Some convenience extensions to IDictionary.

Sasa.Linq


A stand-alone assembly for Linq development.
  • Default IQueryProvider and IQueryable implementations

  • Generic ExpressionVisitor base class.

  • IdentityVisitor which provides default implementations for all NodeTypes and performs no modifications to the expression, just returning it as-is.

  • ErrorVisitor which which throws NotSupportedException for all NodeTypes.

Sasa.Serialization


A stand-alone assembly with serialization classes.
  • Provides a compact serializer which requires only ReflectionPermission, and not SecurityPermission like the System classes do; this serializer can therefore be used in medium trust environments. The serializer currently requires a little more discipline from the developer to use correctly, but space savings of 100-200% are typical.

  • An experimental unsafe, highly compact binary serializer.

Sasa.Net


A library providing missing functionality under System.Net.
  • A Pop3Client class.

  • MailMessage parsing.

Sasa.CodeContracts


Microsoft Research is developing a design by contract library which they hope to release with .NET 4.0. It's a fairly sophisticated piece of software, that integrates with a static verification tool called Pex. The analysis tools can detect contract violations at compile-time, and even generate test cases for each violation.

Unfortunately, their license forbids commercial application of the pre-release library, even if you just want to utilize runtime contract checking.

Sasa.CodeContracts is a Microsoft API-compatible implementation of the CodeContracts library. This is only a runtime library, and does not provide the Pex integration with static analysis and automated test generation.

Precondition checking is enabled, but postconditions and object invariants require CIL re-writing, so they are not currently supported. I will be looking into using Mono.Cecil to rewrite the IL to support post-conditions and invariants in the future.

TODO for v1.0


There are a few items remaining before v1.0 is released, but the library is usable as-is. Notably missing is MIME parsing for MailMessage, which will be added for v1.0. Also serialization will get improved safety almost on-par with standard framework serialization, and the compaction will be user-customizable for even more space savings in any given program.

Future Work


The Sasa API is fully documented with accompanying XML for code completion. Comments on the clarity of the API and documentation are welcome! Some tutorials on using these features safely are coming as well.

I'm dissatisfied with a few other approaches being pursued on the CLR, including:
  • Current approaches to parallel and concurrent programming, even Microsoft's Parallel Extensions and the Concurrency and Coordination Runtime.

  • CLR security is far too coarse-grained and pretty much unusable.

  • Efficient async I/O is too difficult to reason about (though the CCR does make it easier).

  • In lieu of a Pex static analysis, there is the possibility of QuickCheck-like test suites derived from CodeContract annotations.

Keep an eye on this space for what I come up with.

Tuesday, February 3, 2009

The cost of type tests and casts in C#

Awhile back I ran some tests comparing the dispatching performance of runtime tests+casts against double dispatch. Turned out runtime type tests and casting were noticeably faster than dispatching, probably because they avoid more pipeline stalls.

Unfortunately, there is a "common wisdom" in the .NET world that an "is" test followed by an "as" cast is performing two casts, and in fact one should simple perform the "as" cast then check the result against null:
// prefer this form
string a = o as string;
if (a != null)
{
Console.WriteLine(a);
}

// to this form:
if (o is string)
{
string a = o as string;
Console.WriteLine(a);
}

In fact, that's not the case, as any compiler worth its salt will coalesce the two tests into a single cast and branch operation. I took the tests from the above dispatching and altered them to perform the cast-and-null check, then I ran the tests again with the original is-then-as form. The latter form was about 6% faster on every timing run.

There's obviously some optimization being done here, but the lesson is: don't try to outsmart the compiler. In general, just write code the safe way and let the compiler optimize it for you. It's safer to perform a test then cast within a delimited scope like an if-statement, than to let the possibly null variable float around in the outer scope where you might use it inadvertently later in the method or during refactoring.

If performance turns out to be an issue, profile before trying these sorts of low-level "optimizations", because you might be surprised at the results.

Thursday, November 27, 2008

Compact, Declarative Serialization

A few posts back, I hinted at using parameterization as an alternative to metadata. I've just written a serialization interface using this technique, and a binary serializer/deserializer to demonstrate the inherent tradeoffs.

You can inspect the pair of interfaces required, and I implemented a binary serializer/deserializer pair as an example. ICompactSerializable is implemented for each serializable object. It's essentially a declarative method, which describes the sequential, internal structure of the object. It's simple and fast, since it provides native speed access to an object's fields without reflection, and no need for metadata.

Of course, the obvious downside is that clients must describe the internal structure themselves via ICompactSerializer, and refactoring must be careful about reordering the sequence of calls. The upshot is that serialization and deserialization is insanely fast as compared to ordinary reflection-driven serialization, the binary is far more compact, and clients have full control over versioning and schema upgrade without additional, complicated infrastructure (surrogates, surrogate selectors, binders, deserialization callbacks, etc.).

These serializers are very basic, but the principle is sound. My intention here is only to demonstrate that parameterization can often substitute for the typical approach of relying heavily on metadata and reflection. This is only one possible design of course, and other tradeoffs are possible.

For instance, each method in ICompactSerializer could also accept a string naming the field, which could make the Serialize call invariant to the ordering of fields, and thus more robust against refactoring; this brings the technique much closer to the rich structural information available via reflection, but without the heavy metadata infrastructure of the .NET framework.

The Serialize method of ICompactSerializable can also easily be derived by a compiler, just as the Haskell compiler can automatically derive some type class instances.

This application is so basic, that the user wouldn't even need to specify the field names manually, as the compiler could insert them all automatically. Consider how much metadata, and how much slow, reflection-based code can be replaced by fast compiler-derived methods using such techniques. Projects like NHibernate wouldn't need to generate code to efficiently get and set object properties, since full native speed methods are provided by the compiler.

Saturday, November 15, 2008

The Chicken and the Egg - Redux

There is still considerable skepticism regarding my conclusion that the chicken comes first. Many of the objections are simply due to different interpretations of the question, interpretations which I consider unfaithful to the original purpose of the chicken-egg dilemma.

Fundamentally, this question is supposed to represent a causality dilemma. When there is a causal dependency between any two objects, A and B, we can ask ourselves, "which came first, A or B?" An object C could have caused B, and thus triggered the recursive relation C→B→A→B→A... Of course, it could just as easily have been A that was caused first.

To properly answer this question, we must reduce the abstract objects A and B to concrete objects and apply our scientific knowledge to ground the recursion. Using chickens and eggs as our objects, we have to precisely define what chickens and what eggs to consider.

The question "which came first, chickens or any type of egg?" is not a causal dilemma at all, and further is not faithful to the original purpose of the question. The ancient Greeks that first posed this question had no concept of evolution and no inkling that chickens could have any relationship to other egg types, so they would not have asked, "which came first, the chicken or the fish egg?" To the ancient Greeks, such a question isn't a dilemma, it's complete nonsense. Thus, the paper linked in my last post is invalid.

The more precise and faithful phrasing of the dilemma is, "which came first, the chicken or the chicken egg?", or more generally, "which came first, species Sn or it's reproductive mechanism Rn?"

Sn and Rn are in the appropriate recursive relation to form a causal dilemma, and we can ground it in biology and chemistry. The very first single-celled organism did not form by mitosis, thus the first single-celled organism, S preceded its own reproduction mechanism, R. Thus, the dominoes fall, ie. the first egg-laying species was not hatched from an egg, thus it too preceded its own reproductive mechanism, and so on, ad infinitum.

Eventually, we reach chickens and chicken eggs, and the conclusion is simply, that the chicken necessarily came before its egg.