Thursday, November 27, 2008

Compact, Declarative Serialization

A few posts back, I hinted at using parameterization as an alternative to metadata. I've just written a serialization interface using this technique, and a binary serializer/deserializer to demonstrate the inherent tradeoffs.

You can inspect the pair of interfaces required, and I implemented a binary serializer/deserializer pair as an example. ICompactSerializable is implemented for each serializable object. It's essentially a declarative method, which describes the sequential, internal structure of the object. It's simple and fast, since it provides native speed access to an object's fields without reflection, and no need for metadata.

Of course, the obvious downside is that clients must describe the internal structure themselves via ICompactSerializer, and refactoring must be careful about reordering the sequence of calls. The upshot is that serialization and deserialization is insanely fast as compared to ordinary reflection-driven serialization, the binary is far more compact, and clients have full control over versioning and schema upgrade without additional, complicated infrastructure (surrogates, surrogate selectors, binders, deserialization callbacks, etc.).

These serializers are very basic, but the principle is sound. My intention here is only to demonstrate that parameterization can often substitute for the typical approach of relying heavily on metadata and reflection. This is only one possible design of course, and other tradeoffs are possible.

For instance, each method in ICompactSerializer could also accept a string naming the field, which could make the Serialize call invariant to the ordering of fields, and thus more robust against refactoring; this brings the technique much closer to the rich structural information available via reflection, but without the heavy metadata infrastructure of the .NET framework.

The Serialize method of ICompactSerializable can also easily be derived by a compiler, just as the Haskell compiler can automatically derive some type class instances.

This application is so basic, that the user wouldn't even need to specify the field names manually, as the compiler could insert them all automatically. Consider how much metadata, and how much slow, reflection-based code can be replaced by fast compiler-derived methods using such techniques. Projects like NHibernate wouldn't need to generate code to efficiently get and set object properties, since full native speed methods are provided by the compiler.

Saturday, November 15, 2008

The Chicken and the Egg - Redux

There is still considerable skepticism regarding my conclusion that the chicken comes first. Many of the objections are simply due to different interpretations of the question, interpretations which I consider unfaithful to the original purpose of the chicken-egg dilemma.

Fundamentally, this question is supposed to represent a causality dilemma. When there is a causal dependency between any two objects, A and B, we can ask ourselves, "which came first, A or B?" An object C could have caused B, and thus triggered the recursive relation C→B→A→B→A... Of course, it could just as easily have been A that was caused first.

To properly answer this question, we must reduce the abstract objects A and B to concrete objects and apply our scientific knowledge to ground the recursion. Using chickens and eggs as our objects, we have to precisely define what chickens and what eggs to consider.

The question "which came first, chickens or any type of egg?" is not a causal dilemma at all, and further is not faithful to the original purpose of the question. The ancient Greeks that first posed this question had no concept of evolution and no inkling that chickens could have any relationship to other egg types, so they would not have asked, "which came first, the chicken or the fish egg?" To the ancient Greeks, such a question isn't a dilemma, it's complete nonsense. Thus, the paper linked in my last post is invalid.

The more precise and faithful phrasing of the dilemma is, "which came first, the chicken or the chicken egg?", or more generally, "which came first, species Sn or it's reproductive mechanism Rn?"

Sn and Rn are in the appropriate recursive relation to form a causal dilemma, and we can ground it in biology and chemistry. The very first single-celled organism did not form by mitosis, thus the first single-celled organism, S preceded its own reproduction mechanism, R. Thus, the dominoes fall, ie. the first egg-laying species was not hatched from an egg, thus it too preceded its own reproductive mechanism, and so on, ad infinitum.

Eventually, we reach chickens and chicken eggs, and the conclusion is simply, that the chicken necessarily came before its egg.

Friday, November 7, 2008

Embedded Stack Language for .NET

Parametric polymorphism is a powerful tool for constructing and composing safe programs. To demonstrate this, I've constructed a tiny embedded stack language in C#, and I exploited C# generics, aka parametric polymorphism, to ensure that any embedded programs are type-safe, and thus, "can't go wrong".
Moreover, this language is jitted, since I use ILGenerator of System.Reflection.Emit, and the complexity of doing this was no higher than creating an interpreter. This is mainly because the CLR is itself a stack-based VM, and so jitting a stack-based program to a stack-based IL is fairly natural.

Here is the type-safe representation of the stack:
public struct Stack<R, T> : IDisposable
{
internal ILGenerator gen;
public Stack(ILGenerator gen)
{
this.gen = gen;
}
public void Dispose()
{
this.Return();
}
}

This type encapsulates the IL output stream and uses two phantom type variables to encode the values at the top of the execution stack. The T parameter represents the top of the stack, and the R parameter represents the rest of the stack.

A few more types are needed to safely represent other reflected objects used during code generation:
// Represents the 'void' type used in some functions
public struct Unit
{
}
// Represents a named function with a known type structure T→R.
// This could be a void→void, an int→string, etc.
public struct Function<T, R>
{
internal MethodInfo method;
public Function(Action<T> f)
{
this.method = f.Method;
}
}

Here's a test program demonstrating the use:

class Program
{
static void Main(string[] args)
{
var d = new DynamicMethod("test", typeof(void), null);
using (var s = new Stack<Unit, Unit>(d.GetILGenerator()))
{ // equivalent to: () => Console.WriteLine(1 + 2);
s.Int(1)
.Int(2)
.Add()
.Apply(new Function<int, Unit>(Console.WriteLine));
}
d.Invoke(null, null);
Console.ReadLine();
}
}

Next I define a set of extension methods operating on typed stack objects, all parameterized by the underlying types. You can see how the top elements of the stack are specified, and how each operation consumes these elements or otherwise modifies the stack:
public static class CodeGen
{
// Apply a function to its arguments; note that function args and arity
// is checked at compile-time
public static Stack<R, R0>
Apply<R, T, R0>(this Stack<R, T> stack, Function<T, R0> target)
{
stack.gen.EmitCall(OpCodes.Call, target.method, null);
return new Stack<R, R0>(stack.gen);
}
// Embed an literal integer in the IL stream
public static Stack<Stack<R, T>, int>
Int<R, T>(this Stack<R, T> stack, int i)
{
stack.gen.Emit(OpCodes.Ldc_I4, i);
return new Stack<Stack<R, T>, int>(stack.gen);
}
// Add the two integers at the top of the execution stack
public static Stack<R, T>
Add<R, T>(this Stack<Stack<R, T>, T> stack)
{
stack.gen.Emit(OpCodes.Add);
return new Stack<R, T>(stack.gen);
}
// Return from the current function
public static Stack<R, T>
Return<R, T>(this Stack<R, T> stack)
{
stack.gen.Emit(OpCodes.Ret);
return new Stack<R, T>(stack.gen);
}
}

This is obviously a fairly limited set of functionality, and it would require a great deal more work and thought to properly abstract the full execution of CIL (especially exceptions!), but it just might be possible. A large subset is certainly possible.

It might serve as an interesting alternative to System.Linq.Expressions for some code generation applications, since Linq expressions are largely untyped. Here are a few more opcode examples for reference purposes:

// Duplicate the top element
public static Stack<Stack<R, T>, T>
Dup<R, T>(this Stack<R, T> stack)
{
stack.gen.Emit(OpCodes.Dup);
return new Stack<Stack<R, T>, T>(stack.gen);
} // index into an array
public static Stack<R, T>
LoadArrayIndex<R, T>(this Stack<R, Stack<T[], int>> stack)
{
stack.gen.Emit(OpCodes.Ldelem, typeof(T));
return new Stack<R, T>(stack.gen);
}
// runtime type test
public static Stack<R, T>
IsOfType<R, T, T>(this Stack<R, T> stack)
where T : class
{
stack.gen.Emit(OpCodes.Isinst, typeof(T));
return new Stack<R, T>(stack.gen);
}

Pasting all of the code in this post into a file and compiling it should just work.

Wrapping all CIL opcodes this way could be quite useful, since you're guaranteed to have a working program if your code type checks. If you have a language, embedded or otherwise, using this sort of technique might be a good way to achieve a type-safe JIT for .NET, assuming you can assign a stack-based execution semantics. Since you're compiling to .NET, it will need exactly such a stack semantics anyway!

Tuesday, November 4, 2008

Reflection, Attributes and Parameterization

I used to be a big fan of reflection, and C#'s attributes also looked like a significant enhancement. Attributes provide a declarative way to attach metadata to fields, methods, and classes, and this metadata is often used during reflection.

The more I learned about functional programming, type systems, and so on, the more I came to realize that reflection isn't all it's cracked up to be. Consider .NET serialization. You can annotate fields you don't want serialized with the attribute [field:NonSerialized].

However, metadata is just data, and every usage of attributes can be replaced with a pair of interfaces. Using [field:NonSerialized] as an example, we can translate this class:
class Foo {
[field:NonSerialized]
object bar;
}

Into one like this:
// these two interfaces take the place of a NonSerializableAttribute declaration
interface INonSerialized {
void Field<T>(ref T field);
}
interface IUnserializableMembers {
void Unserializable(INonSerialized s);
}
class Foo : IUnserializableMembers {
object bar;
void Unserializabe(INonSerializable s) {
s.Field(ref bar);
}
}

Essentially, we are replacing reflection and metadata with parameterization, which is always safer. This structure is also much more efficient than reflection and attribute checking.

Consider another example, the pinnacle of reflection, O/R mapping. Mapping declarations are often specified in separate files and even in a whole other language (often XML or attributes), which means they don't benefit from static typing. However, using the translation from above, we can obtain the following strongly type-checked mapping specification:
// specifies the relationships of objects fields to table fields
interface IRelation
{
// 'name' specifies the field name in the table
void Key<T>(ref T id, string name);
void Field<T>(ref T value, string name);
void Foreign<T>(ref T fk, string name);
void ForeignInverse<T>(ref T fk, string foreignName);
void List<T>(ref IList<T> list);
void List<T>(ref IList<T> list, string orderBy);
void Map<K, T>(ref IDictionary<K, T> dict);
}
// declares an object as having a mapping to an underlying table
interface IPersistent
{
void Map(IRelation f);
}
// how to use the above two interfaces
class Bar: IPersistent
{
int id;
string foo;
public void Map(IRelation f)
{
f.Key(ref id, "Id");
f.Field(ref foo, "Foo");
}
public string Foo
{
get { return foo; }
}
}

There are in general two implementors of IRelation: hydration, when the object is loaded from the database and the object's fields are populated, and write-back, when the object is being written back to the database. The IRelation interface is general enough to support both use-cases because IRelation accepts references to the object's fields.

This specification of the mappings is more concise than XML mappings, and is strongly type-checked at compile-time. The disadvantage is obviously that the domain objects are exposed to mapping, but this would be the case with attributes anyway.

Using XML allows one to cleanly separate mappings from domain objects, but I'm coming to believe that this isn't necessarily a good thing. I think it's more important to ensure that the mapping specification is concise and declarative.

Ultimately, any use of attributes in C# for reflection purposes can be replaced by the use of a pair of interfaces without losing the declarative benefits.

Monday, November 3, 2008

Garbage Collection Representations Continued

In a previous post, I discussed representations used for polymorphic parameters. These special data representations are a contract between the language and the runtime which enables the runtime to traverse data structures to collect garbage.

64-Bit Polymorphic Representation


I had neglected to mention this interesting variant of tagged representations that I had come across at one point, but which I believe is not in widespread use. This representation uses full 64-bit floating point representations for all polymorphic values, and encodes non-floating point data in the unused bits of NaN values. IEEE 754 floating point numbers reserve 12 of the 64 bits for flagging NaN values, so the remaining bits are free to use for encoding integers, pointers, and so on.

I haven't found very much empirical data on this approach, but it looks promising. The polymorphic representation is doubled in size, but the structure is unboxed. Thus, the polymorphic function call overhead may be slightly higher, but garbage collector pressure is reduced since less boxing is needed. Numerical code is likely to see the biggest speedups.

Also, this representation can be used to ensure that all values are properly 8-byte aligned, which is often an efficiency gain in modern memory systems. This also has the advantage of permitting unboxed floating point values, which is a significant slowdown for the other tagged representations.

Natural Representations with Bitmasks


A new representation I recently came across is being used in the SML# language. Consider a typical tagged representation where 1-bit is co-opted to distinguish integers from pointers. Each polymorphic parameter must carry this 1 bit.

But there is no reason for this 1 bit to be contained within the parameter itself. Consider lifting this 1 bit to an extra function parameter, so:
val f: 'a → 'b → 'c

becomes:
val f: 'a * Bool → 'b * Bool → 'c * Bool

The Bool is the flag indicating whether the parameter is a pointer or an integer. But using a full function parameter to hold a single bit is a waste, so let's coalesce all the Bool parameters into a single Bitmask:
val f: Bitmask → 'a → 'b → 'c

Bitmask can be an arbitrarily long sequence of bits, and each bit in the mask corresponds to the bit that would normally be in the pointer/integer representation of the corresponding parameter in a typical tagged representation. By lifting the tag bit to an external parameter, we can now use full natural representations for all parameters.

Calling a polymorphic function now consists of masking and shifting to extract the bit for the given parameter, or assigning one if calling from a monomorphic function.

Bitmasks must also accompany polymorphic data structures. The presence of any polymorphic type variable implies the requirement for a bitmask. It's also not clear how expensive this shifting is in relation to inline tag bits. SML# does not yet have a native code compiler, so any comparisons to other SML compilers aren't representative.

However, the bitmask representation does not unbox floating point numbers, but a hybrid scheme with the above 64-bit representation is possible.

Consider the lifted 1-bit to indicate whether the parameter is word-sized (1), or larger (0). Larger parameters require a full 64-bit representation, while word-sized parameters do not. The garbage collector thus skips any parameters of bitmask tag value 1 (since they are unboxed integers), and analyzes any values of bitmask tag value 0 as a 64-bit value. If the 64-bit is a NaN, the value is further analyzed to extract a single distinguished bit indicating whether it is a pointer. If it is not a pointer, it is skipped (since it is an unboxed floating point value). If it is a pointer, the structure pointed to is traversed.

This last scheme provides natural representations for integers and floating point numbers, and expands the representation of pointers in polymorphic functions by one word. Other tradeoffs are certainly possible.

See Also

  1. Garbage Collection Representations
  2. Garbage Collection Representations Continued
  3. Garbage Collection Representations Continued II