Monday, July 28, 2008

Garbage Collection Representations

I have yet to find a comprehensive description of the various data representations used for garbage collection. This isn't an overview of garbage collection algorithms, but an overview of how to distinguish pointers from other values and how to traverse the data structures, which has seemingly gotten only a short riff in the literature. I aim to give an in-depth overview of the techniques I've come across or managed to puzzle out on my own.
  1. Boxing: all values are allocated on the heap, so basically everything is a pointer, thus, there is no ambiguity between pointers and values.

  2. Tagged pointers: pointers are distinguished from ordinary values by some sort of tag. This information is often encoded in the pointer or value itself for a very compact representation.

  3. Type-passing: each parameter of a function is associated with a set of operations, usually referred to as a "dictionary", which is similar to an interface in object-oriented languages. This dictionary defines a collect() operation for each type, which will traverse the structure looking for garbage. This dictionary can be passed into every function, or it can be reconstructed from the stack.

Why Are Special Representations Needed?


One might first ask why special representations are even needed. At compile-time, it's generally perfectly clear what types are being used, as well as the types of the various data structures.

While this is true of simple C-like languages, it isn't true of polymorphic languages. C is what's known as monomorphic, meaning all parameters have a fixed, known type at compile-time.

By contrast, polymorphic languages can have function parameters that can accept any type, and this type is not fixed at any time. A simple example is the identity function:
val id: 'a -> 'a
This function accepts a parameter of any type, and returns a value of the same type. But since you can't know this type ahead of time, you need some way to encode the type in a running program so the garbage collector has enough information to traverse it in its search for garbage.

This problem only really crops up with the combination of polymorphism and separate compilation, meaning you compile a program to a single machine code file which can then be distributed without prior knowledge of how it will be instantiated and used. The .NET CLR is an example of a platform which supports polymorphism, but not separate compilation, since it uses the JIT to specialize methods that use generics. A specialized method is GC'd differently depending on whether it's instantiated with a reference type or a value type. In separately compiled languages, this is not true, hence the need for a special representation.

Boxing


All boxed values are allocated in the heap, including integers and floats. This is the most straightforward representation, and the least efficient of the options both in terms of space and time.

The extra space use comes from the additional pointer needed to indirect to an int, and the additional allocator headers required to track it.

The additional runtime overhead is incurred when repeatedly indirecting the pointers to access the values, and the additional garbage collector pressure from having so many small values.

Allocating many small types like this could also lead to significant memory fragmentation, which balloons memory use, and induces poor locality of reference.

A boxed version of the identity function would look like this:
void * id(void * v) {
return v;
}
Basically, we don't know anything about the value involved, and the void pointer just points to a heap location where the value is stored.

There is a single flag stored in the allocator headers to indicate whether this is a compound structure or a value. An int is a value, and a (int,int) pair is a compound. All compounds consist of pointers, so the collector just recursively traverses all of the pointers in the structure, and halts when it reaches values. For compounds, the size of the structure must also be encoded in the header.

Tagged Pointers


Tagged pointers have a long tradition in functional languages. Typically, all of the tag bits are encoded within pointers and values themselves, so the representation is very compact. This is important for functional languages in particular, since they encourage the allocation of many small objects.

Modern memory allocation algorithms generally segregate free space in variously sized buckets. The smallest bucket is generally the size of a pointer, or perhaps twice the size of a pointer. This is generally 4 bytes on a 32-bit machine, so at the very least, the pointer will always point to an address aligned on a 4-byte boundary.

However, most pointers are byte-addressable, meaning they can point to addresses that are aligned on a single byte boundary. This means that the lowest two bits of a pointer are generally wasted space, since they will always be 0. So let's use that wasted space for something!

Let's say we use the lowest bit to differentiate between pointers and integers on the stack and in data structures. This means an int will no longer be 32-bits, but 31-bits, with 1 bit going to this flag. If the lowest bit is 0, it's an int, and if the low order bit is 1, it's a pointer. This means we can scan a stack frame or a data structure one word at a time and easily distinguish integers from pointers by just checking the lowest bit.

So now we can have fully unboxed integers, which saves a lot of time and space in programs since we no longer need to allocate them on the heap. Anything larger than a word must still be heap-allocated.

Field offset calculations now require us to omit the lowest bit, which involves subtracting 1 from any offset. Since most offsets are statically known, this costs nothing at runtime.

There is one remaining problem with the above representation, namely, we still don't know the size of a structure! When the GC traverses a structure, it will iterate right off of the end since it doesn't know when to stop.

The paper I linked to above places the size in a header word present in every structure. This header word is split into multiple fields. On a 32-bit machine, the lowest 22 bits are used for the size of the structure in words, the next two bits are used by the garbage collector (typically a mark-sweep), and the last 8 bits are used for the constructor tag, ie. Nil or Cons. The constructor tag is typically present in every data structure, but using a whole word for it is somewhat wasteful, so this more compact representation is quite useful.

This representation may also permit some interesting optimizations. Since word-sized values are unboxed, all heap allocated structures will be at least 8-bytes in size, consisting of the header word followed by a number of data words (a commenter pointed out that this doesn't necessarily imply 8-byte alignment however; while true, as mentioned above allocators segregate memory into variously sized buckets, and so we need to simply ensure that these buckets are multiples of 8-bytes). This means the third lowest order bit is also unused, giving us two free bits in every pointer.

Consider the following type declaration:
type 'a Foo = Foo1 | Foo2 of int | Foo3 of 'a | Foo4 of 'a * 'a
Since Foo1 does not contain any embedded values or pointers, we don't need a pointer for it at all and it can just be represented by a simple unboxed integer.

Foo2 contains an integer, so we must allocate a heap structure to contain this case.

Foo3 and Foo4 are the interesting cases. 'a is a polymorphic value, and if instantiated with a pointer type, we don't need to allocate a heap structure for it. Basically, we can use the two free bits to encode up to 3 constructors for such values, and remove a level of indirection. Pattern matching on such structures is broken down into 3 cases:
  1. If it's an int, then it's one of the empty constructors.

  2. If it's a pointer, and the second and third lowest order bits are clear, extract the constructor tag from the header, and the value is just after the header.

  3. Else, extract the constructor tag from the pointer, and the pointer is a pointer to the value.
However, to maintain separate compilation, we cannot assume that 'a is a pointer type, as it could be an int. Therefore only Foo4 can benefit from this inlining optimization in this particular case.

Coming back to the question of identifying structure sizes, an alternative representation which I have not seen described anywhere, is to use the unused third lowest bit as a mark bit to terminate GC scanning, and use the second lowest bit to distinguish structures with nested pointers, and those that are pointer-free.

A '1' in the second lowest order bit position flags a "value structure" which contains no pointers. In other words, it consists entirely of integer-sized data, so the GC does not need to scan this structure. Alternately, a '0' in this position indicates a "compound structure" which contains pointers, and which must be scanned by the GC.

A '1' in the third lowest order bit position flags the last pointer in a data structure. This bit is only set in pointers within data structures, and is clear everywhere else.

Note that only the last pointer is used for the halt condition, since the collector cares only about pointers and ignores unboxed values. The compiler can optimize garbage collection by placing all pointers at the beginning of a structure. Also, when extracting a pointer from a structure or writing one into a structure, it should be masked to clear or set the appropriate bits.

Using this representation, we free the 22-bits used for the structure size in the header. We can thus reserve a full 24 bits for the GC, which expands our choice of GC algorithms considerably. Even ref-counting becomes feasible with 24-bits in the header.

One possible downside of this representation is that the size header was also used for array lengths. The length must now be stored in its own field in the structure, so we've effectively grown the array type by one word. Considering the size of arrays compared to a single word, this seems like an acceptable overhead for this representation's uniformity and flexibility.

Type-Passing


Type-passing, aka tagless GC, is the most sophisticated GC technique. There are two approaches that I'm aware of:
  1. Dictionary-passing: a dictionary consisting of functions specific to the type is passed along with all values of that type. These functions know how to collect values of that type.

  2. Type Reconstruction: basically, the concrete type of a polymorphic function is reconstructed at runtime by inferring the type from previous activation frames. Monomorphic functions have all the type information needed to fully specify any polymorphic parameters, so we just search the stack for the type information stored there.
The benefits are that pointers are just pointers, integers are full-sized integers, and no special meaning is assigned to their values or any bits contained within the word.

To my knowledge, only two languages have used this technique, Id and Mercury, and only Mercury is still available as far as I know. The .NET CLR's technique can also be considered a form of type-passing.

The downsides of dictionary-passing are that every polymorphic parameter is paired with a dictionary parameter, thus increasing the parameter lists of all polymorphic functions. The additional pushing and popping can outweigh any advantages that might be gained from dictionary-passing.

Type-reconstruction can remove the need to pass this parameter at the expense of additional runtime overhead. Basically, we don't need to pass this dictionary to every function call, we just need to get it paired with the parameter in the monomorphic function which calls the first polymorphic function. As we walk the stack we can then extract the dictionary for a given parameter from the last monomorphic function. This becomes more complicated with closures, since the dictionaries for polymorphic parameters must now be saved in the closure environment as well.

This process is essentially a form of type inference performed at runtime. Unfortunately, type inference often has pathological corner cases with exponential running time. These cases are unlikely, but inference in general has substantial overhead (see the Id paper).

Acknowledgments


Many thanks to Andreas Rossberg, of Alice ML fame, for patiently wading through my sometimes confused ramblings. :-)

[Edit: GHC's recent pointer tagging paper actually discusses the constructor inlining optimization I describe above, along with some hard numbers to quantify how many constructors can be eliminated. Turns out that >95% of constructors contain 3 cases or less, so the technique could amount to a substantial savings. Theoretically, it seems that every second nested data type could be unboxed in this fashion.]

See Also

  1. Garbage Collection Representations Continued
  2. Garbage Collection Representations Continued
  3. Garbage Collection Representations Continued II

Friday, July 18, 2008

Coroutines in C Redux

My last post generated a bit of discussion here and on Reddit, where after further testing, I discovered that stack management didn't work properly on Windows. It would seem that embedded frame pointers on the stack are used to ensure that we are on the same stack, at least in DEBUG mode, and so growing or shrinking the stack generated cryptic errors.

These problems motivated me to explore alternate methods of implementing coroutines. Fundamentally, the current state of a computation is encapsulated in the stack. So switching between two concurrent computations means modifying the stack, either by swapping out the old stack for an entirely new one, or by copying the relevant portion of the stack to a save area, and restoring it when switching back. Coroutines and threads typically use stack switching, and continuations have traditionally used some sort of copying.

The initial coroutine implementation used stack switching, which is fairly efficient incurring only a register save, a pointer update, and a register restore. Stack copying involves a register save, a copy to save area, a copy from a save area, and a register restore. The additional copies incur a certain overhead.

The benefits of copying are that it's virtually guaranteed to work across all platforms, and that all stack data are restored to their original locations so the full semantics of C are available. Stack switching restricts certain C operations, such as taking the address of a local variable.

My naive copying implementation was about two orders of magnitude slower than stack switching. It performed about 300,000 context switches/sec on my Core 2 Duo, vs stack switching at ~12,500,000 context switches/sec. I reduced this overhead by an order of magnitude by caching stacks (~1,500,000 context switches/sec), so part of the stack save area is wasted, but at least we're not allocating and freeing on each context switch.

The copying implementation can be optimized further using lazy stack copying. Basically, only a portion of the stack is restored, and an underflow handler is set up as the return address of the last stack frame. The underflow handler then restores the next portion of the stack and resumes the computation. If the underflow handler is never invoked, then we only need to copy that small portion of the stack back to the save area, so this ends up being a huge win.

Unfortunately, while establishing an underflow handler is trivial to do in assembly, I haven't been able to figure out how to accomplish this in C. It would be a neat trick though, and I'd estimate that it would bring the performance of a copying implementation very close to a stack switching implementation.

I've also started playing with an implementation of delimited continuations for C, since this has the potential to reduce the amount of copying needed. It's tricky to do in C though, so I don't have anything solid yet.

For expediency, I have also implemented coroutines using Windows Fibers. Unfortunately, Fibers don't support any sort of copy or clone operation, so I had to retire coro_clone for the time being. This means my coroutine library can no longer be used to implement multishot continuations. Of course, this capability is trivial in a stack copying implementation. Hopefully I'll be able to figure out a way to improve the performance of copying coroutines/continuations in a portable way.

Wednesday, July 16, 2008

Coroutines in C

I've just uploaded a functional coroutine library for C, called libconcurrency. It's available under the LGPL. I think it's the most complete, flexible and simplest coroutine implementation I've seen, so hopefully it will find some use.

Next on the todo list are some more rigourous tests of the corner cases, and then extending libconcurrency to scale across CPUs. This will make it the C equivalent of Manticore for ML.

There is a rich opportunity for scalable concurrency in C. Of course, I only built this library to serve as a core component of a virtual machine I'm building, and that's all I'm going to say about that. ;-)