Monday, July 28, 2008

Garbage Collection Representations

I have yet to find a comprehensive description of the various data representations used for garbage collection. This isn't an overview of garbage collection algorithms, but an overview of how to distinguish pointers from other values and how to traverse the data structures, which has seemingly gotten only a short riff in the literature. I aim to give an in-depth overview of the techniques I've come across or managed to puzzle out on my own.
  1. Boxing: all values are allocated on the heap, so basically everything is a pointer, thus, there is no ambiguity between pointers and values.

  2. Tagged pointers: pointers are distinguished from ordinary values by some sort of tag. This information is often encoded in the pointer or value itself for a very compact representation.

  3. Type-passing: each parameter of a function is associated with a set of operations, usually referred to as a "dictionary", which is similar to an interface in object-oriented languages. This dictionary defines a collect() operation for each type, which will traverse the structure looking for garbage. This dictionary can be passed into every function, or it can be reconstructed from the stack.

Why Are Special Representations Needed?


One might first ask why special representations are even needed. At compile-time, it's generally perfectly clear what types are being used, as well as the types of the various data structures.

While this is true of simple C-like languages, it isn't true of polymorphic languages. C is what's known as monomorphic, meaning all parameters have a fixed, known type at compile-time.

By contrast, polymorphic languages can have function parameters that can accept any type, and this type is not fixed at any time. A simple example is the identity function:
val id: 'a -> 'a
This function accepts a parameter of any type, and returns a value of the same type. But since you can't know this type ahead of time, you need some way to encode the type in a running program so the garbage collector has enough information to traverse it in its search for garbage.

This problem only really crops up with the combination of polymorphism and separate compilation, meaning you compile a program to a single machine code file which can then be distributed without prior knowledge of how it will be instantiated and used. The .NET CLR is an example of a platform which supports polymorphism, but not separate compilation, since it uses the JIT to specialize methods that use generics. A specialized method is GC'd differently depending on whether it's instantiated with a reference type or a value type. In separately compiled languages, this is not true, hence the need for a special representation.

Boxing


All boxed values are allocated in the heap, including integers and floats. This is the most straightforward representation, and the least efficient of the options both in terms of space and time.

The extra space use comes from the additional pointer needed to indirect to an int, and the additional allocator headers required to track it.

The additional runtime overhead is incurred when repeatedly indirecting the pointers to access the values, and the additional garbage collector pressure from having so many small values.

Allocating many small types like this could also lead to significant memory fragmentation, which balloons memory use, and induces poor locality of reference.

A boxed version of the identity function would look like this:
void * id(void * v) {
return v;
}
Basically, we don't know anything about the value involved, and the void pointer just points to a heap location where the value is stored.

There is a single flag stored in the allocator headers to indicate whether this is a compound structure or a value. An int is a value, and a (int,int) pair is a compound. All compounds consist of pointers, so the collector just recursively traverses all of the pointers in the structure, and halts when it reaches values. For compounds, the size of the structure must also be encoded in the header.

Tagged Pointers


Tagged pointers have a long tradition in functional languages. Typically, all of the tag bits are encoded within pointers and values themselves, so the representation is very compact. This is important for functional languages in particular, since they encourage the allocation of many small objects.

Modern memory allocation algorithms generally segregate free space in variously sized buckets. The smallest bucket is generally the size of a pointer, or perhaps twice the size of a pointer. This is generally 4 bytes on a 32-bit machine, so at the very least, the pointer will always point to an address aligned on a 4-byte boundary.

However, most pointers are byte-addressable, meaning they can point to addresses that are aligned on a single byte boundary. This means that the lowest two bits of a pointer are generally wasted space, since they will always be 0. So let's use that wasted space for something!

Let's say we use the lowest bit to differentiate between pointers and integers on the stack and in data structures. This means an int will no longer be 32-bits, but 31-bits, with 1 bit going to this flag. If the lowest bit is 0, it's an int, and if the low order bit is 1, it's a pointer. This means we can scan a stack frame or a data structure one word at a time and easily distinguish integers from pointers by just checking the lowest bit.

So now we can have fully unboxed integers, which saves a lot of time and space in programs since we no longer need to allocate them on the heap. Anything larger than a word must still be heap-allocated.

Field offset calculations now require us to omit the lowest bit, which involves subtracting 1 from any offset. Since most offsets are statically known, this costs nothing at runtime.

There is one remaining problem with the above representation, namely, we still don't know the size of a structure! When the GC traverses a structure, it will iterate right off of the end since it doesn't know when to stop.

The paper I linked to above places the size in a header word present in every structure. This header word is split into multiple fields. On a 32-bit machine, the lowest 22 bits are used for the size of the structure in words, the next two bits are used by the garbage collector (typically a mark-sweep), and the last 8 bits are used for the constructor tag, ie. Nil or Cons. The constructor tag is typically present in every data structure, but using a whole word for it is somewhat wasteful, so this more compact representation is quite useful.

This representation may also permit some interesting optimizations. Since word-sized values are unboxed, all heap allocated structures will be at least 8-bytes in size, consisting of the header word followed by a number of data words (a commenter pointed out that this doesn't necessarily imply 8-byte alignment however; while true, as mentioned above allocators segregate memory into variously sized buckets, and so we need to simply ensure that these buckets are multiples of 8-bytes). This means the third lowest order bit is also unused, giving us two free bits in every pointer.

Consider the following type declaration:
type 'a Foo = Foo1 | Foo2 of int | Foo3 of 'a | Foo4 of 'a * 'a
Since Foo1 does not contain any embedded values or pointers, we don't need a pointer for it at all and it can just be represented by a simple unboxed integer.

Foo2 contains an integer, so we must allocate a heap structure to contain this case.

Foo3 and Foo4 are the interesting cases. 'a is a polymorphic value, and if instantiated with a pointer type, we don't need to allocate a heap structure for it. Basically, we can use the two free bits to encode up to 3 constructors for such values, and remove a level of indirection. Pattern matching on such structures is broken down into 3 cases:
  1. If it's an int, then it's one of the empty constructors.

  2. If it's a pointer, and the second and third lowest order bits are clear, extract the constructor tag from the header, and the value is just after the header.

  3. Else, extract the constructor tag from the pointer, and the pointer is a pointer to the value.
However, to maintain separate compilation, we cannot assume that 'a is a pointer type, as it could be an int. Therefore only Foo4 can benefit from this inlining optimization in this particular case.

Coming back to the question of identifying structure sizes, an alternative representation which I have not seen described anywhere, is to use the unused third lowest bit as a mark bit to terminate GC scanning, and use the second lowest bit to distinguish structures with nested pointers, and those that are pointer-free.

A '1' in the second lowest order bit position flags a "value structure" which contains no pointers. In other words, it consists entirely of integer-sized data, so the GC does not need to scan this structure. Alternately, a '0' in this position indicates a "compound structure" which contains pointers, and which must be scanned by the GC.

A '1' in the third lowest order bit position flags the last pointer in a data structure. This bit is only set in pointers within data structures, and is clear everywhere else.

Note that only the last pointer is used for the halt condition, since the collector cares only about pointers and ignores unboxed values. The compiler can optimize garbage collection by placing all pointers at the beginning of a structure. Also, when extracting a pointer from a structure or writing one into a structure, it should be masked to clear or set the appropriate bits.

Using this representation, we free the 22-bits used for the structure size in the header. We can thus reserve a full 24 bits for the GC, which expands our choice of GC algorithms considerably. Even ref-counting becomes feasible with 24-bits in the header.

One possible downside of this representation is that the size header was also used for array lengths. The length must now be stored in its own field in the structure, so we've effectively grown the array type by one word. Considering the size of arrays compared to a single word, this seems like an acceptable overhead for this representation's uniformity and flexibility.

Type-Passing


Type-passing, aka tagless GC, is the most sophisticated GC technique. There are two approaches that I'm aware of:
  1. Dictionary-passing: a dictionary consisting of functions specific to the type is passed along with all values of that type. These functions know how to collect values of that type.

  2. Type Reconstruction: basically, the concrete type of a polymorphic function is reconstructed at runtime by inferring the type from previous activation frames. Monomorphic functions have all the type information needed to fully specify any polymorphic parameters, so we just search the stack for the type information stored there.
The benefits are that pointers are just pointers, integers are full-sized integers, and no special meaning is assigned to their values or any bits contained within the word.

To my knowledge, only two languages have used this technique, Id and Mercury, and only Mercury is still available as far as I know. The .NET CLR's technique can also be considered a form of type-passing.

The downsides of dictionary-passing are that every polymorphic parameter is paired with a dictionary parameter, thus increasing the parameter lists of all polymorphic functions. The additional pushing and popping can outweigh any advantages that might be gained from dictionary-passing.

Type-reconstruction can remove the need to pass this parameter at the expense of additional runtime overhead. Basically, we don't need to pass this dictionary to every function call, we just need to get it paired with the parameter in the monomorphic function which calls the first polymorphic function. As we walk the stack we can then extract the dictionary for a given parameter from the last monomorphic function. This becomes more complicated with closures, since the dictionaries for polymorphic parameters must now be saved in the closure environment as well.

This process is essentially a form of type inference performed at runtime. Unfortunately, type inference often has pathological corner cases with exponential running time. These cases are unlikely, but inference in general has substantial overhead (see the Id paper).

Acknowledgments


Many thanks to Andreas Rossberg, of Alice ML fame, for patiently wading through my sometimes confused ramblings. :-)

[Edit: GHC's recent pointer tagging paper actually discusses the constructor inlining optimization I describe above, along with some hard numbers to quantify how many constructors can be eliminated. Turns out that >95% of constructors contain 3 cases or less, so the technique could amount to a substantial savings. Theoretically, it seems that every second nested data type could be unboxed in this fashion.]

13 comments:

Matt said...

"Since [...] all heap allocated structures will be at least 8-bytes in size, [...] the third lowest order bit is also unused"

Wrong! :)

Sandro Magi said...

Care to elaborate on that?

Matt said...

Sorry - I just meant that it doesn't logically follow. 'At least 8 bytes' is different than '8-byte aligned' (You could have 12 bytes, e.g.) It makes the extra alignment more palatable though.

Sandro Magi said...

Right, so we'd have to ensure that the structure terminates on an 8-byte boundary too, which is trivial to achieve with an allocator implementation that uses buckets aligned to 8-byte boundaries.

Matt said...

My comment was just a nitpick of the way the quoted passage was written, which seems to imply that the 3rd bit will be free *anyway* and require no extra alignment from the allocator. This is definitely a minor quibble - that's what the smiley meant :). Overall I found the article interesting and useful. Thanks

Sandro Magi said...

I've added a parenthetical to that line to clarify the meaning. Thanks! :-)

John Cowan said...

The Lisp literature discusses a fourth basic method, the Big Bag of Pages (BBOP or BiBOP, as in "Don't Stop the BiBOP"). In this scheme, the heap is segmented into many arenas (classically the same size as the pages used by virtual memory), each of which contains only a single type. In that way, one can infer the type of an object by masking off the top bits of the pointer to it and looking up those bits in a table.

Stanley Shebs's dissertation at the University of Utah (not online as far as I know, but he was collecting data in 1988) was about an expert system that, given certain constraints, was able to figure out how to set various parameters in the design of a Lisp runtime system (he also looked at Prolog and APL briefly), one of which was precisely the tagged objects vs. tagged pointers vs. BBOP choice. He also pointed out that it's quite possible to combine them: the original Smalltalk system used pointer tagging for small integers but object tagging for everything else.

Leo Meyerovich said...

not sure how you'd characterize efforts influenced by "garbage collection in an uncooperative environment" (not sound? :)) in this breakdown, where the focus is simply on address

Sandro Magi said...

Re: BIBOP techniques

Interesting. I remember coming across mentions of them a few years ago, but I didn't have the background at the time to understand them. Were any measurements ever made as to how much space was wasted by sparsely populated pages/relatively unused types? How does it handle array/vector types where one instance may span many pages, and another may occupy less than a page?

Leo, I'm not sure what you mean by efforts influenced by the paper I linked to, but based on address. Do you mean conservative collection? I'm just dealing with accurate GC in this article, since conservative GC does not require any special representations at all.

Morrisett said...

The TIL compiler used a type-passing approach for handling polymorphism. This was detailed in my thesis. You'll find other references to tag-free collectors there.

Sandro Magi said...

I debated whether to mention TIL. The most recent paper I perused said it used a hybrid scheme, not a pure type reconstruction approach, but perhaps I'm wrong.

Thanks for the thesis link, it looks very interesting! I'll definitely have to read up on the nitty gritty details of TIL.

Tom said...

Using this representation, we free the 22-bits used for the structure size in the header. We can thus reserve a full 24 bits for the GC, which expands our choice of GC algorithms considerably. Even ref-counting becomes feasible with 24-bits in the header.

But how does the GC know how much space to de-allocate if it doesn't know the size of the structure? Obviously, if the heap is partitioned in regions of equally-sized data, this is easy. But inevitably, there will be some data that will be particularly large and will probably have to be allocated in the "the rest" region - and the GC would not know its size.

Sandro Magi said...

But how does the GC know how much space to de-allocate if it doesn't know the size of the structure? Obviously, if the heap is partitioned in regions of equally-sized data, this is easy.

There a number of ways to do this. Your suggestion to partition is one way. Consider a more general tree sorted on address to track regions. There are also BIBOP techniques, as used in the Streamflow allocator.