Tuesday, September 27, 2011

Faster Thread-Local Data with ThreadScoped<T>

Awhile ago, Jeroen Frijters blogged about .NET 4's new ThreadLocal<T>, and how it was implemented. It was based on a neat generics trick exploiting the fact that the CLR does not erase types. Indexing a generic type X with a type T, means X<string> has distinct static and thread-local fields from say, X<int>.

I've actually used this trick before to implement a type of safe reflection.

Jeroen's sample implementation exploiting this trick for thread-local instance variables seemed far too complicated however, and it suffered from a number of limitations that he notes in his post. If Microsoft's implementation was even more complicated as was claimed, I wasn't confident it would perform very well at all. I decided to do my own implementation, without conforming the ThreadLocal<T>'s interface since I have very specific applications in mind.


As I suspected, my implementation was considerably simpler, coming in at less than 90 lines of code, and it doesn't suffer from the problems that Jeroen identifies, save for the fact generated types are not reclaimed until the AppDomain exits. However, ThreadScoped instances are aggressively reused, so I don't anticipate this to be a huge problem.

Conceptually, the implementation is quite simple. We have four views to keep in mind here:

  1. Thread-global static data: ordinary static fields that are thus visible by all threads.
  2. Thread-global instance data: ordinary object fields visible to all threads with access to that instance.
  3. Thread-local static data: static fields marked with [ThreadStatic], giving a thread-specific view of that field.
  4. Thread-local instance data: not provided natively by the CLR, but simulated by ThreadLocal<T> and ThreadScoped<T> via thread-local static data + generics.
Our thread-global static data consists of a lock-free linked list of unallocated ThreadScoped instances. Our thread-global instance data is a version number, and a 'next' pointer that points to the next available ThreadScoped<T>. This is used when freeing an instance by pushing it on the global list. Accessing the global and instance data are all done via Interlocked.CompareExchange, so all updates are lock-free.

Now comes the tricky part: the private type ThreadScoped<T>.Ref<TIndex> has two static fields, one for the value T and one for a version number (more on this later):

sealed class Ref<TIndex> : ThreadScoped<T>
  static T scoped;     // the unique thread-local slot
  static uint version; // the version number of the thread-local data

The type parameter TIndex is not actually used in a static or instance field, it's merely used here as a "phantom type". Basically, if we keep substituting new types for TIndex, we'll keep forcing the CLR to generate new thread local static fields for us that we can use to simulate thread-local instance fields!

This is done in the Ref<TIndex>.Allocate() method. The global instance list always contains at least one unallocated instance. Whenever we try to allocate a new ThreadScoped<T>, we check whether we're down to the last instance in the list. If so, this last instance will generate a Ref<Ref<TIndex>> and enqueue that on the end:

internal override ThreadScoped<T> Allocate()
    // if 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    x = Interlocked.CompareExchange(ref next, new Ref<Ref<TIndex>>(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at this.next
    return this;

ThreadScoped.Create then pops the front of the list to obtain the next to last instance, and we're back to only having one unallocated instance.

There's an important invariant here: there is always at least one item in the global list, and the last item on the global list is always the most deeply nested generic type that has been generated so far for ThreadScoped<T>. This means when we get to the last remaining unallocated instance, we can always safely generate a new instance without interfering with other threads.

The version numbers also bear some explanation. Basically, even if Thread0 disposes of an ThreadScoped instance, Thread1 may have data sitting in that instance. Suppose that the ThreadScoped instance is then pushed on the free list, and then allocated again. If Thread1 participates in that second computation, it will already find data sitting in its supposedly fresh thread-local instance from the last computation where it was supposed to have been disposed!

Obviously this is not what we want, but while Thread0 is disposing of the instance, it can't access Thread1's thread-local fields to clear them. This is the purpose of the instance and the thread-local version numbers. During dispose, we increment the instance's version number. If the thread-local version number and the instance version number don't match, it means the instance was disposed in the intervening time, so we should reinitialize it before proceeding:

public override T Value
    get { return current == version ? scoped : Update(); }
    set { scoped = value; }
T Update()
    if (next != this) throw new ObjectDisposedException("Transacted<T> has been disposed.");
    version = current;
    return scoped = default(T);

And that's it! There are a few other minor details related to book keeping that aren't really important. I think that's pretty much as simple as you can get, and since this results in effectively a direct pointer to thread-local data, it should be quite fast. There are also no intrinsic allocation limits, as it will just keep allocating or reusing instances as needed.


Here is the source for the benchmarks I used. The main loop runs 10,000,000 iterations of a simple calculation that performs one read and one write to the thread-local instance per iteration. I ran this test 20 times in a loop, logging the number of iterations per second on each run. Then I used Peirce's Criterion to filter out the outliers, and used the resulting data set.

The numbers are iterations/second, benchmarks were run on .NET 4 and Windows 7. The results are fairly compelling:

ThreadLocal<T> ThreadScoped<T> ThreadStatic

5022000 16260000 22396000

5042000 14378000 18484000

4972000 16514000 20790000

5002000 15722000 22470000

5070000 14244000 19860000

5098000 13946000 16570000

5076000 16474000 22246000

5102000 13994000 20532000

5012000 13494000 14482000

5108000 16090000 21598000

5074000 16778000 20000000

5104000 15408000 21620000

5076000 12762000 16312000

5054000 16792000 21008000

5064000 16380000 21856000

5108000 16154000 15910000

5066000 16750000 22988000

5108000 16076000 18778000

5012000 15986000 23094000

MIN 4972000 12762000 14482000
MAX 5108000 16792000 23094000
AVG 5061579 15484316 20052316
% Overhead relative
to ThreadStatic
296% 29% 0%

Microsoft's implementation takes a real beating in this simple benchmark with almost 300% overhead over raw ThreadStatic fields. I'm not sure what could be going on in there to cause such a significant slowdown

By contrast, ThreadScoped has a modest 30% overhead, which is far more reasonable. I suspect this overhead is due to two factors: 1. virtual dispatch overhead because ThreadScoped.Value is a virtual property, and 2. the encapsulated instance for Ref <TIndex> may require a bit of pointer-chasing to resolve the right thread-local static field. I can't think of a way to eliminate this overhead, so this is as good as it's going to get for now.

I will note that I've had a few programs where ThreadLocal<T> did not perform as badly as demonstrated above, so it may be that reads or writes are more penalized. However, no benchmark or program I tested had ThreadLocal outperforming ThreadScoped, so I can say with confidence that ThreadScoped is much faster than ThreadLocal.

Jeroen's post also implied that he had tested a ThreadLocal version that used arrays for the backing store, and that it was faster. I also implemented a version of ThreadScoped using arrays, but haven't tested it enough. It did seem quite a bit faster with some code I had been playing with, but there were many uncontrolled variables, so I can't reach any conclusion with any confidence. The array-backed version is commited to the Sasa repository for the adventurous though. There are serious limitations with using arrays as a backing store however, namely that you're stuck with a fixed number of instances defined at compile-time, and allocating large enough arrays to avoid falling back on slow thread-local data wastes quite a bit of memory.

Still, I will run some benchmarks at some future date, so stay tuned! As for ThreadScoped<T>, it will probably be in the next Sasa release coming in a month or two, or you can just grab it from the Mercurial repo.

No comments: