Faster Thread-Local Data with ThreadScoped<T>

Awhile ago, Jeroen Frijters blogged about .NET 4's new ThreadLocal<T>, and how it was implemented. It was based on a neat generics trick exploiting the fact that the CLR does not erase types. Indexing a generic type X with a type T, means X<string> has distinct static and thread-local fields from say, X<int>.

I've actually used this trick before to implement a type of safe reflection.

Jeroen's sample implementation exploiting this trick for thread-local instance variables seemed far too complicated however, and it suffered from a number of limitations that he notes in his post. If Microsoft's implementation was even more complicated as was claimed, I wasn't confident it would perform very well at all. I decided to do my own implementation, without conforming the ThreadLocal<T>'s interface since I have very specific applications in mind.

ThreadScoped<T>

As I suspected, my implementation was considerably simpler, coming in at less than 90 lines of code, and it doesn't suffer from the problems that Jeroen identifies, save for the fact generated types are not reclaimed until the AppDomain exits. However, ThreadScoped instances are aggressively reused, so I don't anticipate this to be a huge problem.

Conceptually, the implementation is quite simple. We have four views to keep in mind here:

Thread-global static data: ordinary static fields that are thus visible by all threads.
Thread-global instance data: ordinary object fields visible to all threads with access to that instance.
Thread-local static data: static fields marked with [ThreadStatic], giving a thread-specific view of that field.
Thread-local instance data: not provided natively by the CLR, but simulated by ThreadLocal<T> and ThreadScoped<T> via thread-local static data + generics.

Our thread-global static data consists of a lock-free linked list of unallocated ThreadScoped instances. Our thread-global instance data is a version number, and a 'next' pointer that points to the next available ThreadScoped<T>. This is used when freeing an instance by pushing it on the global list. Accessing the global and instance data are all done via Interlocked.CompareExchange, so all updates are lock-free.

Now comes the tricky part: the private type ThreadScoped<T>.Ref<TIndex> has two static fields, one for the value T and one for a version number (more on this later):

sealed class Ref<TIndex> : ThreadScoped<T>
{
  [ThreadStatic]
  static T scoped;     // the unique thread-local slot
  [ThreadStatic]
  static uint version; // the version number of the thread-local data
}

The type parameter TIndex is not actually used in a static or instance field, it's merely used here as a "phantom type". Basically, if we keep substituting new types for TIndex, we'll keep forcing the CLR to generate new thread local static fields for us that we can use to simulate thread-local instance fields!

This is done in the Ref<TIndex>.Allocate() method. The global instance list always contains at least one unallocated instance. Whenever we try to allocate a new ThreadScoped<T>, we check whether we're down to the last instance in the list. If so, this last instance will generate a Ref<Ref<TIndex>> and enqueue that on the end:

internal override ThreadScoped<T> Allocate()
{
    // if 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    x = Interlocked.CompareExchange(ref next, new Ref<Ref<TIndex>>(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at this.next
    return this;
}

ThreadScoped.Create then pops the front of the list to obtain the next to last instance, and we're back to only having one unallocated instance.

There's an important invariant here: there is always at least one item in the global list, and the last item on the global list is always the most deeply nested generic type that has been generated so far for ThreadScoped<T>. This means when we get to the last remaining unallocated instance, we can always safely generate a new instance without interfering with other threads.

The version numbers also bear some explanation. Basically, even if Thread0 disposes of an ThreadScoped instance, Thread1 may have data sitting in that instance. Suppose that the ThreadScoped instance is then pushed on the free list, and then allocated again. If Thread1 participates in that second computation, it will already find data sitting in its supposedly fresh thread-local instance from the last computation where it was supposed to have been disposed!

Obviously this is not what we want, but while Thread0 is disposing of the instance, it can't access Thread1's thread-local fields to clear them. This is the purpose of the instance and the thread-local version numbers. During dispose, we increment the instance's version number. If the thread-local version number and the instance version number don't match, it means the instance was disposed in the intervening time, so we should reinitialize it before proceeding:

public override T Value
{
    get { return current == version ? scoped : Update(); }
    set { scoped = value; }
}
T Update()
{
    if (next != this) throw new ObjectDisposedException("Transacted<T> has been disposed.");
    version = current;
    return scoped = default(T);
}

And that's it! There are a few other minor details related to book keeping that aren't really important. I think that's pretty much as simple as you can get, and since this results in effectively a direct pointer to thread-local data, it should be quite fast. There are also no intrinsic allocation limits, as it will just keep allocating or reusing instances as needed.

Benchmarks

Here is the source for the benchmarks I used. The main loop runs 10,000,000 iterations of a simple calculation that performs one read and one write to the thread-local instance per iteration. I ran this test 20 times in a loop, logging the number of iterations per second on each run. Then I used Peirce's Criterion to filter out the outliers, and used the resulting data set.

The numbers are iterations/second, benchmarks were run on .NET 4 and Windows 7. The results are fairly compelling:

	*ThreadLocal<T>*	*ThreadScoped<T>*	*ThreadStatic*
	5022000	16260000	22396000
	5042000	14378000	18484000
	4972000	16514000	20790000
	5002000	15722000	22470000
	5070000	14244000	19860000
	5098000	13946000	16570000
	5076000	16474000	22246000
	5102000	13994000	20532000
	5012000	13494000	14482000
	5108000	16090000	21598000
	5074000	16778000	20000000
	5104000	15408000	21620000
	5076000	12762000	16312000
	5054000	16792000	21008000
	5064000	16380000	21856000
	5108000	16154000	15910000
	5066000	16750000	22988000
	5108000	16076000	18778000
	5012000	15986000	23094000

*MIN*	4972000	12762000	14482000
*MAX*	5108000	16792000	23094000
*AVG*	5061579	15484316	20052316
% Overhead relative to ThreadStatic	296%	29%	0%

Microsoft's implementation takes a real beating in this simple benchmark with almost 300% overhead over raw ThreadStatic fields. I'm not sure what could be going on in there to cause such a significant slowdown

By contrast, ThreadScoped has a modest 30% overhead, which is far more reasonable. I suspect this overhead is due to two factors: 1. virtual dispatch overhead because ThreadScoped.Value is a virtual property, and 2. the encapsulated instance for Ref <TIndex> may require a bit of pointer-chasing to resolve the right thread-local static field. I can't think of a way to eliminate this overhead, so this is as good as it's going to get for now.

I will note that I've had a few programs where ThreadLocal<T> did not perform as badly as demonstrated above, so it may be that reads or writes are more penalized. However, no benchmark or program I tested had ThreadLocal outperforming ThreadScoped, so I can say with confidence that ThreadScoped is much faster than ThreadLocal.

Jeroen's post also implied that he had tested a ThreadLocal version that used arrays for the backing store, and that it was faster. I also implemented a version of ThreadScoped using arrays, but haven't tested it enough. It did seem quite a bit faster with some code I had been playing with, but there were many uncontrolled variables, so I can't reach any conclusion with any confidence. The array-backed version is commited to the Sasa repository for the adventurous though. There are serious limitations with using arrays as a backing store however, namely that you're stuck with a fixed number of instances defined at compile-time, and allocating large enough arrays to avoid falling back on slow thread-local data wastes quite a bit of memory.

Still, I will run some benchmarks at some future date, so stay tuned! As for ThreadScoped<T>, it will probably be in the next Sasa release coming in a month or two, or you can just grab it from the Mercurial repo.

Easy Reverse Mode Automatic Differentiation in C#

Continuing from my last post on implementing forward-mode automatic differentiation (AD) using C# operator overloading , this is just a quick follow-up showing how easy reverse mode is to achieve, and why it's important. Why Reverse Mode Automatic Differentiation? As explained in the last post, the vector representation of forward-mode AD can compute the derivatives of all parameter simultaneously, but it does so with considerable space cost: each operation creates a vector computing the derivative of each parameter. So N parameters with M operations would allocation O(N*M) space. It turns out, this is unnecessary! Reverse mode AD allocates only O(N+M) space to compute the derivatives of N parameters across M operations. In general, forward mode AD is best suited to differentiating functions of type: R → R N That is, functions of 1 parameter that compute multiple outputs. Reverse mode AD is suited to the dual scenario: R N → R That is, functions of many parameters t...

Higher Logics

Search This Blog

Faster Thread-Local Data with ThreadScoped<T>

ThreadScoped<T>

Benchmarks

Labels

Comments

Popular posts from this blog

async.h - asynchronous, stackless subroutines in C

Easy Automatic Differentiation in C#

Easy Reverse Mode Automatic Differentiation in C#