Wednesday, September 28, 2011

ThreadScoped<T> The Next Generation

Jeroen Frijters helpfully pointed out that the CLR implements some hard limits on nesting generics, which is 99 for .NET 4 based on my tests. My previous implementation of ThreadScoped<T> was thus limited to 99 instances. Not very useful!

The solution is actually quite simple, which I briefly outlined on Jeroen's blog: add more type parameters and use a simple base-99 counting scheme to generate new instances. Each additional type parameters thus increases the permutations 99 fold. One type index parameter yields 991 instances, two type index parameters yields 992 instances, three type index parameters yields 993, and so on.

No one in the foreseeable future will require more than 993, which is almost a million thread-local variables, so I've added two more type index parameters to make Ref<T0, T1, T2>. The instance allocation function is now:

internal override ThreadScoped<T> Allocate()
    // If 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    // The CLR has some fundamental limits on generic nesting depths, so we circumvent
    // this by using two generic parameters, and nesting them via counting.
    x = Interlocked.CompareExchange(ref next, CreateNext(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at
    return this;
and CreateNext is:
ThreadScoped<T> CreateNext()
    var x = allocCount + 1;
    if (x % (99 * 99) == 0) return new Ref<T, T, Ref<T2>> { allocCount = x };
    if (x % 99 == 0)        return new Ref<T, Ref<T1>, T2> { allocCount = x };
    return new Ref<Ref<T0>, T1, T2> { allocCount = x };
This is simple base-99 arithmetic. Anyone familiar with arithmetic should recognize the pattern here: when we get to certain multiples of 99, we reset the previous digits and carry the 1 to the next slot. Normally, humans deal in base-10, so a carry happens at 101, 102, 103, and so on.

In this case, we are dealing with base-99, so carries happen at 991, 992 and 993, and the "carry" operation consists of nesting a generic type parameter. Simple!

This scheme is also trivially extensible to as many additional parameters as is needed, so if someone somewhere really does need more than a million fast thread-local variables, I have you covered.

These changes don't seem to have impacted the performance of ThreadScoped<T>, so I'm still over 250% faster than the ThreadLocal<T> provided in .NET 4's base class libraries.

Tuesday, September 27, 2011

Faster Thread-Local Data with ThreadScoped<T>

Awhile ago, Jeroen Frijters blogged about .NET 4's new ThreadLocal<T>, and how it was implemented. It was based on a neat generics trick exploiting the fact that the CLR does not erase types. Indexing a generic type X with a type T, means X<string> has distinct static and thread-local fields from say, X<int>.

I've actually used this trick before to implement a type of safe reflection.

Jeroen's sample implementation exploiting this trick for thread-local instance variables seemed far too complicated however, and it suffered from a number of limitations that he notes in his post. If Microsoft's implementation was even more complicated as was claimed, I wasn't confident it would perform very well at all. I decided to do my own implementation, without conforming the ThreadLocal<T>'s interface since I have very specific applications in mind.


As I suspected, my implementation was considerably simpler, coming in at less than 90 lines of code, and it doesn't suffer from the problems that Jeroen identifies, save for the fact generated types are not reclaimed until the AppDomain exits. However, ThreadScoped instances are aggressively reused, so I don't anticipate this to be a huge problem.

Conceptually, the implementation is quite simple. We have four views to keep in mind here:

  1. Thread-global static data: ordinary static fields that are thus visible by all threads.
  2. Thread-global instance data: ordinary object fields visible to all threads with access to that instance.
  3. Thread-local static data: static fields marked with [ThreadStatic], giving a thread-specific view of that field.
  4. Thread-local instance data: not provided natively by the CLR, but simulated by ThreadLocal<T> and ThreadScoped<T> via thread-local static data + generics.
Our thread-global static data consists of a lock-free linked list of unallocated ThreadScoped instances. Our thread-global instance data is a version number, and a 'next' pointer that points to the next available ThreadScoped<T>. This is used when freeing an instance by pushing it on the global list. Accessing the global and instance data are all done via Interlocked.CompareExchange, so all updates are lock-free.

Now comes the tricky part: the private type ThreadScoped<T>.Ref<TIndex> has two static fields, one for the value T and one for a version number (more on this later):

sealed class Ref<TIndex> : ThreadScoped<T>
  static T scoped;     // the unique thread-local slot
  static uint version; // the version number of the thread-local data

The type parameter TIndex is not actually used in a static or instance field, it's merely used here as a "phantom type". Basically, if we keep substituting new types for TIndex, we'll keep forcing the CLR to generate new thread local static fields for us that we can use to simulate thread-local instance fields!

This is done in the Ref<TIndex>.Allocate() method. The global instance list always contains at least one unallocated instance. Whenever we try to allocate a new ThreadScoped<T>, we check whether we're down to the last instance in the list. If so, this last instance will generate a Ref<Ref<TIndex>> and enqueue that on the end:

internal override ThreadScoped<T> Allocate()
    // if 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    x = Interlocked.CompareExchange(ref next, new Ref<Ref<TIndex>>(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at
    return this;

ThreadScoped.Create then pops the front of the list to obtain the next to last instance, and we're back to only having one unallocated instance.

There's an important invariant here: there is always at least one item in the global list, and the last item on the global list is always the most deeply nested generic type that has been generated so far for ThreadScoped<T>. This means when we get to the last remaining unallocated instance, we can always safely generate a new instance without interfering with other threads.

The version numbers also bear some explanation. Basically, even if Thread0 disposes of an ThreadScoped instance, Thread1 may have data sitting in that instance. Suppose that the ThreadScoped instance is then pushed on the free list, and then allocated again. If Thread1 participates in that second computation, it will already find data sitting in its supposedly fresh thread-local instance from the last computation where it was supposed to have been disposed!

Obviously this is not what we want, but while Thread0 is disposing of the instance, it can't access Thread1's thread-local fields to clear them. This is the purpose of the instance and the thread-local version numbers. During dispose, we increment the instance's version number. If the thread-local version number and the instance version number don't match, it means the instance was disposed in the intervening time, so we should reinitialize it before proceeding:

public override T Value
    get { return current == version ? scoped : Update(); }
    set { scoped = value; }
T Update()
    if (next != this) throw new ObjectDisposedException("Transacted<T> has been disposed.");
    version = current;
    return scoped = default(T);

And that's it! There are a few other minor details related to book keeping that aren't really important. I think that's pretty much as simple as you can get, and since this results in effectively a direct pointer to thread-local data, it should be quite fast. There are also no intrinsic allocation limits, as it will just keep allocating or reusing instances as needed.


Here is the source for the benchmarks I used. The main loop runs 10,000,000 iterations of a simple calculation that performs one read and one write to the thread-local instance per iteration. I ran this test 20 times in a loop, logging the number of iterations per second on each run. Then I used Peirce's Criterion to filter out the outliers, and used the resulting data set.

The numbers are iterations/second, benchmarks were run on .NET 4 and Windows 7. The results are fairly compelling:

ThreadLocal<T> ThreadScoped<T> ThreadStatic

5022000 16260000 22396000

5042000 14378000 18484000

4972000 16514000 20790000

5002000 15722000 22470000

5070000 14244000 19860000

5098000 13946000 16570000

5076000 16474000 22246000

5102000 13994000 20532000

5012000 13494000 14482000

5108000 16090000 21598000

5074000 16778000 20000000

5104000 15408000 21620000

5076000 12762000 16312000

5054000 16792000 21008000

5064000 16380000 21856000

5108000 16154000 15910000

5066000 16750000 22988000

5108000 16076000 18778000

5012000 15986000 23094000

MIN 4972000 12762000 14482000
MAX 5108000 16792000 23094000
AVG 5061579 15484316 20052316
% Overhead relative
to ThreadStatic
296% 29% 0%

Microsoft's implementation takes a real beating in this simple benchmark with almost 300% overhead over raw ThreadStatic fields. I'm not sure what could be going on in there to cause such a significant slowdown

By contrast, ThreadScoped has a modest 30% overhead, which is far more reasonable. I suspect this overhead is due to two factors: 1. virtual dispatch overhead because ThreadScoped.Value is a virtual property, and 2. the encapsulated instance for Ref <TIndex> may require a bit of pointer-chasing to resolve the right thread-local static field. I can't think of a way to eliminate this overhead, so this is as good as it's going to get for now.

I will note that I've had a few programs where ThreadLocal<T> did not perform as badly as demonstrated above, so it may be that reads or writes are more penalized. However, no benchmark or program I tested had ThreadLocal outperforming ThreadScoped, so I can say with confidence that ThreadScoped is much faster than ThreadLocal.

Jeroen's post also implied that he had tested a ThreadLocal version that used arrays for the backing store, and that it was faster. I also implemented a version of ThreadScoped using arrays, but haven't tested it enough. It did seem quite a bit faster with some code I had been playing with, but there were many uncontrolled variables, so I can't reach any conclusion with any confidence. The array-backed version is commited to the Sasa repository for the adventurous though. There are serious limitations with using arrays as a backing store however, namely that you're stuck with a fixed number of instances defined at compile-time, and allocating large enough arrays to avoid falling back on slow thread-local data wastes quite a bit of memory.

Still, I will run some benchmarks at some future date, so stay tuned! As for ThreadScoped<T>, it will probably be in the next Sasa release coming in a month or two, or you can just grab it from the Mercurial repo.

Sunday, September 25, 2011

Idioms in C# with LINQ

There's a great post on implementing idioms with LINQ, and the example application was to implement formlets, as WebSharper does for F#. Tomas's post is well written, so if you're unclear on the above concepts I recommend reading it first before proceeding with this article.

The claim in that post is that idioms can only be encoded via LINQ's 'join' operators. While strictly true if you stick to all the LINQ rules, because LINQ queries are just naive syntactic transforms you don't have to follow the rules. You can thus exploit this to hijack the signatures for the SelectMany overloads to yield idiom signatures. It's not all sunshine and roses though, as there are consequences.


LINQ is a standard set of methods one can implement that the C# compiler can use to provide "query patterns". This query:

var foo = from x in SomeFoo
          from y in foo.Values
          select y;
is translated by the C# compiler to:
var foo = SomeFoo.SelectMany(x => x.Values, (x, y) => y);
This is a purely syntactic transformation, meaning that the C# compiler simply takes the text from above, and naively translates each 'from', 'where', 'select', etc. into calls to instance or extension methods, SelectMany, Where, Select, etc. Type inference must then be able to infer the types used in your query, and everything must type check.

The fact that we're dealing with a purely syntactic transform means that we can be sneaky and alter the signatures of these LINQ functions and the C# compiler would be none the wiser. The resulting calls to the LINQ methods would still need to compile, but we can ensure that they only compile following the rules we want, in this case, of the rules of idioms.

LINQ Methods

The core LINQ methods are as follows, using Formlet<T> as the LINQ type:

Formlet<R> Select<T, R>(this Formlet<T> f, Func<T, R> selector);
Formlet<R> SelectMany<R>(this Formlet<T> f,
                              Func<T, Formlet<R>> collector);
Formlet<R> SelectMany<U, R>(this Formlet<T> f,
                                 Func<T, Formlet<U>>
                                 Func<T, U, R> selector);
The problematic methods for idioms are the two SelectMany calls, specifically, the parameter I've called 'collector'. You can see that the LINQ type is unwrapped and the value extracted on each SelectMany, and passed to the rest of the query. Accessing the previous values like this is forbidden in idioms.

Fortunately, the signatures for SelectMany don't have to have this exact signature, they must only have a similar structure. You must have two SelectMany overloads with one and two delegate parameters, and the first delegate parameter must return your LINQ type, in this case Formlet<T>, as this allows you to chain query clauses one after another. You can also modify the second delegate parameter in various ways, but I haven't found much use for that myself.

To implement idioms, we will simply alter the first delegate parameter so instead of unwrapping the value encapsulated by the Formlet<T>, we simply pass the Formlet<T> itself:

Formlet<R> Select<T, R>(this Formlet<T> f, Func<T, R> selector);
Formlet<R> SelectMany<R>(this Formlet<T> f,
                              Func<Formlet<T>, Formlet<R>> collector);
Formlet<R> SelectMany<U, R>(this Formlet<T> f,
                                 Func<Formlet<T>, Formlet<U>> collector,
                                 Func<T, U, R> selector);
Our query above:
var foo = from x in SomeFoo
          from y in foo.Values
          select y;
would then no longer compile, because 'x' is now not a Foo, but is in fact a Formlet<Foo>, and the formlet type does not have a "Values" property. Of course, you shouldn't provide a property to extract the encapsulated value, or this is all for naught.

The Downsides

Simple queries work great, but longer queries may run into some problems if you alter the LINQ signatures. In this case, if you try to access previous values by mistake, as in our example query above, you will get a complicated error message:

Error 1       Could not find an implementation of the query pattern for source type
 'Formlet.Formlet<AnonymousType#1>'. '<>h__TransparentIdentifier0' not found.
Basically, your incorrect program was naively translated to use the LINQ methods, but because it does not properly match the type signatures you've hijacked, type inference fails. So you can't break your idioms by hijacking the query pattern this way, but depending on your target audience, perhaps you will render them unusable.

Still, it's a neat trick that should be in every type wizard's toolbox.

Saturday, September 17, 2011

IObservable<T> and Delegate Equality

Equality and identity are often intermingled in the .NET framework. .NET delegates correctly implement a straightforward structural equality, where if two delegates are equal if they designate the same method and the same object, regardless of whether the delegate is a unique instance:
class Bar
public void Foo(int i)
var bar = new Bar();
Action<int> a1 = bar.Foo;
Action<int> a2 = bar.Foo;
Console.WriteLine("Delegate Equality: {0}", a1 == a2);
// prints:
// Delegate Equality: True

However, an IObservable<T> created from delegates does not respect structural equality, and reverts to an identity criterion:
// use a1 and a2 from previous code sample
var o1 = Observer.Create<int>(a1);
var o2 = Observer.Create<int>(a2);
Console.WriteLine("Observable Equality: {0}", o1 == o2 || o1.Equals(o2)
|| EqualityComparer<IObserver<int>>.Default.Equals(o1, o2));
// Observable Equality: False

This is unfortunate, because it means that by default we cannot implement previous event handling patterns using IObservable without significantly restructuring the code to propagate the IDisposable returned by IObservable.Subscribe. By this, I mean code that properly managed event registration and unregistration has no easy transition to using IObservable, it must be completely rewritten.

Like event equality, the equality of IObservers created from delegates should be structural, not based on identity. Thus, manually managing subscriptions would be possible via an explicit "Unsubscribe" operation.

This decision has a real consequence that I just hit: I can implement IObservable given an object implementing INotifyPropertyChanged, but could not do the reverse using the built-in abstractions. You'd either have to define your own IObservers that implement structural equality, or you'd have to store the actual event locally and trigger it manually when a new value is available, as I have done with NamedProperty<T> in Sasa.

On a slightly related note, I've switched Sasa over to use Mercurial on Sourceforge, and have closed the old subversion repo.