Friday, December 23, 2011

Software Transactional Memory in Pure C#

Concurrent programming is a very difficult problem to tackle. The fundamental issue is that manual locking is not composable, which is to say that if you have two concurrent programs P0 and P1 free of deadlocks, livelocks and other concurrency hazards, and you try to compose P0 and P1 to create a program P2, P2 may not be free of concurrency hazards. For instance, if P0 and P1 take two locks in different orders, then P2 will deadlock. Needless to say, this is a serious problem because composition is the cornerstone of all programming.

I've been toying with some ideas for software transactional memory (STM) in C# ever since I started playing with FRP and reactive programming in general. The problem in all of these domains is fundamentally about how to handle concurrent updates to shared state, and how to reconcile multiple, possibly conflicting updates to said state.

Rx.NET handles concurrency essentially by removing the identity inherent to shared state. An IObservable<T> is actually a collection of all values pushed to that observable in some undefined order. If you were to create an IObservable that retains only the "last" pushed value, and thus now retains an identity, you then have the same problems as above, namely that this update must always be consistent with other updates at any given instant in time. For instance:

var plusOne = intObservable.Select(i => i+1);

At every instant in the program's execution, plusOne should always observably equal intObservable + 1, and the ability to observe a violation of this constraint is known in reactive literature as a 'glitch'.

Similarly, in database programming where transactions rule, this is known as a 'dirty read'. Essentially, an update to intObservable is executing in a transaction, but other transactions are able to view those changes before that transaction has committed.

Generally speaking, glitches and dirty reads are undesirable, because they require the developer to manually synchronize state, which defeats the whole purpose of going with FRP or transactions to begin with. From what I've seen so far, Rx.NET gets around this by not providing abstractions that expose identity in this way. The programs you write must work with collections of values, and the program must specify the ordering via Observable.OrderBy.

When I added the Property<T> IObservable to Sasa, I added a limited form of transactions to prevent glitches, because a property has identity. This implementation uses a global 'clock', which is really just a global uint64 counter to properly sequence updates and prevent glitches.

Overview

I'm going to focus here on implementing STM directly, but to keep it simple, I've gone with the simplest STM that is expressible using .NET primitives. In fact, the resulting STM is probably not good if you're after scaling, it does a good job of ensuring concurrency safety for arbitrary composition.

The STM I committed to Sasa is a very simple, perhaps even simplistic, STM employing encounter-time locking with deadlock detection on transactional variables. Any read or write acquires the lock on a transactional variable. Whenever two transactions would block to wait for each other, the transaction that is not already blocked is aborted and retried.

This design has advantages and disadvantages. The disadvantages are the limited concurrency even when reads and writes would not conflict. Two transactions that only read a transactional variable Y, would still block each other despite the fact that concurrent reads can't cause problems. Furthermore, the use of encounter-time locking means that locks can be held for a long time. Finally, the naive deadlock detection combined with encounter time locking means that some programs will have higher abort rates than they would in other STMs.

There are significant advantages to this approach though. For one, a transaction doesn't require elaborate read/write/undo logs. In fact, this STM requires only a single allocation for the transaction object itself at transaction start. By contrast, most other STM designs require at least one allocation for every object that is read or written. These allocation costs are generally amortized, but they still add up.

The STM is also conceptually simple at 450 lines of code, including elaborate comments (127 lines counting only semicolons). This STM consists of only 3 classes, and 1 exception, and uses only System.Monitor for locking. This means that the STM isn't really fair, but it's rather simple to replace standard locks with a fair locking once the core STM algorithm is understood.

There is also preliminary support for integration with System.Transactions.

Transactional Programming

Any sort of transactional programming requires a transaction:

public sealed class MemoryTransaction : IEnlistmentNotification,
                                        IDisposable
{
  public static MemoryTransaction Begin();
  public static void Run(Action body);
  public static MemoryTransaction Current { get; }
  public void Complete();
}

This class is closely modeled on the design of TransactionScope from System.Transactions. Programs will generally concern themselves mostly with transactional variables, which in Sasa.TM is called Transacted<T>:

public class Transacted<T> : Participant, IRef<T>
{
  public T Value { get; set; }
  public void Write(T value, MemoryTransaction transaction);
  public T Read(MemoryTransaction transaction);
}

Any reads and writes to Transacted<T> occur within the lifetime of a MemoryTransaction, and the set of all such reads and writes are committed atomically. A simple program demonstrating the use of these abstractions:

Transacted<int> accountBalance = new Transacted<int>();
MemoryTransaction.Run(() =>
{
  accountBalance.Value += 100;
});

MemoryTransaction.Run will handle all the commits, rollbacks and retries for you. You can do this manually as well if you catch RetryException, and call Complete and Dispose methods on the transaction manually, but for most purposes the Run method suffices. You can nest calls to Run as many times as you like, but only one top-level transaction will ever be created.

No matter how many concurrent threads are executing the above code, it will always be updated atomically, and you can compose the above program with any other transactional program, and the result will also be free of concurrency hazards. The one caveat is that you should not cause non-transactional side-effects from within a transaction.

Please refer to the API docs under Sasa.TM for further details.

Internals

The internals of this STM design is pretty straightforward. Structurally, it looks something like this:

+---------------+   +---------------+    +----------------+
|Tx0            |   |Transacted0    |    |Transcated1     |
|  participants---->|  value = 2    | +->|  value = true  |
|  waitingFor   |   |  undo  = 1    | |  |  undo  = false |
+-----|---------+   |  next-----------+  |  next  = null  |
      |       ^     |  owner        |    |  owner         |
      |       |     +---|-----------+    +---|------------+
      |       |         |                    |
      |       +------------------------------+
      |
      +--------------------+
                           |
                           v
+------------------+     +------------------+
|Tx1               |     |Transacted2       |
|  participants--------->|  value = null    |
|  waitingFor=null |     |  undo  = "Foo"   |
+------------------+<-------owner           |
                         +------------------+

There's quite a bit going on here, so here are some quick highlights:

  • Each Transacted<T> is a member of a linked list rooted in the "participants" field of the MemoryTransaction. The list consists of Transacted<T> which have been read or written during the current transaction.
  • Each Transacted<T> points to the current transaction that owns its lock.
  • Each MemoryTransaction that attempts to acquire a lock on a Transacted<T>, stores the Transacted<T> in a local field called "waitingFor".
  • Transacted<T> stores the original value before any changes are made, so we can rollback if the transaction aborts.

From the above graph, we can see that there are two running transactions, Tx0 and Tx1, and that Tx0 has read or written Transacted0 and Transacted1, and it has tried to read/write Transacted2. However, Tx1 currently owns the lock on Transacted2, so Tx0 is effectively blocked waiting for Tx1 to complete.

This dependency graph is acyclic so there is no deadlock. If Tx1 were to then try to acquire the lock on Transacted0 or Transacted1, we would create a cycle in the waits-for graph, and we would have to abort one of the transactions.

On commit, a transaction's participant list is walked, unlinking elements as it goes, and all the undo fields are cleared and the locks are released. The next transaction blocked on any of the participants acquires the lock it's been waiting for, sets the owner field, and proceeds.

Rollback is much the same, except the Transacted<T>'s value field is first overwritten with the value from the undo field.

Future Work

Fair STM

To those that have read my previous posts, note that the structure of the MemoryTransaction is exactly the structure of MetaThread from a previous post. By simply adding a WaitHandle to MemoryTransaction with a FIFO locking protocol, we have a fair STM.

Lock Stealing

STM research so far has shown that most transactions are short enough that they can execute in a single timeslice, and throughput suffers if a thread is descheduled while it's holding locks. This would only be exacerbated in an encounter-time locking design like I've described here, since locks are held for longer.

Instead of blocking on a variable that is already owned, we can instead steal the lock under certain conditions. For instance, if Tx0 and Tx1 are merely reading from a variable, they can repeatedly steal the lock from each other without concern.

A transaction that writes a variable that has only been locked for reading, can steal that lock too, but if the original owner tries to read the variable again, it must abort.

If Tx0 and Tx1 both try to write the same variable, blocking is unavoidable.

Obviously, all of these performance improvements impact the simplicity of the original design, so I'm leaving them for future work if the need arises.

Saturday, November 12, 2011

Type Unification Forbidden - More C#/CLR Irritations

I've written about quite a few irritations of C#/CLR, including asymmetries in CIL, oddly unverifiable CIL instructions, certain type constraints are forbidden for no reason, delegate creation bugs, the lack of higher-kinded types, equality asymmetries between events and IObservable, generics/type parameter problems, lack of usable control over object layouts, and just overall limitations of the CLR VM.

I've just smack into yet another annoying problem: type parameter unification is forbidden. The Microsoft Connect bug filed in 2004 was closed as By Design.

Here's a simple code fragment demonstrating the problem:

class ObserveTwo<T0, T1> : IObservable<T0>, IObservable<T1>
{
}
This will fail with the error:
'ObserveTwo<T0,T1>' cannot implement both 'System.IObservable<T0>' and 'System.IObservable<T1>' because they may unify for some type parameter substitutions
This is frankly nonsense. What's the problem if T0 and T1 unify? If T0=T1, ObserveTwo just implements IObservable<T0>, and the methods are all unified as well [2]. If T0!=T1, then the usual semantics apply.

The MS Connect bug implies that a type safety issue, but there's no problem that I can see [1].

This is incredibly frustrating, because interfaces are the only way to design certain extension methods (mixin-style). For instance, suppose we have a set of simple interfaces:

// marks a class as containing a parseable parameter of type T
public interface IParseable<T>
{
   HttpRequest Request { get; }
}
// continuation parameters are parseable
public interface IContinuation<T> : IParseable<T> { }
public interface IContinuation<T0, T1> : IParseable<T0>, IParseable<T1> { }
...
This is a perfectly sensible set of type definitions, and there is no ambiguity or type safety problem here, even if a class were to implement IContinuation<int, int>.

If anyone has any suggestions for workarounds, I'm all ears. Options that don't work:

  1. Can't turn IParseable into a struct/class with an implicit conversion, because implicit conversions don't work on interfaces.
  2. Can't define MxN overloads of the extension methods defined on IParseable, where N=number of IContinuation type definitions, and M=number of type parameters (although this actually has a shot of working, the code duplication is just ludicrous).

A Partial Solution

The best solution I have involves N overloads of the extension methods by implementing interfaces that designate type parameter positions:

// type param at index 0
public interface IParam0<T> { }
// type param at index 1
public interface IParam1<T> { }
// type param at index 2
public interface IParam2<T> { }
...
// continuations  have indexed type parameters
public interface IContinuation<T> : IParam0<T> { }
public interface IContinuation<T0, T1> : IParam0<T0>, IParam1<T1> { }
...
Now instead of defining extension methods on IParseable, I define an overload per IParamX interface (N overloads, 1 per type parameter).

This necessarily causes an ambiguity when two type parameters unify, since the compiler won't know whether you wanted to call the extension on IParam0<int> or IParam2<int>, but this ambiguity happens at the call site/at type instantiation, instead of type definition. This is a much better default [1], because you can actually do something about it by disambiguating manually. I've eliminated ambiguity entirely in my library by appending the type parameter index to the extension method name.

Of course, none of this boilerplate would have been necessary if type unification were supported.

[1] If the CLR supported type inequalities, instead of just type equalities, then we could just forbid this at the point ObserveTwo is created.

[2] EDIT 2011-11-13T11:45 AM: there's been some confusion about this statement. Doug McClean rightly points out that there's no guarantee that the two implementations are confluent, so unifying is not an "automatic" thing, so please don't take this to mean that I'm saying the compiler should be able to automatically "merge" methods somehow, or magically know which implementation to dispatch to based on context. This is undecidable in general.

Also, even if the interfaces require no implementation, the error still occurs, despite reduction being Church-Rosser. Finally, plenty of nice properties are violated on the CLR, so I'm not sure why also sacrificing confluence is such a big deal here. A sensible dynamic semantics, like always dispatching to the implementation for T0 in case of type parameter unification, restores most of the desirable properties without sacrificing expressiveness, and without going whole-hog and ensuring confluence via type inequalities [1].

Tuesday, November 1, 2011

Ad-hoc Extensions in .NET

A recent discussion on LtU brought up a common limitation of modern languages and runtimes. Consider some set of abstractions A, B, ..., Z that we wish supported some custom operation Foo(). Languages like C# give only two straightforward possibilities: inheritance and runtime type tests and casts.

Inheritance is straightforward: we simply inherit from each A, B, ..., Z to make A_Foo, B_Foo, ..., Z_Foo, with a custom Foo() method.

Unfortunately, inheritance is sometimes forbidden in C#, and furthermore, it sometimes prevents us from integrating with existing code. For instance, say we want to integrate with callback-style code, where the existing code hands our program an object of type A. We can't call Foo() on this A, because it's not an A_Foo, which is what we really wanted.

This leaves us with the undesirable option of using a large set of runtime type tests. Now type tests and casts are actually pretty fast, faster than virtual dispatch in fact, but the danger is in missing a case and thus triggering a runtime exception, a danger inheritance does not have.

Furthermore, we'd have to repeat the large set of runtime type tests for each operation we want to add to A, B, ... Z, which means every time we want to add an operation we have to do a large copy-paste, and every time we add a new class we have to update every place we test types, and the compiler doesn't warn us when we're missing a case.

A Solution

Fortunately, there's a straightforward way to address at least part of this problem. We can collect all the type tests we have to do in one place, and we can ensure that we have properly handled all cases that we are aware of. To do this, we combine the standard visitor pattern with the CLR's open instance delegates and its unique design for generics.

Here are the types:

// the interface encapsulating the operation to implement, ie. Foo()
public interface IOperation
{
  void A(A obj);
  void B(B obj);
  ...
  void Z(Z obj);
  void Unknown(object obj);
}
// the static dispatcher where runtime tests occur
public static class Dispatcher<T>
{
  static Action<IOperation, T> cached;
  public static Action<IOperation, T> Dispatch
  {
    get { return cached ?? (cached = Load()); }
  }
  static Action<IOperation, T> Load()
  {
    var type = typeof(T);
    var mName = type == typeof(A) ? "A":
                type == typeof(B) ? "B":
                        ...
                                  ? "Unknown";
    var method = typeof(IOperation).GetMethod(method);
    return Delegate.CreateDelegate(typeof(Action<IOperation, T>), null, method)
             as Action<IOperation, T>;
  }
}

If you want to extend A, B, ... Z, with any operation, you simply create a class that implements IOperation and the compiler will ensure you handle all known cases. If you want to add a class to the set of handled cases, you simply extend IOperation and add a case to the runtime type tests (or if you use a naming convention, you can just use the class name as the method name).

The dynamic type test runs once, and then a delegate that directly calls into the IOperation is cached, so the type tests are not run again.

If you try to dispatch on an unknown type, IOperation.Unknown is invoked. I could have made this a generic method, but the CLR currently has a bug creating open instance delegates to generic interface methods, so that would require some code generation to do properly.

There is a caveat though: if any of the cases are subtypes of each other, and you dispatch on the static base type, it will dispatch to the base type and not the super type handler. For instance, if A:B, and you call Dispatcher<B>.Dispatch(new A()), IOperation.B is called, not IOperation.A. This can be handled in various ways, but it's beyond the scope of this article. I may post a follow-up discussing the various options.

Extensions

Type Constraints on T

If all types to handle are subtypes of type X, then it's simple to constrain the type parameter T on Dispatcher<T>:

 public static class Dispatcher<T>
  where T : X
{
  ...
}

Operation Return Type

You could extend IOperation with a return type, but this requires propagating the return type parameter into Dispatcher and the cached delegate type, and thus requires more runtime state for more static fields. This is because the CLR doesn't support GADTs, so following this paper, I prefer to keep the return type encapsulated in IOperation and expose it as a public field for the caller to extract the value:

public sealed class Foo : IOperation
{
  // the return value
  public Bar ReturnValue { get; set; }
  public void A(A obj) { ... }
  ...
}

This requires fewer type parameters, and keeps the implementation simple.

Per-Class Overrides

Suppose some of the classes you are handling are your own, or are already aware of your code such that they have already implemented your custom operations. In that case, you can specify a companion interface to indicate this, and modify the dispatch code to invoke the interface instead:

public interface ICompanion<T>
{
  // your custom operations that are implemented via subclassing
  void Foo(IOperation operation, T self);
  ...
}
The dispatch code is then modified like so:
static Action<IOperation, T> Load()
{
  var type = typeof(T);
  // if companion interface is implemented, then dispatch directly into it
  if (typeof(ICompanion<T>).IsAssignableFrom(type))
  {
    var method = type.GetMethod("Foo");
    return Delegate.CreateDelegate(typeof(Action<IOperation, T>), null, method)
             as Action<IOperation, T>;
  }
  // the usual dispatch code
  ...

Useful Applications

This pattern simulates the usefulness of type classes in Haskell, because these are essentially ad-hoc extensions added after the fact. I use this exact pattern in Sasa.Dynamics to implement safe type case and reflection patterns over the CLR primitive types, ie. ints, strings, delegates, etc., with the per-class override extension described above via the IReflected interface.

Monday, October 24, 2011

The Most General Concurrent API

Concurrent programming is hard, and many abstractions have been developed over the years to properly manage concurrent resources. The .NET base class libraries have mutexes, spinlocks, semaphores, compare-and-set (CAS) instructions, and more.

The Problem

Unfortunately, many of these standard abstractions are rather opaque, so predicting their behaviour is difficult, and enforcing a particular thread schedule is nearly impossible. For example, consider the issue of "fairness". Many abstractions to deal with threads are not fair, which is to say that they are not guaranteed to release threads in FIFO order, so a thread could starve.

Furthermore, these opaque abstractions do not allow the application to easily specify domain-specific scheduling behaviour. Suppose threads T0, T1, and T2 are blocked on a resource X, in that order. X becomes available, and according to FIFO order, T0 should acquire control next. However, a domain-specific requirement may state that T2 should be processed first, but there is no way to check the wait queue for the presence of T2 when using any of .NET's thread abstractions. We'd have to build some auxiliary data structures to log the fact that T2 is waiting for X, and force T0 and T1 to relinquish their control voluntarily, but these data structures already exist as the locking queue, so we're duplicating work. Furthermore, both T0 and T1 must run before T2 to relinquish control, so we're wasting valuable CPU time. It's rather wasteful all around.

What I want to cover here is a pattern consisting of a few simple primitives which forms a general concurrent API. Using this API, we can easily implement all of the above threading abstractions, while still allowing arbitrary scheduling policies. I devised this approach while developing a software transactional memory (STM) library for C#, since STM requires deadlock detection and is rather sensitive to scheduling policies. None of this is achievable using the standard concurrent abstractions on the CLR, without modifying the VM.

The paper entitled Light-weight Locks really crystallized the path I was on, and gelled the final components I was missing into an elegant, general solution. The solution I present here is merely one of the abstractions found in that paper.

The Solution

The fundamental primitives needed for a more flexible concurrent API are few: Interlocked.CompareExchange, AutoResetEvent, and ThreadStaticAttribute.

The approach is simple, and consists of only three fundamental concepts:
  1. A lock-free list is associated with each resource being accessed concurrently. If structured as I describe below, managing this list requires no memory allocation.
  2. An AutoResetEvent is associated with each Thread via thread-static data. Each time the thread must block, it calls WaitOne on its own AutoResetEvent after pushing itself onto the resource's lock-free list.
  3. A thread releasing its control of a resource walks the list of waiting threads, and wakes up the ones that should assume control next by calling Set on that Thread's AutoResetEvent.
AutoResetEvent is fundamentally thread-safe, and serves as a combined signal and mutex, and the only other point of contention is the lock-free list. The list requires only one CAS to push a waiter on the front, and occasionally a CAS to pop the front element [1].

Using only the above, we can construct a mutex abstraction, like Monitor, that allows exclusive access to a resource, but permits arbitrary scheduling policy, such as FIFO, LIFO, or something custom (like priorities that don't suffer from inversion). Even more, this requires no allocation at runtime. Here's all we need [2]:
// the thread-static data
class MetaThread
{
  // the kernel signal+mutex to wait on
  public AutoResetEvent Signal { get; private set; }
  // the next MetaThread in whatever list this
  // MetaThread happens to be on at the time
  public MetaThread Next { get; private set; }
  [ThreadStatic]
  public static MetaThread Current { get; private set; }
}
// some resource T that requires mutually
// exclusive access
class Locked<T>
{
  T value;
  MetaThread owner;
  MetaThread blocked;
}
What makes this work is the property that a thread can only be on one blocked list at any given time. This means we need only one piece of thread-static data to track which list it's on, and one AutoResetEvent to block and unblock the thread. Now suppose Locked has two public operations: Enter and Exit, which mimic the behaviour of Monitor.Enter and Monitor.Exit [3]:
class Locked<T>
{
  ...
  public void Enter()
  {
    var thread = MetaThread.Current;
    // push current thread onto wait list
    do thread.Next = blocked;
    while (thread.Next != Interlocked.CompareExchange(ref blocked, thread, thread.Next));
    // if owner is null, then try to acquire lock; if that fails, block
    if (owner != null || null != Interlocked.CompareExchange(ref owner, thread, null))
    {
      next.Signal.WaitOne();
    }
    Unblock(thread); // remove from blocked list
  }
  public void Exit()
  {
    // retry if no candidate acquired the lock, but the
    // blocked list has since become not empty
    MetaThread next;
    do {
      next = SelectNext();
      Interlocked.Exchange(ref owner, next);
    } while (next == null && blocked != null);
    if (next != null) next.Signal.Set(); // resume the thread
  }
  protected virtual MetaThread SelectNext()
  {
    ...
  }
}
The SelectNext method can provide a variety of domain-specific selection behaviour, such as LIFO, FIFO, etc. to dequeue the next lock owner. All that's required is that the thread releasing the resource select the candidate from the list and assign the lock to it, and then signal it to continue via AutoResetEvent. Unblock is not shown for brevity, but it's not too complicated.

The paper I linked to above uses this pattern to implement all the common concurrency abstractions, using very compact structures (2 bytes per mutex, 4 bytes per rw-lock, 4-bytes per condition variable).

I use it in my STM library to dynamically detect deadlocks and abort transactions, and I will soon extend this to implement reader-writer lock semantics similar to a pessimistic version of TLRW.

Regardless, associating a kernel AutoResetEvent, or pthread signal+mutex on Linux, permits far more flexible and compact concurrency primitives. I recommend this "MetaThread" API be exposed in any language's standard threading library.

The Future

This design can also inform the design of concurrency abstractions in kernels. Consider a reader-writer lock where a writer currently has exclusive access to the resource, and a list of N readers is waiting for the release of the lock. The writer must then invoke AutoResetEvent.Set N times, once for each waiting reader. That's N user-kernel transitions, which can be quite expensive for large N [4].

The ideal solution would be to group a number of individual AutoResetEvents so we only need to make one user-kernel transition to signal the whole group, sort of like a multicast for EventWaitHandles. This would provide efficient single thread and multithread suspend/resume semantics.

Footnotes

[1] A CAS may be needed to remove elements further in the list if more than one thread can operate on a resource at a time.
[2] I'm skipping a few steps to simplify the presentation. MetaThread.Current must actually be lazily-initialized. If I were starting a new programming language, this structure would be part of the standard Thread abstraction. Furthermore, when implementing abstractions which allow multiple threads through, you will probably need more synchronization. For instance, a reader-writer lock requires an atomically incremented/decremented read counter. See the paper linked above for more details.
[3] I'm again simplifying here for presentation, so I apologize for any errors. This code is adapted from my STM library, so I haven't tested this version thoroughly, but it's at least derived from correct code. There are many optimizations to avoid redundant CAS operations, but this is the simplest presentation.
[4] N here is the number of threads, which in general are probably less than a hundred, even for heavily threaded workloads. Still, user-kernel transitions can cost upwards of 1,000 cycles, so wasting 100,000 cycles is nothing to scoff at.

Wednesday, September 28, 2011

ThreadScoped<T> The Next Generation

Jeroen Frijters helpfully pointed out that the CLR implements some hard limits on nesting generics, which is 99 for .NET 4 based on my tests. My previous implementation of ThreadScoped<T> was thus limited to 99 instances. Not very useful!

The solution is actually quite simple, which I briefly outlined on Jeroen's blog: add more type parameters and use a simple base-99 counting scheme to generate new instances. Each additional type parameters thus increases the permutations 99 fold. One type index parameter yields 991 instances, two type index parameters yields 992 instances, three type index parameters yields 993, and so on.

No one in the foreseeable future will require more than 993, which is almost a million thread-local variables, so I've added two more type index parameters to make Ref<T0, T1, T2>. The instance allocation function is now:

internal override ThreadScoped<T> Allocate()
{
    // If 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    // The CLR has some fundamental limits on generic nesting depths, so we circumvent
    // this by using two generic parameters, and nesting them via counting.
    x = Interlocked.CompareExchange(ref next, CreateNext(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at this.next
    return this;
}
and CreateNext is:
ThreadScoped<T> CreateNext()
{
    var x = allocCount + 1;
    if (x % (99 * 99) == 0) return new Ref<T, T, Ref<T2>> { allocCount = x };
    if (x % 99 == 0)        return new Ref<T, Ref<T1>, T2> { allocCount = x };
    return new Ref<Ref<T0>, T1, T2> { allocCount = x };
}
This is simple base-99 arithmetic. Anyone familiar with arithmetic should recognize the pattern here: when we get to certain multiples of 99, we reset the previous digits and carry the 1 to the next slot. Normally, humans deal in base-10, so a carry happens at 101, 102, 103, and so on.

In this case, we are dealing with base-99, so carries happen at 991, 992 and 993, and the "carry" operation consists of nesting a generic type parameter. Simple!

This scheme is also trivially extensible to as many additional parameters as is needed, so if someone somewhere really does need more than a million fast thread-local variables, I have you covered.

These changes don't seem to have impacted the performance of ThreadScoped<T>, so I'm still over 250% faster than the ThreadLocal<T> provided in .NET 4's base class libraries.

Tuesday, September 27, 2011

Faster Thread-Local Data with ThreadScoped<T>

Awhile ago, Jeroen Frijters blogged about .NET 4's new ThreadLocal<T>, and how it was implemented. It was based on a neat generics trick exploiting the fact that the CLR does not erase types. Indexing a generic type X with a type T, means X<string> has distinct static and thread-local fields from say, X<int>.

I've actually used this trick before to implement a type of safe reflection.

Jeroen's sample implementation exploiting this trick for thread-local instance variables seemed far too complicated however, and it suffered from a number of limitations that he notes in his post. If Microsoft's implementation was even more complicated as was claimed, I wasn't confident it would perform very well at all. I decided to do my own implementation, without conforming the ThreadLocal<T>'s interface since I have very specific applications in mind.

ThreadScoped<T>

As I suspected, my implementation was considerably simpler, coming in at less than 90 lines of code, and it doesn't suffer from the problems that Jeroen identifies, save for the fact generated types are not reclaimed until the AppDomain exits. However, ThreadScoped instances are aggressively reused, so I don't anticipate this to be a huge problem.

Conceptually, the implementation is quite simple. We have four views to keep in mind here:

  1. Thread-global static data: ordinary static fields that are thus visible by all threads.
  2. Thread-global instance data: ordinary object fields visible to all threads with access to that instance.
  3. Thread-local static data: static fields marked with [ThreadStatic], giving a thread-specific view of that field.
  4. Thread-local instance data: not provided natively by the CLR, but simulated by ThreadLocal<T> and ThreadScoped<T> via thread-local static data + generics.
Our thread-global static data consists of a lock-free linked list of unallocated ThreadScoped instances. Our thread-global instance data is a version number, and a 'next' pointer that points to the next available ThreadScoped<T>. This is used when freeing an instance by pushing it on the global list. Accessing the global and instance data are all done via Interlocked.CompareExchange, so all updates are lock-free.

Now comes the tricky part: the private type ThreadScoped<T>.Ref<TIndex> has two static fields, one for the value T and one for a version number (more on this later):

sealed class Ref<TIndex> : ThreadScoped<T>
{
  [ThreadStatic]
  static T scoped;     // the unique thread-local slot
  [ThreadStatic]
  static uint version; // the version number of the thread-local data
}

The type parameter TIndex is not actually used in a static or instance field, it's merely used here as a "phantom type". Basically, if we keep substituting new types for TIndex, we'll keep forcing the CLR to generate new thread local static fields for us that we can use to simulate thread-local instance fields!

This is done in the Ref<TIndex>.Allocate() method. The global instance list always contains at least one unallocated instance. Whenever we try to allocate a new ThreadScoped<T>, we check whether we're down to the last instance in the list. If so, this last instance will generate a Ref<Ref<TIndex>> and enqueue that on the end:

internal override ThreadScoped<T> Allocate()
{
    // if 'next' is null, we are at the end of the list of free refs,
    // so allocate a new one and enqueue it, then return 'this'
    var x = next;
    if (x != null) return this;
    x = Interlocked.CompareExchange(ref next, new Ref<Ref<TIndex>>(), null);
    // atomic swap failure doesn't matter, since the caller of Acquire()
    // accesses whatever instance is at this.next
    return this;
}

ThreadScoped.Create then pops the front of the list to obtain the next to last instance, and we're back to only having one unallocated instance.

There's an important invariant here: there is always at least one item in the global list, and the last item on the global list is always the most deeply nested generic type that has been generated so far for ThreadScoped<T>. This means when we get to the last remaining unallocated instance, we can always safely generate a new instance without interfering with other threads.

The version numbers also bear some explanation. Basically, even if Thread0 disposes of an ThreadScoped instance, Thread1 may have data sitting in that instance. Suppose that the ThreadScoped instance is then pushed on the free list, and then allocated again. If Thread1 participates in that second computation, it will already find data sitting in its supposedly fresh thread-local instance from the last computation where it was supposed to have been disposed!

Obviously this is not what we want, but while Thread0 is disposing of the instance, it can't access Thread1's thread-local fields to clear them. This is the purpose of the instance and the thread-local version numbers. During dispose, we increment the instance's version number. If the thread-local version number and the instance version number don't match, it means the instance was disposed in the intervening time, so we should reinitialize it before proceeding:

public override T Value
{
    get { return current == version ? scoped : Update(); }
    set { scoped = value; }
}
T Update()
{
    if (next != this) throw new ObjectDisposedException("Transacted<T> has been disposed.");
    version = current;
    return scoped = default(T);
}

And that's it! There are a few other minor details related to book keeping that aren't really important. I think that's pretty much as simple as you can get, and since this results in effectively a direct pointer to thread-local data, it should be quite fast. There are also no intrinsic allocation limits, as it will just keep allocating or reusing instances as needed.

Benchmarks

Here is the source for the benchmarks I used. The main loop runs 10,000,000 iterations of a simple calculation that performs one read and one write to the thread-local instance per iteration. I ran this test 20 times in a loop, logging the number of iterations per second on each run. Then I used Peirce's Criterion to filter out the outliers, and used the resulting data set.

The numbers are iterations/second, benchmarks were run on .NET 4 and Windows 7. The results are fairly compelling:


ThreadLocal<T> ThreadScoped<T> ThreadStatic

5022000 16260000 22396000

5042000 14378000 18484000

4972000 16514000 20790000

5002000 15722000 22470000

5070000 14244000 19860000

5098000 13946000 16570000

5076000 16474000 22246000

5102000 13994000 20532000

5012000 13494000 14482000

5108000 16090000 21598000

5074000 16778000 20000000

5104000 15408000 21620000

5076000 12762000 16312000

5054000 16792000 21008000

5064000 16380000 21856000

5108000 16154000 15910000

5066000 16750000 22988000

5108000 16076000 18778000

5012000 15986000 23094000




MIN 4972000 12762000 14482000
MAX 5108000 16792000 23094000
AVG 5061579 15484316 20052316
% Overhead relative
to ThreadStatic
296% 29% 0%

Microsoft's implementation takes a real beating in this simple benchmark with almost 300% overhead over raw ThreadStatic fields. I'm not sure what could be going on in there to cause such a significant slowdown

By contrast, ThreadScoped has a modest 30% overhead, which is far more reasonable. I suspect this overhead is due to two factors: 1. virtual dispatch overhead because ThreadScoped.Value is a virtual property, and 2. the encapsulated instance for Ref <TIndex> may require a bit of pointer-chasing to resolve the right thread-local static field. I can't think of a way to eliminate this overhead, so this is as good as it's going to get for now.

I will note that I've had a few programs where ThreadLocal<T> did not perform as badly as demonstrated above, so it may be that reads or writes are more penalized. However, no benchmark or program I tested had ThreadLocal outperforming ThreadScoped, so I can say with confidence that ThreadScoped is much faster than ThreadLocal.

Jeroen's post also implied that he had tested a ThreadLocal version that used arrays for the backing store, and that it was faster. I also implemented a version of ThreadScoped using arrays, but haven't tested it enough. It did seem quite a bit faster with some code I had been playing with, but there were many uncontrolled variables, so I can't reach any conclusion with any confidence. The array-backed version is commited to the Sasa repository for the adventurous though. There are serious limitations with using arrays as a backing store however, namely that you're stuck with a fixed number of instances defined at compile-time, and allocating large enough arrays to avoid falling back on slow thread-local data wastes quite a bit of memory.

Still, I will run some benchmarks at some future date, so stay tuned! As for ThreadScoped<T>, it will probably be in the next Sasa release coming in a month or two, or you can just grab it from the Mercurial repo.

Sunday, September 25, 2011

Idioms in C# with LINQ

There's a great post on implementing idioms with LINQ, and the example application was to implement formlets, as WebSharper does for F#. Tomas's post is well written, so if you're unclear on the above concepts I recommend reading it first before proceeding with this article.

The claim in that post is that idioms can only be encoded via LINQ's 'join' operators. While strictly true if you stick to all the LINQ rules, because LINQ queries are just naive syntactic transforms you don't have to follow the rules. You can thus exploit this to hijack the signatures for the SelectMany overloads to yield idiom signatures. It's not all sunshine and roses though, as there are consequences.

Overview

LINQ is a standard set of methods one can implement that the C# compiler can use to provide "query patterns". This query:

var foo = from x in SomeFoo
          from y in foo.Values
          select y;
is translated by the C# compiler to:
var foo = SomeFoo.SelectMany(x => x.Values, (x, y) => y);
This is a purely syntactic transformation, meaning that the C# compiler simply takes the text from above, and naively translates each 'from', 'where', 'select', etc. into calls to instance or extension methods, SelectMany, Where, Select, etc. Type inference must then be able to infer the types used in your query, and everything must type check.

The fact that we're dealing with a purely syntactic transform means that we can be sneaky and alter the signatures of these LINQ functions and the C# compiler would be none the wiser. The resulting calls to the LINQ methods would still need to compile, but we can ensure that they only compile following the rules we want, in this case, of the rules of idioms.

LINQ Methods

The core LINQ methods are as follows, using Formlet<T> as the LINQ type:

Formlet<R> Select<T, R>(this Formlet<T> f, Func<T, R> selector);
Formlet<R> SelectMany<R>(this Formlet<T> f,
                              Func<T, Formlet<R>> collector);
Formlet<R> SelectMany<U, R>(this Formlet<T> f,
                                 Func<T, Formlet<U>>
                                 Func<T, U, R> selector);
The problematic methods for idioms are the two SelectMany calls, specifically, the parameter I've called 'collector'. You can see that the LINQ type is unwrapped and the value extracted on each SelectMany, and passed to the rest of the query. Accessing the previous values like this is forbidden in idioms.

Fortunately, the signatures for SelectMany don't have to have this exact signature, they must only have a similar structure. You must have two SelectMany overloads with one and two delegate parameters, and the first delegate parameter must return your LINQ type, in this case Formlet<T>, as this allows you to chain query clauses one after another. You can also modify the second delegate parameter in various ways, but I haven't found much use for that myself.

To implement idioms, we will simply alter the first delegate parameter so instead of unwrapping the value encapsulated by the Formlet<T>, we simply pass the Formlet<T> itself:

Formlet<R> Select<T, R>(this Formlet<T> f, Func<T, R> selector);
Formlet<R> SelectMany<R>(this Formlet<T> f,
                              Func<Formlet<T>, Formlet<R>> collector);
Formlet<R> SelectMany<U, R>(this Formlet<T> f,
                                 Func<Formlet<T>, Formlet<U>> collector,
                                 Func<T, U, R> selector);
Our query above:
var foo = from x in SomeFoo
          from y in foo.Values
          select y;
would then no longer compile, because 'x' is now not a Foo, but is in fact a Formlet<Foo>, and the formlet type does not have a "Values" property. Of course, you shouldn't provide a property to extract the encapsulated value, or this is all for naught.

The Downsides

Simple queries work great, but longer queries may run into some problems if you alter the LINQ signatures. In this case, if you try to access previous values by mistake, as in our example query above, you will get a complicated error message:

Error 1       Could not find an implementation of the query pattern for source type
 'Formlet.Formlet<AnonymousType#1>'. '<>h__TransparentIdentifier0' not found.
Basically, your incorrect program was naively translated to use the LINQ methods, but because it does not properly match the type signatures you've hijacked, type inference fails. So you can't break your idioms by hijacking the query pattern this way, but depending on your target audience, perhaps you will render them unusable.

Still, it's a neat trick that should be in every type wizard's toolbox.

Saturday, September 17, 2011

IObservable<T> and Delegate Equality

Equality and identity are often intermingled in the .NET framework. .NET delegates correctly implement a straightforward structural equality, where if two delegates are equal if they designate the same method and the same object, regardless of whether the delegate is a unique instance:
class Bar
{
public void Foo(int i)
{
}
}
var bar = new Bar();
Action<int> a1 = bar.Foo;
Action<int> a2 = bar.Foo;
Console.WriteLine("Delegate Equality: {0}", a1 == a2);
// prints:
// Delegate Equality: True

However, an IObservable<T> created from delegates does not respect structural equality, and reverts to an identity criterion:
// use a1 and a2 from previous code sample
var o1 = Observer.Create<int>(a1);
var o2 = Observer.Create<int>(a2);
Console.WriteLine("Observable Equality: {0}", o1 == o2 || o1.Equals(o2)
|| EqualityComparer<IObserver<int>>.Default.Equals(o1, o2));
//prints:
// Observable Equality: False

This is unfortunate, because it means that by default we cannot implement previous event handling patterns using IObservable without significantly restructuring the code to propagate the IDisposable returned by IObservable.Subscribe. By this, I mean code that properly managed event registration and unregistration has no easy transition to using IObservable, it must be completely rewritten.

Like event equality, the equality of IObservers created from delegates should be structural, not based on identity. Thus, manually managing subscriptions would be possible via an explicit "Unsubscribe" operation.

This decision has a real consequence that I just hit: I can implement IObservable given an object implementing INotifyPropertyChanged, but could not do the reverse using the built-in abstractions. You'd either have to define your own IObservers that implement structural equality, or you'd have to store the actual event locally and trigger it manually when a new value is available, as I have done with NamedProperty<T> in Sasa.

On a slightly related note, I've switched Sasa over to use Mercurial on Sourceforge, and have closed the old subversion repo.

Sunday, August 21, 2011

Open Instance Delegate for Generic Interface Methods

Ran into a snag when developing the latest abstraction for Sasa. The problem is distilled to its simplest form in this Stack Overflow question.

In summary, while the CLR supports open instance delegates to interface methods, and supports delegates to generic interface methods, it does not seem to support the composition of the two, ie. an open instance delegate to a generic interface method.

The following trivial example fails when creating the delegate with a NotSupportedException:
interface IFoo
{
  void Bar<T>(T j);
}
class Foo : IFoo
{
  public void Bar<T>(T j)
  {
  }
}
static void Main(string[] args)
{
  var bar = typeof(IFoo).GetMethod("Bar")
                        .MakeGenericMethod(typeof(int));
  var x = Delegate.CreateDelegate(typeof(Action<IFoo, int>),
                                  null, bar);
}

However, the CLR does support open instance delegates to generic class methods. For some reason interfaces are singled out here, and it's not clear why.

If I'm doing something wrong, please let me know!

Edit: I submitted a bug to Microsoft Connect, since this now looks like a legit bug.

Edit 2: MS acknowledged the limitation, but said it wouldn't be solved in the current version of .NET. As I explain in that bug, this limitation just doesn't seem to make sense, but it must be due to the internals that don't reuse the existing CLR dispatching logic.