Concurrent programming is hard, and many abstractions have been developed over the years to properly manage concurrent resources. The .NET base class libraries have mutexes, spinlocks, semaphores, compare-and-set (CAS) instructions, and more.
Furthermore, these opaque abstractions do not allow the application to easily specify domain-specific scheduling behaviour. Suppose threads T0, T1, and T2 are blocked on a resource X, in that order. X becomes available, and according to FIFO order, T0 should acquire control next. However, a domain-specific requirement may state that T2 should be processed first, but there is no way to check the wait queue for the presence of T2 when using any of .NET's thread abstractions. We'd have to build some auxiliary data structures to log the fact that T2 is waiting for X, and force T0 and T1 to relinquish their control voluntarily, but these data structures already exist as the locking queue, so we're duplicating work. Furthermore, both T0 and T1 must run before T2 to relinquish control, so we're wasting valuable CPU time. It's rather wasteful all around.
What I want to cover here is a pattern consisting of a few simple primitives which forms a general concurrent API. Using this API, we can easily implement all of the above threading abstractions, while still allowing arbitrary scheduling policies. I devised this approach while developing a software transactional memory (STM) library for C#, since STM requires deadlock detection and is rather sensitive to scheduling policies. None of this is achievable using the standard concurrent abstractions on the CLR, without modifying the VM.
The paper entitled Light-weight Locks really crystallized the path I was on, and gelled the final components I was missing into an elegant, general solution. The solution I present here is merely one of the abstractions found in that paper.
The approach is simple, and consists of only three fundamental concepts:
Using only the above, we can construct a mutex abstraction, like Monitor, that allows exclusive access to a resource, but permits arbitrary scheduling policy, such as FIFO, LIFO, or something custom (like priorities that don't suffer from inversion). Even more, this requires no allocation at runtime. Here's all we need [2]:
The paper I linked to above uses this pattern to implement all the common concurrency abstractions, using very compact structures (2 bytes per mutex, 4 bytes per rw-lock, 4-bytes per condition variable).
I use it in my STM library to dynamically detect deadlocks and abort transactions, and I will soon extend this to implement reader-writer lock semantics similar to a pessimistic version of TLRW.
Regardless, associating a kernel AutoResetEvent, or pthread signal+mutex on Linux, permits far more flexible and compact concurrency primitives. I recommend this "MetaThread" API be exposed in any language's standard threading library.
The ideal solution would be to group a number of individual AutoResetEvents so we only need to make one user-kernel transition to signal the whole group, sort of like a multicast for EventWaitHandles. This would provide efficient single thread and multithread suspend/resume semantics.
[2] I'm skipping a few steps to simplify the presentation. MetaThread.Current must actually be lazily-initialized. If I were starting a new programming language, this structure would be part of the standard Thread abstraction. Furthermore, when implementing abstractions which allow multiple threads through, you will probably need more synchronization. For instance, a reader-writer lock requires an atomically incremented/decremented read counter. See the paper linked above for more details.
[3] I'm again simplifying here for presentation, so I apologize for any errors. This code is adapted from my STM library, so I haven't tested this version thoroughly, but it's at least derived from correct code. There are many optimizations to avoid redundant CAS operations, but this is the simplest presentation.
[4] N here is the number of threads, which in general are probably less than a hundred, even for heavily threaded workloads. Still, user-kernel transitions can cost upwards of 1,000 cycles, so wasting 100,000 cycles is nothing to scoff at.
The Problem
Unfortunately, many of these standard abstractions are rather opaque, so predicting their behaviour is difficult, and enforcing a particular thread schedule is nearly impossible. For example, consider the issue of "fairness". Many abstractions to deal with threads are not fair, which is to say that they are not guaranteed to release threads in FIFO order, so a thread could starve.Furthermore, these opaque abstractions do not allow the application to easily specify domain-specific scheduling behaviour. Suppose threads T0, T1, and T2 are blocked on a resource X, in that order. X becomes available, and according to FIFO order, T0 should acquire control next. However, a domain-specific requirement may state that T2 should be processed first, but there is no way to check the wait queue for the presence of T2 when using any of .NET's thread abstractions. We'd have to build some auxiliary data structures to log the fact that T2 is waiting for X, and force T0 and T1 to relinquish their control voluntarily, but these data structures already exist as the locking queue, so we're duplicating work. Furthermore, both T0 and T1 must run before T2 to relinquish control, so we're wasting valuable CPU time. It's rather wasteful all around.
What I want to cover here is a pattern consisting of a few simple primitives which forms a general concurrent API. Using this API, we can easily implement all of the above threading abstractions, while still allowing arbitrary scheduling policies. I devised this approach while developing a software transactional memory (STM) library for C#, since STM requires deadlock detection and is rather sensitive to scheduling policies. None of this is achievable using the standard concurrent abstractions on the CLR, without modifying the VM.
The paper entitled Light-weight Locks really crystallized the path I was on, and gelled the final components I was missing into an elegant, general solution. The solution I present here is merely one of the abstractions found in that paper.
The Solution
The fundamental primitives needed for a more flexible concurrent API are few: Interlocked.CompareExchange, AutoResetEvent, and ThreadStaticAttribute.The approach is simple, and consists of only three fundamental concepts:
- A lock-free list is associated with each resource being accessed concurrently. If structured as I describe below, managing this list requires no memory allocation.
- An AutoResetEvent is associated with each Thread via thread-static data. Each time the thread must block, it calls WaitOne on its own AutoResetEvent after pushing itself onto the resource's lock-free list.
- A thread releasing its control of a resource walks the list of waiting threads, and wakes up the ones that should assume control next by calling Set on that Thread's AutoResetEvent.
Using only the above, we can construct a mutex abstraction, like Monitor, that allows exclusive access to a resource, but permits arbitrary scheduling policy, such as FIFO, LIFO, or something custom (like priorities that don't suffer from inversion). Even more, this requires no allocation at runtime. Here's all we need [2]:
// the thread-static data class MetaThread { // the kernel signal+mutex to wait on public AutoResetEvent Signal { get; private set; } // the next MetaThread in whatever list this // MetaThread happens to be on at the time public MetaThread Next { get; private set; } [ThreadStatic] public static MetaThread Current { get; private set; } } // some resource T that requires mutually // exclusive access class Locked<T> { T value; MetaThread owner; MetaThread blocked; }What makes this work is the property that a thread can only be on one blocked list at any given time. This means we need only one piece of thread-static data to track which list it's on, and one AutoResetEvent to block and unblock the thread. Now suppose Locked has two public operations: Enter and Exit, which mimic the behaviour of Monitor.Enter and Monitor.Exit [3]:
class Locked<T> { ... public void Enter() { var thread = MetaThread.Current; // push current thread onto wait list do thread.Next = blocked; while (thread.Next != Interlocked.CompareExchange(ref blocked, thread, thread.Next)); // if owner is null, then try to acquire lock; if that fails, block if (owner != null || null != Interlocked.CompareExchange(ref owner, thread, null)) { next.Signal.WaitOne(); } Unblock(thread); // remove from blocked list } public void Exit() { // retry if no candidate acquired the lock, but the // blocked list has since become not empty MetaThread next; do { next = SelectNext(); Interlocked.Exchange(ref owner, next); } while (next == null && blocked != null); if (next != null) next.Signal.Set(); // resume the thread } protected virtual MetaThread SelectNext() { ... } }The SelectNext method can provide a variety of domain-specific selection behaviour, such as LIFO, FIFO, etc. to dequeue the next lock owner. All that's required is that the thread releasing the resource select the candidate from the list and assign the lock to it, and then signal it to continue via AutoResetEvent. Unblock is not shown for brevity, but it's not too complicated.
The paper I linked to above uses this pattern to implement all the common concurrency abstractions, using very compact structures (2 bytes per mutex, 4 bytes per rw-lock, 4-bytes per condition variable).
I use it in my STM library to dynamically detect deadlocks and abort transactions, and I will soon extend this to implement reader-writer lock semantics similar to a pessimistic version of TLRW.
Regardless, associating a kernel AutoResetEvent, or pthread signal+mutex on Linux, permits far more flexible and compact concurrency primitives. I recommend this "MetaThread" API be exposed in any language's standard threading library.
The Future
This design can also inform the design of concurrency abstractions in kernels. Consider a reader-writer lock where a writer currently has exclusive access to the resource, and a list of N readers is waiting for the release of the lock. The writer must then invoke AutoResetEvent.Set N times, once for each waiting reader. That's N user-kernel transitions, which can be quite expensive for large N [4].The ideal solution would be to group a number of individual AutoResetEvents so we only need to make one user-kernel transition to signal the whole group, sort of like a multicast for EventWaitHandles. This would provide efficient single thread and multithread suspend/resume semantics.
Footnotes
[1] A CAS may be needed to remove elements further in the list if more than one thread can operate on a resource at a time.[2] I'm skipping a few steps to simplify the presentation. MetaThread.Current must actually be lazily-initialized. If I were starting a new programming language, this structure would be part of the standard Thread abstraction. Furthermore, when implementing abstractions which allow multiple threads through, you will probably need more synchronization. For instance, a reader-writer lock requires an atomically incremented/decremented read counter. See the paper linked above for more details.
[3] I'm again simplifying here for presentation, so I apologize for any errors. This code is adapted from my STM library, so I haven't tested this version thoroughly, but it's at least derived from correct code. There are many optimizations to avoid redundant CAS operations, but this is the simplest presentation.
[4] N here is the number of threads, which in general are probably less than a hundred, even for heavily threaded workloads. Still, user-kernel transitions can cost upwards of 1,000 cycles, so wasting 100,000 cycles is nothing to scoff at.
Comments
In any case, I'm guessing this tends to be faster than lock().
I realize that using a struct can be dangerous since you can't prevent accidental copying of the lock; that's why you should offer both a low-level struct version (nongeneric, no 'value' field) and a high-level generic class version. Then people can use the struct version as an optimization if they need very fine-grained locking (e.g. their program will have over 100,000 locks in it).
In the library I've been developing, there are two collection classes that I offer in 'struct' form, InternalList and InternalDList (a deque). They are intended as primitives to be used inside other collection classes, e.g. I also wrote a B+ tree which uses InternalDList inside each leaf node, and a V-List that uses InternalList. High-level code should of course use classes instead (List and DList, the latter being a wrapper for DListInternal).
It's clear that the WaitHandle should just be part of any thread abstraction, so we only need to create one for the entire thread lifetime.
The Locked<T> class represents the headers needed in an object if I were to do this on a low-level. Right now it uses two words, one for each MetaThread pointer, to keep it simple. You can replace these with two Int15, and use those to indirect to a global thread table. This latter technique is used in the paper above to make lock space overhead very small.
Re: the more compact struct version, I haven't implemented it because I haven't had a need. Only limited time and other projects are more compelling than micro-optimizing this case.
Re: B+ tree, that's cool. I've been meaning to implement a cache-conscious B+ tree, but C# again has some limitations in this regard. I'll get to it one day!
You'll find that most of the purely functional collections in my Sasa library are only available as structs. The internals use classes, but I keep the public interface as a struct so I can properly handle empty collections without the clients triggering NullReferenceExceptions.
Yes, I'm my B+ tree isn't designed for maximum performance but rather maximum flexibility. For instance, it supports an indexer and RemoveAt, O(1) freezing and cloning, and there is an unsorted version called A-List which supports appending and splitting.