Sunday, April 20, 2014

Immutable Sasa.Collections.Tree vs. System.Collections.Dictionary vs. C5 HashDictionary

I've previously posted about Sasa's hash-array mapped trie, but I never posted any benchmarks. I recently came across this post on Stackoverflow which provided a decent basic benchmark between .NET's default Dictionary<TKey, TValue>, the C5 collection's hash dictionary, F#'s immutable map, and .NET's new immutable collections.

I slightly modified the file to remove the bench against the F# map and the new immutable collections since I'm still using VS 2010, and I added a simple warmup phase to ensure the methods have all been JIT compiled and the GC run to avoid introducing noise:

static void Warmup()
{
    var x = Tree.Make<string, object>();
    var y = new C5.HashDictionary<string, object>();
    var z = new Dictionary<string, object>();
    z.Add("foo", "bar");
    for (var i = 0; i < 100; ++i)
    {
        x = x.Add("foo" + i, "bar");
        y.Add("foo" + i, "bar");
        z.Add("foo" + i, "bar");
        var tmp1 = x["foo" + i];
        var tmp2 = y["foo" + i];
        var tmp3 = z["foo" + i];
    }
    x = default(Tree<string, object>);
    y = null;
    z = null;
    GC.Collect();
}

The results are still somewhat representative. This is a sample of an average output, where "Imm" is Sasa's immutable HAMT:

# - 100
SCGD -          0 MS -         25 Ticks
C5   -          0 MS -        887 Ticks
Imm  -          0 MS -        387 Ticks

# - 1000
SCGD -          0 MS -        257 Ticks
C5   -          0 MS -        294 Ticks
Imm  -          0 MS -        368 Ticks

# - 10000
SCGD -          1 MS -       4084 Ticks
C5   -          1 MS -       5182 Ticks
Imm  -          1 MS -       5436 Ticks

# - 100000
SCGD -         28 MS -      85742 Ticks
C5   -         32 MS -      99280 Ticks
Imm  -         32 MS -      97720 Ticks

Observations:

  1. C5's standard deviation was somewhat wider than both Sasa's HAMT and SCGD, so it's performance seems slightly less predictable
  2. Sasa's immutable HAMT appears to perform within 5% of the mutable C5 collection at all collection sizes
  3. Sasa's immutable HAMT appears to perform within 15% of the mutable SCGD for large collections where the hash table with higher load factors
  4. Small collections requiring a small load factor clearly advantage the mutable SCGD by up to an order of magnitude, an advantage not shared by C5 for some reason (possibly they maintain a higher load factor)
  5. C5's terrible performance on very small collections of 100 items was consistent on every test run, again possibly because they maintain a high load factor before resizing
  6. Sasa's HAMT takes just as much time to load 1000 items as it takes to load 100 items; this was consistent across every test run, and it's not clear why

Finally, while not exactly apples-to-apples, Sasa's HAMT is easily 3-4× faster than F#'s map given the numbers cited in the above Stackoverflow post. F# still has an advantage for very small collections though. Sasa's HAMT also appears to be at least 2× faster than the new immutable collections.

Also keep in mind that this benchmark only tests lookup performance. F#'s map would have an advantage over Sasa's HAMT in load performance because the HAMT does not yet include a "bulk-load" operation, which the F# map does appear to support.

Tuesday, April 1, 2014

A Truly Slim Read/Write Lock in C#

It's pretty well known that the CLR's ReaderWriterLock and ReaderWriterLockSlim have unappealing performance characteristics. Each class also encapsulates signficant state, which precludes its use in fine-grained concurrency across large collections of objects.

Enter Sasa.Concurrency.RWLock in the core Sasa assembly. This is the most lightweight R/W lock I could come up with, particularly in terms of resources used. It's a struct that encapsulates a simple integer that stores the number of readers and a flag indicating whether a writer is active.

The interface is similar to ReaderWriterLockSlim, although there are a few differences which are needed to keep the encapsulated state so small:

public struct RWLock
{
  // this field is the only state needed by RWLock 
  private int flags;

  public void EnterReadLock();
  public void ExitReadLock();
  public bool TryEnterReadLock();

  public void EnterWriteLock(object sync);
  public bool TryEnterWriteLock(object sync);
  public void ExitWriteLock(object sync);
}

Conceptually, EnterWriteLock calls Monitor.Enter(sync), which ensures that only a single writer acquires the write lock. It then sets the write bit in the "flags" state, and loops yielding its time slice until all read locks are released.

EnterReadLock also loops yielding its time slice until the write flag is cleared, and then it uses Interlocked.Increment to acquire a read lock, and Interlocked.Decrement to release the read lock.

The TryEnterReadLock and TryEnterWriteLock provide non-blocking semantics, so there is no looping. If the lock on 'sync' cannot be acquired, or the write flag is set, TryEnterWriteLock and TryEnterReadLock respectively return false immediately. They never block or loop under any circumstances.

The RWLock implementation is about 150 lines of heavily commented code, so it's easily digestible for anyone whose interested in the specifics. There are also some rules to abide by when using RWLock:

  1. The same 'sync' object must be passed to all write lock calls on a given RWLock. Obviously if you use a different object, more than one writer can proceed. Different objects can be used for different RWLocks of course.
  2. Recursive write locks are forbidden and will throw LockRecursionException. Recursive read locks are permitted.
  3. You cannot acquire a read lock inside a write lock, or a write lock inside a read lock. If you do, your program will immediately deadlock.

Unlike the base class libraries, none of my concurrency abstractions accept timeout parameters. Timeouts hide concurrency bugs and introduce pervasive non-determinism, which is partly why concurrent programs are traditionally hard to debug. Timeouts should be rare, and specified separately at a higher level than these low-level concurrency primitives.