Generalized Multicast Delegates in Pure C#

.NET's delegates are a powerful and convenient abstraction available since .NET 1.1. They encapsulate a method pointer and the object the method belongs to in a single callable "function object". Multicast delegates are an extension of plain delegates, in that they encompass multiple single delegates. Invoking a multicast delegate invokes every encapsulated delegate, in the order they were added. Multicast delegates are key to event handling patterns that were a core part of the CLR nearly since its inception.

If you're curious about virtual machines or CLR internals, you've perhaps wondered how multicast delegates actually work. These are the key properties of multicast delegates:

They encapsulate a list of delegates of the same delegate type.
Adding a delegate returns a new multicast delegate containing the new addition at the end of the list (the original is unchanged).
Removing a delegate returns a new multicast delegate without the specified delegate (the original is unchanged).
Invoking a multicast delegate invokes each encapsulated delegate, in order, with the given parameters.

These properties are a straightforward immutable queue of delegates, and if we just wanted something that works, we could just use an immutable queue like Sasa.Fifo<T>. However, the CLR's multicast delegates are insanely fast to invoke by comparison. Here's a simple timing result:

    = System.MulticastDelegate =
    Build           =       1,390,819
    Run             =          53,762

    = Delegate Immutable Queue =
    Build           =         507,871
    Run             =         527,292

This builds up a multicast delegate of 20 items about 200,000 times, then executes that delegate 200,000 times. Building multicast delegates is quite slow on the CLR, but invoking them is an order of magnitude faster than naively traversing a queue and invoking each delegate. Presumably, invoking delegates happens more often than adding and removing delegates, so better invocation speed is a good tradeoff. So how do we achieve it?

First, let's consider the fastest possible implementation of invocation: arrays. If we preallocate an appropriately sized array of delegate slots, and invoke each entry in a for loop, we get these timings:

    = Delegate Array =
    Build           =          71,554
    Run             =          54,022

Exactly equal to multicast delegates, to within reasonable experimental error. So now we know we need some data structure that approximates a queue, but uses contiguous memory for fast traversals during delegate invocation. We can simulate an immutable array queue using Sasa's array combinators (see Append and Remove):

public struct MulticastSimple<T>
    where T : class
{
    internal T[] items;

    public MulticastSimple&lgt;T> Add(T item)
    {
        return new MulticastSimple<T>
        {
            items = items == null ? new[] { item } : items.Append(item),
        };
    }

    public MulticastSimple<T> Remove(T item)
    {
        if (items != null)
        {
            var i = Array.IndexOf(items, item);
            if (i >= 0) return new MulticastSimple<T>
            {
                items = items.Length == 1 ? null : items.Remove(i)
            };
        }
        return this;
    }
}

It's pretty simple, but let's see how it compares with the built-in multicast:

    = Multicast Delegate =
    Build           =       1,334,914
    Run             =          57,646

    = MulticastSimple =
    Build           =         860,529
    Run             =          58,011

Pretty good! Building MulticastSimple is much faster, and invocation is just as fast as the built-in multicast delegates. Remember that these numbers are for 20 delegates though, and given the cost of adding delegates is O(n), this won't scale very well to large numbers of delegates. Let's see what the numbers look like for 200 delegates iterated 20,000 times:

    = System.MulticastDelegate =
    Build           =       1,324,381
    Run             =          45,683

    = MulticastSimple =
    Build           =       2,545,824
    Run             =          45,497

MulticastSimple is already more expensive than multicast delegates, so now let's see 2,000 delegates iterated 2,000 times:

    = System.MulticastDelegate =
    Build           =       1,324,422
    Run             =          45,293

    = MulticastSimple =
    Build           =      21,115,099
    Run             =          51,515

And here we see the limitations of the naive approach. The cost of generating multicast delegates appears constant regardless of the number of delegates, but the cost for MulticastSimple scales linearly with the number of delegates. However, even 20 delegates in a multicast delegates is uncommon, so optimizing for the rare cases probably isn't worth the effort.

A Scalable Multicast up to M

Fortunately, there is a simple extension of the above design which will efficiently support up to M delegates, where M is some configurable number. Basically, efficient delegate addition needs a single small array, and efficient traversal needs an array as well, but there's no need for these two arrays to be the same:

public struct Multicast<T>
    where T : class
{
    internal T[][] items;
    internal T[] end;

    public Multicast<T> Add(T item)
    {
        if (items == null) return new Multicast<T> { end = new[] { item } };
        if (items.Length < 16) return new Multicast<T>
        {
            items = items, end = end.Append(item)
        };
        return new Multicast<T>
        {
            items = items == null ? new T[][] { end } : items.Append(end),
            end = new[] { item },
        };
    }

    public Multicast<T> Remove(T item)
    {
        if (items != null)
        {
            var i = Array.IndexOf(items, item);
            if (i >= 0) return new Multicast<T>
            {
                items = items, end = end.Length == 1 ? null : end.Remove(i)
            };
        }
        else if (items != null)
        {
            for (int i = 0; i < items.Length; ++i)
            {
                var j = Array.IndexOf(items[i], item);
                if (j >= 0) return new Multicast<T>
                {
                    items = items[i].Length == 1
                          ? items.Remove(i)
                          : items.Set(i, items[i].Remove(j))
                };
            }
        }
        return this;
    }

    public T[][] Items { get { return items; } }
    public T[] End { get { return end; } }

    public static Multicast<T> operator +(Multicast<T> lhs, T rhs)
    {
        return lhs.Add(rhs);
    }

    public static Multicast<T> operator -(Multicast<T> lhs, T rhs)
    {
        return lhs.Remove(rhs);
    }
}

So we keep a small array up to 32 items for efficient appends (field 'end'), and when 32 or more delegates are added this overflows into a nested array (field 'items') and the 'end' array is restarted. So appends have the same low constant factor overheads as the MulticastSimple, but scale at linearly at a factor of n/32 instead of just n. This is still fundamentally O(n), but the constant factor overheads of builtin multicast delgate appends are so high that they only beat out this implementation around 10,000 delegates:

    = System.MulticastDelegate =
    Build           =       1,479,262
    Run             =          44,318

    = Multicast<T> =
    Build           =       1,778,939
    Run             =          58,599

A multicast delegate with 10,000 entries is more than any reasonable program will ever need. At 200 delegates:

    = System.MulticastDelegate =
    Build           =       1,445,167
    Run             =          44,934

    = Multicast<T> =
    Build           =         949,040
    Run             =          57,367

At 20 delegates:

    = System.MulticastDelegate =
    Build           =       1,411,269
    Run             =          55,544

    = Multicast<T> =
    Build           =         853,145
    Run             =          57,036

And probably the most likely case, 4 delegates:

    = System.MulticastDelegate =
    Build           =         979,093
    Run             =          80,781

    = Multicast<T> =
    Build           =         632,594
    Run             =          77,277

Multicast Invocation

Normally the C# compiler produces the code needed to invoke any delegate type, and we can achieve something similar with T4 templates and reusing the built-in System.Action overloads. First, let's look at the basic invocation for System.Action:

public static void Invoke(this Multicast<Action> m)
{
    for (var i = 0; m.items != null && i < m.items.Length; ++i)
    {
        var x = m.items[i];
        for (var j = 0; j < x.Length; ++j)
        {
            x[j]();
        }
    }
    for (var i = 0; m.end != null && i < m.end.Length; ++i)
      m.end[i]();
}

Pretty straightforward: just iterate over each nested array and invoke the delegates. The invocation for an action with a parameter consists of just adding the parameter to invoke:

public static void Invoke<T0>(this Multicast<Action<T>> m, T0 arg0)
{
    for (var i = 0; m.items != null && i < m.items.Length; ++i)
    {
        var x = m.items[i];
        for (var j = 0; j < x.Length; ++j)
        {
            x[j](arg0);
        }
    }
    for (var i = 0; m.end != null && i < m.end.Length; ++i)
        m.end[i](arg0);
}

And so on for all System.Action overloads up to 16 type arguments. We can automate generating these overloads using T4 templates.

The Semantics of Multicast System.Func

Multicast delegates on the CLR only have proper semantics for delegates that return no value. If you combine multiple delegates that return a value, then only the value from the last delegate is ever returned. But in principle, executing a list of delegates should generate a list of results. And the advantage of pure C# multicast delegates is that implementing this semantics is trivial!

public static IEnumerable<T> Invoke<T>(this Multicast<Func<T>> m)
{
    for (var i = 0; m.items != null && i < m.items.Length; ++i)
    {
        var x = m.items[i];
        for (var j = 0; j < x.Length; ++j)
        {
            yield return x[j]();
        }
    }
    for (var i = 0; m.end != null && i < && m.end.Length; ++i)
        yield return m.end[i]();
}

This is basically the same C# code as Action invocations, but we now exploit C#'s generator syntax to easily return lists of values while executing all the encapsulated delegates. Like System.Action, we can generate all the necessary overloads for System.Func using T4 templates.

Summary

We haven't reproduced precisely the internal multicast implementation, but we have something incredibly simple that's just as fast for all foreseeable programs, and it's in safe C# to boot. Furthermore, we've extended multicast delegates to include a more coherent and complete semantics for delegates that return values, all in about 150 lines of code.

You can download the full source code for this post here, including MulticastSimple and Multicast with T4 templates for the latter.

Easy Reverse Mode Automatic Differentiation in C#

Continuing from my last post on implementing forward-mode automatic differentiation (AD) using C# operator overloading , this is just a quick follow-up showing how easy reverse mode is to achieve, and why it's important. Why Reverse Mode Automatic Differentiation? As explained in the last post, the vector representation of forward-mode AD can compute the derivatives of all parameter simultaneously, but it does so with considerable space cost: each operation creates a vector computing the derivative of each parameter. So N parameters with M operations would allocation O(N*M) space. It turns out, this is unnecessary! Reverse mode AD allocates only O(N+M) space to compute the derivatives of N parameters across M operations. In general, forward mode AD is best suited to differentiating functions of type: R → R N That is, functions of 1 parameter that compute multiple outputs. Reverse mode AD is suited to the dual scenario: R N → R That is, functions of many parameters t...

Higher Logics

Search This Blog