Skip to main content

Diff for IEnumerable<T>

I've just added a simple diff algorithm under Sasa.Linq. The signature is as follows:

/// <summary>
/// Compute the set of differences between two sequences.
/// </summary>
/// <typeparam name="T">The type of sequence items.</typeparam>
/// <param name="original">The original sequence.</param>
/// <param name="updated">The updated sequence to compare to.</param>
/// <returns>
/// The smallest sequence of changes to transform
/// <paramref name="original"/> into <paramref name="updated"/>.
/// </returns>
public static IEnumerable<Change<T>> Difference<T>(
    this IEnumerable<T> original,
    IEnumerable<T> updated);
/// <summary>
/// Compute the set of differences between two sequences.
/// </summary>
/// <typeparam name="T">The type of sequence items.</typeparam>
/// <param name="original">The original sequence.</param>
/// <param name="updated">The updated sequence to compare to.</param>
/// <param name="eq">The equality comparer to use.</param>
/// <returns>The smallest sequence of changes to transform
/// <paramref name="original"/> into <paramref name="updated"/>.
/// </returns>
public static IEnumerable<Change<T>> Difference<T>(
    this IEnumerable<T> original,
    IEnumerable<T> updated,
    IEqualityComparer<T> eq);

The extension methods depend only on the following enum and struct:

/// <summary>
/// Describes the type of change that was made.
/// </summary>
public enum ChangeType
{
    /// <summary>
    /// An item was added at the given position.
    /// </summary>
    Add,
    /// <summary>
    /// An item was removed at the given position.
    /// </summary>
    Remove,
}
/// <summary>
/// Describes a change to a collection.
/// </summary>
/// <typeparam name="T">The collection item type.</typeparam>
public struct Change<T>
{
    /// <summary>
    /// The change made at the given position. 
    /// </summary> 
    public ChangeType ChangeType { get; internal set; } 
    /// <summary> 
    /// The set of values added or removed from the given position. 
    /// </summary> 
    public IEnumerable<T> Values { get; internal set; } 
    /// <summary> 
    /// The position in the sequence where the change took place. 
    /// </summary> 
    public int Position { get; internal set; }
} 

This is a simple and general interface with which you can perform all sorts of computations on the differences between two sequences. The code as provided will work out of the box for any type T that implements equality. Some simple examples:

Console.WriteLine( "miller".Difference("myers").Format("\r\n") );
// prints out: 
// +1:y 
// -1:i,l,l 
// +6:s 

var original = new int[] { 2, 5, 99 }; 
var updated = new int[] { 2, 4, 4, 8 }; 
Console.WriteLine( original.Difference(updated).Format("\r\n") ); 
// prints out: 
// +1:4,4,8 
// -1:5,99 

"Format" is a simple extension method also under Sasa.Linq with generates a formatted string when given an IEnumerable.

At the moment, I simply implemented the naive algorithm that takes N*M space and time. I plan to eventually implement some linear space optimizations, as described in An O(ND) Difference Algorithm and Its Variations.

There are many applications for a general difference algorithm like this. Consider a reactive property of type IEnumerable, like as used in a drop down for a user interface. If the UI is remote, as you find in X11 or a web browser, sending the entire list over and over again is bandwidth-intensive, and trashes the latency of the UI. It's much more efficient to just send the changes, which can be accomplished by taking the diff of the original and the new list.

Comments

Sean Hederman said…
If you're interested I've already completed an implementation of the Myers algorithm for IEnumerable. I've also implemented some optimisations on top of that algorithm, so in most cases it outperforms it.
Sandro Magi said…
Hey Sean, is this code publicly available somewhere?
Sean Hederman said…
Hey, sorry for the late reply. The code is part of our Signate DM system, so it's not available. However we broke it out into a separate library that I used in my free Reflector Diff plugin. My company, Palantir, is happy to license it for free, just drop us a line at diff@palantir.co.za.

Popular posts from this blog

async.h - asynchronous, stackless subroutines in C

The async/await idiom is becoming increasingly popular. The first widely used language to include it was C#, and it has now spread into JavaScript and Rust. Now C/C++ programmers don't have to feel left out, because async.h is a header-only library that brings async/await to C! Features: It's 100% portable C. It requires very little state (2 bytes). It's not dependent on an OS. It's a bit simpler to understand than protothreads because the async state is caller-saved rather than callee-saved. #include "async.h" struct async pt; struct timer timer; async example(struct async *pt) { async_begin(pt); while(1) { if(initiate_io()) { timer_start(&timer); await(io_completed() || timer_expired(&timer)); read_data(); } } async_end; } This library is basically a modified version of the idioms found in the Protothreads library by Adam Dunkels, so it's not truly ground bre

Easy Automatic Differentiation in C#

I've recently been researching optimization and automatic differentiation (AD) , and decided to take a crack at distilling its essence in C#. Note that automatic differentiation (AD) is different than numerical differentiation . Math.NET already provides excellent support for numerical differentiation . C# doesn't seem to have many options for automatic differentiation, consisting mainly of an F# library with an interop layer, or paid libraries . Neither of these are suitable for learning how AD works. So here's a simple C# implementation of AD that relies on only two things: C#'s operator overloading, and arrays to represent the derivatives, which I think makes it pretty easy to understand. It's not particularly efficient, but it's simple! See the "Optimizations" section at the end if you want a very efficient specialization of this technique. What is Automatic Differentiation? Simply put, automatic differentiation is a technique for calcu

Building a Query DSL in C#

I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string". This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API : It does not support disjunctions, ie. "OR" clauses. It does not support filtering using inequalities on more than one property. It does not support a not-equal operation. So in this post, I will describe the design which achieves the following goals: A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR"). Implemen