Skip to main content

Coroutines in C Redux

My last post generated a bit of discussion here and on Reddit, where after further testing, I discovered that stack management didn't work properly on Windows. It would seem that embedded frame pointers on the stack are used to ensure that we are on the same stack, at least in DEBUG mode, and so growing or shrinking the stack generated cryptic errors.

These problems motivated me to explore alternate methods of implementing coroutines. Fundamentally, the current state of a computation is encapsulated in the stack. So switching between two concurrent computations means modifying the stack, either by swapping out the old stack for an entirely new one, or by copying the relevant portion of the stack to a save area, and restoring it when switching back. Coroutines and threads typically use stack switching, and continuations have traditionally used some sort of copying.

The initial coroutine implementation used stack switching, which is fairly efficient incurring only a register save, a pointer update, and a register restore. Stack copying involves a register save, a copy to save area, a copy from a save area, and a register restore. The additional copies incur a certain overhead.

The benefits of copying are that it's virtually guaranteed to work across all platforms, and that all stack data are restored to their original locations so the full semantics of C are available. Stack switching restricts certain C operations, such as taking the address of a local variable.

My naive copying implementation was about two orders of magnitude slower than stack switching. It performed about 300,000 context switches/sec on my Core 2 Duo, vs stack switching at ~12,500,000 context switches/sec. I reduced this overhead by an order of magnitude by caching stacks (~1,500,000 context switches/sec), so part of the stack save area is wasted, but at least we're not allocating and freeing on each context switch.

The copying implementation can be optimized further using lazy stack copying. Basically, only a portion of the stack is restored, and an underflow handler is set up as the return address of the last stack frame. The underflow handler then restores the next portion of the stack and resumes the computation. If the underflow handler is never invoked, then we only need to copy that small portion of the stack back to the save area, so this ends up being a huge win.

Unfortunately, while establishing an underflow handler is trivial to do in assembly, I haven't been able to figure out how to accomplish this in C. It would be a neat trick though, and I'd estimate that it would bring the performance of a copying implementation very close to a stack switching implementation.

I've also started playing with an implementation of delimited continuations for C, since this has the potential to reduce the amount of copying needed. It's tricky to do in C though, so I don't have anything solid yet.

For expediency, I have also implemented coroutines using Windows Fibers. Unfortunately, Fibers don't support any sort of copy or clone operation, so I had to retire coro_clone for the time being. This means my coroutine library can no longer be used to implement multishot continuations. Of course, this capability is trivial in a stack copying implementation. Hopefully I'll be able to figure out a way to improve the performance of copying coroutines/continuations in a portable way.

Comments

Lee said…
The problem almost seems like it would be better solved in the compiler or even in a preprocessor generating a stack struct with structs and unions for subroutines that can freaze.

Is there a solution like this?
Sandro Magi said…
You are correct. It would be ideal for the compiler to emit stack frame information, which can then be used to update all stack frame references. There are some compilers for garbage collected languages that do this in fact. I don't know of any C compilers that do this however.

Still, as long as a language emits PIC code, and uses only relative offsets from the current stack position to reference its locals, stack relocation is trivial.

The tricky parts comes with absolute addressing, like taking the address of a local and passing it to another function. Here you either need to update that address within the stack frame (complicated), use a relative offset from the stack base, or otherwise ensure that the newest address is always used (like adding a level of indirection).
Dobes said…
w.r.t. lazy stack copying - even if you could detect that you need to lazily copy a bit more stack, you'll still have the problem that pointers to the stack might be invalid. For example:

char buf[256];
fill_my_buffer(buf,sizeof buf,stdin);

As a common idiom for I/O in C, this probably has to be supported intelligently.

Another way to reduce copying might be to somehow detect an area of the stack which were not changed since the last time a particular coroutine was invoked and not copy that part.

To do this, you'd have to compare the stack as it is copied away to see what parts of the stack a coroutine actually modified. Then some kind of versioning system might be needed for chunks of stack so that when you are going to copy chunks of stack back in you can check if the version number is the same and skip copying in that case. It's not clear that all this comparing would save any time compared to copying, though.

Popular posts from this blog

async.h - asynchronous, stackless subroutines in C

The async/await idiom is becoming increasingly popular. The first widely used language to include it was C#, and it has now spread into JavaScript and Rust. Now C/C++ programmers don't have to feel left out, because async.h is a header-only library that brings async/await to C! Features: It's 100% portable C. It requires very little state (2 bytes). It's not dependent on an OS. It's a bit simpler to understand than protothreads because the async state is caller-saved rather than callee-saved. #include "async.h" struct async pt; struct timer timer; async example(struct async *pt) { async_begin(pt); while(1) { if(initiate_io()) { timer_start(&timer); await(io_completed() || timer_expired(&timer)); read_data(); } } async_end; } This library is basically a modified version of the idioms found in the Protothreads library by Adam Dunkels, so it's not truly ground bre

Building a Query DSL in C#

I recently built a REST API prototype where one of the endpoints accepted a string representing a filter to apply to a set of results. For instance, for entities with named properties "Foo" and "Bar", a string like "(Foo = 'some string') or (Bar > 99)" would filter out the results where either Bar is less than or equal to 99, or Foo is not "some string". This would translate pretty straightforwardly into a SQL query, but as a masochist I was set on using Google Datastore as the backend, which unfortunately has a limited filtering API : It does not support disjunctions, ie. "OR" clauses. It does not support filtering using inequalities on more than one property. It does not support a not-equal operation. So in this post, I will describe the design which achieves the following goals: A backend-agnostic querying API supporting arbitrary clauses, conjunctions ("AND"), and disjunctions ("OR"). Implemen

Simple, Extensible IoC in C#

I just committed the core of a simple dependency injection container to a standalone assembly, Sasa.IoC . The interface is pretty straightforward: public static class Dependency { // static, type-indexed operations public static T Resolve<T>(); public static void Register<T>(Func<T> create) public static void Register<TInterface, TRegistrant>() where TRegistrant : TInterface, new() // dynamic, runtime type operations public static object Resolve(Type registrant); public static void Register(Type publicInterface, Type registrant, params Type[] dependencies) } If you were ever curious about IoC, the Dependency class is only about 100 lines of code. You can even skip the dynamic operations and it's only ~50 lines of code. The dynamic operations then just use reflection to invoke the typed operations. Dependency uses static generic fields, so resolution is pretty much just a field access + invoking a