My last post generated a bit of discussion here and on Reddit, where after further testing, I discovered that stack management didn't work properly on Windows. It would seem that embedded frame pointers on the stack are used to ensure that we are on the same stack, at least in DEBUG mode, and so growing or shrinking the stack generated cryptic errors.
These problems motivated me to explore alternate methods of implementing coroutines. Fundamentally, the current state of a computation is encapsulated in the stack. So switching between two concurrent computations means modifying the stack, either by swapping out the old stack for an entirely new one, or by copying the relevant portion of the stack to a save area, and restoring it when switching back. Coroutines and threads typically use stack switching, and continuations have traditionally used some sort of copying.
The initial coroutine implementation used stack switching, which is fairly efficient incurring only a register save, a pointer update, and a register restore. Stack copying involves a register save, a copy to save area, a copy from a save area, and a register restore. The additional copies incur a certain overhead.
The benefits of copying are that it's virtually guaranteed to work across all platforms, and that all stack data are restored to their original locations so the full semantics of C are available. Stack switching restricts certain C operations, such as taking the address of a local variable.
My naive copying implementation was about two orders of magnitude slower than stack switching. It performed about 300,000 context switches/sec on my Core 2 Duo, vs stack switching at ~12,500,000 context switches/sec. I reduced this overhead by an order of magnitude by caching stacks (~1,500,000 context switches/sec), so part of the stack save area is wasted, but at least we're not allocating and freeing on each context switch.
The copying implementation can be optimized further using lazy stack copying. Basically, only a portion of the stack is restored, and an underflow handler is set up as the return address of the last stack frame. The underflow handler then restores the next portion of the stack and resumes the computation. If the underflow handler is never invoked, then we only need to copy that small portion of the stack back to the save area, so this ends up being a huge win.
Unfortunately, while establishing an underflow handler is trivial to do in assembly, I haven't been able to figure out how to accomplish this in C. It would be a neat trick though, and I'd estimate that it would bring the performance of a copying implementation very close to a stack switching implementation.
I've also started playing with an implementation of delimited continuations for C, since this has the potential to reduce the amount of copying needed. It's tricky to do in C though, so I don't have anything solid yet.
For expediency, I have also implemented coroutines using Windows Fibers. Unfortunately, Fibers don't support any sort of copy or clone operation, so I had to retire coro_clone for the time being. This means my coroutine library can no longer be used to implement multishot continuations. Of course, this capability is trivial in a stack copying implementation. Hopefully I'll be able to figure out a way to improve the performance of copying coroutines/continuations in a portable way.