My last post generated a bit of discussion here and on Reddit, where after further testing, I discovered that stack management didn't work properly on Windows. It would seem that embedded frame pointers on the stack are used to ensure that we are on the same stack, at least in DEBUG mode, and so growing or shrinking the stack generated cryptic errors.
These problems motivated me to explore alternate methods of implementing coroutines. Fundamentally, the current state of a computation is encapsulated in the stack. So switching between two concurrent computations means modifying the stack, either by swapping out the old stack for an entirely new one, or by copying the relevant portion of the stack to a save area, and restoring it when switching back. Coroutines and threads typically use stack switching, and continuations have traditionally used some sort of copying.
The initial coroutine implementation used stack switching, which is fairly efficient incurring only a register save, a pointer update, and a register restore. Stack copying involves a register save, a copy to save area, a copy from a save area, and a register restore. The additional copies incur a certain overhead.
The benefits of copying are that it's virtually guaranteed to work across all platforms, and that all stack data are restored to their original locations so the full semantics of C are available. Stack switching restricts certain C operations, such as taking the address of a local variable.
My naive copying implementation was about two orders of magnitude slower than stack switching. It performed about 300,000 context switches/sec on my Core 2 Duo, vs stack switching at ~12,500,000 context switches/sec. I reduced this overhead by an order of magnitude by caching stacks (~1,500,000 context switches/sec), so part of the stack save area is wasted, but at least we're not allocating and freeing on each context switch.
The copying implementation can be optimized further using lazy stack copying. Basically, only a portion of the stack is restored, and an underflow handler is set up as the return address of the last stack frame. The underflow handler then restores the next portion of the stack and resumes the computation. If the underflow handler is never invoked, then we only need to copy that small portion of the stack back to the save area, so this ends up being a huge win.
Unfortunately, while establishing an underflow handler is trivial to do in assembly, I haven't been able to figure out how to accomplish this in C. It would be a neat trick though, and I'd estimate that it would bring the performance of a copying implementation very close to a stack switching implementation.
I've also started playing with an implementation of delimited continuations for C, since this has the potential to reduce the amount of copying needed. It's tricky to do in C though, so I don't have anything solid yet.
For expediency, I have also implemented coroutines using Windows Fibers. Unfortunately, Fibers don't support any sort of copy or clone operation, so I had to retire coro_clone for the time being. This means my coroutine library can no longer be used to implement multishot continuations. Of course, this capability is trivial in a stack copying implementation. Hopefully I'll be able to figure out a way to improve the performance of copying coroutines/continuations in a portable way.
These problems motivated me to explore alternate methods of implementing coroutines. Fundamentally, the current state of a computation is encapsulated in the stack. So switching between two concurrent computations means modifying the stack, either by swapping out the old stack for an entirely new one, or by copying the relevant portion of the stack to a save area, and restoring it when switching back. Coroutines and threads typically use stack switching, and continuations have traditionally used some sort of copying.
The initial coroutine implementation used stack switching, which is fairly efficient incurring only a register save, a pointer update, and a register restore. Stack copying involves a register save, a copy to save area, a copy from a save area, and a register restore. The additional copies incur a certain overhead.
The benefits of copying are that it's virtually guaranteed to work across all platforms, and that all stack data are restored to their original locations so the full semantics of C are available. Stack switching restricts certain C operations, such as taking the address of a local variable.
My naive copying implementation was about two orders of magnitude slower than stack switching. It performed about 300,000 context switches/sec on my Core 2 Duo, vs stack switching at ~12,500,000 context switches/sec. I reduced this overhead by an order of magnitude by caching stacks (~1,500,000 context switches/sec), so part of the stack save area is wasted, but at least we're not allocating and freeing on each context switch.
The copying implementation can be optimized further using lazy stack copying. Basically, only a portion of the stack is restored, and an underflow handler is set up as the return address of the last stack frame. The underflow handler then restores the next portion of the stack and resumes the computation. If the underflow handler is never invoked, then we only need to copy that small portion of the stack back to the save area, so this ends up being a huge win.
Unfortunately, while establishing an underflow handler is trivial to do in assembly, I haven't been able to figure out how to accomplish this in C. It would be a neat trick though, and I'd estimate that it would bring the performance of a copying implementation very close to a stack switching implementation.
I've also started playing with an implementation of delimited continuations for C, since this has the potential to reduce the amount of copying needed. It's tricky to do in C though, so I don't have anything solid yet.
For expediency, I have also implemented coroutines using Windows Fibers. Unfortunately, Fibers don't support any sort of copy or clone operation, so I had to retire coro_clone for the time being. This means my coroutine library can no longer be used to implement multishot continuations. Of course, this capability is trivial in a stack copying implementation. Hopefully I'll be able to figure out a way to improve the performance of copying coroutines/continuations in a portable way.
Comments
Is there a solution like this?
Still, as long as a language emits PIC code, and uses only relative offsets from the current stack position to reference its locals, stack relocation is trivial.
The tricky parts comes with absolute addressing, like taking the address of a local and passing it to another function. Here you either need to update that address within the stack frame (complicated), use a relative offset from the stack base, or otherwise ensure that the newest address is always used (like adding a level of indirection).
char buf[256];
fill_my_buffer(buf,sizeof buf,stdin);
As a common idiom for I/O in C, this probably has to be supported intelligently.
Another way to reduce copying might be to somehow detect an area of the stack which were not changed since the last time a particular coroutine was invoked and not copy that part.
To do this, you'd have to compare the stack as it is copied away to see what parts of the stack a coroutine actually modified. Then some kind of versioning system might be needed for chunks of stack so that when you are going to copy chunks of stack back in you can check if the version number is the same and skip copying in that case. It's not clear that all this comparing would save any time compared to copying, though.