Stackless coroutines with Visual Studio 2015

Let’s talk about… coroutines! Once again, for a change! 🙂

In my last posts I talked of a Microsoft proposal to add support for resumable functions and the await keyword to the language and described how this feature had been implemented in one of first previews of VS 2015. Almost a year has passed since then, and there are a few noteworthy updates, so it is time for a new post. In fact, there is a new proposal for resumable functions (N4286) that addresses some of the limitations of the past proposal, and that has been prototyped in the latest CTP of Visual Studio 2015. But let’s start from the beginning.

C++17 Concurrency TS

I had been looking for some time now at the problem of implementing coroutines/resumable functions in order to have even in C++ something similar to what is provided by C# await and yield statements.

It turns out that – unbeknown to me 🙂 – this is quite a hot topic in the C++ community. The next major version of the standard is expected for 2017 and, in preparation for this, the activity of the standard committee has been organized in 8 groups working on 8 technical specifications, for things like a File System API, a Networking API, Concepts-Lite, Transactional Memory, Parallelism and Concurrency. (See http://isocpp.org/std/status for more details).

In particular in the Concurrency TS people are discussing how to improve std::futures to make them more composable, and how to introduce executors (task schedulers) and resumable functions. And while the work on extended futures and executors should only involve updates to the standard library, resumable functions could require changes to the language itself. To be honest, it seems unlikely that these changes will be approved in time for 2017, because there is still no agreement in the committee, but even so, this seems a very interesting topic (well, it is for me, at least…).

Stackful?

In the old proposal (N3858) Gustafsson presented the advantages of having language support for resumable functions, proposed the introduction of the await and yield keywords and described two possible implementation, stackful and stackless. The first, based on resumable side stacks (fibers), worked giving to each coroutine its own stack, which was associated to a “real” OS-thread only when the coroutine was actually running. The second, based on heap-allocated activation frames, requires the compiler to do significantly more work, in order to allocate in the heap a data structure to contain arguments and local variables for the coroutine, and to generate the code for a state machine to manage the suspension and resumption points.

The Visual Studio 2013 CTP came with a stackful, fiber-based implementation of the await feature, which I described in a previous post. But stackful coroutines have a big disadvantage in the fact that fibers are VERY expensive. By default, a fiber is allocated with the default stack size of a thread, which is 1MB in Windows. Clearly this makes it impossible to have too many coroutines running concurrently; just a few thousand coroutines would eat completely the virtual memory of a 32-bit process.

Stackless!

The new proposal (N4286) instead focuses mostly on a stackless implementation, and promises to be scalable to billions of concurrent coroutines. The goal, again, is to be able to have functions that can be suspended, waiting for some asynchronous operation to complete, and then automatically resumed after the operation has completed. For the users, this means being able to explicitly await for an asynchronous completion with await statements, like it’s done in C#.

As an example of an asynchronous operation let’s imagine that we have a function that sleeps asynchronously for a given time interval:

std::future<void> delay(int ms)
{
    return std::async([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

The proposal would make it possible to wait on the asynchronous operation using await:

std::future<void> testAwait()
{
	await delay(2000);
}

Also, resumable function make it possible to have lazy generators (similar to C# iterator blocks) with the yield statement:

std::generator<int> fibonacci(int max)
{
    int a = 0;
    int b = 1;
    while (a <= max) {
        yield a;
        int next = a + b; 
        a = b; 
        b = next;
    }
}

The syntax and the semantic of coroutines remains more or less the same as it was in the old proposal version; the changes are mostly in the way these coroutines should be implemented using activation frames. And since I think that I have already talked more than enough of resumable functions and generators in past, in this post I’d like to focus the attention on the details of this new implementation (which are much better explained in the standard document, anyway).

In fact, making a resumable function using a fiber, while expensive, it is not very complicated: after all a fiber is already a system-level implementation of a coroutine. But making resumable functions with just an activation frame requires much more work for the compiler, which needs to alter the structure of the code, turning it into a sequence of chained continuations.

Visual Studio 2015 preview

To have a look at how stackless coroutines of N4286 are supposed to work we can play with a first prototype which is already available in the CTP version of Visual Studio 2015, freely available for download here.

The only meaningful difference between this prototype and the standard proposal is that in the prototype we’ll have to deal with PPL tasks rather than std::futures. This is because N4286 is based on an improvement of standard std::futures, extended to support the chaining of continuations with std::future::then() like it’s done in boost::future. But, as we know, schedulers and continuation chaining are not yet part of the standard; therefore the VS prototype replaces futures with PPL tasks and utilizes the Concurrency runtime for their scheduling. But this is just an implementation detail: the term “task” in what follows should be read just as an alias for the “std::future” of the future.

Let’s start with await. We can rewrite with the PPL the method delay defined above:

Concurrency::task<void> delay(int ms)
{
    return Concurrency::create_task([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

At this point we can use the new keyword __await (with a double underscore, because not yet standard):

Concurrency::task<void> testAwait()
{
    __await delay(2000); // suspension point (suspends testAwait, resumes delay).
}

In order to compile this code with the VS2015 CTP we need first to enable a compiler extension by passing the /await flag to the compiler options for our project. To use this new feature we also need to include the header file <pplawait.h>.

If we run this with the debugger we can see that the execution really suspends at the __await point and the control goes back to the caller of our testAwait function, which can then go on doing other stuff. After two seconds the delay task completes and the execution of testAwait resumes from where it had suspended. Normally it will resume on a different thread, one of the worker threads used by the PPL scheduler.

There are other limitations in the current CTP implementation: only the x64 platform is supported, not Win32, and stack-frame run time checks (/RTC) need to be disabled because not yet compatible with /await. But these are just minor issues; after all this is just a prototype and not supposed to be used for production code.

Coroutine handles

But how does all this work? In a previous post. I investigated some of the details of the stackful prototype who came last year with VS2013 CTP. Let’s do the same with this new implementation in the latest CTP.

Most of the library logic for this feature ships in two header files (resumable and generator) located in the \Microsoft Visual Studio 14.0\VC\include\experimental\ folder.

As we said, in a stackless implementation a coroutine is associated with an activation frame that is allocated somewhere in the heap and that contains the status of the coroutine together with all the data required to manage its suspension and resumption. In N4286 the frame is an instance of the compiler-generated class coroutine_handle.

In the VS2015 CTP an activation frame is encapsulated by an instance of the resumable_handle type. (This was the old name of a coroutine_handles in a previous version of the standard document).
A resumable handle owns the block of heap memory that represents the activation frame for a function. In general, data in a frame is organized like in figure: there is first the space for a special coroutine promise type, followed by the space for the “suspend context” and the local state (parameters and local variables).

resumable_handle

Coroutine promises

The promise class is a library class that implements the semantics of a particular type of coroutine (like await or generators). It must implement all the logic to pass return values and exceptions back from asynchronous computations, with methods like:

void set_result(T val) 
T get_return_object(resumable_handle rh)
void set_exception(E e)
AwaitType yield_value(T val)
bool cancellation_requested()
...

We’ll see later how the C++ compiler makes use of these functions in the code that it generates for a suspension/resumption point.
In the VS2015 CTP, the logic of a promise for await statements is in class resumable_traits<RetType, …Args>::promise_type. This traits class (renamed as coroutine_traits<> in the latest version of the proposal) is templated on the return and argument types of the resumable function and aggregates all the type-dependent types and methods, like the activation frame allocator.

Awaitable types

Together with resumable handles and promises, the final piece in the picture are awaitable types. An awaitable type is a type for which a library provides the support for await statements, by implementing the following three functions:

// Returns true is the results of the asynchronous operation is already available.
bool await_ready(T& t) const

// Suspends the execution until the asynchronous operation completes.
void await_suspend(T& t, resumable_handle<> rh)

// Resumes the execution from where it had suspended.
void await_resume(T& t)

For example, in the sample code above, when we call:

    __await delay(2000);

the type on which we want to await is a PPL task. So, with our await statement we are asking the compiler to generate code that suspends on a task, and resumes after the task is completed. The await compiles because the new library provides these three functions for PPL tasks:

bool await_ready(task& t) const {
    return t.is_done();
}
void await_suspend(task& t, resumable_handle rh) {
    t.then([rh](task&){ rh(); };
} 
void await_resume(task& t) {
    t.get();
}

From this, it should be clear that the mechanism is easily extensible to new types. We can await on any type for which we can provide the three await functions.

Behind the scenes…

But what happens exactly when we compile a resumable function like this, with one or more await statements? Let’s take again this function as example:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // *** code before the __await statement

    __await delay(2000); // suspension point (suspends testAwait, resumes delay).

    // *** code after the __await statement
}

Given the presence of suspension points, the compiler transforms the code quite radically and leverage the library code for coroutine handles, promises and awaitable types described before. In practice, the function above is transformed into this (simplified) pseudo-code:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // Allocates coroutine activation frame.
    auto resHandle = _Resumable_helper_traits::_Alloc(frameSize, resumeAddress);
    
    // Creates a task to be returned to the caller.
    // This task is associated to the task completion handle of the promise.
    task<R> taskToReturn = resHandle.promise().get_return_object();

    // *** Here goes all the code before the __await statement

    // Calls the awaitable function and receives the task on which we want
    // to asynchronously block.
    auto taskToAwait = delay(2000);
  
    // If the task has already completed, we have nothing else to do 
    if (! Concurrency::await_ready(taskToAwait))
    {
        // Suspends the coroutine and adds ‘resHandle’ as continuation to be called
        // when the task ‘taskToAwait’ completes.
        Concurrency::await_suspend(taskToAwait, resHandle);
        return taskToReturn;
    }

Resume:
    // we get here when the await has completed
    auto result = Concurrency::await_resume(taskToAwait);

    // *** code after the __await statement

    _Resumable_helper_traits::_Free(resHandle);
    return taskToReturn;
}

Let’s look at this in detail. First, a single resumable handler for the whole function is allocated. This is an instance of a template class, specialized on the types of all the function parameters and on the return type. The frame size is calculated at compile time to also hold the local variables declared in the function. One of the arguments to _Alloc() is resumeAddress, a pointer to the callback function that will be called when the asynchronous operation completes.

Any resumable function must return a task, so the next step is to prepare the promise and the task that will be returned to the caller.

After this preamble, the code of the function executes normally, in a synchronous way, until it gets to an await (or yield) statement. In our case we want to await on the task returned by a call to delay(timeInMs) and, as we saw, there is library support for awaiting on tasks. The compiler adds a call to await_ready to check if the task has already completed. If this is the case, we can skip the asynchronous part and just get the task result. But if the task has not yet completed we need to suspend the execution. This is done by returning our return-value task to the caller, so that the caller is able to carry on doing other work or maybe also awaiting asynchronously on our task to complete.

Suspending a function is relatively simple, but how will the function resume its execution? Before returning the result-task to the caller, a call to await_suspend is made to chain a continuation on the task on which we are awaiting. One of the parameter here is the resumable handle itself, which represents the coroutine-state for this resumable function:

template <class _Ty, typename _Handle>
void await_suspend(task<_Ty>& _Task, _Handle _ResumeCb)
{
    _Task.then([_ResumeCb](task<_Ty>&) {
        _ResumeCb();
    });
}

So, now the method testAwait has suspended and returned a task to the caller. The caller is free to do other things or it can also suspended waiting for the task we returned to complete. Meanwhile, life goes on, the scheduler keeps scheduling, time goes by and at some point also the two seconds of the delay task elapse and a call to our continuation can be scheduled to run.

The class resumable_handle implements the call operator (), which is indeed the callback that will be called after the awaited task completes. This is where task continuations meet coroutines: the call operator simply forwards the call to the resumeAddress function that the compiler passed as argument in the allocation of the resumable frame. In this way, the execution can resume from where it had left. The details are a little more complicated: since a function can have more suspension/resumption points, the VS compiler seems to implement a small state machine to restart the execution at the correct location. But, simplifying, for the pseudocode above it is like if there was a

goto Resume;

in order to jump to the Resume: label and complete the await operation, retrieving the result of the completed task and returning it to the caller as a result of the returned task.

Generators

The other asynchronous constructs made possible by coroutines are lazy generators: functions that produce a (possibly infinite) sequence of values that can be pulled one at the time.

These are the equivalent of C# iterator blocks or Python generators; the function suspends after having “yielded” each value, and resumes execution when the next value is requested, until the sequence completes.

Going back to the usual example, this is a generator for the Fibonacci sequence, written with the syntax of VS2015 (where the keyword __yield_value is used in placed of yield).

    generator<int> fib(int max) {			
    int a = 0;					
    int b = 1;					
    while (a <= max) {
        __yield_value a;
        int next = a + b; 
        a = b;
        b = next;
    }
}

The function returns an object of type generator which behaves like any kind of STL sequence whose range is defined by a begin() and end() methods. Therefore the client code to enumerate the values in the sequence can be as simple as a range for:

for (auto n in fib(1000)) {
    std::cout << n << std::endl;
}

In the VS2015, the code for generators can be found in the file \Microsoft Visual Studio 14.0\VC\include\experimental\generator. I could dig into this code and try to describe all the details, but I would risk to become repetitive and sound like a broken record. In fact, the implementation of the prototype for generators turns out to be very similar to the idea I sketched in one of my last posts. (This, of course, not because my idea was particularly brilliant but only because there are not many other ways to write generators using coroutines :-)).

Just to have a quick overview, this is a very simplified version of the implementation:

template<typename T>
class generator
{ 
    class iterator
    {
        resumable_handle _coro;
  
        iterator(resumable_handle rh) : _coro(rh) {}
        iterator(nullptr_t) {} // represents the end of the sequence
        iterator operator ++ () {
            _coro(); // resumes execution
            return *this;
        }
        bool operator == (const iterator& rhs) {
            return _coro == rhs._coro;
        }
        const T& operator * () const {
            return _coro.promise()._CurrentValue;
        }
    };

public:
    generator(resumable_handle rh) : _coro(rh) {}

    iterator begin() {
        _coro(); // starts the coroutine and executes it until it terminates or yields.
        return iterator(_coro);
    }
    iterator end() { return iterator(nullptr); }

private:
    resumable_handle _coro;
};

We see that in order to compile a function like Fibonacci above, the compiler must allocate a resumable_handle and use it to construct an object of type generator<T>.

The role of the generator class is just to provide the semantics of range, with input iterators returned by begin() and end(); all the logic for the lazy generation is in the iterator class.
An iterator owns the resumable_handle which was passed to the generator, and uses it to resume and suspend its execution. Even here, a resumable handle works as a wrapper for the status of a coroutine. There are just small differences respect to await: here a different type of promise it’s used (of class generator::promise_type), designed to support yield_value and to store the last value yielded so that it can later be pulled by dereferencing the iterator.

The result is quite neat, in my opinion: C#-like generators will be very useful, especially if they could be composable with LINQ-like operators.

But again, for more details, I’d refer to this post, or (even better) to the standard document.

Conclusion

To summarize this brief excursus into the new proposal for resumable functions, there are a few points worth noting:

  1. Resumable functions can be used to add support for await and yield in C++.
  2. They can be implemented with stackless coroutines in order to be more scalable (at the price of greater complexity in the compiler).
  3. Stackless coroutines work allocating a single activation frame for each invocation of a resumable function.
  4. The implementation requires the support of improved std::future (composable with then()), which are also being standardized.

Hopefully, something like that will be part of the standard soon.

(Note: slightly edited on 2015/3/8 to clarify a few points).

5 thoughts on “Stackless coroutines with Visual Studio 2015

  1. I was as excited as you when seeing the proposal (N4286), but I soon became disappointed by the implementation shipped with VS 2015, it’s based on N4134 not N4286, and the header is broken since CTP5, and the coroutine is very unstable and buggy, and my bug report doesn’t seem like to be addressed…do you have ever used it for any practical work?

    BTW, you actually can use std::future as the return type already, future::then is not necessarily required.
    IMHO, their design has a major drawback that you have to take care of coroutine management carefully, because coroutine_handle is nothing more than a raw pointer, you have to ensure that it will be fully executed or you’ll leak memory seriously, the situation is even worse than traditional memory leak as the problem is more implicit than mismatched new-delete. I’d suggested that coroutine_handle should be ref-counted on the std-proposal list, but haven’t heard back from the author.

    I like the idea much as you do, and I like to use it sooner than later, so I made a library for that: [co2](https://github.com/jamboree/co2), it requires C++11 and is preprocessor-based, you can look at the back-magic to understand how await works behind the scene 😉

  2. Hi, thanks for your comment!

    Yes it’s true, the implementation shipped with VS2015 is based on N4134, but that document is practically identical to N4286, with just resumable_handle renamed as coroutine_handle and some minor editing.

    And you are right: coroutines are broken since CTP5, but it’s easy to fix them, just a matter of changing a couple of lines in the generator header file:
    line 215: resumable_handle _Coro = nullptr; should be resumable_handle _Coro;
    From line 224 the second definition of “struct resumable_traits” is duplicated and should be commented out. I am pretty sure this will be fixed in the next CTP.

    Also, you are right that std::future can be used as return type, but I what I meant was that task::then() is required internally by the current implementation.

    I am interested in your work, I’ll certainly have a look!

Leave a comment