Range comprehensions with C++ lazy generators

In the previous post we had a look at a recent proposal N4286 to add stackless coroutines to the C++ language and described the implementation prototype that ships with Visual Studio 2015 CTP.

We saw that coroutines can be used to implement lazy sequences. Lazy evaluation is a powerful tool and a pillar of functional programming; it gives the ability to construct potentially infinite data structures, and increases the performance by avoiding needless calculations. Many languages provide constructs to compute a series of elements generating them one at the time, only when they are requested. These are usually named sequences (https://msdn.microsoft.com/en-us/library/dd233209.aspx) in functional languages like Haskell or F#, generators in Python, iterator blocks in C#.

In the N4286 proposal a generator is a function that returns an object of type generator<T> and that contains at least one yield statement. For example, this is a generator for the Fibonacci sequence (the Hello World of generators) written with the notation of the VS2015 CTP:

generator<int> Fibonacci(int max)
{
    int a = 0;
    int b = 1;
    while (a <= max)
    {
        __yield_value a;
        int next = a + b;
        a = b;
        b = next;
    }
}

Query comprehensions and LINQ

Functional languages like Haskell have the concept of list comprehensions, syntactic constructs for creating a list by applying operators to other existing lists. Haskell is perfect to deal with infinite lists, thanks to its laziness, but even non-functional languages like Python and C# provide their version of query comprehensions.

In C#, of course, we have LINQ, whose design was heavily influenced by Haskell (Erik Meijer was one of the authors). The LINQ syntax is designed specifically to make it easy to apply a sequence of operations on the sequence monad IEnumerable<T>, in the form of queries.

It would be nice to have something similar in an eager language like C++. There are very interesting articles on this topic and I’d heartily suggest reading Bartosz Milewski’s “Getting Lazy with C++” and Eric Niebler’s “Range comprehensions”.

Let’s take an example from these two papers: let’s write a small Haskell program that prints the sequence of Pythagorean triples. A Pythagorean triple is a tuple of three positive integers, x, y, and z which satisfy the relation x*x + y*y = z*z.

main = print (take 10 triples)
triples = [(x, y, z) | z <- [1..]
                     , x <- [1..z]
                     , y <- [x..z]
                     , x^2 + y^2 == z^2]

In Haskell this code is very intuitive; we start with (possibly infinite) sequences of integers, and operate on them to generate a sequence of tuples, filtering only the tuples that satisfy the Pythagorean condition. Note that the act of taking only the first elements and the act of printing them are orthogonal to the act of generating the sequence.

The equivalent LINQ code for Pythagorean triples is not very different from the equivalent code in Haskell:

static IEnumerable<Tuple<int,int,int>> PythagoreanTriples(int max)
{
    return (from z in Enumerable.Range(1, max)
            from x in Enumerable.Range(1, z - 1)
            from y in Enumerable.Range(x, z - x)
            where x * x + y * y == z * z
            select new Tuple<int, int, int>(x, y, z));
}

This declarative query syntax is just syntactic sugar for the invocation of standard query operators (https://msdn.microsoft.com/en-us/library/bb397947.aspx) like Where, Select, GroupBy, Range, …, which are defined as extension methods of the Enumerable class.

The same function can be written as a simple iteration block, with a yield statement:

static IEnumerable<Tuple<int, int, int>> PythagoreanTriples(int max)
{
    foreach (var z in Enumerable.Range(1, max))
    {
        foreach (var x in Enumerable.Range(1, z - 1))
        {
            foreach (var y in Enumerable.Range(x, z - x))
            {
                if (x * x + y * y == z * z)
                {
                    yield return new Tuple<int, int, int>(x, y, z);
                }
            }
        }
    }
}

Query comprehensions in C++

Can we have something like LINQ in C++? This has been for some time a little holy grail for me; I tried to implement something with Fibers a long time ago but I didn’t really know what I was doing and the result was very kludgy.

But now the lazy, resumable generators proposed by N4286 seem perfect for this purpose; when coroutines will be fully supported by the language it could be interesting to think of a new library for query comprehensions that work on these generators.

We can use the VS2015 CTP prototype to experiment with this idea. How should a LINQ-like library for C++ look like? How would it work?

Like in LINQ, we want to write a set of operators that transform generator<T> sequences. Let’s start with a first one, a filtering operator Where that takes a generator and a predicate and returns the sequence of elements that satisfy the predicate. This can be done with a simple static function:

template<typename T, typename Pred>
static generator<T> Where(generator<T> gen, Pred pred)
{
    for (auto val : gen) {
        if (pred(val)) {
            __yield_value val;
        }
    }
}

We could very easily write other operators in the same way; for example, the Select operator, analogous to std::transform<> but that work on generators:

template<typename T, typename Fun>
static generator<typename std::result_of<Fun(T)>::type>
Select(generator<T> gen, Fun fun)
{
    for (auto val : gen) {
        __yield_value fun(val);
    }
}

This is also simple to write thanks to std::result_of that deduces the return type of Fun at compile time. Note that we cannot use automatic return type deduction here; at least in the first prototypes of VS2015 CTP we must explicitly return type generator<T>. This would not compile:

static auto Select(generator<T> gen, Fun fun) {}

Composability

Writing query operators is not very difficult. The problem is: how do we compose these operators? As an example, let’s say we want to take the sequence of all integers, filter the ones that are even and then transform them into a sequence of their string representation, like in figure:

Operators pipeline

We could do this in stages, starting with a generator that yields the sequence of integers:

generator<int> Integers()
{
    for (int i = 0; i <= INT_MAX; i++) {
        __yield_value i;
    }
}

We could either (1) apply the Where and Select operators one after the other:

    auto evenInts = Where(Integers(), [](int n) { return (n % 2) == 0; }); 
    auto evenIntStrs = Select(evenInts [](int n){ return std::to_string(n); })

or (2) we could try to put them together in a single expression:

    auto result = Select(
        Where(Integers(), [](int n) { return (n % 2) == 0; }),
        [](int n){ return std::to_string(n); });

But both choices are not great; the second one in particular quickly makes the code difficult to read and to maintain. Ideally, we would like these operators to be composable into a pipeline, as they are in LINQ:

var result = Range(1, Max)
    .Where(n => (n % 2) == 0)
    .Select(n => n.ToString());

If we were able to modify the code for generator<T>, we could do this by adding a predetermined set of operators directly in that class, in the standard library:

template <typename T>
class generator
{
    ...
    generator<T> Where(Pred pred)
    {
        for (auto val : *this) {
            if (pred(val)) {
                __yield_value val;
            }
        }
    }
    ...
}

But even so, the set of operators would not be easy to extend. What we really want is to have the ability to define and add new operators in our code, and have them compose seamlessly.

In C# the query operators are implemented as extension methods of the Enumerable class, which makes very easy to “plug” new operators simply by adding new extension methods. In C++ we don’t have extension methods. As alternative, we can follow the lead of Boost Range library that uses the pipe syntax (bitwise OR operator |) to compose the operations.

Therefore our goal will be being able to write query comprehensions as chains of pipe operators, in this form:

auto evenNumStrs = Range(1, Max)
    | Where([](int n) { return n % 2 == 0; })
    | Select([](int n) { return std::to_string(n); });

And now things get interesting: how to implement this? 🙂

A possible implementation

To be honest, at first I had really no idea. But, as Picasso once said, “good artists copy, great artists steal”: so I stole from a real artist. More precisely, I looked at the code of Eric Niebler’s range library, and it was a very interesting read. Here I am reusing some of his ideas, especially to implement a pipeline of operators. The resulting code is much simpler in my case, of course, because I only have to deal with the narrow scope of resumable generators.

So, let’s start from the beginning: as we said our goal is to be able to create pipelines of operations with the pipe operator |:

generator<int> evenNumbers = Integers() |
                             Where([](int n) { return n%2 == 0; });

We clearly need to overload the binary-OR operator | (). When its left argument is a generator, the second argument must be something of “pipeable” to a generator. Therefore, a call to a function like Where() must return “something pipeable”, whatever that means.

We can say that a type is pipeable if it implements a static method pipe(), with two arguments: a generator<T> and an instance of the pipeable type itself:

template<typename T, typename Pipeable>
auto operator|(generator<T>&& src, Pipeable&& pipe)
{
    return Pipeable::pipe(std::forward<generator<T>>(src),
                          std::forward<Pipeable>(pipe));
}

The next step could be to implement Where as a pipeable type. Rather than just a function, Where now becomes an invokable type, a class with a call operator that implements the logic to filter a sequence according to a predicate. And class Where should also have a static method pipe() that evaluates a pipe by invoking the call operator:

template<typename T>
struct Where
{
    template<typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        for (auto val : gen) {
            if (pred(val)) {
                __yield_value val;
            }
        }
    }

    template<typename Pipe>
    static auto pipe(generator<T>&& src, Pipe&& pipe)
    {
        return operator()(src, pred); // pred ???
    }
};

But now the problem is: how do we pass the predicate to the pipe? Or, generalizing the question, since operators can have any number of arguments: how do we pass to the pipe all the arguments, if we have an expressions like this?

auto outputGen = inputGen | Op(arg1, arg2, ...);

Pipeable types

We need to complicate the design a little, leveraging the magic of variadic templates and std::bind. We introduce a type for pipeable objects (objects that can be at the right side of a pipe operator |), implemented as follows:

template<typename Binder>
struct pipeable
{
    // A forwarding call wrapper generated by std::bind.
    Binder _binder;

    pipeable(Binder pipeableBinder)
        : _binder(std::move(pipeableBinder))
    {
    }

    // Executes the call operator of a bound method, passing a generator
    // as placeholder argument.
    template<typename T, typename Pipe>
    static auto pipe(generator<T>&& gen, Pipe&& pipe)
    {
        return pipe._binder(std::forward<generator<T>>(gen));
    } 
};

Let’s look at this in more detail. Type pipeable<> has a template parameter Binder, which is an invokable type, the type of the objects returned when we call std::bind() to bind the call operator (of a type like Where) to a set of arguments.

Now, having an object op of class Where, a generator object and a predicate, we can call std::bind(op, sourceGenerator, predicate) and use the result to construct an instance of a pipeable object, so that pipeable::pipe() will call a method already bound to its arguments.

Class Where can now be simplified to have just a call operator:

template<typename T>
struct Where
{
    template<typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        [...]
    }
};

Not all the arguments are early-bound though. When we construct the pipeable object, while the operator and the predicate are given, we don’t have a sourceGenerator to bind yet. This is the generator to which the operator is applied, i.e. the object that is at left side of the pipe and that is passed as argument to pipeable<T>::pipe(). So we need to do a little currying and use a std::bind placeholder in its stead:

std::bind(op, std::placeholders::_1, predicate);

The resulting workflow is the following:

// Given an input sequence,
auto gen = Integers();

// a predicate,
auto pred = [](int n){ return (n % 2) == 0; }

// and a query operator:
Where<int> opWhere;

// Binds the predicate to the operator
auto opWhereBoundToPred = std::bind(opWhere, std::placeholders::_1,
    std::move(pred));

// Creates a pipeable object
pipeable<decltype(binder)> pipeableWhereBoundToPred(opWhereBoundToPred);

// Now a pipe expression executes the operator replacing the placeholder with
// the input sequence:
gen | Where(pred) =>
gen | pipeableWhereBoundToPred =>
Pipeable::pipe(gen, pipeableWhereBoundToPred) =>
opWhereBoundToPred(gen)

Pipeable factories

The next step is to encapsulate all this logic into a class that acts as a factory for our pipeable types.

template<typename Op>
struct pipeable_factory
{
    // an invokable object of an operator class.
    Op _op;

    // Binds _op to its arguments, using a placeholder for the first argument.
    template<typename...Args>
    auto operator()(Args&&... args) const
    {
        return make_pipeable(std::bind(_op, std::placeholders::_1,
               std::move(args)...));
    }

private:
    template<typename Binder>
    static pipeable<Binder> make_pipeable(Binder pipeableBinder)
    {
        return { std::move(pipeableBinder) };
    }
};

Class pipeable_factory is templated on a query operator type (a type like class Where above) and it is also an invokable type. Its call operator uses the magic of variadic templates to accept a variable number of arguments and forward them to the std::bind function described before, then uses the result to instantiate and return a pipeable object.

So now in order to be able to execute an operator in a pipe expression like:

auto evenIntegers = Integers() | Where( [](int n) { return (n%2)==0; });

we just need to transform somehow Where(pred) into a pipeable_factory of the Where operator. We can do this very simply, by renaming the operator and constructing a single, constant instance of the pipeable_factory for Where:

struct where_op
{
    template<typename T, typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        for (auto n : gen) {
	     if (pred(n)) {
                __yield_value n;
            }
        }
    }
};
constexpr pipeable_factory<where_op> Where{};

Now finally “gen | Where(pred)” works as expected: at the right side there is a factory which binds the predicate to where_op::operator() and constructs a pipeable object. Evaluating the pipe means invoking pipeable<>::pipe(), which calls the bound method replacing a placeholder with the generator object that is at the left side of the pipe. The result is that the pipe expression is evaluated as a call to where_op::operator(gen, pipe).

And since the result of this pipe expression can be itself a generator<>, and the bitwise-OR operator | is left-to-right associative, it we have also composability.

In other words, it is possible to chain a sequence of operators, exactly like LINQ does:

auto result = Integers() | Where(isEven) | Select(toString);

and we can iterate over the result with a simple range loop:

for (auto s in result) {
    cout << s << endl;
}

Here we have introduced a second operator, Select, which projects each element of a sequence into a new form by applying a transform function:

struct select_op
{
    template<typename T, typename Fun>
    generator<typename std::result_of<Fun(T)>::type>
    operator()(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            __yield_value fun(n);
        }
    }
};
constexpr pipeable_factory<select_op> Select{};

More operators

Operators like Select and Where only take two arguments, a generator and a callable object. But we can have cases a little more challenging. For example the operator Zip, which “joins” two sequences together by applying a specified function to the corresponding i-th element of each sequence and producing a sequence of the results.

We could think of writing Zip in the same way, simply adding an additional argument for the second sequence:

class zip_op
{
public:
    template<typename TFirst, typename TSecond, typename Fun>
    generator<typename std::result_of<Fun(TFirst, TSecond)>::type>
      operator()(generator<TFirst> first,
                 generator<TSecond> second,
                 Fun resultSelector) const
    {
        auto it1 = first.begin();
        auto it2 = second.begin();
        while (it1 != first.end() && it2 != second.end()) {
            __yield_value resultSelector(*it1, *it2);
            ++it1;
            ++it2;
        }
    }
};
constexpr pipeable_factory<zip_op> Zip{};

But this would not compile (with VS2015 CTP) and I am not totally sure if it should. It fails to bind all the arguments to the variadic pipeable_factory::operator() seen before. I suspect that this could work if I could use return type deduction in zip_op::operator() but unfortunately this is not an option, since we need to explicitly return a generator<T> type.

What we can do instead is to write operator() as a variadic template function, which can take any number of arguments in its parameter pack. This function can use return type deduction as long as it does not really *implement* a resumable function but only *calls* a resumable function. To generalize the solution, we can put the variadic call operator in a base class generator_operator from which all our query operator should inherit. And each operator class will now lazily do its work in a resumable function exec() that returns a generator<T>.

Putting this idea into code, we have now:

// A generator operator is an invokable object templatized on
// an operator class. The template type Op represents a LINQ-like
// operator that works on lazy generators.
template <typename Op>
struct generator_operator
{
    template<typename Gen, typename... Rest>
    auto operator()(Gen&& gen, Rest&&... rest) const
    {
        return Op::exec(std::move(gen), std::move(rest)...);
    }
};

class select_op : public generator_operator<select_op>
{
    friend struct generator_operator<select_op>;

    template<typename T, typename Fun>
    static generator<typename std::result_of<Fun(T)>::type>
    exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            __yield_value fun(n);
        }
    }
};
constexpr pipeable_factory<select_op> Select{};

class where_op : public generator_operator<where_op>
{
    friend struct generator_operator<where_op>;

    template<typename T, typename Fun>
    static generator<T> exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            if (fun(n)) {
                __yield_value n;
            }
        }
    }
};
constexpr pipeable_factory<where_op> Where{};

Note that the operator classes are stateless, the exec function can be static and all the state is captured in the closures. The function exec can even be private, since it’s always called through the base generator_operator class, which is a friend of the operator class.

The result is quite clean, in my opinion: if we now want to add a new resumable operator all we have to do is to:

  1. write a new generator_operator-derived class with a method exec that returns a generator<T>, and
  2. declare a pipeable_factory<> for this new type.

More about generator<T>

So far nothing seems to prevent us from making these lazy operators work with any range type, even eager ranges, or with any type that defines begin() and end() methods that return input iterators.

However, we are using resumable functions to implement operators for lazy algorithms and we have seen that a resumable generator function can return only a type generator<T> and not auto. Class generator<T> has some other peculiarities that must be taken into account.

If we look at its code (in the header file C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\experimental\generator in the VS CTP) we see that class generator is moveable but not copyable.

template <typename _Ty, typename _Alloc = allocator<char> >
struct generator
{
    struct promise_type { ... };

    struct iterator { ... };

    iterator begin() {
        ...
        return {_Coro};
    }
    iterator end() {
        return {nullptr};
    }

    generator() = default;
    ~generator() { ... }

    generator(generator const&) = delete;
    generator& operator = (generator const&) = delete;

    generator(generator && _Right) : _Coro(_Right._Coro) {
        _Right._Coro = nullptr;
    }
    generator& operator = (generator && _Right) {
        if (&_Right != this) {
            _Coro = _Right._Coro;
            _Right._Coro = nullptr;
        }
    }

private:
    resumable_handle<promise_type> _Coro;
};

The copy constructor and copy assignment operators are deleted. To understand why, it is useful to look at a class diagram for the generator’s code:

generators

Generators are based on stackless coroutines, which work (as described here) by allocating an activation frame, encapsulated by the class resumable_handler. In this design a resumable_handler provides the functionality to suspend and resume the execution of a function. Every instance of a generator<T> creates and owns a resumable_handler object, and shares a reference to this object with its iterators. This is why the generator<T> type cannot be copyable: we should never have two instances of a generator that refer to the same coroutine.

But the type can be moveable, so the move constructor and assignment operator are declared and work by transferring the ownership of the resumable_handler from one object to another.

The non-copyability must be taken into account in our implementation. Every time we pass a generator<T> in a function call, we need to pass an rvalue reference (generator<T>&&). The only, very important exception is in the generator_operator<> call operator. As we have seen, it is written like this:

template <typename Op>
struct generator_operator
{
    template<typename Gen, typename... Rest>
    auto operator()(Gen&& gen, Rest&&... rest) const
    {
        return Op::exec(std::move(gen), std::move(rest)...);
    }
};

We pass explicitly a rvalue reference to the call operator, but we must make sure that in the call to exec() the move constructor is called to pass the ownership of the resumable handler to a new instance of generator<T> allocated in the scope of the exec() resumable function:

struct where_op
{
    template<typename T, typename Pred>
    static generator<T> exec(generator<T> gen, Pred pred) {...}
};

Copying the rvalue reference would not work:

struct where_op
{
    template<typename T, typename Pred>
    static generator<T> exec(generator<T>&& gen, Pred pred) { ... }
};

This error is quite tricky and insidious to debug. The problem is that with resumable functions one must always make sure that the lifespan of the generator object surpasses the lifespan of the resumable function that uses that generator. Note that when we evaluate a pipe, like:

auto evenNumbers = Integers() | Where([](int n) { return n % 2 == 0; });

the generator’s destructor is called after the expression is evaluated, well before we can start iterating over the result. So, in the where_op::exec() coroutine we would have a reference to a deleted object. But if we use a move constructor instead, then there is a new generator<T> object that lives in the scope of the exec() function and that takes ownership of the resumable_handle. And since exec() is actually a resumable function, this new generator<T> object is allocated in the coroutine allocation frame and not on the stack. So, this generator<T> object lifespan coincides with the lifespan of the resumable function and everything works.

For all these reasons I designed the lazy operators to work only on generator<T>s, not on any kind of ranges (like STL containers). But it is very easy to write an adapter function that converts an eager range into a lazy generator:

template<typename Src, typename T = Src::value_type>
static generator<T> make_generator(const Src& src)
{
    for (const T& t : src) {
        __yield_value t;
    }
}

Examples

Do you still remember the problem from which we started? We can now finally write our own solution to the problem of generating the sequence of Pythagorean triples. What we miss is just an implementation of Range, to generate a sequence of integers:

generator<int> Range(int from, int count)
{
    for (int i = 0; i < count; i++)
    {
        __yield_value (from + i);
    }
}

And finally:

generator<tuple<int, int, int>> Triples(int max)
{
    for (auto z : Range(1, max - 1)) {
        for (auto x : Range(1, z - 1)) {
            for (auto y :
                Range(x, z - x) |
                Where([x, z](int y) { return x*x + y*y == z*z; }))
            {
                __yield_value tuple<int, int, int>(x, y, z);
            }
        }
    }
}

As another, and final, example we can write a (slow) prime numbers generator. We need an additional operator for this, the operator All that returns true if all the elements in a sequence satisfy a given predicate:

class all_op : public generator_operator<all_op>
{
    friend struct generator_operator<all_op>;

    template<typename T, typename Fun>
    static bool exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            if (!fun(n)) {
                return false;
            }
        }
        return true;
    }
};
constexpr pipeable_factory<all_op> All{};

All shows that a lazy operator does not necessarily need to return a generator, it can return any type. With All we can finally write:

generator<int> Primes(int max)
{
    return Range(2, max - 1) | 
	    Where([](int i) {
               return Range(2, (int)sqrt(i)) |
                      All([i](int j) { return (i % j) != 0; });
	    });
}

But these are just few basic examples. If you are interested, you can find here:

the sources of this small library, with the implementation of a number of LINQ-like operators, like Aggregate, All, Any, Concat, Contains, Count, ElementAt, Empty, Except, First, Last, Max, Min, Range, Repeat, Reverse, Select, Skip, SkipWhile, Sum, Take, TakeWhile, Where, Zip.
The code is definitely not production-quality: error and exception handling are mostly absent, the unit tests are minimal. It is just an example of what we will be able to do with lazy generators, when they will become part of C++.

Stackless coroutines with Visual Studio 2015

Let’s talk about… coroutines! Once again, for a change! 🙂

In my last posts I talked of a Microsoft proposal to add support for resumable functions and the await keyword to the language and described how this feature had been implemented in one of first previews of VS 2015. Almost a year has passed since then, and there are a few noteworthy updates, so it is time for a new post. In fact, there is a new proposal for resumable functions (N4286) that addresses some of the limitations of the past proposal, and that has been prototyped in the latest CTP of Visual Studio 2015. But let’s start from the beginning.

C++17 Concurrency TS

I had been looking for some time now at the problem of implementing coroutines/resumable functions in order to have even in C++ something similar to what is provided by C# await and yield statements.

It turns out that – unbeknown to me 🙂 – this is quite a hot topic in the C++ community. The next major version of the standard is expected for 2017 and, in preparation for this, the activity of the standard committee has been organized in 8 groups working on 8 technical specifications, for things like a File System API, a Networking API, Concepts-Lite, Transactional Memory, Parallelism and Concurrency. (See http://isocpp.org/std/status for more details).

In particular in the Concurrency TS people are discussing how to improve std::futures to make them more composable, and how to introduce executors (task schedulers) and resumable functions. And while the work on extended futures and executors should only involve updates to the standard library, resumable functions could require changes to the language itself. To be honest, it seems unlikely that these changes will be approved in time for 2017, because there is still no agreement in the committee, but even so, this seems a very interesting topic (well, it is for me, at least…).

Stackful?

In the old proposal (N3858) Gustafsson presented the advantages of having language support for resumable functions, proposed the introduction of the await and yield keywords and described two possible implementation, stackful and stackless. The first, based on resumable side stacks (fibers), worked giving to each coroutine its own stack, which was associated to a “real” OS-thread only when the coroutine was actually running. The second, based on heap-allocated activation frames, requires the compiler to do significantly more work, in order to allocate in the heap a data structure to contain arguments and local variables for the coroutine, and to generate the code for a state machine to manage the suspension and resumption points.

The Visual Studio 2013 CTP came with a stackful, fiber-based implementation of the await feature, which I described in a previous post. But stackful coroutines have a big disadvantage in the fact that fibers are VERY expensive. By default, a fiber is allocated with the default stack size of a thread, which is 1MB in Windows. Clearly this makes it impossible to have too many coroutines running concurrently; just a few thousand coroutines would eat completely the virtual memory of a 32-bit process.

Stackless!

The new proposal (N4286) instead focuses mostly on a stackless implementation, and promises to be scalable to billions of concurrent coroutines. The goal, again, is to be able to have functions that can be suspended, waiting for some asynchronous operation to complete, and then automatically resumed after the operation has completed. For the users, this means being able to explicitly await for an asynchronous completion with await statements, like it’s done in C#.

As an example of an asynchronous operation let’s imagine that we have a function that sleeps asynchronously for a given time interval:

std::future<void> delay(int ms)
{
    return std::async([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

The proposal would make it possible to wait on the asynchronous operation using await:

std::future<void> testAwait()
{
	await delay(2000);
}

Also, resumable function make it possible to have lazy generators (similar to C# iterator blocks) with the yield statement:

std::generator<int> fibonacci(int max)
{
    int a = 0;
    int b = 1;
    while (a <= max) {
        yield a;
        int next = a + b; 
        a = b; 
        b = next;
    }
}

The syntax and the semantic of coroutines remains more or less the same as it was in the old proposal version; the changes are mostly in the way these coroutines should be implemented using activation frames. And since I think that I have already talked more than enough of resumable functions and generators in past, in this post I’d like to focus the attention on the details of this new implementation (which are much better explained in the standard document, anyway).

In fact, making a resumable function using a fiber, while expensive, it is not very complicated: after all a fiber is already a system-level implementation of a coroutine. But making resumable functions with just an activation frame requires much more work for the compiler, which needs to alter the structure of the code, turning it into a sequence of chained continuations.

Visual Studio 2015 preview

To have a look at how stackless coroutines of N4286 are supposed to work we can play with a first prototype which is already available in the CTP version of Visual Studio 2015, freely available for download here.

The only meaningful difference between this prototype and the standard proposal is that in the prototype we’ll have to deal with PPL tasks rather than std::futures. This is because N4286 is based on an improvement of standard std::futures, extended to support the chaining of continuations with std::future::then() like it’s done in boost::future. But, as we know, schedulers and continuation chaining are not yet part of the standard; therefore the VS prototype replaces futures with PPL tasks and utilizes the Concurrency runtime for their scheduling. But this is just an implementation detail: the term “task” in what follows should be read just as an alias for the “std::future” of the future.

Let’s start with await. We can rewrite with the PPL the method delay defined above:

Concurrency::task<void> delay(int ms)
{
    return Concurrency::create_task([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

At this point we can use the new keyword __await (with a double underscore, because not yet standard):

Concurrency::task<void> testAwait()
{
    __await delay(2000); // suspension point (suspends testAwait, resumes delay).
}

In order to compile this code with the VS2015 CTP we need first to enable a compiler extension by passing the /await flag to the compiler options for our project. To use this new feature we also need to include the header file <pplawait.h>.

If we run this with the debugger we can see that the execution really suspends at the __await point and the control goes back to the caller of our testAwait function, which can then go on doing other stuff. After two seconds the delay task completes and the execution of testAwait resumes from where it had suspended. Normally it will resume on a different thread, one of the worker threads used by the PPL scheduler.

There are other limitations in the current CTP implementation: only the x64 platform is supported, not Win32, and stack-frame run time checks (/RTC) need to be disabled because not yet compatible with /await. But these are just minor issues; after all this is just a prototype and not supposed to be used for production code.

Coroutine handles

But how does all this work? In a previous post. I investigated some of the details of the stackful prototype who came last year with VS2013 CTP. Let’s do the same with this new implementation in the latest CTP.

Most of the library logic for this feature ships in two header files (resumable and generator) located in the \Microsoft Visual Studio 14.0\VC\include\experimental\ folder.

As we said, in a stackless implementation a coroutine is associated with an activation frame that is allocated somewhere in the heap and that contains the status of the coroutine together with all the data required to manage its suspension and resumption. In N4286 the frame is an instance of the compiler-generated class coroutine_handle.

In the VS2015 CTP an activation frame is encapsulated by an instance of the resumable_handle type. (This was the old name of a coroutine_handles in a previous version of the standard document).
A resumable handle owns the block of heap memory that represents the activation frame for a function. In general, data in a frame is organized like in figure: there is first the space for a special coroutine promise type, followed by the space for the “suspend context” and the local state (parameters and local variables).

resumable_handle

Coroutine promises

The promise class is a library class that implements the semantics of a particular type of coroutine (like await or generators). It must implement all the logic to pass return values and exceptions back from asynchronous computations, with methods like:

void set_result(T val) 
T get_return_object(resumable_handle rh)
void set_exception(E e)
AwaitType yield_value(T val)
bool cancellation_requested()
...

We’ll see later how the C++ compiler makes use of these functions in the code that it generates for a suspension/resumption point.
In the VS2015 CTP, the logic of a promise for await statements is in class resumable_traits<RetType, …Args>::promise_type. This traits class (renamed as coroutine_traits<> in the latest version of the proposal) is templated on the return and argument types of the resumable function and aggregates all the type-dependent types and methods, like the activation frame allocator.

Awaitable types

Together with resumable handles and promises, the final piece in the picture are awaitable types. An awaitable type is a type for which a library provides the support for await statements, by implementing the following three functions:

// Returns true is the results of the asynchronous operation is already available.
bool await_ready(T& t) const

// Suspends the execution until the asynchronous operation completes.
void await_suspend(T& t, resumable_handle<> rh)

// Resumes the execution from where it had suspended.
void await_resume(T& t)

For example, in the sample code above, when we call:

    __await delay(2000);

the type on which we want to await is a PPL task. So, with our await statement we are asking the compiler to generate code that suspends on a task, and resumes after the task is completed. The await compiles because the new library provides these three functions for PPL tasks:

bool await_ready(task& t) const {
    return t.is_done();
}
void await_suspend(task& t, resumable_handle rh) {
    t.then([rh](task&){ rh(); };
} 
void await_resume(task& t) {
    t.get();
}

From this, it should be clear that the mechanism is easily extensible to new types. We can await on any type for which we can provide the three await functions.

Behind the scenes…

But what happens exactly when we compile a resumable function like this, with one or more await statements? Let’s take again this function as example:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // *** code before the __await statement

    __await delay(2000); // suspension point (suspends testAwait, resumes delay).

    // *** code after the __await statement
}

Given the presence of suspension points, the compiler transforms the code quite radically and leverage the library code for coroutine handles, promises and awaitable types described before. In practice, the function above is transformed into this (simplified) pseudo-code:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // Allocates coroutine activation frame.
    auto resHandle = _Resumable_helper_traits::_Alloc(frameSize, resumeAddress);
    
    // Creates a task to be returned to the caller.
    // This task is associated to the task completion handle of the promise.
    task<R> taskToReturn = resHandle.promise().get_return_object();

    // *** Here goes all the code before the __await statement

    // Calls the awaitable function and receives the task on which we want
    // to asynchronously block.
    auto taskToAwait = delay(2000);
  
    // If the task has already completed, we have nothing else to do 
    if (! Concurrency::await_ready(taskToAwait))
    {
        // Suspends the coroutine and adds ‘resHandle’ as continuation to be called
        // when the task ‘taskToAwait’ completes.
        Concurrency::await_suspend(taskToAwait, resHandle);
        return taskToReturn;
    }

Resume:
    // we get here when the await has completed
    auto result = Concurrency::await_resume(taskToAwait);

    // *** code after the __await statement

    _Resumable_helper_traits::_Free(resHandle);
    return taskToReturn;
}

Let’s look at this in detail. First, a single resumable handler for the whole function is allocated. This is an instance of a template class, specialized on the types of all the function parameters and on the return type. The frame size is calculated at compile time to also hold the local variables declared in the function. One of the arguments to _Alloc() is resumeAddress, a pointer to the callback function that will be called when the asynchronous operation completes.

Any resumable function must return a task, so the next step is to prepare the promise and the task that will be returned to the caller.

After this preamble, the code of the function executes normally, in a synchronous way, until it gets to an await (or yield) statement. In our case we want to await on the task returned by a call to delay(timeInMs) and, as we saw, there is library support for awaiting on tasks. The compiler adds a call to await_ready to check if the task has already completed. If this is the case, we can skip the asynchronous part and just get the task result. But if the task has not yet completed we need to suspend the execution. This is done by returning our return-value task to the caller, so that the caller is able to carry on doing other work or maybe also awaiting asynchronously on our task to complete.

Suspending a function is relatively simple, but how will the function resume its execution? Before returning the result-task to the caller, a call to await_suspend is made to chain a continuation on the task on which we are awaiting. One of the parameter here is the resumable handle itself, which represents the coroutine-state for this resumable function:

template <class _Ty, typename _Handle>
void await_suspend(task<_Ty>& _Task, _Handle _ResumeCb)
{
    _Task.then([_ResumeCb](task<_Ty>&) {
        _ResumeCb();
    });
}

So, now the method testAwait has suspended and returned a task to the caller. The caller is free to do other things or it can also suspended waiting for the task we returned to complete. Meanwhile, life goes on, the scheduler keeps scheduling, time goes by and at some point also the two seconds of the delay task elapse and a call to our continuation can be scheduled to run.

The class resumable_handle implements the call operator (), which is indeed the callback that will be called after the awaited task completes. This is where task continuations meet coroutines: the call operator simply forwards the call to the resumeAddress function that the compiler passed as argument in the allocation of the resumable frame. In this way, the execution can resume from where it had left. The details are a little more complicated: since a function can have more suspension/resumption points, the VS compiler seems to implement a small state machine to restart the execution at the correct location. But, simplifying, for the pseudocode above it is like if there was a

goto Resume;

in order to jump to the Resume: label and complete the await operation, retrieving the result of the completed task and returning it to the caller as a result of the returned task.

Generators

The other asynchronous constructs made possible by coroutines are lazy generators: functions that produce a (possibly infinite) sequence of values that can be pulled one at the time.

These are the equivalent of C# iterator blocks or Python generators; the function suspends after having “yielded” each value, and resumes execution when the next value is requested, until the sequence completes.

Going back to the usual example, this is a generator for the Fibonacci sequence, written with the syntax of VS2015 (where the keyword __yield_value is used in placed of yield).

    generator<int> fib(int max) {			
    int a = 0;					
    int b = 1;					
    while (a <= max) {
        __yield_value a;
        int next = a + b; 
        a = b;
        b = next;
    }
}

The function returns an object of type generator which behaves like any kind of STL sequence whose range is defined by a begin() and end() methods. Therefore the client code to enumerate the values in the sequence can be as simple as a range for:

for (auto n in fib(1000)) {
    std::cout << n << std::endl;
}

In the VS2015, the code for generators can be found in the file \Microsoft Visual Studio 14.0\VC\include\experimental\generator. I could dig into this code and try to describe all the details, but I would risk to become repetitive and sound like a broken record. In fact, the implementation of the prototype for generators turns out to be very similar to the idea I sketched in one of my last posts. (This, of course, not because my idea was particularly brilliant but only because there are not many other ways to write generators using coroutines :-)).

Just to have a quick overview, this is a very simplified version of the implementation:

template<typename T>
class generator
{ 
    class iterator
    {
        resumable_handle _coro;
  
        iterator(resumable_handle rh) : _coro(rh) {}
        iterator(nullptr_t) {} // represents the end of the sequence
        iterator operator ++ () {
            _coro(); // resumes execution
            return *this;
        }
        bool operator == (const iterator& rhs) {
            return _coro == rhs._coro;
        }
        const T& operator * () const {
            return _coro.promise()._CurrentValue;
        }
    };

public:
    generator(resumable_handle rh) : _coro(rh) {}

    iterator begin() {
        _coro(); // starts the coroutine and executes it until it terminates or yields.
        return iterator(_coro);
    }
    iterator end() { return iterator(nullptr); }

private:
    resumable_handle _coro;
};

We see that in order to compile a function like Fibonacci above, the compiler must allocate a resumable_handle and use it to construct an object of type generator<T>.

The role of the generator class is just to provide the semantics of range, with input iterators returned by begin() and end(); all the logic for the lazy generation is in the iterator class.
An iterator owns the resumable_handle which was passed to the generator, and uses it to resume and suspend its execution. Even here, a resumable handle works as a wrapper for the status of a coroutine. There are just small differences respect to await: here a different type of promise it’s used (of class generator::promise_type), designed to support yield_value and to store the last value yielded so that it can later be pulled by dereferencing the iterator.

The result is quite neat, in my opinion: C#-like generators will be very useful, especially if they could be composable with LINQ-like operators.

But again, for more details, I’d refer to this post, or (even better) to the standard document.

Conclusion

To summarize this brief excursus into the new proposal for resumable functions, there are a few points worth noting:

  1. Resumable functions can be used to add support for await and yield in C++.
  2. They can be implemented with stackless coroutines in order to be more scalable (at the price of greater complexity in the compiler).
  3. Stackless coroutines work allocating a single activation frame for each invocation of a resumable function.
  4. The implementation requires the support of improved std::future (composable with then()), which are also being standardized.

Hopefully, something like that will be part of the standard soon.

(Note: slightly edited on 2015/3/8 to clarify a few points).