Galaga: an Arcade machine emulator for Windows and HTML5

Retrogaming

Galaga was a very popular arcade in the ‘80s. When I was a kid I must have spent so many coins to play that game that I could probably drive a Ferrari, today, had I better invested them. 🙂

pic1

One of the scenes that made the 1983 movie “Wargames” endearing to me is the one where the main character, David (Matthew Broderick) plays Galaga before going to school, and not even the presence of the beautiful classmate Jennifer (Ally Sheedy) can really distract him from his game… 🙂

pic2

A few years ago I must have had really too much free time, so I engaged in the toy project of writing an emulator for the Galaga arcade machine. By emulation I don’t mean writing a game “just like” Galaga (which would not be a trivial task, anyway) but really emulating the hardware of that old arcade machine in all its details: the original ISA we want to emulate is those of the old Z80 processor and the program we want to run is the one stored in the original ROMs. Of course I knew that MAME and MESS already existed, and I snooped through its sources to find out how that machine actually worked.

My initial goal was to have something running on a browser, so not tied to the Windows platform, and to learn something about software emulation and virtual machines in the process. For this reason I chose to use first Silverlight as host, a platform that looked very promising at the time. After that I quickly ported the code from C# to C++ (even to see what the speed-up could be) and made it run as a WinRT app, more recently converted into a UWP app for Windows 10.

Nothing special, so far. I was just doing for a single game what MAME has done better for thousands of games. But more recently I discovered the wonderful world of Emscripten and I wondered: how difficult can be to make the engine REALLY PORTABLE, compiling it to JavaScript and targeting HTML5? And can JavaScript run fast enough to emulate an arcade machine? It’s what we are going to discover! In this blog post I’ll present a few technical notes on this toy project, a small journey from the initial idea to the final HTML5 implementation.

The machine

But let’s start from the beginning: how did an arcade machine from the 80s work? And how difficult is it to emulate it?

The following picture presents a high-level view of the Galaga hardware:

pic3

Interestingly, Galaga was multi-core. There were three Z80 8-bit processors, which shared the same address space and RAM, but executed each a different program, stored in a set of different ROMs. Other ROMs contained the graphics (sprites and a tiles) and the audio waveforms used to generate sound effects.

There was a form of inter-processor communication based on interrupts: Z80 handles maskable interrupts by executing an interrupt handler located at a fixed address in the ROM. In Galaga the interrupt handler managed a vector of commands with a jump table, so the main CPU could invoke a procedure on one of the other processors simply by setting the index of a command in a location in shared memory and signaling an interrupt.

Beside the main CPUs there were a number of custom chips, dedicated to different functions. There was one just to handle the graphics effects of the star-field in the background, others were used to to render the graphics, to manage the input controls and to produce the sound effects.

To play Galaga we need to emulate the workings of all its components. All the pieces must be connected as they were in their circuit board; and for this we need to find detailed information about the original schematics. Luckily, besides MAME, there are a numbers of sources for this info in the Internet. Galaga’s popularity meant that there has been a great deal of low-level hacking over the years. For example, someone entirely disassembled the 24K bytes of Z80 code from its ROMs (1, 2, 3) to the point of being able to find the root cause and fix the famous bug that made the alien bees stop firing after a while. Another electronics wizard wrote the code to emulate the sound hardware on a PIC.

The Emulator

I wrote the machine emulator engine in C++, making sure that it compiled with the latest versions of Visual Studio and Clang, so that it could be used to drive different hosts, possibly running on different platforms.

Processors

The first thing I needed was a software emulator for the Z80 microprocessor. The emulator, of course, should run in real time, so it is important that the emulation of the Z80 instruction set should be performed very efficiently. The original machine had three Z80 processors running at ~3MHz, so we need to make sure that the time spent to emulate these chips is just a fraction of the machine nominal execution time. In other words, we want to emulate 10 msec of machine time in less than 10 msec).

There are two main techniques for emulating an instruction set (see Smith, Nair “Virtual Machines”). The most efficient one is binary translation, which converts whole blocks of a source program into a customized binary program in the target ISA. A much simpler technique is the “decode-and-dispatch interpreter”, which works by fetching one instruction at the time, analyzing and executing it and modifying the source state.

In our case more than 20 years have passed since the days of 8-bit processors, so we can afford to choose the simplest technique and write an interpreter for Z80. Even so, this turns out to be the most computationally expensive part of the emulator, so it’s important to optimize carefully this code.

Besides the Z80, a couple of the custom chips in the machine are also based on another very simple 4-bit microprocessor, which also needs to be emulated, and even here we can use a simple “decode-and-dispatch interpreter”.

Memory and I/O

We have seen that the memory space is partially shared by the three main processors, so when an address is accessed by the CPU, it may refer to a portion of the same RAM or it may refer to one of the ROMs, which are instead separated per processor, with different ROMs mapped into the same range of addresses for different processors. This is not the whole story: some of the other chips are also memory-mapped, so the CPU instructions used to access the memory can also be used for accessing these devices; the emulator needs to take care of all this, managing the mapping of addresses into the right device.

Other devices use instead port-mapped I/O and are accessed via a dedicated set of microprocessor instructions, (in and out in Z80) on a separate, dedicated address space. Our emulator, of course, must also take care of this port-mapping.

Graphics

The graphics system was similar to those of the 8-bit microcomputers of the time (like the Commodore 64). The screen (of 224 x 288 pixels) was managed as a tilemap of 8×8 characters, and the graphics system also supported up to 64 sprites that were rendered above the tilemap.

The generation of a screen frame takes three steps:

  1. First, the star field in the background is drawn. In the original, there is a custom chip to move and draw the stars; its behavior needs to be replicated.
  2. Then the matrix of 28 x 36 square tiles (224 x 288 pixels) is drawn over the star field, by reading from video-memory the index of the tile to draw and the color to use. Tiles are used for all the text that appears in the game, like the score; the content of each tile (a bitmapped font) is read from a ROM.
  3. Finally it draws up to 64 sprites over the tiles. Even here, size and position and colors of each sprite is stored in a mapped area of video memory, while the sprite bitmaps are read from separate ROMs.

Audio

Audio synthesis turned out to be by far the most complicated part of this project. In Galaga the audio system is composed by two different chips that generated two audio signals.

The first chip is a 3-voice waveform sounds generator. It uses a library of waveforms, read from a PROM, to generate digital samples for sound effects and little tunes.

The second chip, a sound/noise generator based on a 4-bits controller, was only used to generate explosions. This is a discrete sound circuit, made with analog components, like operational amplifiers and DAC. The emulation of a discrete circuit was the part where I really had to dig into the MAME sources to be able to move on.

A final mixer merges these two audio streams and generates a sequence of samples in the format of a single PCM channel of 16 bit samples at a sample rate of 48 KHz.

Interface

A nice thing about the emulation of a game machine is that, once the engine is written and works, it can easily be “hosted” in different apps.

We can see the arcade emulator as a black box which takes a number of inputs (the state of the button controls, coin insertion) and, when run for some time interval (quantum) of N microseconds, produces a set of outputs: the video frame that results from the execution of the machine and the set of audio samples that were generated in that interval.

pic4

The input interface turns out to be very simple: there is a small set of buttons or switches that can be turned on or off at any moment. Three buttons are used to horizontally move the ship and to shoot, other buttons are used to start the game, and a switch is triggered when a coin is inserted:
was a very popular arcade in the ‘80s

void SetInsertCoin(bool value);
void SetStart1Player(bool value);
void SetStart2Players(bool value);
void SetMoveLeft(bool value);
void SetMoveRight(bool value);
void SetButton(bool value);

All the rest of the machine behaviour can be encapsulated in a single function, that runs the machine for a time interval specified:

void Run(double microSeconds, VideoFrame& videoFrame, AudioFrame& audioFrame);

To guarantee smooth video rendering and low-latency audio we need to keep the time interval very small. For example, a quantum of 16.7 msec would allow a frame rate of 60 frames/second and a reasonable low latency for the sound.

Note that the code of the engine is totally platform-independent; once we have defined the format for audio samples and video frames, until we just change the state of the input controls and run quanta of emulation time, we are not really doing anything that interacts with the host, yet.

So, we can reuse the same emulator experimenting with different “host applications”, comparing the technologies and measuring their performances. We saw that the engine produces screen frames, in the form of arrays of pixels, and audio samples, in the form of arrays of PCM samples; the challenge will be to find the most efficient way to render these frames and to play these samples.

Here I’d like to present three different versions of a host application. Being Windows still my preferred platform, the first choice was to write the game as a classic, Direct2D-based Win32 app. The natural evolution was to move then to a UWP app for Windows 10. (We’ll see that in both cases Windows provides nice APIs that make the hosting code quite simple). Finally, we’ll move away from native code towards something very different, a HTML5/JavaScript app, and we’ll see how we can reuse the same C++ code that drove the Windows’ apps.

pic5

Direct2D

Let’s start with a classic Windows app. In this case Direct2D is probably the best choice to quickly redraw screen frames. Personally I am a big fan of Direct2D and of its simple, low-level interfaces. (In one of my projects at work recurring to pure Direct2D for the most complex part of our user interface unexpectedly turned out to be a much simpler and faster solution than relying on a more declarative XAML-like framework… but this is another story :-)).

There are very efficient ways to render video frames with Direct2D, like swap chain buffers, but in our case we don’t need anything so sophisticated; it is sufficient to use a single bitmap to represent the game screen.

The code (error handling omitted) is very simple indeed; we just need to create a render target associated to the app window and to create a bitmap with a pixel format that matches those of the screen frames produced by the emulator:

    CComPtr<ID2D1Factory> _pDirect2dFactory;
    CComPtr<ID2D1HwndRenderTarget> _pRenderTarget;
    CComPtr<ID2D1Bitmap> _pBitmap;
    ...

    D2D1CreateFactory(D2D1_FACTORY_TYPE_SINGLE_THREADED, &_pDirect2dFactory);
    _pDirect2dFactory->CreateHwndRenderTarget(
        D2D1::RenderTargetProperties(),
        D2D1::HwndRenderTargetProperties(m_hwnd, size),
        &_pRenderTarget);
    D2D1_PIXEL_FORMAT pixelFormat = D2D1::PixelFormat(DXGI_FORMAT_B8G8R8A8_UNORM, D2D1_ALPHA_MODE_IGNORE);
    _pRenderTarget->CreateBitmap(
        D2D1::SizeU(Width, Height),
        NULL, 0,
        D2D1::BitmapProperties(pixelFormat),
        &_pBitmap);	

ID2D1Bitmap provides a method CopyFromMemory to fill the bitmap from an array of pixels in memory, and this is all we need.

To schedule the engine we can simply use a Win32 timer; with a timeout of 17msec we get about 60 frames per second. Every time the timer expires we run the emulator for the same time interval and as result a new screen frame will be ready to be copied into the bitmap and then blitted onto the render target and drawn to the window:

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    _pBitmap->CopyFromMemory(&rect, videoFrame.Pixels, pitch);
    m_pRenderTarget->DrawBitmap(_pBitmap);

    // render audio
    (later)
}

We’ll see later how to deal with the audio samples.

UWP App

With Windows 10, we can write a UWP app that also works on mobile. In this case we use XAML and the required markup is extremely simple:

<UserControl
    x:Class="rtGalaga.MainPage"  xmlns=...
    KeyDown="LayoutRoot_KeyDown" KeyUp="LayoutRoot_KeyUp" Loaded="UserControl_Loaded">
    <Grid x:Name="LayoutRoot" Background="#FF0C0C0C">
        <Image x:Name="img" Stretch="Uniform" />
    </Grid>
</UserControl>

We just use an element to render the game screen. This is very simple, because the UWP framework provides a mechanism to fill the content of the from an array of pixels in memory, by using a WriteableBitmap as the image source:

WriteableBitmap^ _bitmap;
_bitmap = ref new WriteableBitmap(Height, Width);
img->Source = _bitmap;

XAML also provides us with a better and more efficient way to schedule our emulator. The CompositionTarget::Rendering event is fired once each time the XAML engine decides to render a frame. In normal conditions, this happens about 60 times per second, but the event will be fired less frequently when the processor is busy:

CompositionTarget::Rendering += ref new EventHandler<Object^>(this, &MainPage::OnCompositionTargetRendered);

Each time the frame rendering is scheduled, we run the emulator for the appropriate amount of time and get a VideoFrame, whose pixels we can copy into the WriteableBitmap:

void MainPage::OnCompositionTargetRendered(Platform::Object^ sender, Platform::Object^ args)
{
    // run the emulator
    galaga->Run(time, videoFrame, audioFrame);
    // render the video frame
    ComPtr<Windows::Storage::Streams::IBufferByteAccess> pBufferByteAccess;
    if (SUCCEEDED(bitmap->PixelBuffer.As(&pBufferByteAccess)))
    {
        // get pointer to pixel bytes
        BYTE* pBuff = nullptr;
        if (SUCCEEDED(pBufferByteAccess->Buffer(&pBuff)))
        {
            memcpy(pBuff, videoFrame.Pixels, …);

            // invalidate the bitmap to redraw the image
            bitmap->Invalidate();
        }
    }

    // render audio samples
    (later)
}

XAudio2

What is missing now is audio rendering. We can use the same technology both in a classic Win32 app and in WinRT/UWP. XAudio2, the replacement for DirectSound, which provides a very simple API for low-level access to the soundcard and it is particularly suitable for videogames.

XAudio2 works by building an audio processing graph and in our case the graph will be very simple: there is a source node (voice) that produces the emulator- synthesized samples (in the form of a single 16 bit PCM channel at 48 KHz) and there is a mastering voice that represents the audio device.

pic6

This is the code that creates and starts the graph (error handling omitted):

CComPtr<IXAudio2> _pMusicEngine;
IXAudio2SourceVoice* _pSourceVoice;
IXAudio2MasteringVoice* _pMasteringVoice;

std::vector<uint8_t> _buffers[QUEUE_SIZE];
int _lastPlayed;
int _lastWritten;

...

::XAudio2Create(&_pMusicEngine, 0, XAUDIO2_DEFAULT_PROCESSOR);
_pMusicEngine->CreateMasteringVoice(&_pMasteringVoice, XAUDIO2_DEFAULT_CHANNELS);

WAVEFORMATEX wfx;
ZeroMemory(&wfx, sizeof(WAVEFORMATEX));
wfx.wFormatTag = WAVE_FORMAT_PCM;
wfx.nChannels = 1;
wfx.nSamplesPerSec = 48000;
wfx.wBitsPerSample = 16;
wfx.nAvgBytesPerSec = wfx.nSamplesPerSec * wfx.wBitsPerSample / 8;
wfx.nBlockAlign = wfx.wBitsPerSample / 8;

_pMusicEngine->CreateSourceVoice(&_pSourceVoice, (WAVEFORMATEX*)&wfx, 0, 1.0f, &_voiceContext, NULL, NULL);
_pSourceVoice->Start(0, 0);

Here _voiceContext is a pointer to an object that implements the IXAudio2VoiceCallback interface, used to buffer and synchronize the samples to be played.

Our emulator produces a continuous sequence of samples which we need to play in real time, with minimum delay and without glitches and the best way to do this is to use a circular buffer (see Petzold).

pic7

More precisely, we use an array of small buffers arranged in a circular queue so that when one of the buffers is being played, the other buffers can be filled and queued to be processed.

Every time we run the emulator for a quantum of time (GalagaMachine.Run(…)) we get the set of samples that the machine would have generated in that time interval and we insert them into the queue. For simplicity, we can ask the emulator to provide us samples in small buffers with the same size of the buffers in the queue.

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    ...

    // render audio
    int buffer = (1 + _lastWritten) % QUEUE_SIZE;
    if (buffer == lastPlayed) {
        return; // overflow
    }
    std::copy(audioFrame.samples().begin(), audioFrame.samples().end(), _buffers[buffer].begin());
    _lastWritten = buffer;
}

Meanwhile, XAudio will continue playing the queued buffer, and it will report when each buffer is finished by calling the method OnBufferEnd in the IXAudio2VoiceCallback interface. There, we can refill the buffer with more samples and add it back to the queue to be processed, submitting it to the IXAudio2SourceVoice object with code like this:

{
    _lastPlayed = (_lastPlayed + 1) % QUEUE_SIZE;

    XAUDIO2_BUFFER buf = { 0 };
    buf.AudioBytes = _buffers[_lastPlayed].size();
    buf.pAudioData = _buffers[_lastPlayed].data();
    _pSourceVoice->SubmitSourceBuffer(&buf);
}

Having a little buffering is useful to compensate for moments when the computer’s CPU is too busy to keep up with the task of producing audio samples in our emulator. The bigger the size of the buffers, the more resistant the app will be to transient changes in CPU load that could starve or overflow the queue. But we pay this buffering with latency, a delay between the moment something happens in the game and the moment the sound effect for that event is played; a bit like the delay between lightning and thunder.

On a normal Windows PC we can generally afford to use relatively small buffers; I found that using a queue with three or four 20-msec buffers (each with 48000 samples/sec * 1 channel * .02 sec = 960 samples) works very well in most cases. More sophisticated algorithms could be used to dynamically adapt the latency to the CPU load.

From UWP to HTML5

So far, everything was quite easy: writing an arcade emulator for Windows is not a particularly difficult task. With native code we can efficiently emulate another ISA and the system provides us with a simple API to quickly render video frames and to play audio samples. The real problem is that the resulting code is not portable. If we want to support a different platform we can recompile and reuse the C++ code for the emulator, but we need to rewrite all the “host” code that deals with audio and with screen frames.

Nowadays the platform of choice for user interface is, of course, the browser and JavaScript is becoming more and more the lingua franca for UI development. So the natural next step is to ask how difficult would be to write the same emulator in JavaScript and whether an interpreted language like JS can be fast enough to emulate an old arcade.

JavaScript as the assembly language for the Web

While many people consider JavaScript as a very elegant programming language, others, like me, are really put off by its lack of static typing. Whatever the preference, there is no doubt that JavaScript is perfect to write portable code.

Already in 2011 gurus like Brendan Eich (the inventor of JavaScript) and Douglas Crockford (inventor of JSON) predicted that JavaScript could be used as a target language to compile from other languages, and could become a sort of Assembly language for the Web. And even before, Erik Meijer (then in Microsoft) had idea of writing a compiler from .NET IL to JavaScript as part of the Volta project.

There are other frameworks which enable developers to “target” JS using C# or Java as source languages, like Script# and GWT but the Volta C#-to-JS compiler was the first (that I know) to really use JavaScript as an assembly-like language: it worked compiling normally .NET code into MSIL and then converting each MSIL instruction into JavaScript.

Emscripten

Project Volta has long been dead, but the idea of using JavaScript as the assembly for the Web is more interesting than ever, especially because the JavaScript engines in our browsers are today much faster than in the past.
The best way to compile into JavaScript today is to use Emscripten, an open source tool based on LLVM that compiles from C and C+.

To be more precise, what Emscripten does is to take LLVM bytecode (which can be generated from C/C++ using Clang, or any other LLVM-friendly language) and to compile it into highly-optimizable JavaScript.
For example, the following C++ function, taken from the sources of the Galaga emulator:

void Frame::SetPixel(int x, int y, int color)
{
    _pixels[(_width * y) + x] = color;
}

Once compiled (with almost all optimizations disabled) is transformed into the following JS code:

function __ZN6Galaga5Frame8SetPixelEiii($this,$x,$y,$color) {
   $this = $this|0;
   $x = $x|0;
   $y = $y|0;
   $color = $color|0;
   var $0 = 0, $1 = 0, $2 = 0, $3 = 0, $4 = 0, $5 = 0, label = 0, sp = 0;
   sp = STACKTOP;
   $0 = HEAP32[$this>>2]|0;       // $0: _width
   $1 = Math_imul($0, $y)|0;      // $1: _width * y
   $2 = (($1) + ($x))|0;          // $2: _width * y + x
   $3 = ((($this)) + 8|0);
   $4 = HEAP32[$3>>2]|0;          // $4: &(_pixels)
   $5 = (($4) + ($2<<2)|0);       // $5: &(_pixels[_width * y + x])
   HEAP32[$5>>2] = $color;        // $6: _pixels[_width * y + x] = color
   return;
}

We can see that the resulting code is really the equivalent of assembly code, only written in JS rather than in a native binary. (The optimized version would be less verbose but also less readable).

Asm.js

The generated code is full of expressions like $x|0… this is a way to coerce the variable x to be of an integer type. So there are kind of static typing rules being enforced… what kind of JavaScript is this? It turns out that Emscripten uses a well specified and strict subset of JavaScript as compilation target, named Asm.js.

Asm.js was designed (by Mozilla) exactly for the purpose of being used as a target language for the source-to-source compilation from a statically-typed languages like C/C++. You can find more details in the spec, but in a few words, asm.js is JS without all the potentially slow parts (like dynamic type guards and garbage collection); it only supports the parts that can be executed very efficiently (strictly-typed integers, floats, arithmetic, function calls, and heap accesses).

The static-type rules allow much faster JIT compilation or ahead-of-time compilation, the script engine can perform optimizations that would not otherwise be possible, and the generated code can be surprisingly fast, running at near-native speed!
But Asm.js also has its disadvantages:

  • It is difficult to write Asm.js code by hand (exactly as it is to code in x86 assembly).
  • The whole C++ application gets compiled into a single massive JavaScript file, including the standard libraries; it is not trivial to make this compiled code interact with other JavaScript parts (but some form of interaction is always necessary).
  • It is difficult to port programs that have a lot of interactions with the OS, or that do a lot of I/O, or that use many external native libraries. There is no support for multi-threading.

Furthermore, while the compiled Asm.js code is valid JavaScript that can run on every browser, it is very important that browsers explicitly support Asm.js to really take advantage of the optimizations that are possible once the asm.js code is validated. For this reason the performance of the same Emscripten-generated code may vary widely across different browsers.

If you are curious, you can use the Octane benchmark to measure the performance of your browser. There is also an interesting demo that shows very effectively what difference running on a browser optimized for asm.js can make. At this page, two chess engines are matched against one another. They are both compiled from the same Stockfish source code, but in only one has a flag set to tell the browser that the code should be interpreted (and optimized) as Asm.js. Of course this is the engine that wins all the matches as long as the game is played on a browser that supports asm.js.

This chart, taken from http://caniuse.com/#feat=asmjs, describes the current status of the browser support matrix. Currently Asm.js is fully supported by Firefox and Edge, while Chrome is still a little behind.

pic8

The arcade emulator in Asm.js

After this long introduction to Emscripten and Asm.js I think you’ll see where I’m going: if we want to port our C++ arcade emulator to JavaScript we don’t need to write (almost) any code: we can just recompile the existing C++ sources!

In fact, while it’s unlikely that Asm.js will ever be used for traditional web-development (DOM/CSS), it seems perfect for a particular application like a software emulator (and already in 2011 Fabrice Bellard used Emscripten to write an impressive emulator of a Linux PC).

The code of the emulator engine already builds with Clang, it is “self-contained” (with no external dependencies a part from the CRT and the STL), it does not use multithreading and exposes a very simple interface to its clients. So, it does not require any adjustment: once it’s compiled with Emscripten the resulting asm.js code can be used right away in the browser.

The only question left to be settled now is: will asm.js be fast enough to run a (relatively) computationally intensive application like an arcade emulator in real time?

In order to find out, we first need to install the Emscripten SDK (the latest version can be found here).
Then we run the Emscripten compiler from the command-line:

emcc
    --bind                   // uses Embind
    -std=c++14               // supports C++14
    -O3                      // enable all optimizations
    src1.cpp … srcN.cpp      // the list of files to compile
    –o galaga.js             // the result of the compilation

All the source files will be compiled into a single, big JavaScript file (about 800K bytes). It is really so simple. I’ve never had any problems with the Emscripten compiler: as long as my code compiled without errors and warnings with the LLVM toolchain for Visual Studio, it also compiled without errors with Emscripten.

The only tricky part is now to “consume” the giant blob of almost-obfuscated code that we have generated from “normal” JavaScript code, and here the –bind option comes to our rescue. When this option is specified we can use special macros to bind C++ functions to JavaScript, and vice-versa, to call JavaScript functions from C++.
In our case, we only need to add an additional file to our sources, with the following code:

#include <emscripten/bind.h>
using namespace emscripten;

#include "galagamachine.h"

    // Binding code
    EMSCRIPTEN_BINDINGS(galaga_machine)
    {
        class_<GalagaMachine>("GalagaMachine")
            .constructor<>()
            .function("Run", &GalagaMachine::Run)
            .function("set_InsertCoin", &GalagaMachine::set_InsertCoin)
            .function("set_Start1Player", &GalagaMachine::set_Start1Player)
            .function("set_Start2Player", &GalagaMachine::set_Start2Player)
            .function("set_MoveLeft", &GalagaMachine::set_MoveLeft)
            .function("set_MoveRight", &GalagaMachine::set_MoveRight)
            .function("set_Button1", &GalagaMachine::set_Button1);
    }

This will tell the compiler that it must create binding code for our class GalagaMachine (or more precisely, for the function of this class listed in the EMSCRIPTEN_BINDINGS macro). As result, the giant blob of generated JavaScript code will also contain a class Module.GalagaMachine with the methods specified, and we will be able to instantiate the whole emulator simply writing:

    var galaga = new Module.GalagaMachine();

The HTML5 Host

We are now in the same situation we were when we started the Direct2D and UWP apps: we have encapsulated the arcade emulator with a simple interface and we just need to implement the “host” code to render the game screen frames and to play the audio samples. This time, however, we want to use HTML5 and JavaScript and things get more interesting.

pic9

HTML5: the scheduler

The first problem we need to solve is how to efficiently schedule the execution of the emulator so that it can produce 50 or 60 frames per second. The more simplistic idea would be to use JavaScript timers (setTimeout(msec)), but a much better solution is to use Window.requestAnimationFrame(callback).

This is an API intended to simplify animations: we call requestAnimationFrame and pass a callback that the browser will call before the next repaint. The callback routine must itself call requestAnimationFrame() to animate another frame at the next repaint, and so on forever. The repaint rate is usually 60 times per second, but the browser may reduce it to a lower rate when the page or tab is in background or when we are in a hidden , in order to improve performance and battery life.

The following code uses requestAnimationFrame to schedule the periodic execution of the function step():

    var prevTimestamp;

    function repeatOften(timestamp) {
        var interval;
        if (!prevTimestamp || !timestamp) {
            interval = 20.0;
        }
        else {
            interval = timestamp - prevTimestamp;
        }
        prevTimestamp = timestamp;
        step(interval);
        requestAnimationFrame(repeatOften);
    }

We’ll start the scheduling in a startup function main() that does all the initialization for the game:

    function main() {
        galaga = new Module.GalagaMachine();
        initVideo();
        initAudio();

        repeatOften();
    }

In step(), the call to galaga.Run() is where we finally go to Emscripten-compiled code and run the emulator for a quantum of execution time. The duration of this interval will vary according to the variable refresh rate, and to calculate it we use the timestamp that is passed as argument to our callback.

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // render videoFrame
                ...

                // play audioBuffer
                ...
            }
        );
    }

Note that the second argument passed to Run() is another callback. In fact, Emscripten bindings dictate another small change to our C++ code. While originally we could retrieve the screen frame and audio samples passing two objects by reference:

void Run(double us, VideoFrame& videoFrame, AudioFrame& audioFrame);

This would not work with Emscripten. Instead, we can make the interface more JavaScript-friendly by passing a callback function that will be called when the function Run() completes and that takes as arguments the same two objects that were originally returned by reference:

void GalagaMachine::Run(double us, emscripten::val onComplete)
{
    // ... runs the emulator for a timeslice, obtains videoFrame, audioFrame.

    onComplete(
        emscripten::typed_memory_view(Height * Width * 4, videoFrame.Pixels),
        emscripten::typed_memory_view(_audioFrame.Size, audioFrame.Samples)
    );
}

emscripten::val is a type provided by Enbind to represent any JavaScript object, and it is what makes it possible to call JavaScript code directly from C++.

HTML5 frame rendering

In order to render screen frames we need to find an efficient way to “blit” the pixels from the VideoFrame generated by the emulator onto the screen. Luckily, the Canvas 2D API provides the ImageData object that does just this. So, in HTML5 the whole UI for the game can be made by a single Canvas element suitably placed in our page:

    <canvas id="myCanvas"/>

At initialization time, we create an ImageData object with the same size of the arcade screen. A small complication is that we need to scale the ImageData to fill the whole canvas, which is resized to fill the whole browser window. The best way to do this is to draw the imageData into another off-screen canvas and then draw the off-screen canvas to the original canvas.

    var galaga;
    var canvas;
    var offscreenCanvas;
    var context;
    var offscreenContext;
    var imageData;

    function initVideo() {
        // create offscreen canvas and imageData
        offscreenCanvas = document.createElement('canvas');
        offscreenCanvas.width = 224;
        offscreenCanvas.height = 288;
        offscreenContext = offscreenCanvas.getContext('2d');
        imageData = offscreenContext.createImageData(224, 288);

        canvas = document.getElementById('myCanvas');
        context = canvas.getContext('2d');
        canvas.height = window.innerHeight - 24;
        canvas.width = canvas.height * offscreenCanvas.width / offscreenCanvas.height;
        context.scale(canvas.width / offscreenCanvas.width, canvas.height / offscreenCanvas.height);
    }

Then, periodically, after having run the emulator for a quantum of emulation time we just copy the generated screen frame onto the canvas ImageData. This copy can be done as fast as the copy of a JavaScript typed array. ImageData exposes its data as Uint8ClampedArray of four bytes per pixels, in the RGBA format. So, if the emulator generates videoFrame pixels in the same format, we can fill the ImageData with a simple call to Uint8ClampedArray.set():

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // videoFrame is a Uint8Array that aliases directly into the Emscripten heap
                imageData.data.set(videoFrame);
                offscreenContext.putImageData(imageData, 0, 0);
                context.drawImage(offscreenCanvas, 0, 0);

                // Audio
                (later)
            }
        );
    }

HTML5 Audio

Finally, the last thing left to do is to play the audio sample synthesized by our emulator. This is probably the most complicated part of the whole project; one could argue that today with HTML5 WebAudio there is no really good way to do this.
WebAudio provides a rich API to capture, manipulate and play audio samples. Audio operations are performed by putting together a graph of audio nodes (sources nodes, filter nodes, output nodes, and so on).

The basic way to play a sound with WebAudio is simply to create an AudioBufferSourceNode, fill this buffer with audio samples and schedule it to be played back at the appropriate time with AudioBufferSourceNode.start(when). In our case, we want to play a continuous sequence of buffers, each containing 20ms of samples. Ideally, we should be able to schedule the buffers one after the other, like with bufferSourceNode0.start(t + 0); bufferSourceNode1.start(t + 20); bufferSourceNode2.start(t + 40); and so on. But this does not work and in all browsers produces glitching audio. Web Audio does not guarantee sample-precise stitching of contiguous buffers

There is no “buffer-queueing” functionality in the Web Audio API, which would be very useful to render audio synthesized in real time. I found a long discussion of this problem in the Audio discussion list of w3.org.

However, a (somehow convoluted) workaround does exist, using ScriptProcessorNodes. ScriptProcessorNodes are nodes that can be inserted in the graph to do some elaboration/filtering with a JavaScript event handler. The node is connected to an input and an output buffer and the event handler onaudioprocess() is called every time the input buffer contains new data and must fill the output buffer with new data.

pic10

In our case, the important feature of a ScriptProcessorNode is that it produces a continuous sequence of samples with no artifacts of glitches. Since we are generating the samples ourselves, we don’t really need any input buffer but we can rely on the fact that the system will call onaudioprocess() to fill an output buffer with audio samples that will be sent to the playback node.

pic11

But a limitation is that we cannot decide the playback rate of the output buffer; it will always be equal to the sample rate of the WebAudio context. To complicate things further, this frequency is platform-dependent. On Windows with Firefox, Chrome and Edge, it is usually 44.1 Khz, but I found out that it was instead 48KHz with any browser on my Surface Pro 3.

The problem is that the emulator also synthesizes audio at a fixed rate, 48 KHz, and when these two sample rates don’t match we need to resample our audio. And even with resampling there can still be the possibility of glitches because the ScriptProcessorNode works in the (main) thread and can easily cause performance issues. Initializing a ScriptProcessorNode takes as argument the size of the buffer that will be processed by the handler function, and which must be a power of 2 (between 256 and 16K samples). This buffer size then also determines how frequently the onaudioprocess event needs to be fired and even here there is a buffer vs latency trade-off. With smaller sizes we also have smaller latency, but the handler will be called more frequently and it will be easier to have glitching audio. For example having a 1024 samples buffer at 44.1 KHz means that onaudioprocess will be called every 1024/44.1 = 23.2 msec.

So, this is evidently not a perfect solution, and in fact the ScriptProcessorNode interface has now been deprecated, to be replaced by Audio Workers, which implement audio processing in the context of a web worker and not in the main application thread. But as far as I know, no browser supports Audio Workers yet, so ScriptProcessorNodes still are the best option.

The code to play audio with ScriptProcessorNode is not very different from the one we wrote for XAudio2. Even here we need a circular buffer to be resistant against timer jitters and changes in the CPU load. Buffering is particularly useful here, since as we have seen, the ScriptProcessorNode event handler is called in the main JavaScript thread, which is also busy periodically running the emulator.
At initialization time we initialize Web Audio and create a graph with a ScriptProcessorNode:

var BUFFER_COUNT = 8;
var BUFFER_SIZE = 1024; // samples
var _lastPlayed = BUFFER_COUNT - 1;
var _writePos = 4 * BUFFER_SIZE - 960; 

function initAudio() {
    window.AudioContext = window.AudioContext||window.webkitAudioContext;
    if (!window.AudioContext) {
        throw 'Web Audio API is not available!';
    }

    audio.context = new window.AudioContext;
    audio.buffer = new Int16Array(BUFFER_COUNT * BUFFER_SIZE);

    audio.scriptNode = audio.context.createScriptProcessor(BUFFER_SIZE, 0, 1);
    audio.scriptNode.connect(audio.context.destination);
    audio.scriptNode.onaudioprocess = ... // later
}

After running the emulator we insert the generated audio samples into the circular buffer. For brevity, I omitted the code that takes care of resampling the audio when the Audio context frequency does not match the 48KHz of our emulator.

galaga.Run(interval,
    function (videoFrame, audioBuffer) {
        // ... Render screen frame, as seen before...

        if (audio.context.sampleRate != 48000) {
            // resampling
        }

        // Queue audio samples. Note that the number of samples in audioBuffer
        // can be different from BUFFER_SIZE.
        var max = BUFFER_SIZE * _lastPlayed;
        var i = 0;
        var j = _writePos;
        var len = BUFFER_SIZE * BUFFER_COUNT;
        while (audioBuffer.length >= i) {
            if (j == max) {
                // overflow
                break;
            }
            audio.buffer[j] = audioBuffer[i];
            i++;
            j = (j + 1) % len;
        }
        _writePos = j;
    }
);

Finally, there is the onaudioprocess event handler, which is called periodically by the ScriptProcessorNode. Here we extract the samples from the queue and feed them to the output buffer.

    // This function is called periodically by the audio system and runs in the main thread.
    audio.scriptNode.onaudioprocess = function(audioEvent) {
        try {
            var channelData = audioEvent.outputBuffer.getChannelData(0);

            _lastPlayed = (_lastPlayed + 1) % BUFFER_COUNT;
            var j = _lastPlayed * BUFFER_SIZE;
            var i = 0;
            while (BUFFER_SIZE >= i)
            {
                // Convert from [-32768:32767] to [-1.0:1.0]
                channelData[i] = audio.buffer[j] / 0x8000;
                audio.buffer[j] = 0;
                i++; j++;
            }
        }
        catch (e) {
            // ...
        }
    }

And that’s really all! We can now play our synthesized sounds also on a browser. 🙂 I must say that the quality of the resulting audio is not always perfect, because of the limitations of ScriptProcessorNode, and there can be sporadic glitches here and there, on slow machines. It will be interesting to rewrite the buffering algorithm with Audio Workers, when they will become available.

Performance

Now we are in the condition to answer an interesting question. When I started porting the C++ engine to JavaScript with Enscripten I had no idea what the slow-down could be. With native (or also managed) code, it is obviously very easy to efficiently emulate 30+ year old hardware and ancient 8-bit microprocessor. But is this true even with a scripted language?

Looking back at the emulator interface, we saw that we have a single method Run(double ms) that executes the emulator for a specified interval, producing a video frame and a number of audio samples. This is the only really CPU-intensive part of our system. Once we generate a display frame, we can blit it onto the screen very quickly; the Canvas 2D API is fast. Dealing with audio samples means moving around very small arrays of bytes (for a single channel at 48 KHz, 20 milliseconds of audio only take 960 bytes) and the Web Audio API is also fast. So, the only bottleneck is in the engine itself, and everything works very well as long as a call to Run() takes much less than N milliseconds to emulate N milliseconds.

So, I timed how long running the emulator actually takes with JavaScript, on different web browsers, and I took as reference the time spent by the native emulator used in the Direct2D project. The comparison is particularly meaningful because we are comparing execution time for exactly the same code (even though transpiled into another language). The following chart shows the results, running the emulator on an old Sony laptop, i7 processor:

pic12

We can see that some browsers are faster than others with asm.js code. In particular, Chrome turned out to be the slowest, and Firefox the fastest, with Microsoft Edge not too far away. But was very, very interesting to me was to measure that the slow-down factor for Enscripten-generated JavaScript compared to the native code can be as low as 1.66. This is really not a lot, and it’s much much faster than I would have expected.

Looking forward

Today Asm.js seems only useful for a particular kind of applications, like games, emulators and in general for computational workloads. So, it seems to occupy the same niche which is also occupied by NaCl/PNaCl, the sandbox for native code in a browser developed by Google (which has not had huge success and it is still only supported by Chrome).

Very likely, Asm.js will never be used for “standard” web applications, but I wonder whether it could become in the future the “native” language for a new type of web applications based on HTML5. The browser could become the host for a new kind of portable OS, with the canvas and WebGL APIs used for the user interface, the WebAudio API for the audio, WebRTC for real time media streaming, WebSockets and XMLHttpRequests for networking, local File API and cloud APIs for storage, and so on.

Asm.js could be the target language for a number of Enscripten-compiled libraries and components that could be reused by “normal” web applications. For example, Asm.js should be fast enough to implement new video codecs, or to program whole -based UI frameworks, with controls, animations, databinding (even though the need for this is questionable). Or maybe the future of the web (and of mobile apps) could be in desktop-style apps that run in the browser, written in statically-typed languages and transpiled into JavaScript, with Asm.js as native language.

And very soon also C# and the other .NET languages could be viable alternatives to C++ to compile into Asm.js. Microsoft in now working on LLILC, an LLVM-based MSIL compiler. For now the goal of the project (open source) is to create a .NET JIT for LLVM, but the plan is to have also an AOT compiler in the future.

WebAssembly

It is also worth noting that people are already working on even faster alternatives to Asm.js. A small issue with Asm.js is that the assembly-like code is quite verbose, and with large programs, having to download and then parse large codebases can take some time. This problem could be solved by WebAssembly (wasm), a low-level assembly/binary format that directly represents the AST of a program and is designed to be smaller in size, faster to parse and faster to execute than JavaScript. All main browser vendors are working on supporting WebAssembly inside their script engines, and Emscripten should soon support WebAssembly as compilation target.

Insert coin…

This concludes my little exploration of the world of retro-gaming and the emulation of vintage arcade games. If you are interested, you can play with my version of Galaga online HERE.
This is a screenshot of the game in action, running on Microsoft Edge:

pic13

You can find the sources here:

Range comprehensions with C++ lazy generators

In the previous post we had a look at a recent proposal N4286 to add stackless coroutines to the C++ language and described the implementation prototype that ships with Visual Studio 2015 CTP.

We saw that coroutines can be used to implement lazy sequences. Lazy evaluation is a powerful tool and a pillar of functional programming; it gives the ability to construct potentially infinite data structures, and increases the performance by avoiding needless calculations. Many languages provide constructs to compute a series of elements generating them one at the time, only when they are requested. These are usually named sequences (https://msdn.microsoft.com/en-us/library/dd233209.aspx) in functional languages like Haskell or F#, generators in Python, iterator blocks in C#.

In the N4286 proposal a generator is a function that returns an object of type generator<T> and that contains at least one yield statement. For example, this is a generator for the Fibonacci sequence (the Hello World of generators) written with the notation of the VS2015 CTP:

generator<int> Fibonacci(int max)
{
    int a = 0;
    int b = 1;
    while (a <= max)
    {
        __yield_value a;
        int next = a + b;
        a = b;
        b = next;
    }
}

Query comprehensions and LINQ

Functional languages like Haskell have the concept of list comprehensions, syntactic constructs for creating a list by applying operators to other existing lists. Haskell is perfect to deal with infinite lists, thanks to its laziness, but even non-functional languages like Python and C# provide their version of query comprehensions.

In C#, of course, we have LINQ, whose design was heavily influenced by Haskell (Erik Meijer was one of the authors). The LINQ syntax is designed specifically to make it easy to apply a sequence of operations on the sequence monad IEnumerable<T>, in the form of queries.

It would be nice to have something similar in an eager language like C++. There are very interesting articles on this topic and I’d heartily suggest reading Bartosz Milewski’s “Getting Lazy with C++” and Eric Niebler’s “Range comprehensions”.

Let’s take an example from these two papers: let’s write a small Haskell program that prints the sequence of Pythagorean triples. A Pythagorean triple is a tuple of three positive integers, x, y, and z which satisfy the relation x*x + y*y = z*z.

main = print (take 10 triples)
triples = [(x, y, z) | z <- [1..]
                     , x <- [1..z]
                     , y <- [x..z]
                     , x^2 + y^2 == z^2]

In Haskell this code is very intuitive; we start with (possibly infinite) sequences of integers, and operate on them to generate a sequence of tuples, filtering only the tuples that satisfy the Pythagorean condition. Note that the act of taking only the first elements and the act of printing them are orthogonal to the act of generating the sequence.

The equivalent LINQ code for Pythagorean triples is not very different from the equivalent code in Haskell:

static IEnumerable<Tuple<int,int,int>> PythagoreanTriples(int max)
{
    return (from z in Enumerable.Range(1, max)
            from x in Enumerable.Range(1, z - 1)
            from y in Enumerable.Range(x, z - x)
            where x * x + y * y == z * z
            select new Tuple<int, int, int>(x, y, z));
}

This declarative query syntax is just syntactic sugar for the invocation of standard query operators (https://msdn.microsoft.com/en-us/library/bb397947.aspx) like Where, Select, GroupBy, Range, …, which are defined as extension methods of the Enumerable class.

The same function can be written as a simple iteration block, with a yield statement:

static IEnumerable<Tuple<int, int, int>> PythagoreanTriples(int max)
{
    foreach (var z in Enumerable.Range(1, max))
    {
        foreach (var x in Enumerable.Range(1, z - 1))
        {
            foreach (var y in Enumerable.Range(x, z - x))
            {
                if (x * x + y * y == z * z)
                {
                    yield return new Tuple<int, int, int>(x, y, z);
                }
            }
        }
    }
}

Query comprehensions in C++

Can we have something like LINQ in C++? This has been for some time a little holy grail for me; I tried to implement something with Fibers a long time ago but I didn’t really know what I was doing and the result was very kludgy.

But now the lazy, resumable generators proposed by N4286 seem perfect for this purpose; when coroutines will be fully supported by the language it could be interesting to think of a new library for query comprehensions that work on these generators.

We can use the VS2015 CTP prototype to experiment with this idea. How should a LINQ-like library for C++ look like? How would it work?

Like in LINQ, we want to write a set of operators that transform generator<T> sequences. Let’s start with a first one, a filtering operator Where that takes a generator and a predicate and returns the sequence of elements that satisfy the predicate. This can be done with a simple static function:

template<typename T, typename Pred>
static generator<T> Where(generator<T> gen, Pred pred)
{
    for (auto val : gen) {
        if (pred(val)) {
            __yield_value val;
        }
    }
}

We could very easily write other operators in the same way; for example, the Select operator, analogous to std::transform<> but that work on generators:

template<typename T, typename Fun>
static generator<typename std::result_of<Fun(T)>::type>
Select(generator<T> gen, Fun fun)
{
    for (auto val : gen) {
        __yield_value fun(val);
    }
}

This is also simple to write thanks to std::result_of that deduces the return type of Fun at compile time. Note that we cannot use automatic return type deduction here; at least in the first prototypes of VS2015 CTP we must explicitly return type generator<T>. This would not compile:

static auto Select(generator<T> gen, Fun fun) {}

Composability

Writing query operators is not very difficult. The problem is: how do we compose these operators? As an example, let’s say we want to take the sequence of all integers, filter the ones that are even and then transform them into a sequence of their string representation, like in figure:

Operators pipeline

We could do this in stages, starting with a generator that yields the sequence of integers:

generator<int> Integers()
{
    for (int i = 0; i <= INT_MAX; i++) {
        __yield_value i;
    }
}

We could either (1) apply the Where and Select operators one after the other:

    auto evenInts = Where(Integers(), [](int n) { return (n % 2) == 0; }); 
    auto evenIntStrs = Select(evenInts [](int n){ return std::to_string(n); })

or (2) we could try to put them together in a single expression:

    auto result = Select(
        Where(Integers(), [](int n) { return (n % 2) == 0; }),
        [](int n){ return std::to_string(n); });

But both choices are not great; the second one in particular quickly makes the code difficult to read and to maintain. Ideally, we would like these operators to be composable into a pipeline, as they are in LINQ:

var result = Range(1, Max)
    .Where(n => (n % 2) == 0)
    .Select(n => n.ToString());

If we were able to modify the code for generator<T>, we could do this by adding a predetermined set of operators directly in that class, in the standard library:

template <typename T>
class generator
{
    ...
    generator<T> Where(Pred pred)
    {
        for (auto val : *this) {
            if (pred(val)) {
                __yield_value val;
            }
        }
    }
    ...
}

But even so, the set of operators would not be easy to extend. What we really want is to have the ability to define and add new operators in our code, and have them compose seamlessly.

In C# the query operators are implemented as extension methods of the Enumerable class, which makes very easy to “plug” new operators simply by adding new extension methods. In C++ we don’t have extension methods. As alternative, we can follow the lead of Boost Range library that uses the pipe syntax (bitwise OR operator |) to compose the operations.

Therefore our goal will be being able to write query comprehensions as chains of pipe operators, in this form:

auto evenNumStrs = Range(1, Max)
    | Where([](int n) { return n % 2 == 0; })
    | Select([](int n) { return std::to_string(n); });

And now things get interesting: how to implement this? 🙂

A possible implementation

To be honest, at first I had really no idea. But, as Picasso once said, “good artists copy, great artists steal”: so I stole from a real artist. More precisely, I looked at the code of Eric Niebler’s range library, and it was a very interesting read. Here I am reusing some of his ideas, especially to implement a pipeline of operators. The resulting code is much simpler in my case, of course, because I only have to deal with the narrow scope of resumable generators.

So, let’s start from the beginning: as we said our goal is to be able to create pipelines of operations with the pipe operator |:

generator<int> evenNumbers = Integers() |
                             Where([](int n) { return n%2 == 0; });

We clearly need to overload the binary-OR operator | (). When its left argument is a generator, the second argument must be something of “pipeable” to a generator. Therefore, a call to a function like Where() must return “something pipeable”, whatever that means.

We can say that a type is pipeable if it implements a static method pipe(), with two arguments: a generator<T> and an instance of the pipeable type itself:

template<typename T, typename Pipeable>
auto operator|(generator<T>&& src, Pipeable&& pipe)
{
    return Pipeable::pipe(std::forward<generator<T>>(src),
                          std::forward<Pipeable>(pipe));
}

The next step could be to implement Where as a pipeable type. Rather than just a function, Where now becomes an invokable type, a class with a call operator that implements the logic to filter a sequence according to a predicate. And class Where should also have a static method pipe() that evaluates a pipe by invoking the call operator:

template<typename T>
struct Where
{
    template<typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        for (auto val : gen) {
            if (pred(val)) {
                __yield_value val;
            }
        }
    }

    template<typename Pipe>
    static auto pipe(generator<T>&& src, Pipe&& pipe)
    {
        return operator()(src, pred); // pred ???
    }
};

But now the problem is: how do we pass the predicate to the pipe? Or, generalizing the question, since operators can have any number of arguments: how do we pass to the pipe all the arguments, if we have an expressions like this?

auto outputGen = inputGen | Op(arg1, arg2, ...);

Pipeable types

We need to complicate the design a little, leveraging the magic of variadic templates and std::bind. We introduce a type for pipeable objects (objects that can be at the right side of a pipe operator |), implemented as follows:

template<typename Binder>
struct pipeable
{
    // A forwarding call wrapper generated by std::bind.
    Binder _binder;

    pipeable(Binder pipeableBinder)
        : _binder(std::move(pipeableBinder))
    {
    }

    // Executes the call operator of a bound method, passing a generator
    // as placeholder argument.
    template<typename T, typename Pipe>
    static auto pipe(generator<T>&& gen, Pipe&& pipe)
    {
        return pipe._binder(std::forward<generator<T>>(gen));
    } 
};

Let’s look at this in more detail. Type pipeable<> has a template parameter Binder, which is an invokable type, the type of the objects returned when we call std::bind() to bind the call operator (of a type like Where) to a set of arguments.

Now, having an object op of class Where, a generator object and a predicate, we can call std::bind(op, sourceGenerator, predicate) and use the result to construct an instance of a pipeable object, so that pipeable::pipe() will call a method already bound to its arguments.

Class Where can now be simplified to have just a call operator:

template<typename T>
struct Where
{
    template<typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        [...]
    }
};

Not all the arguments are early-bound though. When we construct the pipeable object, while the operator and the predicate are given, we don’t have a sourceGenerator to bind yet. This is the generator to which the operator is applied, i.e. the object that is at left side of the pipe and that is passed as argument to pipeable<T>::pipe(). So we need to do a little currying and use a std::bind placeholder in its stead:

std::bind(op, std::placeholders::_1, predicate);

The resulting workflow is the following:

// Given an input sequence,
auto gen = Integers();

// a predicate,
auto pred = [](int n){ return (n % 2) == 0; }

// and a query operator:
Where<int> opWhere;

// Binds the predicate to the operator
auto opWhereBoundToPred = std::bind(opWhere, std::placeholders::_1,
    std::move(pred));

// Creates a pipeable object
pipeable<decltype(binder)> pipeableWhereBoundToPred(opWhereBoundToPred);

// Now a pipe expression executes the operator replacing the placeholder with
// the input sequence:
gen | Where(pred) =>
gen | pipeableWhereBoundToPred =>
Pipeable::pipe(gen, pipeableWhereBoundToPred) =>
opWhereBoundToPred(gen)

Pipeable factories

The next step is to encapsulate all this logic into a class that acts as a factory for our pipeable types.

template<typename Op>
struct pipeable_factory
{
    // an invokable object of an operator class.
    Op _op;

    // Binds _op to its arguments, using a placeholder for the first argument.
    template<typename...Args>
    auto operator()(Args&&... args) const
    {
        return make_pipeable(std::bind(_op, std::placeholders::_1,
               std::move(args)...));
    }

private:
    template<typename Binder>
    static pipeable<Binder> make_pipeable(Binder pipeableBinder)
    {
        return { std::move(pipeableBinder) };
    }
};

Class pipeable_factory is templated on a query operator type (a type like class Where above) and it is also an invokable type. Its call operator uses the magic of variadic templates to accept a variable number of arguments and forward them to the std::bind function described before, then uses the result to instantiate and return a pipeable object.

So now in order to be able to execute an operator in a pipe expression like:

auto evenIntegers = Integers() | Where( [](int n) { return (n%2)==0; });

we just need to transform somehow Where(pred) into a pipeable_factory of the Where operator. We can do this very simply, by renaming the operator and constructing a single, constant instance of the pipeable_factory for Where:

struct where_op
{
    template<typename T, typename Pred>
    generator<T> operator()(generator<T> gen, Pred pred)
    {
        for (auto n : gen) {
	     if (pred(n)) {
                __yield_value n;
            }
        }
    }
};
constexpr pipeable_factory<where_op> Where{};

Now finally “gen | Where(pred)” works as expected: at the right side there is a factory which binds the predicate to where_op::operator() and constructs a pipeable object. Evaluating the pipe means invoking pipeable<>::pipe(), which calls the bound method replacing a placeholder with the generator object that is at the left side of the pipe. The result is that the pipe expression is evaluated as a call to where_op::operator(gen, pipe).

And since the result of this pipe expression can be itself a generator<>, and the bitwise-OR operator | is left-to-right associative, it we have also composability.

In other words, it is possible to chain a sequence of operators, exactly like LINQ does:

auto result = Integers() | Where(isEven) | Select(toString);

and we can iterate over the result with a simple range loop:

for (auto s in result) {
    cout << s << endl;
}

Here we have introduced a second operator, Select, which projects each element of a sequence into a new form by applying a transform function:

struct select_op
{
    template<typename T, typename Fun>
    generator<typename std::result_of<Fun(T)>::type>
    operator()(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            __yield_value fun(n);
        }
    }
};
constexpr pipeable_factory<select_op> Select{};

More operators

Operators like Select and Where only take two arguments, a generator and a callable object. But we can have cases a little more challenging. For example the operator Zip, which “joins” two sequences together by applying a specified function to the corresponding i-th element of each sequence and producing a sequence of the results.

We could think of writing Zip in the same way, simply adding an additional argument for the second sequence:

class zip_op
{
public:
    template<typename TFirst, typename TSecond, typename Fun>
    generator<typename std::result_of<Fun(TFirst, TSecond)>::type>
      operator()(generator<TFirst> first,
                 generator<TSecond> second,
                 Fun resultSelector) const
    {
        auto it1 = first.begin();
        auto it2 = second.begin();
        while (it1 != first.end() && it2 != second.end()) {
            __yield_value resultSelector(*it1, *it2);
            ++it1;
            ++it2;
        }
    }
};
constexpr pipeable_factory<zip_op> Zip{};

But this would not compile (with VS2015 CTP) and I am not totally sure if it should. It fails to bind all the arguments to the variadic pipeable_factory::operator() seen before. I suspect that this could work if I could use return type deduction in zip_op::operator() but unfortunately this is not an option, since we need to explicitly return a generator<T> type.

What we can do instead is to write operator() as a variadic template function, which can take any number of arguments in its parameter pack. This function can use return type deduction as long as it does not really *implement* a resumable function but only *calls* a resumable function. To generalize the solution, we can put the variadic call operator in a base class generator_operator from which all our query operator should inherit. And each operator class will now lazily do its work in a resumable function exec() that returns a generator<T>.

Putting this idea into code, we have now:

// A generator operator is an invokable object templatized on
// an operator class. The template type Op represents a LINQ-like
// operator that works on lazy generators.
template <typename Op>
struct generator_operator
{
    template<typename Gen, typename... Rest>
    auto operator()(Gen&& gen, Rest&&... rest) const
    {
        return Op::exec(std::move(gen), std::move(rest)...);
    }
};

class select_op : public generator_operator<select_op>
{
    friend struct generator_operator<select_op>;

    template<typename T, typename Fun>
    static generator<typename std::result_of<Fun(T)>::type>
    exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            __yield_value fun(n);
        }
    }
};
constexpr pipeable_factory<select_op> Select{};

class where_op : public generator_operator<where_op>
{
    friend struct generator_operator<where_op>;

    template<typename T, typename Fun>
    static generator<T> exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            if (fun(n)) {
                __yield_value n;
            }
        }
    }
};
constexpr pipeable_factory<where_op> Where{};

Note that the operator classes are stateless, the exec function can be static and all the state is captured in the closures. The function exec can even be private, since it’s always called through the base generator_operator class, which is a friend of the operator class.

The result is quite clean, in my opinion: if we now want to add a new resumable operator all we have to do is to:

  1. write a new generator_operator-derived class with a method exec that returns a generator<T>, and
  2. declare a pipeable_factory<> for this new type.

More about generator<T>

So far nothing seems to prevent us from making these lazy operators work with any range type, even eager ranges, or with any type that defines begin() and end() methods that return input iterators.

However, we are using resumable functions to implement operators for lazy algorithms and we have seen that a resumable generator function can return only a type generator<T> and not auto. Class generator<T> has some other peculiarities that must be taken into account.

If we look at its code (in the header file C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\include\experimental\generator in the VS CTP) we see that class generator is moveable but not copyable.

template <typename _Ty, typename _Alloc = allocator<char> >
struct generator
{
    struct promise_type { ... };

    struct iterator { ... };

    iterator begin() {
        ...
        return {_Coro};
    }
    iterator end() {
        return {nullptr};
    }

    generator() = default;
    ~generator() { ... }

    generator(generator const&) = delete;
    generator& operator = (generator const&) = delete;

    generator(generator && _Right) : _Coro(_Right._Coro) {
        _Right._Coro = nullptr;
    }
    generator& operator = (generator && _Right) {
        if (&_Right != this) {
            _Coro = _Right._Coro;
            _Right._Coro = nullptr;
        }
    }

private:
    resumable_handle<promise_type> _Coro;
};

The copy constructor and copy assignment operators are deleted. To understand why, it is useful to look at a class diagram for the generator’s code:

generators

Generators are based on stackless coroutines, which work (as described here) by allocating an activation frame, encapsulated by the class resumable_handler. In this design a resumable_handler provides the functionality to suspend and resume the execution of a function. Every instance of a generator<T> creates and owns a resumable_handler object, and shares a reference to this object with its iterators. This is why the generator<T> type cannot be copyable: we should never have two instances of a generator that refer to the same coroutine.

But the type can be moveable, so the move constructor and assignment operator are declared and work by transferring the ownership of the resumable_handler from one object to another.

The non-copyability must be taken into account in our implementation. Every time we pass a generator<T> in a function call, we need to pass an rvalue reference (generator<T>&&). The only, very important exception is in the generator_operator<> call operator. As we have seen, it is written like this:

template <typename Op>
struct generator_operator
{
    template<typename Gen, typename... Rest>
    auto operator()(Gen&& gen, Rest&&... rest) const
    {
        return Op::exec(std::move(gen), std::move(rest)...);
    }
};

We pass explicitly a rvalue reference to the call operator, but we must make sure that in the call to exec() the move constructor is called to pass the ownership of the resumable handler to a new instance of generator<T> allocated in the scope of the exec() resumable function:

struct where_op
{
    template<typename T, typename Pred>
    static generator<T> exec(generator<T> gen, Pred pred) {...}
};

Copying the rvalue reference would not work:

struct where_op
{
    template<typename T, typename Pred>
    static generator<T> exec(generator<T>&& gen, Pred pred) { ... }
};

This error is quite tricky and insidious to debug. The problem is that with resumable functions one must always make sure that the lifespan of the generator object surpasses the lifespan of the resumable function that uses that generator. Note that when we evaluate a pipe, like:

auto evenNumbers = Integers() | Where([](int n) { return n % 2 == 0; });

the generator’s destructor is called after the expression is evaluated, well before we can start iterating over the result. So, in the where_op::exec() coroutine we would have a reference to a deleted object. But if we use a move constructor instead, then there is a new generator<T> object that lives in the scope of the exec() function and that takes ownership of the resumable_handle. And since exec() is actually a resumable function, this new generator<T> object is allocated in the coroutine allocation frame and not on the stack. So, this generator<T> object lifespan coincides with the lifespan of the resumable function and everything works.

For all these reasons I designed the lazy operators to work only on generator<T>s, not on any kind of ranges (like STL containers). But it is very easy to write an adapter function that converts an eager range into a lazy generator:

template<typename Src, typename T = Src::value_type>
static generator<T> make_generator(const Src& src)
{
    for (const T& t : src) {
        __yield_value t;
    }
}

Examples

Do you still remember the problem from which we started? We can now finally write our own solution to the problem of generating the sequence of Pythagorean triples. What we miss is just an implementation of Range, to generate a sequence of integers:

generator<int> Range(int from, int count)
{
    for (int i = 0; i < count; i++)
    {
        __yield_value (from + i);
    }
}

And finally:

generator<tuple<int, int, int>> Triples(int max)
{
    for (auto z : Range(1, max - 1)) {
        for (auto x : Range(1, z - 1)) {
            for (auto y :
                Range(x, z - x) |
                Where([x, z](int y) { return x*x + y*y == z*z; }))
            {
                __yield_value tuple<int, int, int>(x, y, z);
            }
        }
    }
}

As another, and final, example we can write a (slow) prime numbers generator. We need an additional operator for this, the operator All that returns true if all the elements in a sequence satisfy a given predicate:

class all_op : public generator_operator<all_op>
{
    friend struct generator_operator<all_op>;

    template<typename T, typename Fun>
    static bool exec(generator<T> gen, Fun fun)
    {
        for (auto n : gen) {
            if (!fun(n)) {
                return false;
            }
        }
        return true;
    }
};
constexpr pipeable_factory<all_op> All{};

All shows that a lazy operator does not necessarily need to return a generator, it can return any type. With All we can finally write:

generator<int> Primes(int max)
{
    return Range(2, max - 1) | 
	    Where([](int i) {
               return Range(2, (int)sqrt(i)) |
                      All([i](int j) { return (i % j) != 0; });
	    });
}

But these are just few basic examples. If you are interested, you can find here:

the sources of this small library, with the implementation of a number of LINQ-like operators, like Aggregate, All, Any, Concat, Contains, Count, ElementAt, Empty, Except, First, Last, Max, Min, Range, Repeat, Reverse, Select, Skip, SkipWhile, Sum, Take, TakeWhile, Where, Zip.
The code is definitely not production-quality: error and exception handling are mostly absent, the unit tests are minimal. It is just an example of what we will be able to do with lazy generators, when they will become part of C++.

Stackless coroutines with Visual Studio 2015

Let’s talk about… coroutines! Once again, for a change! 🙂

In my last posts I talked of a Microsoft proposal to add support for resumable functions and the await keyword to the language and described how this feature had been implemented in one of first previews of VS 2015. Almost a year has passed since then, and there are a few noteworthy updates, so it is time for a new post. In fact, there is a new proposal for resumable functions (N4286) that addresses some of the limitations of the past proposal, and that has been prototyped in the latest CTP of Visual Studio 2015. But let’s start from the beginning.

C++17 Concurrency TS

I had been looking for some time now at the problem of implementing coroutines/resumable functions in order to have even in C++ something similar to what is provided by C# await and yield statements.

It turns out that – unbeknown to me 🙂 – this is quite a hot topic in the C++ community. The next major version of the standard is expected for 2017 and, in preparation for this, the activity of the standard committee has been organized in 8 groups working on 8 technical specifications, for things like a File System API, a Networking API, Concepts-Lite, Transactional Memory, Parallelism and Concurrency. (See http://isocpp.org/std/status for more details).

In particular in the Concurrency TS people are discussing how to improve std::futures to make them more composable, and how to introduce executors (task schedulers) and resumable functions. And while the work on extended futures and executors should only involve updates to the standard library, resumable functions could require changes to the language itself. To be honest, it seems unlikely that these changes will be approved in time for 2017, because there is still no agreement in the committee, but even so, this seems a very interesting topic (well, it is for me, at least…).

Stackful?

In the old proposal (N3858) Gustafsson presented the advantages of having language support for resumable functions, proposed the introduction of the await and yield keywords and described two possible implementation, stackful and stackless. The first, based on resumable side stacks (fibers), worked giving to each coroutine its own stack, which was associated to a “real” OS-thread only when the coroutine was actually running. The second, based on heap-allocated activation frames, requires the compiler to do significantly more work, in order to allocate in the heap a data structure to contain arguments and local variables for the coroutine, and to generate the code for a state machine to manage the suspension and resumption points.

The Visual Studio 2013 CTP came with a stackful, fiber-based implementation of the await feature, which I described in a previous post. But stackful coroutines have a big disadvantage in the fact that fibers are VERY expensive. By default, a fiber is allocated with the default stack size of a thread, which is 1MB in Windows. Clearly this makes it impossible to have too many coroutines running concurrently; just a few thousand coroutines would eat completely the virtual memory of a 32-bit process.

Stackless!

The new proposal (N4286) instead focuses mostly on a stackless implementation, and promises to be scalable to billions of concurrent coroutines. The goal, again, is to be able to have functions that can be suspended, waiting for some asynchronous operation to complete, and then automatically resumed after the operation has completed. For the users, this means being able to explicitly await for an asynchronous completion with await statements, like it’s done in C#.

As an example of an asynchronous operation let’s imagine that we have a function that sleeps asynchronously for a given time interval:

std::future<void> delay(int ms)
{
    return std::async([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

The proposal would make it possible to wait on the asynchronous operation using await:

std::future<void> testAwait()
{
	await delay(2000);
}

Also, resumable function make it possible to have lazy generators (similar to C# iterator blocks) with the yield statement:

std::generator<int> fibonacci(int max)
{
    int a = 0;
    int b = 1;
    while (a <= max) {
        yield a;
        int next = a + b; 
        a = b; 
        b = next;
    }
}

The syntax and the semantic of coroutines remains more or less the same as it was in the old proposal version; the changes are mostly in the way these coroutines should be implemented using activation frames. And since I think that I have already talked more than enough of resumable functions and generators in past, in this post I’d like to focus the attention on the details of this new implementation (which are much better explained in the standard document, anyway).

In fact, making a resumable function using a fiber, while expensive, it is not very complicated: after all a fiber is already a system-level implementation of a coroutine. But making resumable functions with just an activation frame requires much more work for the compiler, which needs to alter the structure of the code, turning it into a sequence of chained continuations.

Visual Studio 2015 preview

To have a look at how stackless coroutines of N4286 are supposed to work we can play with a first prototype which is already available in the CTP version of Visual Studio 2015, freely available for download here.

The only meaningful difference between this prototype and the standard proposal is that in the prototype we’ll have to deal with PPL tasks rather than std::futures. This is because N4286 is based on an improvement of standard std::futures, extended to support the chaining of continuations with std::future::then() like it’s done in boost::future. But, as we know, schedulers and continuation chaining are not yet part of the standard; therefore the VS prototype replaces futures with PPL tasks and utilizes the Concurrency runtime for their scheduling. But this is just an implementation detail: the term “task” in what follows should be read just as an alias for the “std::future” of the future.

Let’s start with await. We can rewrite with the PPL the method delay defined above:

Concurrency::task<void> delay(int ms)
{
    return Concurrency::create_task([ms]() {
        this_thread::sleep_for(std::chrono::milliseconds(ms));
    });
}

At this point we can use the new keyword __await (with a double underscore, because not yet standard):

Concurrency::task<void> testAwait()
{
    __await delay(2000); // suspension point (suspends testAwait, resumes delay).
}

In order to compile this code with the VS2015 CTP we need first to enable a compiler extension by passing the /await flag to the compiler options for our project. To use this new feature we also need to include the header file <pplawait.h>.

If we run this with the debugger we can see that the execution really suspends at the __await point and the control goes back to the caller of our testAwait function, which can then go on doing other stuff. After two seconds the delay task completes and the execution of testAwait resumes from where it had suspended. Normally it will resume on a different thread, one of the worker threads used by the PPL scheduler.

There are other limitations in the current CTP implementation: only the x64 platform is supported, not Win32, and stack-frame run time checks (/RTC) need to be disabled because not yet compatible with /await. But these are just minor issues; after all this is just a prototype and not supposed to be used for production code.

Coroutine handles

But how does all this work? In a previous post. I investigated some of the details of the stackful prototype who came last year with VS2013 CTP. Let’s do the same with this new implementation in the latest CTP.

Most of the library logic for this feature ships in two header files (resumable and generator) located in the \Microsoft Visual Studio 14.0\VC\include\experimental\ folder.

As we said, in a stackless implementation a coroutine is associated with an activation frame that is allocated somewhere in the heap and that contains the status of the coroutine together with all the data required to manage its suspension and resumption. In N4286 the frame is an instance of the compiler-generated class coroutine_handle.

In the VS2015 CTP an activation frame is encapsulated by an instance of the resumable_handle type. (This was the old name of a coroutine_handles in a previous version of the standard document).
A resumable handle owns the block of heap memory that represents the activation frame for a function. In general, data in a frame is organized like in figure: there is first the space for a special coroutine promise type, followed by the space for the “suspend context” and the local state (parameters and local variables).

resumable_handle

Coroutine promises

The promise class is a library class that implements the semantics of a particular type of coroutine (like await or generators). It must implement all the logic to pass return values and exceptions back from asynchronous computations, with methods like:

void set_result(T val) 
T get_return_object(resumable_handle rh)
void set_exception(E e)
AwaitType yield_value(T val)
bool cancellation_requested()
...

We’ll see later how the C++ compiler makes use of these functions in the code that it generates for a suspension/resumption point.
In the VS2015 CTP, the logic of a promise for await statements is in class resumable_traits<RetType, …Args>::promise_type. This traits class (renamed as coroutine_traits<> in the latest version of the proposal) is templated on the return and argument types of the resumable function and aggregates all the type-dependent types and methods, like the activation frame allocator.

Awaitable types

Together with resumable handles and promises, the final piece in the picture are awaitable types. An awaitable type is a type for which a library provides the support for await statements, by implementing the following three functions:

// Returns true is the results of the asynchronous operation is already available.
bool await_ready(T& t) const

// Suspends the execution until the asynchronous operation completes.
void await_suspend(T& t, resumable_handle<> rh)

// Resumes the execution from where it had suspended.
void await_resume(T& t)

For example, in the sample code above, when we call:

    __await delay(2000);

the type on which we want to await is a PPL task. So, with our await statement we are asking the compiler to generate code that suspends on a task, and resumes after the task is completed. The await compiles because the new library provides these three functions for PPL tasks:

bool await_ready(task& t) const {
    return t.is_done();
}
void await_suspend(task& t, resumable_handle rh) {
    t.then([rh](task&){ rh(); };
} 
void await_resume(task& t) {
    t.get();
}

From this, it should be clear that the mechanism is easily extensible to new types. We can await on any type for which we can provide the three await functions.

Behind the scenes…

But what happens exactly when we compile a resumable function like this, with one or more await statements? Let’s take again this function as example:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // *** code before the __await statement

    __await delay(2000); // suspension point (suspends testAwait, resumes delay).

    // *** code after the __await statement
}

Given the presence of suspension points, the compiler transforms the code quite radically and leverage the library code for coroutine handles, promises and awaitable types described before. In practice, the function above is transformed into this (simplified) pseudo-code:

task<R> testAwait(T1 param1, T2 param2, …)
{
    // Allocates coroutine activation frame.
    auto resHandle = _Resumable_helper_traits::_Alloc(frameSize, resumeAddress);
    
    // Creates a task to be returned to the caller.
    // This task is associated to the task completion handle of the promise.
    task<R> taskToReturn = resHandle.promise().get_return_object();

    // *** Here goes all the code before the __await statement

    // Calls the awaitable function and receives the task on which we want
    // to asynchronously block.
    auto taskToAwait = delay(2000);
  
    // If the task has already completed, we have nothing else to do 
    if (! Concurrency::await_ready(taskToAwait))
    {
        // Suspends the coroutine and adds ‘resHandle’ as continuation to be called
        // when the task ‘taskToAwait’ completes.
        Concurrency::await_suspend(taskToAwait, resHandle);
        return taskToReturn;
    }

Resume:
    // we get here when the await has completed
    auto result = Concurrency::await_resume(taskToAwait);

    // *** code after the __await statement

    _Resumable_helper_traits::_Free(resHandle);
    return taskToReturn;
}

Let’s look at this in detail. First, a single resumable handler for the whole function is allocated. This is an instance of a template class, specialized on the types of all the function parameters and on the return type. The frame size is calculated at compile time to also hold the local variables declared in the function. One of the arguments to _Alloc() is resumeAddress, a pointer to the callback function that will be called when the asynchronous operation completes.

Any resumable function must return a task, so the next step is to prepare the promise and the task that will be returned to the caller.

After this preamble, the code of the function executes normally, in a synchronous way, until it gets to an await (or yield) statement. In our case we want to await on the task returned by a call to delay(timeInMs) and, as we saw, there is library support for awaiting on tasks. The compiler adds a call to await_ready to check if the task has already completed. If this is the case, we can skip the asynchronous part and just get the task result. But if the task has not yet completed we need to suspend the execution. This is done by returning our return-value task to the caller, so that the caller is able to carry on doing other work or maybe also awaiting asynchronously on our task to complete.

Suspending a function is relatively simple, but how will the function resume its execution? Before returning the result-task to the caller, a call to await_suspend is made to chain a continuation on the task on which we are awaiting. One of the parameter here is the resumable handle itself, which represents the coroutine-state for this resumable function:

template <class _Ty, typename _Handle>
void await_suspend(task<_Ty>& _Task, _Handle _ResumeCb)
{
    _Task.then([_ResumeCb](task<_Ty>&) {
        _ResumeCb();
    });
}

So, now the method testAwait has suspended and returned a task to the caller. The caller is free to do other things or it can also suspended waiting for the task we returned to complete. Meanwhile, life goes on, the scheduler keeps scheduling, time goes by and at some point also the two seconds of the delay task elapse and a call to our continuation can be scheduled to run.

The class resumable_handle implements the call operator (), which is indeed the callback that will be called after the awaited task completes. This is where task continuations meet coroutines: the call operator simply forwards the call to the resumeAddress function that the compiler passed as argument in the allocation of the resumable frame. In this way, the execution can resume from where it had left. The details are a little more complicated: since a function can have more suspension/resumption points, the VS compiler seems to implement a small state machine to restart the execution at the correct location. But, simplifying, for the pseudocode above it is like if there was a

goto Resume;

in order to jump to the Resume: label and complete the await operation, retrieving the result of the completed task and returning it to the caller as a result of the returned task.

Generators

The other asynchronous constructs made possible by coroutines are lazy generators: functions that produce a (possibly infinite) sequence of values that can be pulled one at the time.

These are the equivalent of C# iterator blocks or Python generators; the function suspends after having “yielded” each value, and resumes execution when the next value is requested, until the sequence completes.

Going back to the usual example, this is a generator for the Fibonacci sequence, written with the syntax of VS2015 (where the keyword __yield_value is used in placed of yield).

    generator<int> fib(int max) {			
    int a = 0;					
    int b = 1;					
    while (a <= max) {
        __yield_value a;
        int next = a + b; 
        a = b;
        b = next;
    }
}

The function returns an object of type generator which behaves like any kind of STL sequence whose range is defined by a begin() and end() methods. Therefore the client code to enumerate the values in the sequence can be as simple as a range for:

for (auto n in fib(1000)) {
    std::cout << n << std::endl;
}

In the VS2015, the code for generators can be found in the file \Microsoft Visual Studio 14.0\VC\include\experimental\generator. I could dig into this code and try to describe all the details, but I would risk to become repetitive and sound like a broken record. In fact, the implementation of the prototype for generators turns out to be very similar to the idea I sketched in one of my last posts. (This, of course, not because my idea was particularly brilliant but only because there are not many other ways to write generators using coroutines :-)).

Just to have a quick overview, this is a very simplified version of the implementation:

template<typename T>
class generator
{ 
    class iterator
    {
        resumable_handle _coro;
  
        iterator(resumable_handle rh) : _coro(rh) {}
        iterator(nullptr_t) {} // represents the end of the sequence
        iterator operator ++ () {
            _coro(); // resumes execution
            return *this;
        }
        bool operator == (const iterator& rhs) {
            return _coro == rhs._coro;
        }
        const T& operator * () const {
            return _coro.promise()._CurrentValue;
        }
    };

public:
    generator(resumable_handle rh) : _coro(rh) {}

    iterator begin() {
        _coro(); // starts the coroutine and executes it until it terminates or yields.
        return iterator(_coro);
    }
    iterator end() { return iterator(nullptr); }

private:
    resumable_handle _coro;
};

We see that in order to compile a function like Fibonacci above, the compiler must allocate a resumable_handle and use it to construct an object of type generator<T>.

The role of the generator class is just to provide the semantics of range, with input iterators returned by begin() and end(); all the logic for the lazy generation is in the iterator class.
An iterator owns the resumable_handle which was passed to the generator, and uses it to resume and suspend its execution. Even here, a resumable handle works as a wrapper for the status of a coroutine. There are just small differences respect to await: here a different type of promise it’s used (of class generator::promise_type), designed to support yield_value and to store the last value yielded so that it can later be pulled by dereferencing the iterator.

The result is quite neat, in my opinion: C#-like generators will be very useful, especially if they could be composable with LINQ-like operators.

But again, for more details, I’d refer to this post, or (even better) to the standard document.

Conclusion

To summarize this brief excursus into the new proposal for resumable functions, there are a few points worth noting:

  1. Resumable functions can be used to add support for await and yield in C++.
  2. They can be implemented with stackless coroutines in order to be more scalable (at the price of greater complexity in the compiler).
  3. Stackless coroutines work allocating a single activation frame for each invocation of a resumable function.
  4. The implementation requires the support of improved std::future (composable with then()), which are also being standardized.

Hopefully, something like that will be part of the standard soon.

(Note: slightly edited on 2015/3/8 to clarify a few points).

Generator functions in C++

In the previous post we had a look at the proposal of introducing resumable functions into the C++ standard to support writing asynchronous code modeled on the C# async/await pattern.

We saw that it is already possible to experiment with the future resumable and await keywords in Visual Studio, by installing the latest November 2013 CTP. But the concept of resumable functions is not limited to asynchrony; in this post we’ll see how it can be expanded to support generator functions.

Generator functions and lazy evaluation

In several languages, like C# and Python, generator functions provide the ability of lazily producing the values in a sequence only when they are needed. In C# a generator (or iterator) is a method that contains at least one yield statement and that returns an IEnumerable<T>.

For example, the following C# code produces the sequence of Fibonacci numbers (1, 1, 2, 3, 5, 8, 13, 21, …):

IEnumerable<T> Fibonacci()
{
    int a = 0;
    int b = 1;
    while (true) {
        yield return b;
        int tmp = a + b;
        a = b;
        b = tmp;
    }
}

A generator acts in two phases. When it is called, it just sets up a resumable function, preparing for its execution, and returns some enumerator (in the case of C#, an IEnumerable<T>). But the actual execution is deferred to the moment when the values are actually enumerated and pulled from the sequence, for example with a foreach statement:

foreach (var num in Fibonacci())
{
    Console.WriteLine("{0}", num);
}

Note that the returned sequence is potentially infinite; its enumeration could go on indefinitely (if we ignore the integer overflows).

Of course there is nothing particularly special about doing the same thing in C++. While STL collections are usually eagerly evaluated (all their values are produced upfront) it is not difficult to write a collection that provides iterators that calculate their current value on the spot, on the base of some state or heuristic.

What gives a particular expressive power to generators is the ability to pause the execution each time a new value is generated, yielding control back to the caller, and then to resume the execution exactly from the point where it had suspended. A generator is therefore a special form of coroutine, limited in the sense that it may only yield back to its caller.

The yield statement hides all the complexity inherent in the suspension and resumption of the function; the developer can express the logic of the sequence plainly, without having to setup callbacks or continuations.

From resumable functions to generators (and beyond)

It would be nice to bring the expressive power of generators to our good old C++, and naturally there is already some work going on for this. In this proposal Gustaffson et al. explain how generator functions could be supported by the language as an extension of resumable functions, making it possible to write code like:

sequence<int> fibonacci() resumable
{
    int a = 0;
    int b = 1;
    while (true)
    {
        yield b;
        int tmp = a + b;
        a = b;
        b = tmp;
    }
}

Here, the proposal introduces two new concepts, the type sequence<T> and the yield keyword.

–        sequence<T> is a (STL-like) collection that only supports iteration and only provides an input iterator.

–        The yield statement suspends the execution of the function and returns one item from the sequence to the caller.

In C# terms, sequence<T> and its iterator are respectively the equivalent of an IEnumerable<T> and IEnumerator<T>. But while the C# generators are implemented with a state machine, in C++ the suspension and resumption would be implemented, as we’ll see, with stackful coroutines.

Once we had a lazy-evaluated sequence<T> we could write client code to pull a sequence of values, which would be generated one at the time, and only when requested:

sequence<int> fibs = fibonacci();
for (auto it = fibs.begin(); it != fibs.end(); ++it)
{
    std::cout << *it << std::endl;
}

In C++11 we could also simplify the iteration with a range-based for loop:

sequence<int> fibs = fibonacci();
for (auto it : fibs)
{
    std::cout << *it << std::endl;
}

More interestingly, we could define other resumable functions that manipulate the elements of a sequence, lazily producing another sequence. This example, taken from Gustaffson’s proposal, shows a lazy version of std::transform():

template<typename Iter>
sequence<int> lazy_tform(Iter beg, Iter end, std::function<int(int)> func) resumable
{
    for (auto iter = beg; iter != end; ++iter)
    {
        yield func(*iter);
    }
}

Moving further with this idea, we could pull another page out of the C# playbook and enrich the sequence class with a whole set of composable, deferred query operators, a la LINQ:

template <typename T>
class sequence
{
public:
    template <typename Predicate> bool all(Predicate predicate);
    [...]
    static sequence<int> range(int from, int to);
    template <typename TResult> sequence<TResult> select(std::function<TResult(T)> selector);
    sequence<T> take(int count);
    sequence<T> where(std::function<bool(T)> predicate);
};

Lazy sequences

Certainly, resumable generators would be a very interesting addition to the standard. But how would they work? We saw that the Visual Studio CTP comes with a first implementation of resumable functions built over the PPL task library, but in this case the CTP is of little help, since it does not support generator functions yet. Maybe they will be part of a future release… but why to wait? We can implement them ourselves! 🙂

In the rest of this post I’ll describe a possible simple implementation of C++ lazy generators.

Let’s begin with the lazy sequence<T> class. This is a STL-like collection which only needs to support input iterators, with a begin() and an end() method.

Every instance of this class must somehow be initialized with a functor that represents the generator function that will generate the values of the sequence. We’ll see later what can be a good prototype for it.

As we said, the evaluation of this function must be deferred to the moment when the values are retrieved, one by one, via the iterator. All the logic for executing, suspending and resuming the generator will actually be implemented by the iterator class, which therefore needs to have a reference to the same functor.

So, our first cut at the sequence class could be something like this:

template<typename T>
class sequence_iterator
{
    // TO DO
};
template<typename T>
class sequence
{
public:
    typedef typename sequence_iterator<T> iterator;
    typedef ??? functor;

    sequence(functor func) : _func(func) { }
    iterator begin() {
        return iterator(_func);
    }
    iterator end() {
        return iterator();
    }

private:
    functor _func;
};

Step by step

The sequence<T> class should not do much more than create iterators. The interesting code is all in the sequence iterator, which is the object that has the ability to actually generate the values.

Let’s go back to our Fibonacci generator and write some code that iterates through it:

sequence<int> fibonacci() resumable
{
    int a = 0;
    int b = 1;
    while (true)
    {
        yield b;
        int tmp = a + b;
        a = b;
        b = tmp;
    }
}

auto fibs = fibonacci();
for (auto it : fibs)
{
    std::cout << *it << std::endl;
}

How should this code really work? Let’s follow its execution step by step.

  1. First, we call the function fibonacci(), which returns an object of type sequence<int>. Note that at this point the execution of the function has not even started yet. We just need to return a sequence object somehow associated to the body of the generator, which will be executed later.
  2. The returned sequence is copied into the variable fibs. We need to define what does it mean to copy a sequence: should we allow copy operations? Should we enforce move semantic?
  3. Given the sequence fibs, we call the begin() method which returns an iterator “pointing ” to the first element of the sequence. The resumable function should start running the moment the iterator is created and execute until a first value is yielded (or until it completes, in case of empty sequences).
  4. When the end() method is called, the sequence returns an iterator that represents the fact that the generator has completed and there are no more values to enumerate.
  5. The operator == () should behave as expected, returning true if both iterators are at the same position of the same sequence, or both pointing at the end of the sequence.
  6. The operator *() will return the value generated by the last yield statement (i.e., the current value of the sequence).
  7. At each step of the iteration, when operator ++() is called, the execution of the generator function will be resumed, and will continue until either the next yield statement updates the current value or until the function returns.

Putting all together, we can begin to write some code for the sequence_iterator class:

template<typename T>
class sequence_iterator
{
public:
    typedef ??? functor;

    sequence_iterator(functor func) {
        // initializes the iterator from the generator functors, executes the functors
        // until it terminates or yields.
    }
    sequence_iterator() : _func(func) {
        // must represent the end of the sequence
    }
    bool operator == (const sequence_iterator& rhs) {
        // true if the iterators are at the same position.
    }
    bool operator != (const sequence_iterator& rhs) {
        return !(*this==rhs);
    }
    const T& operator * () const {
        return _currentVal;
    }
    sequence_iterator operator ++ () {
        // resume execution
        return *this;
    }

private:
    T _currentVal;
};

The behavior of the iterator is fairly straightforward, but there are a few interesting things to note. The first is that evidently a generator function does not do what it says: looking at the code of the fibonacci() function there is no statement that actually returns a sequence<T>; what the code does is simply to yield the sequence elements, one at the time.

So who creates the sequence<T> object? Clearly, the implementation of generators cannot be purely library-based. We can put in a library the code for the sequence<T> and for its iterators, we can also put in a library the platform-dependent code that manages the suspension and resumptions of generators. But it will be up to the compiler to generate the appropriate code that creates a sequence<T> object for a generator function. More on this later.

Also, we should note that there is no asynchrony or concurrency involved in this process. The function could resume in the same thread where it suspended.

Generators as coroutines

The next step is to implement the logic to seamlessly pause and resume a generator. A generator can be seen as an asymmetric coroutine, where the asymmetry lies in the fact that the control can be only yielded back to the caller, contrary to the case of symmetric coroutines that can yield control to any other coroutine at any time.

Unfortunately coroutines cannot be implemented in a platform-independent way. In Windows we can use Win32 Fibers (as I described in this very old post) while on POSIX, you can use the makecontext()/swapcontext() API. There is also a very nice Boost library that we could leverage for this purpose.

But let’s ignore the problems of portability, for the moment, and assume that we have a reliable way to implement coroutines. How should we use them in an iterator? We can encapsulate the non-portable code in a class __resumable_func that exposes this interface:

template <typename TRet>
class __resumable_func
{
    typedef std::function<void(__resumable_func&)> TFunc;

public:
    __resumable_func(TFunc func);

    void yieldReturn(const TRet& value);
    void yieldBreak();
    void resume();

    const TRet& getCurrent() const;
    bool isEos() const;
}

The class is templatized on the type of the values produced by the generator and provides methods to yield one value (yieldReturn()), to retrieve the current value (i.e., the latest value yielded) and to resume the execution and move to the next value.

It should also provide methods to terminate the enumeration (yieldBreak()) and to tell if we have arrived at the end of the sequence (isEos()).

The function object passed to the constructor represents the generator function itself that we want to run. More precisely, it is the function that will be executed as a coroutine, and its prototype tells us that this function, in order to be able to suspend execution, needs a reference to the __resumable_func object that is running the coroutine itself.

In fact the compiler should transform the code of a generator into the (almost identical) code of a lambda that uses the __resumable_func object to yield control and emit a new value.

For example, going back again to our fibonacci() generator, we could expect the C++ compiler to transform the code we wrote:

sequence<int> fibonacci() resumable
{
    int a = 0;
    int b = 1;
    while (true)
    {
        yield b;
        int tmp = a + b;
        a = b;
        b = tmp;
    }
}

into this lambda expression:

auto __fibonacci_func([](__resumable_func<int>& resFn) {
    int a = 0;
    int b = 1;
    while (true)
    {
        resFn.yieldReturn(b);
        int tmp = a + b;
        a = b;
        b = tmp;
    }
});

where the yield statement has been transformed into a call to __resumable_func::yieldReturn().

Likewise, client code that invokes this function, like:

sequence<int> fibs = fibonacci();

should be transformed by the compiler into a call to the sequence constructor, passing this lambda as argument:

sequence<int> fibs(__fibonacci_func);

Sequence iterators

We can ignore the details of the implementation of __resumable_func<T> coroutines for the moment and, assuming that we have them working, we can now complete the implementation of the sequence_iterator class:

template <typename T>
class sequence_iterator
{
    std::unique_ptr<__resumable_func<T>> _resumableFunc;

    sequence_iterator() :
        _resumableFunc(nullptr)
    {
    }

    sequence_iterator(const std::function<void(__resumable_func<T>&)> func) :
        _resumableFunc(new __resumable_func<T>(func))
    {
    }

    sequence_iterator(const sequence_iterator& rhs) = delete;
    sequence_iterator& operator = (const sequence_iterator& rhs) = delete;
    sequence_iterator& operator = (sequence_iterator&& rhs) = delete;

public:
    sequence_iterator(sequence_iterator&& rhs) :
        _resumableFunc(std::move(rhs._resumableFunc))
    {
    }

    sequence_iterator& operator++()
    {
        _ASSERT(_resumableFunc != nullptr);
        _resumableFunc->resume();
        return *this;
    }

    bool operator==(const sequence_iterator& _Right) const
    {
        if (_resumableFunc == _Right._resumableFunc) {
            return true;
        }

        if (_resumableFunc == nullptr) {
            return _Right._resumableFunc->isEos();
        }

        if (_Right._resumableFunc == nullptr) {
            return _resumableFunc->isEos();
        }

        return (_resumableFunc->isEos() == _Right._resumableFunc->isEos());
    }

    bool operator!=(const sequence_iterator& _Right) const
    {
        return (!(*this == _Right));
    }

    const T& operator*() const
    {
        _ASSERT(_resumableFunc != nullptr);
        return (_resumableFunc->getCurrent());
    }
};

The logic here is very simple. Internally, a sequence_iterator contains a __resumable_func object, to run the generator as a coroutine. The default constructor creates an iterator that points at the end of the sequence. Another constructor accepts as argument the generator function that we want to run and starts executing it in a coroutine and the function will run until either it yields a value or terminates, giving the control back to the constructor. In this way we create an iterator that points at the beginning of the sequence.

If a value was yielded, we can call the dereference-operator to retrieve it from the __resumable_func object. If the function terminated, instead, the iterator will already point at the end of the sequence. The equality operator takes care of equating an iterator whose function has terminated to the end()-iterators created with the default constructor. Incrementing the iterator means resuming the execution of the coroutine, from the point it had suspended, giving it the opportunity to produce another value.

Note that, since the class owns the coroutine object, we disable copy constructors and assignment operators and only declare the move constructor, to pass the ownership of the coroutine.

Composable sequence operators

Almost there! We have completed our design, but there are still a few details to work out. The most interesting are related to the lifetime and copyability of sequence objects. What should happen with code like this?

sequence<int> fibs1 = fibonacci();
sequence<int> fibs2 = fibs1;
for (auto it1 : fibs1) {
    for (auto it2 : fibs2) {
        ...
    }
}

If we look at how we defined class sequence<T>, apparently there is no reason why we should prevent the copy of sequence objects. In fact, sequence<T> is an immutable class. Its only data member is the std::function object that wraps the functor we want to run.

However, even though we don’t modify this functor object, we do execute it. This object could have been constructed from a lambda expression that captured some variables, either by value or by reference. Since one of the captured variables could be a reference to the same sequence<T> object that created that iterator, we need to ensure that the sequence object will always outlive its functors, and allowing copy-semantics suddenly becomes complicated.

This brings us to LINQ and to the composability of sequences. Anyone who has worked with C# knows that what makes enumerable types truly powerful and elegant is the ability to apply chains of simple operators that transform the elements of a sequence into another sequence. LINQ to Objects is built on the concept of a data pipeline: we start with a data source which implements IEnumerable<T>, and we can compose together a number of query operators, defined as extension methods to the Enumerable class.

For example, this very, very useless query in C# generates the sequence of all square roots of odd integers between 0 and 10:

var result = Enumerable.Range(0, 10)
    .Where(n => n%2 == 1)
    .Select(n => Math.Sqrt(n));

Similarly, to make the C++ sequence<T> type really powerful we should make it composable and enrich it with a good range of LINQ-like operators to generate, filter, aggregate, group, sort and generally transform sequences.

These are just a few of the operators that we could define in the sequence<T> class:

template <typename T>
class sequence
{
public:
    [...]
    static sequence<int> range(int from, int to);
    template <typename TResult> sequence<TResult> select(std::function<TResult(T)> selector);
    sequence<T> where(std::function<bool(T)> predicate);
};

to finally be able to write the same (useless) query:

sequence<double> result = sequence<int>::range(0, 10)
    .where([](int n) { return n => n%2 == 1; })
    .select([](int n) { return sqrt(n); });

Let’s try to implement select(), as an experiment. It is conceptually identical to the lazy_tform() method  we saw before, but now defined in the sequence class. A very naïve implementation could be as follows:

// Projects each element of a sequence into a new form. (NOT WORKING!)
template <typename TResult>
sequence<TResult> select(std::function<TResult(T)> selector)
{
    auto func = [this, selector](__resumable_func<T>& rf) {
        for (T t : *this)
        {
            auto val = selector(t);
            rf.yieldReturn(val);
        }
    };
    return sequence<TResult>(func);
}

It should be now clear how it works: first we create a generator functor, in this case with a lambda expression, and then we return a new sequence constructed on this functor. The point is that the lambda needs to capture the “parent” sequence object to be able to iterate through the values of its sequence.

Unfortunately this code is very brittle. What happens when we compose more operators, using the result of one as the input of the next one in the chain? When we write:

sequence<double> result = sequence<int>::range(0, 10)
    .where([](int n) { return n => n%2 == 1; })
    .select([](int n) { return sqrt(n); });

there are (at least) three temporary objects created here, of type sequence<T>, and their lifetime is tied to that of the expression, so they are deleted before the whole statement completes.

A chain of sequences

The situation is like in the figure: the functor of each sequence in the chain is a lambda that has captured a pointer to the previous sequence object. The problem is in the deferred execution: nothing really happens until we enumerate the resulting sequence through its iterator, but as soon as we do so each sequence starts pulling values from its predecessor, which has already been deleted.

Temporary objects and deferred execution really do not get along nicely at all. On one hand in order to compose sequences we have to deal with temporaries that can be captured in a closure and then deleted long before being used. On the other hand, the sequence iterators, and their underlying coroutines, should not be copied and can outlive the instance of the sequence that generated them.
We can enforce move semantics on the sequence<T> class, but then what do we capture in a generator like select() that acts on a sequence?

As often happens, a possible solution requires adding another level of indirection. We introduce a new class, sequence_impl<T>, which represents a particular application of a generator function closure:

template <typename T>
class sequence_impl
{
public:
    typedef std::function<void(__resumable_func<T>&)> functor;

private:
    const functor _func;

    sequence_impl(const sequence_impl& rhs) = delete;
    sequence_impl(sequence_impl&& rhs) = delete;
    sequence_impl& operator = (const sequence_impl& rhs) = delete;
    sequence_impl& operator = (sequence_impl&& rhs) = delete;

public:
    sequence_impl(const functor func) : _func(std::move(func)) {}

    sequence_iterator<T> begin() const
    {
        // return iterator for beginning of sequence
        return iterator(_func);
    }
    sequence_iterator<T> end() const
    {
        // return iterator for end of sequence
        return iterator();
    }
};

A sequence_impl<T> is neither copiable nor movable and only provides methods to iterate through it.

The sequence<T> class now keeps only a shared pointer to the unique instance of a sequence_impl<T> that represents that particular application of the generator function. Now we can support chained sequences by allowing move semantics on the sequence<T> class.

template <typename T>
class sequence
{
    std::shared_ptr<sequence_impl<T>> _impl;

    sequence(const sequence& rhs) = delete;
    sequence& operator = (const sequence& rhs) = delete;

public:
    typedef typename sequence_impl<T>::iterator iterator;
    typedef typename sequence_impl<T>::functor functor;

    sequence(functor func) {
        _impl(std::make_shared<sequence_impl<T>>(func))
    }
    sequence(sequence&& rhs) {
        _impl = std::move(rhs._impl);
    }
    sequence& operator = (sequence&& rhs) {
        _impl = std::move(rhs._impl);
    }

    iterator begin() const {
        return _impl->begin();
    }
    iterator end() const {
        return _impl->end();
    }
};

The diagram below illustrates the relationships between the classes involved in the implementation of lazy sequences:

genFunc2

LINQ operators

Ok, now we have really almost done. The only thing left to do, if we want, is to write a few sequence-manipulation operators, modeled on the example of the LINQ-to-objects. I’ll list just a few, as example:

// Determines whether all elements of a sequence satisfy a condition.
bool all(std::function<bool(T)> predicate)
{
    if (nullptr == predicate) {
        throw std::exception();
    }

    for (auto t : *_impl)
    {
        if (!predicate(t)) {
            return false;
        }
    }
    return true;
}

// Returns an empty sequence
static sequence<T> empty()
{
    auto fn = [](__resumable_func<T>& rf) {
        rf.yieldBreak();
    };
    return sequence<T>(fn);
}

// Generates a sequence of integral numbers within a specified range [from, to).
static sequence<int> range(int from, int to)
{
    if (to < from) {
        throw std::exception();
    }

    auto fn = [from, to](__resumable_func<T>& rf) {
        for (int i = from; i < to; i++) {
            rf.yieldReturn(i);
        }
    };
    return sequence<int>(fn);
}

// Projects each element of a sequence into a new form.
template <typename TResult>
sequence<TResult> select(std::function<TResult(T)> selector)
{
    if (nullptr == selector) {
        throw std::exception();
    }

    std::shared_ptr<sequence_impl<T>> impl = _impl;
    auto fn = [impl, selector](__resumable_func<T>& rf) {
        for (T t : *impl)
        {
            auto val = selector(t);
            rf.yieldReturn(val);
        }
    };
    return sequence<TResult>(fn);
}

// Returns a specified number of contiguous elements from the start of a sequence.
sequence<T> take(int count)
{
    if (count < 0) {
        throw std::exception();
    }

    std::shared_ptr<sequence_impl<T>> impl = _impl;
    auto fn = [impl, count](__resumable_func<T>& rf) {
        auto it = impl->begin();
        for (int i = 0; i < count && it != impl->end(); i++, ++it) {
            rf.yieldReturn(*it);
        }
    };
    return sequence<T>(fn);
}

// Filters a sequence of values based on a predicate.
sequence<T> where(std::function<bool(T)> predicate)
{
    if (nullptr == predicate) {
        throw std::exception();
    }

    std::shared_ptr<sequence_impl<T>> impl = _impl;
    auto fn = [impl, predicate](__resumable_func<T>& rf) {
        for (auto item : *impl)
        {
            if (predicate(item)) {
                rf.yieldReturn(item);
            }
        }
    };
    return sequence<T>(fn);
}

We could write many more, but I think these should convey the idea.

Example: a prime numbers generator

As a final example, the following query lazily provides the sequence of prime numbers (smaller than INT_MAX), using a very simple, brute-force algorithm. It is definitely not the fastest generator of prime numbers, it’s maybe a little cryptic, but it’s undoubtedly quite compact!

sequence<int> primes(int max)
{
    return sequence<int>::range(2, max)
        .where([](int i) {
            return sequence<int>::range(2, (int)sqrt(i) + 2)
                .all([i](int j) { return (i % j) != 0; });
            });
}

Conclusion

In this article I rambled about generators in C++, describing a new sequence<T> type that model lazy enumerators and that could be implemented as an extension of resumable functions, as specified in N3858. I have described a possible implementation based on coroutines and introduced the possibility of extending the sequence class with a set of composable operators.

If you are curious and want to play with my sample implementation, you can find a copy of the sources here. Nothing too fancy, just the code that I showed in this post.

Appendix – Coroutines in Win32

Having completed my long ramble on the “platform independent” aspects of C++ generators, it’s time to go back to the point we left open: how to implement, on Windows, the coroutines that we encapsulated in the __resumable_func class?

We saw that the Visual Studio CTP comes with a first implementation of resumable functions, built over the PPL task library and using Win32 fibers as side-stacks. Even though the CTP does not support generator functions yet, my first idea was to just extend the <pplawait.h> library to implement them. However the code there is specialized for resumable functions that suspend awaitingfor some task, andit turns out that we can reuse only part of their code because, even if we are still dealing with resumable functions, the logic of await and yield are quite different.

In the case of await, functions can be suspended (possibly multiple times) waiting for some other task to complete. This means switching to a fiber associated to the task after having set up a continuation that will be invoked after the task completes, to switch the control back to the current fiber. When the function terminates, the control goes back to the calling fiber, returning the single return value of the async resumable function.

In the case of yield, we never suspend to call external async methods. Instead, we can suspend multiple times going back to the calling fiber, each time by returning one of the values that compose the sequence. So, while the implementation of the await keyword needs to leverage the support of PPL tasks, the concept of generator functions does not imply any concurrency or multithreading and using the PPL is not necessary.

Actually, there are ways to implement yield with await) but I could not find a simple way to use the new __await keyword without spawning new threads (maybe this could be possible with a custom PPL scheduler?)

So I chose to write the code for coroutines myself; the idea here is not very different from the one I described in a very old post (it looks like I keep rewriting the same post :-)) but now I can take advantage of the fiber-based code from the CTP’s <pplawait.h> library.

Win32 Fibers

Let’s delve into the details of the implementation.  Before all, let me summarize once again the Win32 Fiber API.

Fibers were added to Windows NT to support cooperative multitasking. They can be thought as lightweight threads that must be manually scheduled by the application. In other words, fibers are a perfect tool to implement coroutines sequencing.

When a fiber is created, with CreateFiber, it is passed a fiber-start function. The OS then assigns it a separate stack and sets up execution to begin at this fiber-start function. To schedule a fiber we need to “switch” to it manually with SwitchToFiber and once it is running, a fiber can then suspend itself only by explicitly yielding execution to another fiber, also by calling SwitchToFiber.

SwitchToFiber only works from a fiber to another, so the first thing to do is to convert the current thread into a fiber, with ConvertThreadToFiber. Finally, when we have done using fibers, we can convert the main fiber back to a normal thread with ConvertFiberToThread.

The __resumable_func class

We want to put all the logic to handle the suspension and resumption of a function in the __resumable_func<T> class, as described before.

In our case we don’t need symmetric coroutines; we just need the ability of returning control to the calling fiber. So our class will contain a handle to the “caller” fiber and a handle to the fiber we want to run.

#include <functional>
#include <pplawait.h>

template <typename TRet>
class __resumable_func : __resumable_func_base
{
    typedef std::function<void(__resumable_func&)> TFunc;

    TFunc _func;
    TRet _currentValue;
    LPVOID _pFiber;
    LPVOID _pCallerFiber;
    Concurrency::details::__resumable_func_fiber_data* _pFuncData;

public:
    __resumable_func(TFunc func);
    ~__resumable_func();

    void yieldReturn(TRet value);
    void yieldBreak();
    void resume();

    const TRet& getCurrent() const const { return _currentValue; }
    bool isEos() const { return _pFiber == nullptr; }

private:
    static void yield();
    static VOID CALLBACK ResumableFuncFiberProc(PVOID lpParameter);
};

The constructor stores a copy of the generator function to run, creates a new fiber object specifying ResumableFuncFiberProc as the function to execute, and immediately switches the execution to this fiber:

    __resumable_func(TFunc func) :
        _currentValue(TRet()),
        _pFiber(nullptr),
        _func(func),
        _pFuncData(nullptr)
    {
        // Convert the current thread to a fiber. This is needed because the thread needs to "be"
        // a fiber in order to be able to switch to another fiber.
        ConvertCurrentThreadToFiber();
        _pCallerFiber = GetCurrentFiber();

        // Create a new fiber (or re-use an existing one from the pool)
        _pFiber = Concurrency::details::POOL CreateFiberEx(Concurrency::details::fiberPool.commitSize,
            Concurrency::details::fiberPool.allocSize, FIBER_FLAG_FLOAT_SWITCH, &ResumableFuncFiberProc, this);
        if (!_pFiber) {
            throw std::bad_alloc();
        }

        // Switch to the newly created fiber. When this "returns" the functor has either returned,
        // or issued an 'yield' statement.
        ::SwitchToFiber(_pFiber);

        _pFuncData->suspending = false;
        _pFuncData->Release();
    }

The fiber will start from the fiber procedure, which has the only task of running the generator function in the context of the fiber:

    // Entry proc for the Resumable Function Fiber.
    static VOID CALLBACK ResumableFuncFiberProc(PVOID lpParameter)
    {
        LPVOID threadFiber;

        // This function does not formally return, due to the SwitchToFiber call at the bottom.
        // This scope block is needed for the destructors of the locals in this block to fire
        // before we do the SwitchToFiber.
        {
            Concurrency::details::__resumable_func_fiber_data funcDataOnFiberStack;
            __resumable_func* pThis = (__resumable_func*)lpParameter;

            // The callee needs to setup some more stuff after we return (which would be either on
            // yield or an ordinary return). Hence the callee needs the pointer to the func_data
            // on our stack. This is not unsafe since the callee has a refcount on this structure
            // which means the fiber will continue to live.
            pThis->_pFuncData = &funcDataOnFiberStack;

            Concurrency::details::POOL SetFiberData(&funcDataOnFiberStack);

            funcDataOnFiberStack.threadFiber = pThis->_pCallerFiber;
            funcDataOnFiberStack.resumableFuncFiber = GetCurrentFiber();

            // Finally calls the function in the context of the fiber. The execution can be
            // suspended by calling yield
            pThis->_func(*pThis);

            // Here the function has completed. We set return to true meaning this is the
            // final 'real' return and not one of the 'yield' returns.
            funcDataOnFiberStack.returned = true;
            pThis->_pFiber = nullptr;
            threadFiber = funcDataOnFiberStack.threadFiber;
        }

        // Return to the calling fiber.
        ::SwitchToFiber(threadFiber);

        // On a normal fiber this function won't exit after this point. However, if the fiber is
        // in a fiber-pool and re-used we can get control back. So just exit this function, which
        // will cause the fiber pool to spin around and re-enter.
    }

There are two ways to suspend the execution of the generator function running in the fiber and to yield control back to the caller. The first is to yield a value, which will be stored in a data member:

    void yieldReturn(TRet value)
    {
        _currentValue = value;
        yield();
    }

The second is to immediately terminate the sequence, for example with a return statement or reaching the end of the function. The compiler should translate a return into a call to the yieldBreak method:

void yieldBreak()
{
    _pFiber = nullptr;
    yield();
}

To yield the control we just need to switch back to the calling fiber:

    static void yield()
    {
        _ASSERT(IsThreadAFiber());
        Concurrency::details::__resumable_func_fiber_data* funcDataOnFiberStack =
            Concurrency::details::__resumable_func_fiber_data::GetCurrentResumableFuncData();

        // Add-ref's the fiber. Even though there can only be one thread active in the fiber
        // context, there can be multiple threads accessing the fiber data.
        funcDataOnFiberStack->AddRef();

        _ASSERT(funcDataOnFiberStack);
        funcDataOnFiberStack->verify();

        // Mark as busy suspending. We cannot run the code in the 'then' statement
        // concurrently with the await doing the setting up of the fiber.
        _ASSERT(!funcDataOnFiberStack->suspending);
        funcDataOnFiberStack->suspending = true;

        // Make note of the thread that we're being called from (Note that we'll always resume
        // on the same thread).
        funcDataOnFiberStack->awaitingThreadId = GetCurrentThreadId();

        _ASSERT(funcDataOnFiberStack->resumableFuncFiber == GetCurrentFiber());

        // Return to the calling fiber.
        ::SwitchToFiber(funcDataOnFiberStack->threadFiber);
    }

Once we have suspended, incrementing the iterator will resume the execution by calling resume, which will switch to this object’s fiber:

    void resume()
    {
        _ASSERT(IsThreadAFiber());
        _ASSERT(_pFiber != nullptr);
        _ASSERT(_pFuncData != nullptr);
        _ASSERT(!_pFuncData->suspending);
        _ASSERT(_pFuncData->awaitingThreadId == GetCurrentThreadId());

        // Switch to the fiber. When this "returns" the functor has either returned, or issued
        // an 'yield' statement.
        ::SwitchToFiber(_pFiber);

        _ASSERT(_pFuncData->returned || _pFuncData->suspending);
        _pFuncData->suspending = false;
        if (_pFuncData->returned) {
            _pFiber = nullptr;
        }
        _pFuncData->Release();
    }

The destructor just needs to convert the current fiber back to a normal thread, but only when there are no more fibers running in the thread. For this reason we need to keep a per-thread fiber count, which is incremented every time we create a __resumable_funcand decremented every time we destroy it.

~__resumable_func()
{
    if (_pCallerFiber != nullptr) {
        ConvertFiberBackToThread();
    }
}

class __resumable_func_base
{
    __declspec(thread) static int ts_count;

protected:
    // Convert the thread to a fiber.
    static void ConvertCurrentThreadToFiber()
    {
        if (!IsThreadAFiber())
        {
            // Convert the thread to a fiber. Use FIBER_FLAG_FLOAT_SWITCH on x86.
            LPVOID threadFiber = ConvertThreadToFiberEx(nullptr, FIBER_FLAG_FLOAT_SWITCH);
            if (threadFiber == NULL) {
                throw std::bad_alloc();
            }
            ts_count = 1;
        }
        else
        {
            ts_count++;
        }
    }

    // Convert the fiber back to a thread.
    static void ConvertFiberBackToThread()
    {
        if (--ts_count == 0)
        {
            if (ConvertFiberToThread() == FALSE) {
                throw std::bad_alloc();
            }
        }
    }
};
__declspec(thread) int __resumable_func_base::ts_count = 0;

And this is all we need to have resumable generators in C++, on Windows. The complete source code can be found here.

Async-Await in C++

In the previous post we had a quick look at the problem of writing concurrent, asynchronous code with C++. We saw that the <future> library, introduced with C++11, offers a partial solution to this problem by separating the act of initiating an operation from the act of waiting for its result. But we saw that the act of waiting for a result is still synchronous, and that std:futures are not easily composable.

We talked of the platform-specific improvements offered by Microsoft’s PPL tasks, which have become the model for a few interesting changes proposed to the C++ standard library.

But are these improvements enough?

The problems with Tasks

Having composable futures would certainly be a big improvement for C++. But the experience with C# tells us that it’s not a perfect solution. A purely library-based solution is bound to have limitations. In particular, .NET’s TPL has the defect of making the code overly complex.

It is difficult to write robust, readable and debuggable code with TPL. It is too easy to end up with spaghetti code: a large number of Task<T>s interconnected in inextricable chains of continuations.

I found out the hard way, when I was given, by my last employer, a .NET component to maintain which had been architected as an incredibly convoluted graph of interconnected TPL Tasks, often joined with conditional ContinueWith().

Loops and conditionals

It is particularly difficult to write iterations and branches with tasks.

In my previous post I used as example a copyFile function that asynchronously read all the file content into a single string and copied it into another file. Of course such an implementation would not be practical with very large files, so let’s rewrite it so that it will read and write a file chunk by chunk.

Here I am using the PPL, but very similar code could be written with std::future (especially when they will support task continuations with future::then).

We can easily write an iterative function that use tasks to read and write asynchronously every chunk of a file, but that still blocks waiting for these tasks to complete:

Concurrency::task<string> readFileChunk(ifstream& file, int chunkLength)
{
    return Concurrency::create_task([&file, chunkLength](){
        vector<char> buffer(chunkLength);
        file.read(&buffer[0], chunkLength);
        return string(&buffer[0], (unsigned int)file.gcount());
    });
}

Concurrency::task<void> writeFileChunk(ofstream& file, const string& chunk)
{
    return Concurrency::create_task([&file, chunk](){
        file.write(chunk.c_str(), chunk.length());
    });
}

void copyFile_get(const string& inFilePath, const string& outFilePath)
{
    ifstream inFile(inFilePath, ios::binary | ios::ate);
    inFile.seekg(0, inFile.beg);
    ofstream outFile(outFilePath, ios::binary);

    string chunk;
    while (chunk = readFileChunk(inFile, 4096).get(), !chunk.empty()) {
        writeFileChunk(outFile, chunk).get();
    }
}

The problem is that this code is not much better than a purely synchronous loop of blocking calls to read and write the file; if our thread schedules asynchronous operations but then immediately blocks waiting for them to complete, we cannot really do much else concurrently.

It would be better to chain a sequence of tasks that read and write the file chunk by chunk, so that they can be scheduled one after the other leaving our thread free to do other work while these tasks are blocked in some I/O operation.

It turns out that writing a “task loop” correctly is not trivial, especially if there are many possible error cases or exceptions to handle. It is actually so tricky that there are MSDN pages that explains what to do, and what the possible pitfalls are, and the PPL Asynchronous Sample Pack (PPL Asynchronous Sample Pack) even provides sample code for this.

The code I put together is neither clean nor concise:

task<shared_ptr<string>> readFileChunk(shared_ptr<ifstream> file, int chunkLength)
{
    return create_task([file, chunkLength]( {
        vector<char> buffer(chunkLength);
        file->read(&buffer[0], chunkLength);
        return make_shared<string>(&buffer[0], (unsigned int)file->gcount());
    });
}

task<bool> writeFileChunk(shared_ptr<ofstream> file, shared_ptr<string> chunk)
{
    return create_task([file, chunk](){
        file->write(chunk->c_str(), chunk->length());
        return chunk->length() == 0;
    });
}

task<void> copyFile_repeat(shared_ptr<ifstream> inFile, shared_ptr<ofstream> outFile)
{
    return readFileChunk(inFile, 4096)
        .then([=](shared_ptr<string> chunk){
            return writeFileChunk(outFile, chunk);
        })
        .then([=](bool eof){
            if (!eof) {
                return copyFile_repeat(inFile, outFile);
            }
            else {
                return task_from_result();
            }
        });
}

task<void> copyFile_then(const string& inFilePath, const string& outFilePath)
{
    auto inFile = make_shared<ifstream>(inFilePath, ios::binary | ios::ate);
    inFile->seekg(0, inFile->beg);
    auto outFile = make_shared<ofstream>(outFilePath, ios::binary);

    return copyFile_repeat(inFile, outFile);
}

It is probably easy to do better, to write cleaner code that executes in a loop a simple continuation chain like this one, but it is also quite evident that when the algorithm becomes more complicated the asynchronous code can become very convoluted.

Debugging

And then there is the problem of debugging: if we set a breakpoint in the file->write(…) line, for example, when the execution stops there the debugger will present us with a quite strange call stack. With continuations we lose the usual “causality chain” of return stacks, typical of synchronous code. Here the lambdas that compose the body of the tasks are scattered in the call stack, and distributed across several threads and sometimes it can be difficult to understand what is going on.

Image

Image

The async-await pattern

The complexity of working with Tasks is well known to C# programmers, who have already had a few years to experiment and gain experience with the .NET TPL. This is the reason why the biggest improvement of C# 5.0 was the introduction of the Async/Await pattern, designed on the model of F# asynchronous workflows.

To see how it works, let’s start by writing in C# an iterative but blocking implementation of our CopyFile function. This can be done very easily:

        private void CopyChunk(Stream input, Stream output)
        {
            byte[] buffer = new byte[4096];
            int bytesRead;
            while ((bytesRead = input.Read(buffer, 0, buffer.Length)) != 0)
            {
                output.Write(buffer, 0, bytesRead);
            }
        }

        public void CopyFile(string inFile, string outFile)
        {
            using (StreamReader sr = new StreamReader(inFile)
            {
                using (StreamWriter sw = new StreamWriter(outFile)
                {
                    CopyChunk(sr.BaseStream, sw.BaseStream);
                }
            }
        }

Let’s say that we want now to make the function completely asynchronous. Ignoring the fact that the framework actually provides a Stream.CopyToAsync, we can use other asynchronous .NET APIs, like Stream.ReadAsync and Stream.WriteAsync that return a Task, together with async/await, and write this:

        private async Task CopyChunk_Async(Stream input, Stream output)
        {
            byte[] buffer = new byte[4096];
            int bytesRead;
            while ((bytesRead = await input.ReadAsync(buffer, 0, buffer.Length)) != 0)
            {
                await output.WriteAsync(buffer, 0, bytesRead);
            }
        }

        public async Task CopyFile_Async(string inFile, string outFile)
        {
            using (StreamReader sr = new StreamReader(inFile)
            {
                using (StreamWriter sw = new StreamWriter(outFile))
                {
                   await CopyChunk_Async(sr.BaseStream, sw.BaseStream);
                }
            }
        }

Now, this is really beautiful! We did only a few small changes to the structure of the code, adding the async modifier and changing the return type of our methods, and adding a strange await keyword in the places where we call asynchronous functions. The logic of the function has not changed at all, and yet, it now executes in a completely different way. Let’s see how.

In C# a method can be declared as asynchronous by adding the async modifier to the method declaration. For example:

public static async Task<int> FooAsync() {…}

An asynchronous method is like a regular C# method, but its return type can only be of type void, Task or Task<T>, and it can also contain await expressions, like:

int result = await FooAsync();

Awaiting on an expression means launching the asynchronous execution of that expression (usually an async method) and, if it has not yet completed, immediately yielding the execution to the caller.

So asynchronous methods are, like iterator methods, a kind of coroutine. They are resumable, their execution can “magically” pause in a few points identified by the ‘await’ statements, and then “magically” resume when the asynchronous expression has completed. While the method is paused waiting for the completion its caller continues to run. This means that if the method was called from an UI message loop, for example, the UI thread can continue processing messages, and the application will remain responsive. If the method was called by a worker thread in a web server the thread will be free to serve more requests.

But just like iterator methods, in C# asynchronous methods are not really coroutines. They are instead implemented with a state machine. The details are quite complicated (a very good description can be found here and here), but what happens is that the compiler transforms the function body by attaching a task continuation to the asynchronous operation so that the execution will resume from the point where it had left.

In which thread the awaitable expression executes, and in which thread the async method will resume? It’s up to the task scheduler to decide; asynchrony does not mean parallelism, a method does not need to run on another thread to be asynchronous. The scheduler may run the task later in the same thread, or it may run it on a worker thread from the ThreadPool. The execution will then resume on the right thread (for example the UI thread in a GUI app) according to the current synchronization context. For the point of view of the developer, this means that he cannot rely on the fact that an async method will continue its execution on the same thread where it started (with all the consequent multithreading headaches).

C++ resumable functions

So, everything is nice in the managed world of C#, but what about native programming? Is there anything like async/await that we can use with our futures? We can find the answer in N3858, another proposal made by Gustafsson et al. for C++17.

This time the changes proposed are to the language itself and not just to the library. The idea is to introduce the equivalent of C# async methods in the form of resumable functions. They can be thought as the basis to add to C++ the support for real co-routines, and are not strictly related to the <future> library, even though they have being defined especially to improve the usability of futures and promises.

Resumable functions would have almost exactly the same semantic of async/await in C#: two new keywords would be added to the language:

–        The resumable keyword, added after the argument list, defines a function as being suspendable and resumable, in the way described before. It is the equivalent of C#’s async.

–        Like in C#, the await keyword in the body of a resumable function identifies a suspension point, one of the places where the function may suspend execution and then resume it later (possibly in a different thread) when some asynchronous operation completes.

For example:

        future<int> abs_r(future<int> i_op) resumable
        {
            int i = await i_op;
            return (i < 0) ? –i : i;
        }

This would define a resumable function that takes a future<int> as argument, suspends itself waiting for the future result to be available, and produces an integer as result. Of course, a resumable function can only return a future<T> or shared_future<T>. It completes and produces its final value only when a return statement is executed.

Visual Studio 2013 CTP

Will this proposal be accepted? Will resumable functions become part of the standard? And is there a portable way to implement them?

I really cannot answer any of these questions. But, interestingly, it is already possible to play with resumable functions today. Last November Microsoft shipped the Visual Studio November 2013 CTP with a preview of several C++ features that will be released in the next version of Visual Studio and that improve the conformance to the C++ standard. Between those, there is also the first implementation of resumable functions.

To enable them we must change the “Platform Toolset” of our C++ project to use the updated compiler from the CTP, as shown in figure. Also, we need to include the new header <pplawait.h> which extends the PPL <ppltasks.h>.

Image

The syntax of CTP’s resumable functions is very similar to the one proposed by Gustafsson, with a small change in the keywords. In order not to “taint” the language with (still) non-standard keywords, async has been renamed as __resumable and await as __await.

So, we can now rewrite one last time our copyFile method, this time using the CTP’s resumable functions:

#include <pplawait.h>

task<void> copyFile_res(const string inFilePath, const string outFilePath) __resumable
{
	auto inFile = make_shared<ifstream>(inFilePath, ios::binary | ios::ate);
	inFile->seekg(0, inFile->beg);
	auto outFile = make_shared<ofstream>(outFilePath, ios::binary);

	while (true)
	{
		shared_ptr<string> s = __await readFileChunk(inFile, 4096);
		if (s->empty()) { break; }
		__await writeFileChunk(outFile, s);
	}
}

int _tmain(int argc, _TCHAR* argv[])
{
    …
    copyFile_res(file1, file2).get();
}

We got there! Finally we have simple, readable, maintainable asynchronous C++ code. The only small complication is that it is necessary to pay a lot of attention to the life scope of the objects passed to the asynchronous/resumable methods. In C# things are easier, but we don’t have a garbage collector here, so it is often better to wrap the objects in shared_ptr<T>s to make sure they don’t get destroyed when they are still referenced by some async task. A small tribute to pay to the Gods of asynchrony, anyway.

A kind of magic…

But how would resumable functions work, in practice? To learn more about this it is interesting to study the <pplawait.h> header and to watch the interesting Channel 9 talk of Deon Brewis, from last year’s GoingNative conference.

Gustafsson’s proposal describes two possible solutions:

  1. With a state machine, like in C#.
  2. With resumable side-stacks (aka Fibers).

The first implementation would be similar to how async/await is implemented in C#. The compiler would transform the code of the function, implementing the logic of a state machine that keeps track of the current state of the function. The implementation would be more complicated than in C# because C++ does not have garbage collection; the function variables could not live in the stack (because the stack frame would be lost when the function is suspended), so they would have to be stored in an heap-allocated activation frame.

But the second way of implementing resumable function seems more elegant: each function would have its own side stack, a stack separated by the thread stack and allocated by the compiler when the resumable function is invoked, and destroyed when the resumable function completes. What the proposal describes is actually to using something identical to Fibers, in Windows.

The very, very surprising thing is, in my opinion, that Microsoft chose the second option, the fibers, to implement resumable functions in the CTP.

Fibers

Now, this is curious: a couple of years ago I wrote a few posts about a possible way to write C#-like iterator blocks and LINQ operators in C++ using Fibers.

Fibers are an almost forgotten, old feature added to Windows NT to support cooperative multitasking, in the form of lightweight threads that must be manually scheduled by the application. If I am not wrong, they were added to Windows to improve the performance of the first versions of SQL Server.

My implementation of iterator blocks was very questionable for many reasons (for example, I just copied the IEnumerator/IEnumerable interfaces from .NET rather than writing something that could work better with C++ STL) but I thought that the weakest point was probably the use of fibers.

In fact it is difficult to write correct code with fibers; the entire program must be designed to support them. During the years, many things in Windows have grown increasingly incompatible with them. The whole .NET does not work with Fibers, and an attempt to fix this proved so complicated that was soon abandoned.

Evidently this is not necessarily a problem if fibers are used in the limited scope of supporting native resumable functions, and under the direct control of our compiler. But even so, the functions and libraries that we call from our resumable function must be fiber-safe (in addition to being thread-safe); for example thread-local storage does not work with fibers. The CRT itself was not completely fiber-safe in the past, but I guess this should have been fixed by now. And there is the problem of managing C++ exceptions: in the past they could only be caught by the fiber that threw them, they could not pass across fibers’ boundaries, but even this problem can be solved by the compiler or library.

Also, I wonder whether there is a portable way to implement side-stacks also on non-Windows platforms, and if some support from the OS will necessarily be required.

Under the hood

But let’s look in more detail at the CTP code, to learn how resumable functions actually work. The implementation is split between the compiler and the library code in <pplawait.h>.

The code of <pplawait.h> is organized in two main classes:

template <typename TTask, typename TFunc, typename TTaskType>
class __resumable_func_setup_data
{
    __resumable_func_fiber_data* _func_data;
    LPVOID _pSystemFiber;
    LPVOID _pvLambda; // The function to execute
    task_completion_event<TTaskType> _tce;
    [...]
};
struct __resumable_func_fiber_data
{
    LPVOID threadFiber;         // The thread fiber to which to return.
    LPVOID resumableFuncFiber;  // The fiber on which this function will run
    volatile LONG refcount;     // Refcount of the fiber
    [...]
};

The first, __resumable_func_setup_data, contains the data and logic to initialize a resumable method call, and it’s templatized on the function type.

The second, __resumable_func_fiber_data, contains the data and methods to manage the life of a resumable method that runs on a fiber.

When a function is declared as resumable, like:

    task<int> foo(future<int> fi) __resumable

the compiler generates the code for a stub function foo_resumable.

When a resumable function is called, the compiler generates a call to this stub function, which takes care of executing foo into a new “resumable context”. The code for this is in a static function:

    TTask __resumable_func_setup_data::RunActionInResumableContext(TFunc& lambda)

which shows how a Fiber can be used to seamlessly suspend and resume a function:

  1. The first thing to do is to convert the caller thread into a fiber, by calling __resumable_func_fiber_data::ConvertCurrentThreadToFiber(). When the function completes it will convert the fiber back into a normal thread with ConvertFiberToThread.
  2. Then, a __resumable_func_setup_data struct is initialized with the data required to invoke the function on a fiber.
  3. After this, a new fibers is created, specifying as entry point __resumable_func_setup_data.ResumableFuncFiberProc. Since creating and destroying a fiber is a relatively expensive operation, the library also implements a class FiberPool, to manage a pool of Fiber, which will be reused when needed.
  4. Now the current fiber can suspend itself and start the execution of this new fiber by calling SwitchToFiber.

Calling SwitchToFiber means manually doing a context switch (or better, a fiber switch). The instruction pointer of our thread suddenly jumps to a different place, using a different stack, and all the registers are updated. But the old context and stack is not lost: it is part of the previous fiber. It is just not running, because it is not associated to any thread, but it can be resumed in any moment.

For example, the following figure shows what happens when the resumable copyFile_res is called in our copyFile program: first the current thread (Thread A) is converted into a fiber (Fiber 0), then a new fiber (Fiber 1) is created to run copyFile_res, and the thread execution is switched to the new Fiber. At that point only Fiber 1 is running. Fiber 0 still exists, with its stack, but it’s suspended.

Image

When the ResumableFuncFiberProc starts executing in the context of the new fiber, it:

  1. Retrieves a pointer to the __resumable_func_setup_data from its parameters,
  2. Initializes a new instance of __resumable_func_fiber_data, allocated on the thread stack, with all the data required to manage the function (like the handle of the caller-fiber), and
  3. Finally executes the resumable function in the context of the fiber.

Now we are executing our functor in the context of the fiber. We can leave the function in two ways:

–        We can complete the function (with a return statement), and return the final result, or

–        We can temporarily suspend the function on an ­­__await expression and yield the execution to the caller.

In the first case, all we need to do is to complete the ResumableFuncFiberProc, store the result and call SwitchToFiber(previousFiber) to resume the previous fiber. The “terminated” fiber is now ready to be deleted or recycled.

Things get more interesting when we have an await expression, like:

    shared_ptr<string> s = __await readFileChunk(inFile, 4096);

In this case the compiler converts the __await into a call to the library function

    template <typename T> T await_task(concurrency::task<T> func)

where func is a task that wraps the awaitable expression that is being called.

await_task implements all the logic to suspend the current function/fiber and to yield the control back to the previous (caller) fiber, so resuming it.

But before yielding the execution, await_task attaches a continuation to the task func. When that task will complete, we want to resume the current fiber, and to do this the continuation needs to retrieve and update the func_data_on_fiber_stack struct and to call SwitchToFiber once again.

The actual details are a little complicated, but the whole code in <pplawait.h> is not too difficult to read.

Changes to the Windows Fiber API

The last interesting thing to note is that the new <pplawait.h> library makes use of a new and slightly improved version of the Win32 Fiber API, defined for Windows 8.1 (where _WIN32_WINNT >= 0x0603) but actually available also on Windows 7 (at least with the most recently patched version of kernel32.dll).

The original APIs, available since WinNT 4.0, provided the following methods:

LPVOID WINAPI CreateFiber(
    _In_     SIZE_T dwStackSize,
    _In_     LPFIBER_START_ROUTINE lpStartAddress,
    _In_opt_ LPVOID lpParameter);
LPVOID WINAPI ConvertThreadToFiber(_In_opt_ LPVOID lpParameter);
VOID WINAPI DeleteFiber(_In_ LPVOID lpFiber);
BOOL WINAPI ConvertFiberToThread();
VOID WINAPI SwitchToFiber(_In_ LPVOID lpFiber);

The first two of these methods have now been superseded by an extended version:

LPVOID WINAPI CreateFiberEx(
    _In_     SIZE_T dwStackCommitSize,
    _In_     SIZE_T dwStackReserveSize,
    _In_     DWORD dwFlags,
    _In_     LPFIBER_START_ROUTINE lpStartAddress,
    _In_opt_ LPVOID lpParameter);
LPVOID WINAPI ConvertThreadToFiberEx(
    _In_opt_ LPVOID lpParameter,
    _In_     DWORD dwFlags);

The difference is that a Fiber can now be created specifying different values for the CommitSize and the ReserveSize of the Fiber stack.Here ReserveSize represents the total stack allocation in virtual memory, while CommitSize is the initially committed memory; it can be useful to fine-tunes these values optimize the memory usage. In fact, programs with a large number of concurrent resumable operations could end up allocating a very large amount of memory for the fiber stacks, since a different Fiber must be allocated for each resumable function.

There is also a dwFlags argument that can take the only possible value FIBER_FLAG_FLOAT_SWITCH and that it’s necessary to overcome an old limitation of the Fiber API, which in the past did not save the floating point registers in a Fiber-switch.

Finally, there is a new function, still undocumented but already used by the CTP library:

PVOID WINAPI CalloutOnFiberStack(
    _In_      PVOID lpFiber,
    _In_      PFIBER_CALLOUT_ROUTINE lpStartAddress,
    _In_opt_  PVOID lpParameter);

It is not difficult to imagine that its purpose is to execute a function on the Fiber specified and return the result to the caller.

Generator functions

We have almost completed a quick review of asynchrony in C++. But there is only one last thing to consider before finishing: asynchronous methods are very similar to iterator methods.

They are, both, a kind of co-routine. We can think of an async method as a strange kind of iterator, which returns only one value, at the end, but which is able to suspend its execution as iterators do after yielding each value. Or we could think of iterator blocks as strange kinds of async methods, because they can suspend and resume execution but only returning a value to the caller.

In C# the compiler implements both iterator methods and asynchronous methods transforming the code of the method into a state machine. We have seen that in C++ the proposal is to implement async methods as real co-routines, using side-stacks. So, the logical question to ask is “Can we use resumable functions to implement C#-style iterators”?

Of course we can! The same proposal N3858 explains that the concept of resumable functions can be expanded to include generator functions, which would use the same kind of interruption/resumption logic to produce a sequence of values.

Gustafsson defines a new special container, sequence<T>, to represent a sequence of values lazily generated by a generator. With this, it should be possible to write LINQ-like code, like the following:

template <typename Iter>
sequence<int> lazy_tform(Iter beg, Iter end, std::function<int(int)> func) resumable
{
    for (auto it = beg; it != end; ++it) {
        yield func(*it);
    }
}

But this will be the topic for a future post. Stay tuned! 🙂