Galaga: an Arcade machine emulator for Windows and HTML5

Retrogaming

Galaga was a very popular arcade in the ‘80s. When I was a kid I must have spent so many coins to play that game that I could probably drive a Ferrari, today, had I better invested them. 🙂

pic1

One of the scenes that made the 1983 movie “Wargames” endearing to me is the one where the main character, David (Matthew Broderick) plays Galaga before going to school, and not even the presence of the beautiful classmate Jennifer (Ally Sheedy) can really distract him from his game… 🙂

pic2

A few years ago I must have had really too much free time, so I engaged in the toy project of writing an emulator for the Galaga arcade machine. By emulation I don’t mean writing a game “just like” Galaga (which would not be a trivial task, anyway) but really emulating the hardware of that old arcade machine in all its details: the original ISA we want to emulate is those of the old Z80 processor and the program we want to run is the one stored in the original ROMs. Of course I knew that MAME and MESS already existed, and I snooped through its sources to find out how that machine actually worked.

My initial goal was to have something running on a browser, so not tied to the Windows platform, and to learn something about software emulation and virtual machines in the process. For this reason I chose to use first Silverlight as host, a platform that looked very promising at the time. After that I quickly ported the code from C# to C++ (even to see what the speed-up could be) and made it run as a WinRT app, more recently converted into a UWP app for Windows 10.

Nothing special, so far. I was just doing for a single game what MAME has done better for thousands of games. But more recently I discovered the wonderful world of Emscripten and I wondered: how difficult can be to make the engine REALLY PORTABLE, compiling it to JavaScript and targeting HTML5? And can JavaScript run fast enough to emulate an arcade machine? It’s what we are going to discover! In this blog post I’ll present a few technical notes on this toy project, a small journey from the initial idea to the final HTML5 implementation.

The machine

But let’s start from the beginning: how did an arcade machine from the 80s work? And how difficult is it to emulate it?

The following picture presents a high-level view of the Galaga hardware:

pic3

Interestingly, Galaga was multi-core. There were three Z80 8-bit processors, which shared the same address space and RAM, but executed each a different program, stored in a set of different ROMs. Other ROMs contained the graphics (sprites and a tiles) and the audio waveforms used to generate sound effects.

There was a form of inter-processor communication based on interrupts: Z80 handles maskable interrupts by executing an interrupt handler located at a fixed address in the ROM. In Galaga the interrupt handler managed a vector of commands with a jump table, so the main CPU could invoke a procedure on one of the other processors simply by setting the index of a command in a location in shared memory and signaling an interrupt.

Beside the main CPUs there were a number of custom chips, dedicated to different functions. There was one just to handle the graphics effects of the star-field in the background, others were used to to render the graphics, to manage the input controls and to produce the sound effects.

To play Galaga we need to emulate the workings of all its components. All the pieces must be connected as they were in their circuit board; and for this we need to find detailed information about the original schematics. Luckily, besides MAME, there are a numbers of sources for this info in the Internet. Galaga’s popularity meant that there has been a great deal of low-level hacking over the years. For example, someone entirely disassembled the 24K bytes of Z80 code from its ROMs (1, 2, 3) to the point of being able to find the root cause and fix the famous bug that made the alien bees stop firing after a while. Another electronics wizard wrote the code to emulate the sound hardware on a PIC.

The Emulator

I wrote the machine emulator engine in C++, making sure that it compiled with the latest versions of Visual Studio and Clang, so that it could be used to drive different hosts, possibly running on different platforms.

Processors

The first thing I needed was a software emulator for the Z80 microprocessor. The emulator, of course, should run in real time, so it is important that the emulation of the Z80 instruction set should be performed very efficiently. The original machine had three Z80 processors running at ~3MHz, so we need to make sure that the time spent to emulate these chips is just a fraction of the machine nominal execution time. In other words, we want to emulate 10 msec of machine time in less than 10 msec).

There are two main techniques for emulating an instruction set (see Smith, Nair “Virtual Machines”). The most efficient one is binary translation, which converts whole blocks of a source program into a customized binary program in the target ISA. A much simpler technique is the “decode-and-dispatch interpreter”, which works by fetching one instruction at the time, analyzing and executing it and modifying the source state.

In our case more than 20 years have passed since the days of 8-bit processors, so we can afford to choose the simplest technique and write an interpreter for Z80. Even so, this turns out to be the most computationally expensive part of the emulator, so it’s important to optimize carefully this code.

Besides the Z80, a couple of the custom chips in the machine are also based on another very simple 4-bit microprocessor, which also needs to be emulated, and even here we can use a simple “decode-and-dispatch interpreter”.

Memory and I/O

We have seen that the memory space is partially shared by the three main processors, so when an address is accessed by the CPU, it may refer to a portion of the same RAM or it may refer to one of the ROMs, which are instead separated per processor, with different ROMs mapped into the same range of addresses for different processors. This is not the whole story: some of the other chips are also memory-mapped, so the CPU instructions used to access the memory can also be used for accessing these devices; the emulator needs to take care of all this, managing the mapping of addresses into the right device.

Other devices use instead port-mapped I/O and are accessed via a dedicated set of microprocessor instructions, (in and out in Z80) on a separate, dedicated address space. Our emulator, of course, must also take care of this port-mapping.

Graphics

The graphics system was similar to those of the 8-bit microcomputers of the time (like the Commodore 64). The screen (of 224 x 288 pixels) was managed as a tilemap of 8×8 characters, and the graphics system also supported up to 64 sprites that were rendered above the tilemap.

The generation of a screen frame takes three steps:

  1. First, the star field in the background is drawn. In the original, there is a custom chip to move and draw the stars; its behavior needs to be replicated.
  2. Then the matrix of 28 x 36 square tiles (224 x 288 pixels) is drawn over the star field, by reading from video-memory the index of the tile to draw and the color to use. Tiles are used for all the text that appears in the game, like the score; the content of each tile (a bitmapped font) is read from a ROM.
  3. Finally it draws up to 64 sprites over the tiles. Even here, size and position and colors of each sprite is stored in a mapped area of video memory, while the sprite bitmaps are read from separate ROMs.

Audio

Audio synthesis turned out to be by far the most complicated part of this project. In Galaga the audio system is composed by two different chips that generated two audio signals.

The first chip is a 3-voice waveform sounds generator. It uses a library of waveforms, read from a PROM, to generate digital samples for sound effects and little tunes.

The second chip, a sound/noise generator based on a 4-bits controller, was only used to generate explosions. This is a discrete sound circuit, made with analog components, like operational amplifiers and DAC. The emulation of a discrete circuit was the part where I really had to dig into the MAME sources to be able to move on.

A final mixer merges these two audio streams and generates a sequence of samples in the format of a single PCM channel of 16 bit samples at a sample rate of 48 KHz.

Interface

A nice thing about the emulation of a game machine is that, once the engine is written and works, it can easily be “hosted” in different apps.

We can see the arcade emulator as a black box which takes a number of inputs (the state of the button controls, coin insertion) and, when run for some time interval (quantum) of N microseconds, produces a set of outputs: the video frame that results from the execution of the machine and the set of audio samples that were generated in that interval.

pic4

The input interface turns out to be very simple: there is a small set of buttons or switches that can be turned on or off at any moment. Three buttons are used to horizontally move the ship and to shoot, other buttons are used to start the game, and a switch is triggered when a coin is inserted:
was a very popular arcade in the ‘80s

void SetInsertCoin(bool value);
void SetStart1Player(bool value);
void SetStart2Players(bool value);
void SetMoveLeft(bool value);
void SetMoveRight(bool value);
void SetButton(bool value);

All the rest of the machine behaviour can be encapsulated in a single function, that runs the machine for a time interval specified:

void Run(double microSeconds, VideoFrame& videoFrame, AudioFrame& audioFrame);

To guarantee smooth video rendering and low-latency audio we need to keep the time interval very small. For example, a quantum of 16.7 msec would allow a frame rate of 60 frames/second and a reasonable low latency for the sound.

Note that the code of the engine is totally platform-independent; once we have defined the format for audio samples and video frames, until we just change the state of the input controls and run quanta of emulation time, we are not really doing anything that interacts with the host, yet.

So, we can reuse the same emulator experimenting with different “host applications”, comparing the technologies and measuring their performances. We saw that the engine produces screen frames, in the form of arrays of pixels, and audio samples, in the form of arrays of PCM samples; the challenge will be to find the most efficient way to render these frames and to play these samples.

Here I’d like to present three different versions of a host application. Being Windows still my preferred platform, the first choice was to write the game as a classic, Direct2D-based Win32 app. The natural evolution was to move then to a UWP app for Windows 10. (We’ll see that in both cases Windows provides nice APIs that make the hosting code quite simple). Finally, we’ll move away from native code towards something very different, a HTML5/JavaScript app, and we’ll see how we can reuse the same C++ code that drove the Windows’ apps.

pic5

Direct2D

Let’s start with a classic Windows app. In this case Direct2D is probably the best choice to quickly redraw screen frames. Personally I am a big fan of Direct2D and of its simple, low-level interfaces. (In one of my projects at work recurring to pure Direct2D for the most complex part of our user interface unexpectedly turned out to be a much simpler and faster solution than relying on a more declarative XAML-like framework… but this is another story :-)).

There are very efficient ways to render video frames with Direct2D, like swap chain buffers, but in our case we don’t need anything so sophisticated; it is sufficient to use a single bitmap to represent the game screen.

The code (error handling omitted) is very simple indeed; we just need to create a render target associated to the app window and to create a bitmap with a pixel format that matches those of the screen frames produced by the emulator:

    CComPtr<ID2D1Factory> _pDirect2dFactory;
    CComPtr<ID2D1HwndRenderTarget> _pRenderTarget;
    CComPtr<ID2D1Bitmap> _pBitmap;
    ...

    D2D1CreateFactory(D2D1_FACTORY_TYPE_SINGLE_THREADED, &_pDirect2dFactory);
    _pDirect2dFactory->CreateHwndRenderTarget(
        D2D1::RenderTargetProperties(),
        D2D1::HwndRenderTargetProperties(m_hwnd, size),
        &_pRenderTarget);
    D2D1_PIXEL_FORMAT pixelFormat = D2D1::PixelFormat(DXGI_FORMAT_B8G8R8A8_UNORM, D2D1_ALPHA_MODE_IGNORE);
    _pRenderTarget->CreateBitmap(
        D2D1::SizeU(Width, Height),
        NULL, 0,
        D2D1::BitmapProperties(pixelFormat),
        &_pBitmap);	

ID2D1Bitmap provides a method CopyFromMemory to fill the bitmap from an array of pixels in memory, and this is all we need.

To schedule the engine we can simply use a Win32 timer; with a timeout of 17msec we get about 60 frames per second. Every time the timer expires we run the emulator for the same time interval and as result a new screen frame will be ready to be copied into the bitmap and then blitted onto the render target and drawn to the window:

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    _pBitmap->CopyFromMemory(&rect, videoFrame.Pixels, pitch);
    m_pRenderTarget->DrawBitmap(_pBitmap);

    // render audio
    (later)
}

We’ll see later how to deal with the audio samples.

UWP App

With Windows 10, we can write a UWP app that also works on mobile. In this case we use XAML and the required markup is extremely simple:

<UserControl
    x:Class="rtGalaga.MainPage"  xmlns=...
    KeyDown="LayoutRoot_KeyDown" KeyUp="LayoutRoot_KeyUp" Loaded="UserControl_Loaded">
    <Grid x:Name="LayoutRoot" Background="#FF0C0C0C">
        <Image x:Name="img" Stretch="Uniform" />
    </Grid>
</UserControl>

We just use an element to render the game screen. This is very simple, because the UWP framework provides a mechanism to fill the content of the from an array of pixels in memory, by using a WriteableBitmap as the image source:

WriteableBitmap^ _bitmap;
_bitmap = ref new WriteableBitmap(Height, Width);
img->Source = _bitmap;

XAML also provides us with a better and more efficient way to schedule our emulator. The CompositionTarget::Rendering event is fired once each time the XAML engine decides to render a frame. In normal conditions, this happens about 60 times per second, but the event will be fired less frequently when the processor is busy:

CompositionTarget::Rendering += ref new EventHandler<Object^>(this, &MainPage::OnCompositionTargetRendered);

Each time the frame rendering is scheduled, we run the emulator for the appropriate amount of time and get a VideoFrame, whose pixels we can copy into the WriteableBitmap:

void MainPage::OnCompositionTargetRendered(Platform::Object^ sender, Platform::Object^ args)
{
    // run the emulator
    galaga->Run(time, videoFrame, audioFrame);
    // render the video frame
    ComPtr<Windows::Storage::Streams::IBufferByteAccess> pBufferByteAccess;
    if (SUCCEEDED(bitmap->PixelBuffer.As(&pBufferByteAccess)))
    {
        // get pointer to pixel bytes
        BYTE* pBuff = nullptr;
        if (SUCCEEDED(pBufferByteAccess->Buffer(&pBuff)))
        {
            memcpy(pBuff, videoFrame.Pixels, …);

            // invalidate the bitmap to redraw the image
            bitmap->Invalidate();
        }
    }

    // render audio samples
    (later)
}

XAudio2

What is missing now is audio rendering. We can use the same technology both in a classic Win32 app and in WinRT/UWP. XAudio2, the replacement for DirectSound, which provides a very simple API for low-level access to the soundcard and it is particularly suitable for videogames.

XAudio2 works by building an audio processing graph and in our case the graph will be very simple: there is a source node (voice) that produces the emulator- synthesized samples (in the form of a single 16 bit PCM channel at 48 KHz) and there is a mastering voice that represents the audio device.

pic6

This is the code that creates and starts the graph (error handling omitted):

CComPtr<IXAudio2> _pMusicEngine;
IXAudio2SourceVoice* _pSourceVoice;
IXAudio2MasteringVoice* _pMasteringVoice;

std::vector<uint8_t> _buffers[QUEUE_SIZE];
int _lastPlayed;
int _lastWritten;

...

::XAudio2Create(&_pMusicEngine, 0, XAUDIO2_DEFAULT_PROCESSOR);
_pMusicEngine->CreateMasteringVoice(&_pMasteringVoice, XAUDIO2_DEFAULT_CHANNELS);

WAVEFORMATEX wfx;
ZeroMemory(&wfx, sizeof(WAVEFORMATEX));
wfx.wFormatTag = WAVE_FORMAT_PCM;
wfx.nChannels = 1;
wfx.nSamplesPerSec = 48000;
wfx.wBitsPerSample = 16;
wfx.nAvgBytesPerSec = wfx.nSamplesPerSec * wfx.wBitsPerSample / 8;
wfx.nBlockAlign = wfx.wBitsPerSample / 8;

_pMusicEngine->CreateSourceVoice(&_pSourceVoice, (WAVEFORMATEX*)&wfx, 0, 1.0f, &_voiceContext, NULL, NULL);
_pSourceVoice->Start(0, 0);

Here _voiceContext is a pointer to an object that implements the IXAudio2VoiceCallback interface, used to buffer and synchronize the samples to be played.

Our emulator produces a continuous sequence of samples which we need to play in real time, with minimum delay and without glitches and the best way to do this is to use a circular buffer (see Petzold).

pic7

More precisely, we use an array of small buffers arranged in a circular queue so that when one of the buffers is being played, the other buffers can be filled and queued to be processed.

Every time we run the emulator for a quantum of time (GalagaMachine.Run(…)) we get the set of samples that the machine would have generated in that time interval and we insert them into the queue. For simplicity, we can ask the emulator to provide us samples in small buffers with the same size of the buffers in the queue.

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    ...

    // render audio
    int buffer = (1 + _lastWritten) % QUEUE_SIZE;
    if (buffer == lastPlayed) {
        return; // overflow
    }
    std::copy(audioFrame.samples().begin(), audioFrame.samples().end(), _buffers[buffer].begin());
    _lastWritten = buffer;
}

Meanwhile, XAudio will continue playing the queued buffer, and it will report when each buffer is finished by calling the method OnBufferEnd in the IXAudio2VoiceCallback interface. There, we can refill the buffer with more samples and add it back to the queue to be processed, submitting it to the IXAudio2SourceVoice object with code like this:

{
    _lastPlayed = (_lastPlayed + 1) % QUEUE_SIZE;

    XAUDIO2_BUFFER buf = { 0 };
    buf.AudioBytes = _buffers[_lastPlayed].size();
    buf.pAudioData = _buffers[_lastPlayed].data();
    _pSourceVoice->SubmitSourceBuffer(&buf);
}

Having a little buffering is useful to compensate for moments when the computer’s CPU is too busy to keep up with the task of producing audio samples in our emulator. The bigger the size of the buffers, the more resistant the app will be to transient changes in CPU load that could starve or overflow the queue. But we pay this buffering with latency, a delay between the moment something happens in the game and the moment the sound effect for that event is played; a bit like the delay between lightning and thunder.

On a normal Windows PC we can generally afford to use relatively small buffers; I found that using a queue with three or four 20-msec buffers (each with 48000 samples/sec * 1 channel * .02 sec = 960 samples) works very well in most cases. More sophisticated algorithms could be used to dynamically adapt the latency to the CPU load.

From UWP to HTML5

So far, everything was quite easy: writing an arcade emulator for Windows is not a particularly difficult task. With native code we can efficiently emulate another ISA and the system provides us with a simple API to quickly render video frames and to play audio samples. The real problem is that the resulting code is not portable. If we want to support a different platform we can recompile and reuse the C++ code for the emulator, but we need to rewrite all the “host” code that deals with audio and with screen frames.

Nowadays the platform of choice for user interface is, of course, the browser and JavaScript is becoming more and more the lingua franca for UI development. So the natural next step is to ask how difficult would be to write the same emulator in JavaScript and whether an interpreted language like JS can be fast enough to emulate an old arcade.

JavaScript as the assembly language for the Web

While many people consider JavaScript as a very elegant programming language, others, like me, are really put off by its lack of static typing. Whatever the preference, there is no doubt that JavaScript is perfect to write portable code.

Already in 2011 gurus like Brendan Eich (the inventor of JavaScript) and Douglas Crockford (inventor of JSON) predicted that JavaScript could be used as a target language to compile from other languages, and could become a sort of Assembly language for the Web. And even before, Erik Meijer (then in Microsoft) had idea of writing a compiler from .NET IL to JavaScript as part of the Volta project.

There are other frameworks which enable developers to “target” JS using C# or Java as source languages, like Script# and GWT but the Volta C#-to-JS compiler was the first (that I know) to really use JavaScript as an assembly-like language: it worked compiling normally .NET code into MSIL and then converting each MSIL instruction into JavaScript.

Emscripten

Project Volta has long been dead, but the idea of using JavaScript as the assembly for the Web is more interesting than ever, especially because the JavaScript engines in our browsers are today much faster than in the past.
The best way to compile into JavaScript today is to use Emscripten, an open source tool based on LLVM that compiles from C and C+.

To be more precise, what Emscripten does is to take LLVM bytecode (which can be generated from C/C++ using Clang, or any other LLVM-friendly language) and to compile it into highly-optimizable JavaScript.
For example, the following C++ function, taken from the sources of the Galaga emulator:

void Frame::SetPixel(int x, int y, int color)
{
    _pixels[(_width * y) + x] = color;
}

Once compiled (with almost all optimizations disabled) is transformed into the following JS code:

function __ZN6Galaga5Frame8SetPixelEiii($this,$x,$y,$color) {
   $this = $this|0;
   $x = $x|0;
   $y = $y|0;
   $color = $color|0;
   var $0 = 0, $1 = 0, $2 = 0, $3 = 0, $4 = 0, $5 = 0, label = 0, sp = 0;
   sp = STACKTOP;
   $0 = HEAP32[$this>>2]|0;       // $0: _width
   $1 = Math_imul($0, $y)|0;      // $1: _width * y
   $2 = (($1) + ($x))|0;          // $2: _width * y + x
   $3 = ((($this)) + 8|0);
   $4 = HEAP32[$3>>2]|0;          // $4: &(_pixels)
   $5 = (($4) + ($2<<2)|0);       // $5: &(_pixels[_width * y + x])
   HEAP32[$5>>2] = $color;        // $6: _pixels[_width * y + x] = color
   return;
}

We can see that the resulting code is really the equivalent of assembly code, only written in JS rather than in a native binary. (The optimized version would be less verbose but also less readable).

Asm.js

The generated code is full of expressions like $x|0… this is a way to coerce the variable x to be of an integer type. So there are kind of static typing rules being enforced… what kind of JavaScript is this? It turns out that Emscripten uses a well specified and strict subset of JavaScript as compilation target, named Asm.js.

Asm.js was designed (by Mozilla) exactly for the purpose of being used as a target language for the source-to-source compilation from a statically-typed languages like C/C++. You can find more details in the spec, but in a few words, asm.js is JS without all the potentially slow parts (like dynamic type guards and garbage collection); it only supports the parts that can be executed very efficiently (strictly-typed integers, floats, arithmetic, function calls, and heap accesses).

The static-type rules allow much faster JIT compilation or ahead-of-time compilation, the script engine can perform optimizations that would not otherwise be possible, and the generated code can be surprisingly fast, running at near-native speed!
But Asm.js also has its disadvantages:

  • It is difficult to write Asm.js code by hand (exactly as it is to code in x86 assembly).
  • The whole C++ application gets compiled into a single massive JavaScript file, including the standard libraries; it is not trivial to make this compiled code interact with other JavaScript parts (but some form of interaction is always necessary).
  • It is difficult to port programs that have a lot of interactions with the OS, or that do a lot of I/O, or that use many external native libraries. There is no support for multi-threading.

Furthermore, while the compiled Asm.js code is valid JavaScript that can run on every browser, it is very important that browsers explicitly support Asm.js to really take advantage of the optimizations that are possible once the asm.js code is validated. For this reason the performance of the same Emscripten-generated code may vary widely across different browsers.

If you are curious, you can use the Octane benchmark to measure the performance of your browser. There is also an interesting demo that shows very effectively what difference running on a browser optimized for asm.js can make. At this page, two chess engines are matched against one another. They are both compiled from the same Stockfish source code, but in only one has a flag set to tell the browser that the code should be interpreted (and optimized) as Asm.js. Of course this is the engine that wins all the matches as long as the game is played on a browser that supports asm.js.

This chart, taken from http://caniuse.com/#feat=asmjs, describes the current status of the browser support matrix. Currently Asm.js is fully supported by Firefox and Edge, while Chrome is still a little behind.

pic8

The arcade emulator in Asm.js

After this long introduction to Emscripten and Asm.js I think you’ll see where I’m going: if we want to port our C++ arcade emulator to JavaScript we don’t need to write (almost) any code: we can just recompile the existing C++ sources!

In fact, while it’s unlikely that Asm.js will ever be used for traditional web-development (DOM/CSS), it seems perfect for a particular application like a software emulator (and already in 2011 Fabrice Bellard used Emscripten to write an impressive emulator of a Linux PC).

The code of the emulator engine already builds with Clang, it is “self-contained” (with no external dependencies a part from the CRT and the STL), it does not use multithreading and exposes a very simple interface to its clients. So, it does not require any adjustment: once it’s compiled with Emscripten the resulting asm.js code can be used right away in the browser.

The only question left to be settled now is: will asm.js be fast enough to run a (relatively) computationally intensive application like an arcade emulator in real time?

In order to find out, we first need to install the Emscripten SDK (the latest version can be found here).
Then we run the Emscripten compiler from the command-line:

emcc
    --bind                   // uses Embind
    -std=c++14               // supports C++14
    -O3                      // enable all optimizations
    src1.cpp … srcN.cpp      // the list of files to compile
    –o galaga.js             // the result of the compilation

All the source files will be compiled into a single, big JavaScript file (about 800K bytes). It is really so simple. I’ve never had any problems with the Emscripten compiler: as long as my code compiled without errors and warnings with the LLVM toolchain for Visual Studio, it also compiled without errors with Emscripten.

The only tricky part is now to “consume” the giant blob of almost-obfuscated code that we have generated from “normal” JavaScript code, and here the –bind option comes to our rescue. When this option is specified we can use special macros to bind C++ functions to JavaScript, and vice-versa, to call JavaScript functions from C++.
In our case, we only need to add an additional file to our sources, with the following code:

#include <emscripten/bind.h>
using namespace emscripten;

#include "galagamachine.h"

    // Binding code
    EMSCRIPTEN_BINDINGS(galaga_machine)
    {
        class_<GalagaMachine>("GalagaMachine")
            .constructor<>()
            .function("Run", &GalagaMachine::Run)
            .function("set_InsertCoin", &GalagaMachine::set_InsertCoin)
            .function("set_Start1Player", &GalagaMachine::set_Start1Player)
            .function("set_Start2Player", &GalagaMachine::set_Start2Player)
            .function("set_MoveLeft", &GalagaMachine::set_MoveLeft)
            .function("set_MoveRight", &GalagaMachine::set_MoveRight)
            .function("set_Button1", &GalagaMachine::set_Button1);
    }

This will tell the compiler that it must create binding code for our class GalagaMachine (or more precisely, for the function of this class listed in the EMSCRIPTEN_BINDINGS macro). As result, the giant blob of generated JavaScript code will also contain a class Module.GalagaMachine with the methods specified, and we will be able to instantiate the whole emulator simply writing:

    var galaga = new Module.GalagaMachine();

The HTML5 Host

We are now in the same situation we were when we started the Direct2D and UWP apps: we have encapsulated the arcade emulator with a simple interface and we just need to implement the “host” code to render the game screen frames and to play the audio samples. This time, however, we want to use HTML5 and JavaScript and things get more interesting.

pic9

HTML5: the scheduler

The first problem we need to solve is how to efficiently schedule the execution of the emulator so that it can produce 50 or 60 frames per second. The more simplistic idea would be to use JavaScript timers (setTimeout(msec)), but a much better solution is to use Window.requestAnimationFrame(callback).

This is an API intended to simplify animations: we call requestAnimationFrame and pass a callback that the browser will call before the next repaint. The callback routine must itself call requestAnimationFrame() to animate another frame at the next repaint, and so on forever. The repaint rate is usually 60 times per second, but the browser may reduce it to a lower rate when the page or tab is in background or when we are in a hidden , in order to improve performance and battery life.

The following code uses requestAnimationFrame to schedule the periodic execution of the function step():

    var prevTimestamp;

    function repeatOften(timestamp) {
        var interval;
        if (!prevTimestamp || !timestamp) {
            interval = 20.0;
        }
        else {
            interval = timestamp - prevTimestamp;
        }
        prevTimestamp = timestamp;
        step(interval);
        requestAnimationFrame(repeatOften);
    }

We’ll start the scheduling in a startup function main() that does all the initialization for the game:

    function main() {
        galaga = new Module.GalagaMachine();
        initVideo();
        initAudio();

        repeatOften();
    }

In step(), the call to galaga.Run() is where we finally go to Emscripten-compiled code and run the emulator for a quantum of execution time. The duration of this interval will vary according to the variable refresh rate, and to calculate it we use the timestamp that is passed as argument to our callback.

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // render videoFrame
                ...

                // play audioBuffer
                ...
            }
        );
    }

Note that the second argument passed to Run() is another callback. In fact, Emscripten bindings dictate another small change to our C++ code. While originally we could retrieve the screen frame and audio samples passing two objects by reference:

void Run(double us, VideoFrame& videoFrame, AudioFrame& audioFrame);

This would not work with Emscripten. Instead, we can make the interface more JavaScript-friendly by passing a callback function that will be called when the function Run() completes and that takes as arguments the same two objects that were originally returned by reference:

void GalagaMachine::Run(double us, emscripten::val onComplete)
{
    // ... runs the emulator for a timeslice, obtains videoFrame, audioFrame.

    onComplete(
        emscripten::typed_memory_view(Height * Width * 4, videoFrame.Pixels),
        emscripten::typed_memory_view(_audioFrame.Size, audioFrame.Samples)
    );
}

emscripten::val is a type provided by Enbind to represent any JavaScript object, and it is what makes it possible to call JavaScript code directly from C++.

HTML5 frame rendering

In order to render screen frames we need to find an efficient way to “blit” the pixels from the VideoFrame generated by the emulator onto the screen. Luckily, the Canvas 2D API provides the ImageData object that does just this. So, in HTML5 the whole UI for the game can be made by a single Canvas element suitably placed in our page:

    <canvas id="myCanvas"/>

At initialization time, we create an ImageData object with the same size of the arcade screen. A small complication is that we need to scale the ImageData to fill the whole canvas, which is resized to fill the whole browser window. The best way to do this is to draw the imageData into another off-screen canvas and then draw the off-screen canvas to the original canvas.

    var galaga;
    var canvas;
    var offscreenCanvas;
    var context;
    var offscreenContext;
    var imageData;

    function initVideo() {
        // create offscreen canvas and imageData
        offscreenCanvas = document.createElement('canvas');
        offscreenCanvas.width = 224;
        offscreenCanvas.height = 288;
        offscreenContext = offscreenCanvas.getContext('2d');
        imageData = offscreenContext.createImageData(224, 288);

        canvas = document.getElementById('myCanvas');
        context = canvas.getContext('2d');
        canvas.height = window.innerHeight - 24;
        canvas.width = canvas.height * offscreenCanvas.width / offscreenCanvas.height;
        context.scale(canvas.width / offscreenCanvas.width, canvas.height / offscreenCanvas.height);
    }

Then, periodically, after having run the emulator for a quantum of emulation time we just copy the generated screen frame onto the canvas ImageData. This copy can be done as fast as the copy of a JavaScript typed array. ImageData exposes its data as Uint8ClampedArray of four bytes per pixels, in the RGBA format. So, if the emulator generates videoFrame pixels in the same format, we can fill the ImageData with a simple call to Uint8ClampedArray.set():

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // videoFrame is a Uint8Array that aliases directly into the Emscripten heap
                imageData.data.set(videoFrame);
                offscreenContext.putImageData(imageData, 0, 0);
                context.drawImage(offscreenCanvas, 0, 0);

                // Audio
                (later)
            }
        );
    }

HTML5 Audio

Finally, the last thing left to do is to play the audio sample synthesized by our emulator. This is probably the most complicated part of the whole project; one could argue that today with HTML5 WebAudio there is no really good way to do this.
WebAudio provides a rich API to capture, manipulate and play audio samples. Audio operations are performed by putting together a graph of audio nodes (sources nodes, filter nodes, output nodes, and so on).

The basic way to play a sound with WebAudio is simply to create an AudioBufferSourceNode, fill this buffer with audio samples and schedule it to be played back at the appropriate time with AudioBufferSourceNode.start(when). In our case, we want to play a continuous sequence of buffers, each containing 20ms of samples. Ideally, we should be able to schedule the buffers one after the other, like with bufferSourceNode0.start(t + 0); bufferSourceNode1.start(t + 20); bufferSourceNode2.start(t + 40); and so on. But this does not work and in all browsers produces glitching audio. Web Audio does not guarantee sample-precise stitching of contiguous buffers

There is no “buffer-queueing” functionality in the Web Audio API, which would be very useful to render audio synthesized in real time. I found a long discussion of this problem in the Audio discussion list of w3.org.

However, a (somehow convoluted) workaround does exist, using ScriptProcessorNodes. ScriptProcessorNodes are nodes that can be inserted in the graph to do some elaboration/filtering with a JavaScript event handler. The node is connected to an input and an output buffer and the event handler onaudioprocess() is called every time the input buffer contains new data and must fill the output buffer with new data.

pic10

In our case, the important feature of a ScriptProcessorNode is that it produces a continuous sequence of samples with no artifacts of glitches. Since we are generating the samples ourselves, we don’t really need any input buffer but we can rely on the fact that the system will call onaudioprocess() to fill an output buffer with audio samples that will be sent to the playback node.

pic11

But a limitation is that we cannot decide the playback rate of the output buffer; it will always be equal to the sample rate of the WebAudio context. To complicate things further, this frequency is platform-dependent. On Windows with Firefox, Chrome and Edge, it is usually 44.1 Khz, but I found out that it was instead 48KHz with any browser on my Surface Pro 3.

The problem is that the emulator also synthesizes audio at a fixed rate, 48 KHz, and when these two sample rates don’t match we need to resample our audio. And even with resampling there can still be the possibility of glitches because the ScriptProcessorNode works in the (main) thread and can easily cause performance issues. Initializing a ScriptProcessorNode takes as argument the size of the buffer that will be processed by the handler function, and which must be a power of 2 (between 256 and 16K samples). This buffer size then also determines how frequently the onaudioprocess event needs to be fired and even here there is a buffer vs latency trade-off. With smaller sizes we also have smaller latency, but the handler will be called more frequently and it will be easier to have glitching audio. For example having a 1024 samples buffer at 44.1 KHz means that onaudioprocess will be called every 1024/44.1 = 23.2 msec.

So, this is evidently not a perfect solution, and in fact the ScriptProcessorNode interface has now been deprecated, to be replaced by Audio Workers, which implement audio processing in the context of a web worker and not in the main application thread. But as far as I know, no browser supports Audio Workers yet, so ScriptProcessorNodes still are the best option.

The code to play audio with ScriptProcessorNode is not very different from the one we wrote for XAudio2. Even here we need a circular buffer to be resistant against timer jitters and changes in the CPU load. Buffering is particularly useful here, since as we have seen, the ScriptProcessorNode event handler is called in the main JavaScript thread, which is also busy periodically running the emulator.
At initialization time we initialize Web Audio and create a graph with a ScriptProcessorNode:

var BUFFER_COUNT = 8;
var BUFFER_SIZE = 1024; // samples
var _lastPlayed = BUFFER_COUNT - 1;
var _writePos = 4 * BUFFER_SIZE - 960; 

function initAudio() {
    window.AudioContext = window.AudioContext||window.webkitAudioContext;
    if (!window.AudioContext) {
        throw 'Web Audio API is not available!';
    }

    audio.context = new window.AudioContext;
    audio.buffer = new Int16Array(BUFFER_COUNT * BUFFER_SIZE);

    audio.scriptNode = audio.context.createScriptProcessor(BUFFER_SIZE, 0, 1);
    audio.scriptNode.connect(audio.context.destination);
    audio.scriptNode.onaudioprocess = ... // later
}

After running the emulator we insert the generated audio samples into the circular buffer. For brevity, I omitted the code that takes care of resampling the audio when the Audio context frequency does not match the 48KHz of our emulator.

galaga.Run(interval,
    function (videoFrame, audioBuffer) {
        // ... Render screen frame, as seen before...

        if (audio.context.sampleRate != 48000) {
            // resampling
        }

        // Queue audio samples. Note that the number of samples in audioBuffer
        // can be different from BUFFER_SIZE.
        var max = BUFFER_SIZE * _lastPlayed;
        var i = 0;
        var j = _writePos;
        var len = BUFFER_SIZE * BUFFER_COUNT;
        while (audioBuffer.length >= i) {
            if (j == max) {
                // overflow
                break;
            }
            audio.buffer[j] = audioBuffer[i];
            i++;
            j = (j + 1) % len;
        }
        _writePos = j;
    }
);

Finally, there is the onaudioprocess event handler, which is called periodically by the ScriptProcessorNode. Here we extract the samples from the queue and feed them to the output buffer.

    // This function is called periodically by the audio system and runs in the main thread.
    audio.scriptNode.onaudioprocess = function(audioEvent) {
        try {
            var channelData = audioEvent.outputBuffer.getChannelData(0);

            _lastPlayed = (_lastPlayed + 1) % BUFFER_COUNT;
            var j = _lastPlayed * BUFFER_SIZE;
            var i = 0;
            while (BUFFER_SIZE >= i)
            {
                // Convert from [-32768:32767] to [-1.0:1.0]
                channelData[i] = audio.buffer[j] / 0x8000;
                audio.buffer[j] = 0;
                i++; j++;
            }
        }
        catch (e) {
            // ...
        }
    }

And that’s really all! We can now play our synthesized sounds also on a browser. 🙂 I must say that the quality of the resulting audio is not always perfect, because of the limitations of ScriptProcessorNode, and there can be sporadic glitches here and there, on slow machines. It will be interesting to rewrite the buffering algorithm with Audio Workers, when they will become available.

Performance

Now we are in the condition to answer an interesting question. When I started porting the C++ engine to JavaScript with Enscripten I had no idea what the slow-down could be. With native (or also managed) code, it is obviously very easy to efficiently emulate 30+ year old hardware and ancient 8-bit microprocessor. But is this true even with a scripted language?

Looking back at the emulator interface, we saw that we have a single method Run(double ms) that executes the emulator for a specified interval, producing a video frame and a number of audio samples. This is the only really CPU-intensive part of our system. Once we generate a display frame, we can blit it onto the screen very quickly; the Canvas 2D API is fast. Dealing with audio samples means moving around very small arrays of bytes (for a single channel at 48 KHz, 20 milliseconds of audio only take 960 bytes) and the Web Audio API is also fast. So, the only bottleneck is in the engine itself, and everything works very well as long as a call to Run() takes much less than N milliseconds to emulate N milliseconds.

So, I timed how long running the emulator actually takes with JavaScript, on different web browsers, and I took as reference the time spent by the native emulator used in the Direct2D project. The comparison is particularly meaningful because we are comparing execution time for exactly the same code (even though transpiled into another language). The following chart shows the results, running the emulator on an old Sony laptop, i7 processor:

pic12

We can see that some browsers are faster than others with asm.js code. In particular, Chrome turned out to be the slowest, and Firefox the fastest, with Microsoft Edge not too far away. But was very, very interesting to me was to measure that the slow-down factor for Enscripten-generated JavaScript compared to the native code can be as low as 1.66. This is really not a lot, and it’s much much faster than I would have expected.

Looking forward

Today Asm.js seems only useful for a particular kind of applications, like games, emulators and in general for computational workloads. So, it seems to occupy the same niche which is also occupied by NaCl/PNaCl, the sandbox for native code in a browser developed by Google (which has not had huge success and it is still only supported by Chrome).

Very likely, Asm.js will never be used for “standard” web applications, but I wonder whether it could become in the future the “native” language for a new type of web applications based on HTML5. The browser could become the host for a new kind of portable OS, with the canvas and WebGL APIs used for the user interface, the WebAudio API for the audio, WebRTC for real time media streaming, WebSockets and XMLHttpRequests for networking, local File API and cloud APIs for storage, and so on.

Asm.js could be the target language for a number of Enscripten-compiled libraries and components that could be reused by “normal” web applications. For example, Asm.js should be fast enough to implement new video codecs, or to program whole -based UI frameworks, with controls, animations, databinding (even though the need for this is questionable). Or maybe the future of the web (and of mobile apps) could be in desktop-style apps that run in the browser, written in statically-typed languages and transpiled into JavaScript, with Asm.js as native language.

And very soon also C# and the other .NET languages could be viable alternatives to C++ to compile into Asm.js. Microsoft in now working on LLILC, an LLVM-based MSIL compiler. For now the goal of the project (open source) is to create a .NET JIT for LLVM, but the plan is to have also an AOT compiler in the future.

WebAssembly

It is also worth noting that people are already working on even faster alternatives to Asm.js. A small issue with Asm.js is that the assembly-like code is quite verbose, and with large programs, having to download and then parse large codebases can take some time. This problem could be solved by WebAssembly (wasm), a low-level assembly/binary format that directly represents the AST of a program and is designed to be smaller in size, faster to parse and faster to execute than JavaScript. All main browser vendors are working on supporting WebAssembly inside their script engines, and Emscripten should soon support WebAssembly as compilation target.

Insert coin…

This concludes my little exploration of the world of retro-gaming and the emulation of vintage arcade games. If you are interested, you can play with my version of Galaga online HERE.
This is a screenshot of the game in action, running on Microsoft Edge:

pic13

You can find the sources here:

Implementing iterator blocks in C++ (part 3: LINQ)

Now that we know how to implement coroutines with fibers, we can use them to port to the unmanaged world of C++ a few familiar C# constructs, like iterator blocks and LINQ operators. The result will be in the form of a small template class library. A first code drop of the cppLinq library is available in this source repository on Google code.

The cppLinq library compiles with the Developer Preview of Visual Studio 2011; I have used a few features from the latest C+11 standard that were not well implemented in VS2010. (For example, VS2010 had a few problems with nested lambda expressions). I am also using the new C++ unit test framework to implement the unit tests.

Small disclaimer: this is still a work in progress; I have not yet implemented all the LINQ operators, and the code could certainly use some refactoring and cleaning up. Also, using fibers in a Windows application is a bit dangerous and should be done carefully. In my opinion, this code is mostly useful as a reference of how C# iterators and LINQ is actually works. I doubt I will ever use this code in a real program, but certainly I know better how to use these constructs in C# now!

Iterator blocks

We have seen that In C# an iterator is a function that returns an IEnumerable<T> or an IEnumerator<T>. When the C# compiler compiles such a function, it generates an hidden iterator class that implement the IEnumerator<T> and IEnumerable<T> interfaces  and generates code for a state machine that produces the sequence of values “yielded” by the iterator. (See part 1, or, even better, this article for a detailed description).

We can base the C++ implementation iterator blocks on the equivalent autogenerated C# code, using fibers in place of the state machine. In C++, we can define IEnumerator and IEnumerable as follows:

template <typename T>
class IEnumerator
{
public:
    virtual void Reset() = 0;
    virtual bool MoveNext() = 0;
    virtual T& get_Current() = 0;
};

template<typename T>;
class IEnumerable
{
public:
    virtual std::shared_ptr<IEnumerator<T>> GetEnumerator() = 0;
};

Collecting our garbage

Note that GetEnumerator returns a std::shared_ptr. Managing the lifespan of iterator blocks turns out to be an interesting problem: iterators can be generated inside other iterators and can be composed in a data pipeline, so it is not easy to manually dispose of them. The code of iterator blocks is naturally more elegant when written in languages that support garbage collection. We don’t have GC in C++, but we can use shared pointers; as long as we guarantee that we always use shared pointers to manage the lifetime of our IEnumerables and IEnumerators (and that there are no circular references between them) we can be sure that they will be automatically deleted when the reference count associated to the shared pointer goes to zero.

By the way, I think that the standard implementation of shared_ptr is very cool. It also provides ways to generate a shared_ptr from this and, in case of multiple-inheritance, to convert a shared_ptr <Base1> into a shared_ptr <Base2>.

The IteratorBlock class

Going on with the implementation, we can write a template class IteratorBlock<T> that implements an iterator that produces a sequence of T objects through the IEnumerator and IEnumerable interfaces:

template <typename TSource>
class IteratorBlock : public IEnumerable<TSource>,
                      public IEnumerator<TSource>,
                      public Fiber
{
public:
    // IEnumerable
    virtual std::shared_ptr<IEnumerator<TSource>> GetEnumerator() {
        if (::GetCurrentThreadId() == _threadId && ! _enumeratorCreated) {
            _enumeratorCreated = true;
            return std::dynamic_pointer_cast<IEnumerator<TSource>>(shared_from_this());
        }

        std::shared_ptr<IteratorBlock<TSource>> cloned = clone();
        return cloned->GetEnumerator();
    }

    // IEnumerator
    virtual void Reset() {
        throw InvalidOperationException();
    }

    virtual bool MoveNext() {
        return resume();
    }

    virtual TSource& get_Current() {
        return _current;
    }

    void yieldReturn(TSource returnValue) {
        _current = returnValue;
        yield(true);
    }

    void yieldBreak() {
        yield(false);
    }

protected:
    IteratorBlock() :
        _current(),
        _enumeratorCreated(false),
        _threadId(::GetCurrentThreadId()) {
    }
    virtual ~IteratorBlock() {}

    virtual std::shared_ptr<IteratorBlock<Source>> clone() const = 0;

private:
    TSource _current;
    bool _enumeratorCreated;
    DWORD _threadId;
};

Class IteratorBlock is designed to be used as the base class of an iterator class. Its implementation works as a coroutine, based on the Fiber class from which it inherits.

In order to implement an iterator block, we need to inherit from IteratorBlock<T> and provide an implementation for the abstract methods clone() and run() .

As in C# iterators, the method Reset should never be called and throws an exception.

The method MoveNext simply restarts the coroutine, calling Fiber::resume. From here, the execution goes to the overridden method run. Inside run we can suspend the execution of the coroutine by calling yieldReturn and yieldBreak which behave, respectively, like the C# yield return and yield break keywords. A call to yieldReturn also specifies the current value to be returned by the iterator, stored in the data member _current­.

The implementation of GetEnumerator deserves a few more words. An iterator block is an IEnumerable, from which we can ask for one or more instances of an IEnumerator. The first time we call GetEnumerator the function can return a pointer to the object itself, since it implements both interfaces, but following calls to GetEnumerator return a pointer to a copy of the iterator block object. The abstract method clone needs to be implemented to create this copy by cloning an instance of the iterator class.

Example: a Fibonacci generator

As example we can implement a generator for the Fibonacci sequence writing a class like the following, putting all the logic in the overridden run function.

class FibonacciIterator : public IteratorBlock<long>
{
public:
    virtual void run() {
        long a = 0;
        long b = 1;
        yieldReturn(a);
        yieldReturn(b);
        for (int i = 2; i < 10; i++) {
            long tmp = a + b;
            a = b;
            b = tmp;
            yieldReturn(b);
        }
    }

    virtual std::shared_ptr<IteratorBlock<long>> clone() const {
        return this;
    }
};

The foreach loop

Client code can interact with an iterator with code like this:

auto source = std::shared_ptr<IEnumerable<long>>(new FibonacciIterator());
auto e = source->GetEnumerator();
while (e->MoveNext())
{
    long value = e->get_Current();
    // use ‘value’
}

This is the equivalent of a foreach statement, and can be encapsulated in a foreach function, templatized on the type of a function that will be applied to each item in the sequence:

template <typename T, typename Func>
static void foreach(std::shared_ptr<IEnumerable<T>> enumerable, Func f)
{
    std::shared_ptr<IEnumerator<T>> e = enumerable->GetEnumerator();
    while (e->MoveNext()) {
        f(e->get_Current());
    }
}

This allows us to write code like the following:

std::vector<std::string> greek;
greek.push_back("alpha"); // initializer lists not supported in VS11
greek.push_back("beta");
greek.push_back("gamma");

auto it = new StlEnumerable<std::vector<std::string>, std::string>(v);

foreach<int>(it, [](std::string& s) -> void {
    std::cout << s << std::endl;
});

Here we assume to have an iterator class StlEnumerable<Container, T>, which generates a sequence iterating over the items of a STL container of T’s. (You can find this class in the sources).

The code simply creates an iterator from a STL vector, and then applies a lambda expression to each item in the vector, with a foreach loop. Nothing very useful, but we are now almost ready to extend this with composition and LINQ-like operators.

Closures

A class like the FibonacciIterator is a closure; it encapsulates some behavior (the function run) and also the context (the data members) on which the function operates.

Writing code like this (that is, a class that derives from IteratorBlock and overrides the method run) works well, but can be a little “verbose” since it forces us to define a different class for each different iterator. In C++11 we can instead implement our coroutine with a lambda expression, which will automatically implement a closure for us.

The following class, _IteratorBlock, inherits from the previous ­IteratorBlock class and adds the capability of passing a lambda expression to the constructor, which will be executed by the method run.

template <typename TSource>
class _IteratorBlock : public IteratorBlock<TSource>
{
protected:
    struct IF {
        virtual void run(IteratorBlock<TSource>* pThis) = 0;
    };

    template <typename Func>
    struct F : public IF
    {
        F(Func func) : _func(func) {
        }

        virtual void run(IteratorBlock<TSource&>* pThis) {
            _func(pThis);
        }

        Func _func;
    };

public:
    template <typename _F>
    _IteratorBlock(_F f) :
        _f(std::shared_ptr<IF>(new F<_F>(f))) {
    }

    _IteratorBlock(const _IteratorBlock& rhs) :
        _f(rhs._f) {
    }

protected:
    virtual void run() {
        _f->run(this);
    }

    virtual std::shared_ptr<IteratorBlock<TSource>> clone() const {
        return std::shared_ptr<IteratorBlock<TSource>>(new _IteratorBlock<TSource>(*this));
    }

private:
    std::shared_ptr<IF> _f;
};

Using this specialized ­_IteratorBlock class we can simply create an iterator defining a lambda. (Interestingly, it will be the compiler to create a closure class for us, storing captured variables in data members, but we don’t need to worry about this implementation detail).

So, we can more simply define our Fibonacci iterator like follows:

auto fn = [](IteratorBlock<long>* it) {
    long a = 0;
    long b = 1;
    it->yieldReturn(a);
    it->yieldReturn(b);
    for (int i = 2; i < 10; i++) {
        long tmp = a + b;
        a = b;
        b = tmp;
        it->yieldReturn(b);
    }
}
return std::shared_ptr<IEnumerable<T>>(new _IteratorBlock<T>(fn));

Reimplementing LINQ to Objects

Now we have finally all the pieces ready to reimplement LINQ to objects.

How to proceed? I could have tried to reimplement all the LINQ clauses/methods myself. I could have used Reflector to find out how they are actually implemented in System.Linq.dll. But since I am very lazy, I decided to just “reuse” the work of Jon Skeet, who wrote a long and interesting series of articles about reimplementing the whole of LINQ to Objects in C#, some time ago.

What is better, he also wrote an exhaustive test suite for his reimplementation. All I had to do was to convert his code and tests from C# to C++… J

Of course, C++ does not have extensions methods, so I just added new methods to the IEnumerable<T> interface. For example, this is the code for the Where LINQ clause:

template <typename T>
class IEnumerable : public std::enable_shared_from_this<IEnumerable<T>>
{
public:
    virtual std::shared_ptr<IEnumerator<T>> GetEnumerator() = 0;

    // Linq operators
    …
    template <typename Predicate>
    std::shared_ptr<IEnumerable<T>> Where(Predicate predicate) {
        if (nullptr == predicate) {
            throw ArgumentNullException();
        }

        // deferred
        std::shared_ptr<IEnumerable<T>> source = shared_from_this();
        auto fn =  [source, predicate](IteratorBlock<T>* it) {
            foreach<T>(source, [it, predicate](T& item) {
                if (predicate(item)) {
                    it->yieldReturn(item);
                }
            });
        };
        return std::shared_ptr<IEnumerable<T>>(new _IteratorBlock<T>(fn));
    }
};

LINQ methods use deferred execution – until a client start trying to fetch items from the output sequence, they won’t start fetching items from the input sequence. Consequently, a method like Where is divided in two parts. Argument validation can be executed immediately, and the actual iterative code gets executed later, “on demand”.

The constructor of the Where-iterator class is templatized on the type of the predicate function passed as argument. This allows to pass as predicate either a lambda expression, or a std::function, or also any “functor” class that implements operator (). The implementation of Where is simply a lambda expression that is passed to an instance of the _IteratorBlock class. The lambda keeps fetching items from its source enumerator until the sequence is over or until an item is found that satisfies the predicate. In the latter case, the item is yielded to the caller.

Again, LINQ operators are lazily evaluated: expressions are evaluated only when their value is effectively required.

Data pipelines

What makes LINQ operators particularly useful is the ability of compose them into data pipelines, using the output of an iterator as the source of the next iterator in a pipe.

For example, the code in the following (quite contrived) examples prints the square of all even integers smaller than 10:

auto source = IEnumerable<int>::Range(0, 10);

auto it = source->Where([](int val) { return ((val % 2) == 0); })
                ->Select<double>([](int val) -> double { return (val * val); }));

foreach<double>(it, [](double& val){
    printf("%.2f\n", val);
});

Unit Tests

Unit tests are written using the new “CppUnitTest” framework that ships with the preview of VS11. As said, the tests are based on (and with “based on” I mean copied from J) the unit tests that Jon Skeet wrote for his “EduLinq” blog series.

For example, this is one of the unit tests for the Where method:

TEST_CLASS(WhereTest)
{
public:
    ...
    TEST_METHOD(WhereTest_SimpleFiltering2)
    {
        int v[] = { 1, 3, 4, 2, 8, 1 };
        std::shared_ptr<IEnumerable<int>> source(new Vector<int>(v, ARRAYSIZE(v)));
        std::function<bool(const int&)> predicate = [](int x) { return x < 4; };

        auto result = source->Where(predicate);

        int expected[] = { 1, 3, 2, 1 };
        Assert::IsTrue(result->SequenceEqual(expected, ARRAYSIZE(expected)));
    }
    ...
};

Finally…

This ends my quick excursus in the world on unmanaged iterator blocks. The sources of the small cppLinq library can be found here.  A few final comments:

  • As I said, this is still a work in progress. There are a few LINQ methods still left to implement (mainly the OrderBy method, which is the most laborious to write), but I hope to be able to complete them soon.
  • Basing the implementation of Win32 fibers has many disadvantages, but also one advantage: in C# iterators are implemented as state-machines and it is not possible to yield from more than one stack frame deep. Here, we don’t have that limitation.
  • It is very interesting to play with some of the features of the new C++ standard. Used together, templates and lambdas are a very powerful tool and allow us to make the C++ code very similar to its C# equivalent. But I found that this kind of C++ code is a bit too complicate to write correctly and when something is wrong the resulting error messages are not always easy to “decrypt”.
  • I learned a lot about the actual LINQ  writing this.

Implementing iterator blocks in C++ (part 2: Win32 Fibers)

Fibers were added to Windows NT to support cooperative multitasking. They can be thought as lightweight threads that must be manually scheduled by the application.

When a fiber is created, it is passed a fiber-start function. The OS then assigns it a separate stack and sets up execution to begin at this fiber-start function. To schedule this fiber, you need to switch to it manually and when running, a fiber can then suspend itself by yielding execution to another fiber, or back to “calling” fiber. In other words, fibers are a perfect tool to implement coroutines sequencing.

These two articles, from Dr.Dobb’s and MSDN Magazine, explain how to implement coroutines using the Fiber API. The MSDN Magazine article also shows how to do this in .NET (it was written before the release of .NET 2.0, so before iterators were available in C#. But actually Fibers don’t get along well with the CLR). Together with an old series of Raymond Chen’s blog posts, these articles have been the main source of inspiration for this small project. (In this article Duffy proposes another possible implementations of coroutines based on threads, which is however quite inefficient).

Interestingly, a Fiber-based implementation does not have the limitations of (state machine based) C# iterators; with fibers it is possible to yield the control from any function in a stack frame. There are other problems, though.

Fibers are like dynamite

The Win32 Fiber API is quite simple. The first thing to do is to convert the thread on which the fibers will run into a fiber, by calling ConvertThreadToFiber. After this, additional fibers are created using CreateFiber passing as parameter the address of the function to execute, just as the threadProc for real threads. Then, a fiber can suspend itself and start the execution of another fiber by calling SwitchToFiber. Finally, when the application has done using fibers, it can convert the “main” fiber back to a normal thread with ConvertFiberToThread.

There are a few important caveats to consider:

  • It is difficult to write a library or framework that uses fibers, because the entire program must be designed to support them.  For example, the function ConvertThreadToFiber must be called only once in a thread.
  • Since each fiber has its own stack, it also has its own SEH exception chain. This means that if a fiber throws an exception, only that fiber can catch it. The same is true for C++ exceptions. Exceptions cannot pass across fibers’ “boundaries”.
  • The default stack size is 1MB, so using many fibers can consume a lot of memory.
  • The code must be “fiber-safe”, but most code is designed to be just “thread-safe”. For example, using thread-local storage does not work with fibers, and in fact Windows Vista introduced an API for Fiber local storage. More importantly, the CRT was not completely fiber-safe in the past, and I am not sure it is now. There are also compiler options to set in Visual Studio, like /GT, which enables only Fiber-safe code optimizations.

In other words, as Raymond Chen put it, “Fibers are like dynamite. Mishandle them and your process explodes”.  Therefore, this framework for iterators and Linq-like operators we’ll define should be used VERY CAREFULLY in real programs.
(And if you think things are bad in C++, consider that fibers are practically unusable with .NET managed code!)

Implementation details

Putting all these caveats aside, let’s see how coroutines can be implemented with fibers. This is the declaration of a Fiber class I created as a thin wrapper over the Win32 Fiber API.

class Fiber
{
public:
    Fiber();
    virtual ~Fiber();

    static void enableFibersInCurrentThread();
    static bool disableFibersInCurrentThread();

    void* main();
    void* resume();

protected:
    virtual void run() = 0;
    void yield(bool goOn);

private:
    static void WINAPI fiberProc(void* lpFiberParameter);

    PFIBER_START_ROUTINE _fiber;
    PFIBER_START_ROUTINE _previousFiber;

    enum FiberState {
        FiberCreated, FiberRunning, FiberStopPending, FiberStopped
    };
    FiberState _state;
};

The Fiber class exposes the static functions enableFibersInCurrentThread and disableFibersInCurrentThread that wrap the initialization/termination functions.
The constructor creates a new fiber object, specifying Fiber::fiberProc as the function to execute.

Fiber::Fiber() :
    _state(FiberCreated),
    _previousFiber(nullptr),
    _exception(),
    _exceptionCaught(false)
{
    _fiber = (PFIBER_START_ROUTINE)::CreateFiber(256 * 1024, fiberProc, this);
}

As said, fibers use cooperative multitasking: the execution of a fiber must be explicitly scheduled by the application by calling SwitchToFiber. This is encapsulated by the method resume below:

bool Fiber::resume()
{
    if (nullptr == _fiber || _state == FiberStopped) {
        return false;
    }

    _previousFiber = (PFIBER_START_ROUTINE)::GetCurrentFiber();
    assert(_previousFiber != _fiber);

    ::SwitchToFiber(_fiber);

    if (_exceptionCaught) {
        throw _exception;
    }

    return (FiberRunning == _state);
}

When the fiber is started with SwitchToFiber, it begins executing from the fiberProc method, which calls main:

void CALLBACK Fiber::fiberProc(void* pObj)
{
    Fiber* pFiber = (Fiber*)pObj;
    void* previousFiber = pFiber-&gt;main();
    ::SwitchToFiber(previousFiber);
}

What main does is simply to call the abstract function run, which any class derived from Fiber needs to implement.
Inside run we can at some point yield the execution to a different Fiber object, calling its resume method, so effectively creating a “stack” of fibers nested into each other. Or, we can yield the execution to the previously running fiber (the one that launched the current one), calling the method yield (equivalent to yield return and yield break in C#):

void Fiber::yield(bool goOn)
{
    if (! goOn) {
        _state = FiberStopped; // yield break
    }
    ::SwitchToFiber(_previousFiber);
}

Logically this works like the sequence of nested function calls in a thread stack, but of course there is no physical stack here, and we can return back to the caller only by storing a pointer to the previous fiber and explicitly yielding control to it.

Since exceptions cannot travel outside a fiber, the call to run is wrapped in a try-catch clause which tries to catch any kind of exceptions (even Win32 exceptions). If caught, data about an exception is stored, the fiber is stopped and the fiberProc ends restarting the execution of the previous fiber, inside the resume function. Here exceptions can be re-created and re-thrown in the context of the previous fiber, so effectively forwarding them back, up on the “fiber stack”. Not sure this is a very elegant workaround, but I could not find a better solution.

void* Fiber::main()
{
    _state = FiberRunning;
    _exceptionCaught = false;

    try {
        run();
    }
    catch (StopFiberException&amp;)
    {
        _state = FiberStopped;
    }
    catch (std::exception&amp; e)
    {
        _exception = e;
        _exceptionCaught = true;
        _state = FiberStopped;
    }
    catch (...)
    {
        _exception = std::exception("win32 exception");
        _exceptionCaught = true;
        _state = FiberStopped;
    }

    _state = FiberStopped;
    return _previousFiber;
}

Finally, when we have done with a fiber object, we can delete it, so releasing all the resources it had allocated and the memory of its stack.

Fiber::~Fiber()
{
    if (_state == FiberRunning) {
        _previousFiber = (PFIBER_START_ROUTINE)::GetCurrentFiber();
        _state = FiberStopPending;
        ::SwitchToFiber(_fiber);
    }
    ::DeleteFiber(_fiber);
}

Next…

The Fiber class is the main building block in the implementation of coroutines. What is left to do is just to write classes that inherit from Fiber and implement in the virtual function run the actual coroutines code to execute.
In the next posts we’ll see how to use the Fiber class to implement iterator blocks that work like in C#, implementing the IEnumerable and IEnumerator interfaces, and how to use this to reimplement many Linq operators.

If you are curious, you can look at the sources of this small project. A first code drop (still quite a preliminary version) is available in this source repository on Google code.
Let me know your comments and suggestions, if you are so kind to have a look.

Implementing iterator blocks in C++ (part 1)

Recently I had to port some C# code to C++. The C# code was making a limited use of the .NET framework, so, most of the porting turned out to be fairly simple. (I only had to pay attention to the difference between value and reference types and to fix small issues with uninitialized data members, where the code relied on the default initialization in C#).

But part of the code was also using some of the features of C# 3.0: implicit types, lambda expressions, iterator blocks and LINQ. Some of this code is not difficult to port to C++, if we target the new C++11 standard. For example C# ‘var’ maps to C++ ‘auto’; C# lambda expressions can be translated into C++ lambdas.

But there is nothing in C++ that matches the elegance of IEnumerable<T> sequences and LINQ, so for the problem at hand I just ended up rewriting the LINQ-based code using STL containers and algorithms. That worked fine, but let me wonder: is there a way to implement C# iterators in C++, and, consequently, to have something similar to LINQ-to-objects in unmanaged code?

The answer is… kind of. In this short series of posts I’ll try to present a possible solution, based on co-routines implemented with Win32 Fibers.

Caveats

A few caveats: this is really just an experiment, a proof of concept, a toy project. The best way to deal with algorithms and containers in C++ is certainly to use the STL. I found that using Fibers has many drawbacks in Windows and makes it very difficult to write correct code. Also, the absence of a garbage collector makes things much trickier, as we will see.

But still, I think that emulating the behavior of C# iterator in C++ is a very interesting exercise, and a great way to learn how LINQ actually works. In many ways it also made me appreciate even more the simplicity and elegance of C#. It’s not the destination, but the journey.

C# iterator blocks

Iterators are objects that allow you to traverse a sequence of elements. Iterators in .NET are the basis of the foreach statement and of the LINQ-to-Objects framework. A very good introduction to iterators, iterator blocks and data pipelines can be found here, or in John Skeet’s “C# In Depth”.

To summarize, the iterator pattern in .NET is encapsulated in the IEnumerable/IEnumerator interfaces, which are defined as follows:

public interface IEnumerable : IEnumerable {
     IEnumerator GetEnumerator();
}

public interface IEnumerator : IEnumerator, IDisposable {
     void Reset();
     bool MoveNext();
     T Current { get; }

     void Dispose();
}

An object that wants to enumerate a sequence of data will implement the IEnumerable interface. Data consumers ask the IEnumerable for an IEnumerator calling GetEnumerator(), and then iterate over the sequence calling MoveNext and getting the Current property. MoveNext returns false when there is no more data to return.

A foreach statement is transformed by the C# compiler in code like this:

IEnumerator it = source.GetEnumerator();

while (it.MoveNext()) {
     int value = it.Current;
     // use ‘value’
}

Note that the Reset() method is never called.

C#2.0 added support for iterator blocks, with the yield keyword. An iterator block is a function that contains the yield keyword and returns an IEnumerable interface. For example, this function returns the sequence of the first 10 Fibonacci numbers:

class Test
{
     public static IEnumerable Fibonacci()
     {
         long a = 0;
         long b = 1;
         yield return a;
         yield return b;
         for (int i = 2; i < 10; i++)
         {
             long tmp = a + b;
             a = b;
             b = tmp;
             yield return b;
         }
     }
}

Coroutines

It is interesting to understand how the function returns the sequence of numbers. A function like this is completely different from normal methods of a class. When normal methods are invoked, execution begins at the beginning and once the method exits, it is finished; an instance of a method only returns once. In the case of iterators, instead, the execution is paused each time a value is yielded. Therefore, C# iterators behave like coroutines.

More precisely, coroutines can be defined as generalized subroutines with the following three properties:

· Local data to a coroutine persists between successive calls.

· Execution is suspended at the point where control leaves a coroutine (perhaps to allow another coroutine to produce something for it). When control reenters the coroutine, execution is resumed at the point where it previously left off.

· Control can be passed from one coroutine to another, causing the current coroutine to suspend its execution and transfer control to the new coroutine.

C# iterators can be considered like a basic form of coroutines. In this article Joe Duffy explains how they can be used to implement a kind of co-routine scheduler, which can be used to schedule lightweight tasks for simulated concurrent execution.

State machine

However, C# iterators are not really coroutines. They are implemented by the C# compiler by generating the code of a state machine. More precisely, for a function like the one above, the compiler generates a class that represents the closure for the block of code in the function, and that exposes both the IEnumerable and the IEnumerator interfaces. The effect of this compiler magic is that an instance of this iterator function will suspend (or “yield” itself and can be resumed at well-defined points.

The core of the implementation is in the MoveNext method: this is where the state machine is carefully crafted to produce the sequence of values according to the sequence of yield statements. When a yield statement is met the method MoveNext exits returning the corresponding value and the state machine keeps track of the exact state of the iterator in that point. When MoveNext is called again, afterwards, the state machine restarts the execution from the instruction that follows the last issued yield.

This implementation has one drawback: it is possible to yield only from one stack frame deep. The C# compiler state-machine can only generate the code required to transform one function, but obviously cannot do that for the entire stack. Real coroutines can yield from an arbitrarily nested call stack, and have that entire stack restored when they are resumed.

For example, C# iterators cannot be used to easily implement in-order traversal of a binary tree. The following code implements a recursive iterator. It works, but with poor performance: if the tree has N nodes, it builds O(N) temporary iterator objects, which is not ideal.

class TreeNode<T>
{
    public TreeNode<T> left;
    public TreeNode<T> right;
    public T value;

    public static IEnumerable<T> InOrder(TreeNode<T> t)
    {
        if (t != null)
        {
            foreach (var i in InOrder(t.left)) { yield return i; }
            yield return t.value;
            foreach (var i in InOrder(t.right)) { yield return i; }
        }
    }
}

// build tree
TreeNode<int> root = new TreeNode<int> { … };

// in-order traversal
var query = TreeNode<int>.InOrder(root);
foreach (int i in query) {
     Console.WriteLine("{0}", i); // only the root node will be printed!
}

We’ll see how this limitation can be overcome with an implementation of coroutines based on Fibers.

I will not delve into more implementation details, since a very detailed explanation can be found here and here. Also, it can be very interesting to look at the generated code using the Reflector  tool.

Next…

In the next posts we’ll see a possible way to reproduce in C++ the behavior of C# iterators (deferred execution, lazy evaluation, “yield” semantic and so on) using Win32 Fibers. And we’ll use iterators as the base for reimplementing LINQ-style queries in C++.

Of course, I can’t think of a way to reproduce the declarative query syntax in C++, with keywords like from, where, select. But we should be able to write code like the following, with expressions defined as the pipe of query operators:

auto source = IEnumerable::Range(0, 10);

auto it = source->Where([](int val) { return ((val % 2) == 0); })
                ->Select([](int val) -> double { return (val * val); }));

foreach(it, [](double& val){
     printf("%.2f\n", val);
});

WinRT: reference count strikes back

At a first look WinRT objects look similar to .NET objects. In fact, WinRT supports many constructs we can found in the .NET framework: classes, methods, properties, delegates, events. But, under the hood, the model is completely different. In WinRT there is no managed runtime.

One of the differences that will the most painful, in my opinion, is that WinRT objects are not garbage collected. They are COM objects. They are reference counted; their lifespan is managed calling the old dear AddRef, Release and QueryInterface IUnknown functions.

Old-time COM developers remember that manually managing the calls to AddRef, Release and QueryInterface was no fun; COM rules could get very complicated, and smart pointers (like ATL CComPtr<>) were very handy, in most cases, to simplify the developer’s life.

The good news is that WinRT language projections take care to manage the reference count for us (as nicely as the old VB used to do). Normally, there should not be many reasons to call AddRef and Release explicitly.  But still, WinRT objects are not garbage collected. This makes a big difference: it is sometimes possible to have (kind of) memory leaks in .NET. But it is very easy  to have memory leaks in COM. What one needs to leak two objects is basically just a cross reference between them.

A small memory leak example

Look at the small class in Listing 1 as an example. The class WinRTComponent  exposes a single property, Ref, which can be set to a reference of another instance of WinRTComponent . Listing 2 shows the code of a client JavaScript app which references the WinRTComponent  project. When the web application is launched, every time the left button is clicked the function addAndLinkObjects  instantiates two instances of WinRTComponent  and sets cross references between them.

We can first run the application disabling the cross reference (commenting the line that sets the second reference: objB.ref = objA;). Once the JavaScript objects that “wrap” the WinRT objects go out of scope they can be deleted by the JS garbage collector. We don’t know when this deletion will happen; JavaScript uses its own garbage collection algorithm to decide when to destroy unreferenced objects. But the collection will certainly happen at least when the web application is being closed.

We can force the application to terminate with Alt+F4. Normally this causes the JS wrapper to release the last reference to the WinRT object, and have it destroyed too. This can be easily verified by setting a breakpoint in the class destructor.

But what happens if we uncomment the last line and set a cross-reference between the two WinRT objects? In this case the two WinRTComponent objects, cross referencing each other will still have the reference count to one and will stay alive forever. The destructor breakpoint will never get called. In other words, the objects will be leaked. (Not really forever, of course: their memory must be deallocated when the process terminates).

This memory leak pattern is quite bad, for two reasons:

  1. In a mobile/tablet environment memory is a scarce resource, so any memory leak should be avoided.
  2. If we use a RAII pattern and rely on the destructor to do some clean up (like closing a DB connection), the destructor could never be called.

In his //build/ talk on the internal details of C++ Metro apps Deon Brewis was mentioned using weak references as a possible way to avoid this kind of problems. Before the next post I’ll try to find out how we can use weak references to fix our WinRTComponent class.

Listing 1 – The WinRT object

#pragma once
#include <Windows.h>

using namespace Windows::Foundation;

namespace WinRTTest1
{
    public ref class WinRTComponent sealed
    {
        WinRTComponent^ _ref;

    public:
        WinRTComponent() {
            OutputDebugString(L"created\n");
        }
        ~WinRTComponent() {
            OutputDebugString(L"deleted\n");
        }

        property WinRTComponent^ Ref
        {
            WinRTComponent^ get() { return _ref; }
            void set(WinRTComponent^ propertyAValue) { _ref = propertyAValue; }
        }
    };
}
Listing 2 – The JavaScript client
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8" />
    <title>WinWebApp1</title>
    <!-- WinJS references -->
    <link rel="stylesheet" href="/winjs/css/ui-dark.css" />
    <script src="/winjs/js/base.js"></script>
    <script src="/winjs/js/wwaapp.js"></script>
    <!-- WinWebApp1 references -->
    <link rel="stylesheet" href="/css/default.css" />
    <script src="/js/default.js"></script>
    <script>
        function addAndLinkObjects() {
            var objA = new WinRTTest1.WinRTComponent();
            var objB = new WinRTTest1.WinRTComponent();
            objA.ref = objB;
            //objB.ref = objA;
        }
    </script>
</head>
<body id="b" onclick ="addAndLinkObjects();">
</body>
</html>

Windows 8 and WinRT

At the recent //build conference Microsoft has presented Windows 8 bits, and offered its first bits, by publishing a “Developer Preview” version, available for download at the MSDN site.

The big change in Win8 is the shift to “Metro-style” apps, targeted especially to mobile and tablet devices. Steven Sinofsky explained the design goal for Windows 8 as “no compromises”, in the sense that Windows 8 will support the old desktop interface together with the new, immersive, Metro-style interface.

Desktop apps will continue to be written with the technologies used today (Win32, .NET, WPF, and so on). Metro apps are written either using HTML5/JavaScript or using an XAML-based, totally unmanaged GUI framework, codenamed Jupiter. While unmanaged, Jupiter offers an API very similar to WPF and Silverlight.

In both cases, at the foundation of Metro apps lays WinRT (the Windows runtime). This is new runtime, also totally unmanaged, based on COM and written in C++, which targets C++, .NET and JavaScript clients.

From a purely technical point of view, WinRT is the most fascinating aspect of Windows 8. Heavily inspired by .NET, it brings some of the strengths of the managed world to unmanaged C++.

Having downloaded and installed the first prerelease I feel like a kid with a new toy to break and put back together. What I hope to do in these pages is to present a few notes of my little exploration of Win8 and its new runtime.