Galaga: an Arcade machine emulator for Windows and HTML5

Retrogaming

Galaga was a very popular arcade in the ‘80s. When I was a kid I must have spent so many coins to play that game that I could probably drive a Ferrari, today, had I better invested them. 🙂

pic1

One of the scenes that made the 1983 movie “Wargames” endearing to me is the one where the main character, David (Matthew Broderick) plays Galaga before going to school, and not even the presence of the beautiful classmate Jennifer (Ally Sheedy) can really distract him from his game… 🙂

pic2

A few years ago I must have had really too much free time, so I engaged in the toy project of writing an emulator for the Galaga arcade machine. By emulation I don’t mean writing a game “just like” Galaga (which would not be a trivial task, anyway) but really emulating the hardware of that old arcade machine in all its details: the original ISA we want to emulate is those of the old Z80 processor and the program we want to run is the one stored in the original ROMs. Of course I knew that MAME and MESS already existed, and I snooped through its sources to find out how that machine actually worked.

My initial goal was to have something running on a browser, so not tied to the Windows platform, and to learn something about software emulation and virtual machines in the process. For this reason I chose to use first Silverlight as host, a platform that looked very promising at the time. After that I quickly ported the code from C# to C++ (even to see what the speed-up could be) and made it run as a WinRT app, more recently converted into a UWP app for Windows 10.

Nothing special, so far. I was just doing for a single game what MAME has done better for thousands of games. But more recently I discovered the wonderful world of Emscripten and I wondered: how difficult can be to make the engine REALLY PORTABLE, compiling it to JavaScript and targeting HTML5? And can JavaScript run fast enough to emulate an arcade machine? It’s what we are going to discover! In this blog post I’ll present a few technical notes on this toy project, a small journey from the initial idea to the final HTML5 implementation.

The machine

But let’s start from the beginning: how did an arcade machine from the 80s work? And how difficult is it to emulate it?

The following picture presents a high-level view of the Galaga hardware:

pic3

Interestingly, Galaga was multi-core. There were three Z80 8-bit processors, which shared the same address space and RAM, but executed each a different program, stored in a set of different ROMs. Other ROMs contained the graphics (sprites and a tiles) and the audio waveforms used to generate sound effects.

There was a form of inter-processor communication based on interrupts: Z80 handles maskable interrupts by executing an interrupt handler located at a fixed address in the ROM. In Galaga the interrupt handler managed a vector of commands with a jump table, so the main CPU could invoke a procedure on one of the other processors simply by setting the index of a command in a location in shared memory and signaling an interrupt.

Beside the main CPUs there were a number of custom chips, dedicated to different functions. There was one just to handle the graphics effects of the star-field in the background, others were used to to render the graphics, to manage the input controls and to produce the sound effects.

To play Galaga we need to emulate the workings of all its components. All the pieces must be connected as they were in their circuit board; and for this we need to find detailed information about the original schematics. Luckily, besides MAME, there are a numbers of sources for this info in the Internet. Galaga’s popularity meant that there has been a great deal of low-level hacking over the years. For example, someone entirely disassembled the 24K bytes of Z80 code from its ROMs (1, 2, 3) to the point of being able to find the root cause and fix the famous bug that made the alien bees stop firing after a while. Another electronics wizard wrote the code to emulate the sound hardware on a PIC.

The Emulator

I wrote the machine emulator engine in C++, making sure that it compiled with the latest versions of Visual Studio and Clang, so that it could be used to drive different hosts, possibly running on different platforms.

Processors

The first thing I needed was a software emulator for the Z80 microprocessor. The emulator, of course, should run in real time, so it is important that the emulation of the Z80 instruction set should be performed very efficiently. The original machine had three Z80 processors running at ~3MHz, so we need to make sure that the time spent to emulate these chips is just a fraction of the machine nominal execution time. In other words, we want to emulate 10 msec of machine time in less than 10 msec).

There are two main techniques for emulating an instruction set (see Smith, Nair “Virtual Machines”). The most efficient one is binary translation, which converts whole blocks of a source program into a customized binary program in the target ISA. A much simpler technique is the “decode-and-dispatch interpreter”, which works by fetching one instruction at the time, analyzing and executing it and modifying the source state.

In our case more than 20 years have passed since the days of 8-bit processors, so we can afford to choose the simplest technique and write an interpreter for Z80. Even so, this turns out to be the most computationally expensive part of the emulator, so it’s important to optimize carefully this code.

Besides the Z80, a couple of the custom chips in the machine are also based on another very simple 4-bit microprocessor, which also needs to be emulated, and even here we can use a simple “decode-and-dispatch interpreter”.

Memory and I/O

We have seen that the memory space is partially shared by the three main processors, so when an address is accessed by the CPU, it may refer to a portion of the same RAM or it may refer to one of the ROMs, which are instead separated per processor, with different ROMs mapped into the same range of addresses for different processors. This is not the whole story: some of the other chips are also memory-mapped, so the CPU instructions used to access the memory can also be used for accessing these devices; the emulator needs to take care of all this, managing the mapping of addresses into the right device.

Other devices use instead port-mapped I/O and are accessed via a dedicated set of microprocessor instructions, (in and out in Z80) on a separate, dedicated address space. Our emulator, of course, must also take care of this port-mapping.

Graphics

The graphics system was similar to those of the 8-bit microcomputers of the time (like the Commodore 64). The screen (of 224 x 288 pixels) was managed as a tilemap of 8×8 characters, and the graphics system also supported up to 64 sprites that were rendered above the tilemap.

The generation of a screen frame takes three steps:

  1. First, the star field in the background is drawn. In the original, there is a custom chip to move and draw the stars; its behavior needs to be replicated.
  2. Then the matrix of 28 x 36 square tiles (224 x 288 pixels) is drawn over the star field, by reading from video-memory the index of the tile to draw and the color to use. Tiles are used for all the text that appears in the game, like the score; the content of each tile (a bitmapped font) is read from a ROM.
  3. Finally it draws up to 64 sprites over the tiles. Even here, size and position and colors of each sprite is stored in a mapped area of video memory, while the sprite bitmaps are read from separate ROMs.

Audio

Audio synthesis turned out to be by far the most complicated part of this project. In Galaga the audio system is composed by two different chips that generated two audio signals.

The first chip is a 3-voice waveform sounds generator. It uses a library of waveforms, read from a PROM, to generate digital samples for sound effects and little tunes.

The second chip, a sound/noise generator based on a 4-bits controller, was only used to generate explosions. This is a discrete sound circuit, made with analog components, like operational amplifiers and DAC. The emulation of a discrete circuit was the part where I really had to dig into the MAME sources to be able to move on.

A final mixer merges these two audio streams and generates a sequence of samples in the format of a single PCM channel of 16 bit samples at a sample rate of 48 KHz.

Interface

A nice thing about the emulation of a game machine is that, once the engine is written and works, it can easily be “hosted” in different apps.

We can see the arcade emulator as a black box which takes a number of inputs (the state of the button controls, coin insertion) and, when run for some time interval (quantum) of N microseconds, produces a set of outputs: the video frame that results from the execution of the machine and the set of audio samples that were generated in that interval.

pic4

The input interface turns out to be very simple: there is a small set of buttons or switches that can be turned on or off at any moment. Three buttons are used to horizontally move the ship and to shoot, other buttons are used to start the game, and a switch is triggered when a coin is inserted:
was a very popular arcade in the ‘80s

void SetInsertCoin(bool value);
void SetStart1Player(bool value);
void SetStart2Players(bool value);
void SetMoveLeft(bool value);
void SetMoveRight(bool value);
void SetButton(bool value);

All the rest of the machine behaviour can be encapsulated in a single function, that runs the machine for a time interval specified:

void Run(double microSeconds, VideoFrame& videoFrame, AudioFrame& audioFrame);

To guarantee smooth video rendering and low-latency audio we need to keep the time interval very small. For example, a quantum of 16.7 msec would allow a frame rate of 60 frames/second and a reasonable low latency for the sound.

Note that the code of the engine is totally platform-independent; once we have defined the format for audio samples and video frames, until we just change the state of the input controls and run quanta of emulation time, we are not really doing anything that interacts with the host, yet.

So, we can reuse the same emulator experimenting with different “host applications”, comparing the technologies and measuring their performances. We saw that the engine produces screen frames, in the form of arrays of pixels, and audio samples, in the form of arrays of PCM samples; the challenge will be to find the most efficient way to render these frames and to play these samples.

Here I’d like to present three different versions of a host application. Being Windows still my preferred platform, the first choice was to write the game as a classic, Direct2D-based Win32 app. The natural evolution was to move then to a UWP app for Windows 10. (We’ll see that in both cases Windows provides nice APIs that make the hosting code quite simple). Finally, we’ll move away from native code towards something very different, a HTML5/JavaScript app, and we’ll see how we can reuse the same C++ code that drove the Windows’ apps.

pic5

Direct2D

Let’s start with a classic Windows app. In this case Direct2D is probably the best choice to quickly redraw screen frames. Personally I am a big fan of Direct2D and of its simple, low-level interfaces. (In one of my projects at work recurring to pure Direct2D for the most complex part of our user interface unexpectedly turned out to be a much simpler and faster solution than relying on a more declarative XAML-like framework… but this is another story :-)).

There are very efficient ways to render video frames with Direct2D, like swap chain buffers, but in our case we don’t need anything so sophisticated; it is sufficient to use a single bitmap to represent the game screen.

The code (error handling omitted) is very simple indeed; we just need to create a render target associated to the app window and to create a bitmap with a pixel format that matches those of the screen frames produced by the emulator:

    CComPtr<ID2D1Factory> _pDirect2dFactory;
    CComPtr<ID2D1HwndRenderTarget> _pRenderTarget;
    CComPtr<ID2D1Bitmap> _pBitmap;
    ...

    D2D1CreateFactory(D2D1_FACTORY_TYPE_SINGLE_THREADED, &_pDirect2dFactory);
    _pDirect2dFactory->CreateHwndRenderTarget(
        D2D1::RenderTargetProperties(),
        D2D1::HwndRenderTargetProperties(m_hwnd, size),
        &_pRenderTarget);
    D2D1_PIXEL_FORMAT pixelFormat = D2D1::PixelFormat(DXGI_FORMAT_B8G8R8A8_UNORM, D2D1_ALPHA_MODE_IGNORE);
    _pRenderTarget->CreateBitmap(
        D2D1::SizeU(Width, Height),
        NULL, 0,
        D2D1::BitmapProperties(pixelFormat),
        &_pBitmap);	

ID2D1Bitmap provides a method CopyFromMemory to fill the bitmap from an array of pixels in memory, and this is all we need.

To schedule the engine we can simply use a Win32 timer; with a timeout of 17msec we get about 60 frames per second. Every time the timer expires we run the emulator for the same time interval and as result a new screen frame will be ready to be copied into the bitmap and then blitted onto the render target and drawn to the window:

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    _pBitmap->CopyFromMemory(&rect, videoFrame.Pixels, pitch);
    m_pRenderTarget->DrawBitmap(_pBitmap);

    // render audio
    (later)
}

We’ll see later how to deal with the audio samples.

UWP App

With Windows 10, we can write a UWP app that also works on mobile. In this case we use XAML and the required markup is extremely simple:

<UserControl
    x:Class="rtGalaga.MainPage"  xmlns=...
    KeyDown="LayoutRoot_KeyDown" KeyUp="LayoutRoot_KeyUp" Loaded="UserControl_Loaded">
    <Grid x:Name="LayoutRoot" Background="#FF0C0C0C">
        <Image x:Name="img" Stretch="Uniform" />
    </Grid>
</UserControl>

We just use an element to render the game screen. This is very simple, because the UWP framework provides a mechanism to fill the content of the from an array of pixels in memory, by using a WriteableBitmap as the image source:

WriteableBitmap^ _bitmap;
_bitmap = ref new WriteableBitmap(Height, Width);
img->Source = _bitmap;

XAML also provides us with a better and more efficient way to schedule our emulator. The CompositionTarget::Rendering event is fired once each time the XAML engine decides to render a frame. In normal conditions, this happens about 60 times per second, but the event will be fired less frequently when the processor is busy:

CompositionTarget::Rendering += ref new EventHandler<Object^>(this, &MainPage::OnCompositionTargetRendered);

Each time the frame rendering is scheduled, we run the emulator for the appropriate amount of time and get a VideoFrame, whose pixels we can copy into the WriteableBitmap:

void MainPage::OnCompositionTargetRendered(Platform::Object^ sender, Platform::Object^ args)
{
    // run the emulator
    galaga->Run(time, videoFrame, audioFrame);
    // render the video frame
    ComPtr<Windows::Storage::Streams::IBufferByteAccess> pBufferByteAccess;
    if (SUCCEEDED(bitmap->PixelBuffer.As(&pBufferByteAccess)))
    {
        // get pointer to pixel bytes
        BYTE* pBuff = nullptr;
        if (SUCCEEDED(pBufferByteAccess->Buffer(&pBuff)))
        {
            memcpy(pBuff, videoFrame.Pixels, …);

            // invalidate the bitmap to redraw the image
            bitmap->Invalidate();
        }
    }

    // render audio samples
    (later)
}

XAudio2

What is missing now is audio rendering. We can use the same technology both in a classic Win32 app and in WinRT/UWP. XAudio2, the replacement for DirectSound, which provides a very simple API for low-level access to the soundcard and it is particularly suitable for videogames.

XAudio2 works by building an audio processing graph and in our case the graph will be very simple: there is a source node (voice) that produces the emulator- synthesized samples (in the form of a single 16 bit PCM channel at 48 KHz) and there is a mastering voice that represents the audio device.

pic6

This is the code that creates and starts the graph (error handling omitted):

CComPtr<IXAudio2> _pMusicEngine;
IXAudio2SourceVoice* _pSourceVoice;
IXAudio2MasteringVoice* _pMasteringVoice;

std::vector<uint8_t> _buffers[QUEUE_SIZE];
int _lastPlayed;
int _lastWritten;

...

::XAudio2Create(&_pMusicEngine, 0, XAUDIO2_DEFAULT_PROCESSOR);
_pMusicEngine->CreateMasteringVoice(&_pMasteringVoice, XAUDIO2_DEFAULT_CHANNELS);

WAVEFORMATEX wfx;
ZeroMemory(&wfx, sizeof(WAVEFORMATEX));
wfx.wFormatTag = WAVE_FORMAT_PCM;
wfx.nChannels = 1;
wfx.nSamplesPerSec = 48000;
wfx.wBitsPerSample = 16;
wfx.nAvgBytesPerSec = wfx.nSamplesPerSec * wfx.wBitsPerSample / 8;
wfx.nBlockAlign = wfx.wBitsPerSample / 8;

_pMusicEngine->CreateSourceVoice(&_pSourceVoice, (WAVEFORMATEX*)&wfx, 0, 1.0f, &_voiceContext, NULL, NULL);
_pSourceVoice->Start(0, 0);

Here _voiceContext is a pointer to an object that implements the IXAudio2VoiceCallback interface, used to buffer and synchronize the samples to be played.

Our emulator produces a continuous sequence of samples which we need to play in real time, with minimum delay and without glitches and the best way to do this is to use a circular buffer (see Petzold).

pic7

More precisely, we use an array of small buffers arranged in a circular queue so that when one of the buffers is being played, the other buffers can be filled and queued to be processed.

Every time we run the emulator for a quantum of time (GalagaMachine.Run(…)) we get the set of samples that the machine would have generated in that time interval and we insert them into the queue. For simplicity, we can ask the emulator to provide us samples in small buffers with the same size of the buffers in the queue.

{
    _galagaMachine.Run(time, videoFrame, audioFrame);

    // render screen frame
    ...

    // render audio
    int buffer = (1 + _lastWritten) % QUEUE_SIZE;
    if (buffer == lastPlayed) {
        return; // overflow
    }
    std::copy(audioFrame.samples().begin(), audioFrame.samples().end(), _buffers[buffer].begin());
    _lastWritten = buffer;
}

Meanwhile, XAudio will continue playing the queued buffer, and it will report when each buffer is finished by calling the method OnBufferEnd in the IXAudio2VoiceCallback interface. There, we can refill the buffer with more samples and add it back to the queue to be processed, submitting it to the IXAudio2SourceVoice object with code like this:

{
    _lastPlayed = (_lastPlayed + 1) % QUEUE_SIZE;

    XAUDIO2_BUFFER buf = { 0 };
    buf.AudioBytes = _buffers[_lastPlayed].size();
    buf.pAudioData = _buffers[_lastPlayed].data();
    _pSourceVoice->SubmitSourceBuffer(&buf);
}

Having a little buffering is useful to compensate for moments when the computer’s CPU is too busy to keep up with the task of producing audio samples in our emulator. The bigger the size of the buffers, the more resistant the app will be to transient changes in CPU load that could starve or overflow the queue. But we pay this buffering with latency, a delay between the moment something happens in the game and the moment the sound effect for that event is played; a bit like the delay between lightning and thunder.

On a normal Windows PC we can generally afford to use relatively small buffers; I found that using a queue with three or four 20-msec buffers (each with 48000 samples/sec * 1 channel * .02 sec = 960 samples) works very well in most cases. More sophisticated algorithms could be used to dynamically adapt the latency to the CPU load.

From UWP to HTML5

So far, everything was quite easy: writing an arcade emulator for Windows is not a particularly difficult task. With native code we can efficiently emulate another ISA and the system provides us with a simple API to quickly render video frames and to play audio samples. The real problem is that the resulting code is not portable. If we want to support a different platform we can recompile and reuse the C++ code for the emulator, but we need to rewrite all the “host” code that deals with audio and with screen frames.

Nowadays the platform of choice for user interface is, of course, the browser and JavaScript is becoming more and more the lingua franca for UI development. So the natural next step is to ask how difficult would be to write the same emulator in JavaScript and whether an interpreted language like JS can be fast enough to emulate an old arcade.

JavaScript as the assembly language for the Web

While many people consider JavaScript as a very elegant programming language, others, like me, are really put off by its lack of static typing. Whatever the preference, there is no doubt that JavaScript is perfect to write portable code.

Already in 2011 gurus like Brendan Eich (the inventor of JavaScript) and Douglas Crockford (inventor of JSON) predicted that JavaScript could be used as a target language to compile from other languages, and could become a sort of Assembly language for the Web. And even before, Erik Meijer (then in Microsoft) had idea of writing a compiler from .NET IL to JavaScript as part of the Volta project.

There are other frameworks which enable developers to “target” JS using C# or Java as source languages, like Script# and GWT but the Volta C#-to-JS compiler was the first (that I know) to really use JavaScript as an assembly-like language: it worked compiling normally .NET code into MSIL and then converting each MSIL instruction into JavaScript.

Emscripten

Project Volta has long been dead, but the idea of using JavaScript as the assembly for the Web is more interesting than ever, especially because the JavaScript engines in our browsers are today much faster than in the past.
The best way to compile into JavaScript today is to use Emscripten, an open source tool based on LLVM that compiles from C and C+.

To be more precise, what Emscripten does is to take LLVM bytecode (which can be generated from C/C++ using Clang, or any other LLVM-friendly language) and to compile it into highly-optimizable JavaScript.
For example, the following C++ function, taken from the sources of the Galaga emulator:

void Frame::SetPixel(int x, int y, int color)
{
    _pixels[(_width * y) + x] = color;
}

Once compiled (with almost all optimizations disabled) is transformed into the following JS code:

function __ZN6Galaga5Frame8SetPixelEiii($this,$x,$y,$color) {
   $this = $this|0;
   $x = $x|0;
   $y = $y|0;
   $color = $color|0;
   var $0 = 0, $1 = 0, $2 = 0, $3 = 0, $4 = 0, $5 = 0, label = 0, sp = 0;
   sp = STACKTOP;
   $0 = HEAP32[$this>>2]|0;       // $0: _width
   $1 = Math_imul($0, $y)|0;      // $1: _width * y
   $2 = (($1) + ($x))|0;          // $2: _width * y + x
   $3 = ((($this)) + 8|0);
   $4 = HEAP32[$3>>2]|0;          // $4: &(_pixels)
   $5 = (($4) + ($2<<2)|0);       // $5: &(_pixels[_width * y + x])
   HEAP32[$5>>2] = $color;        // $6: _pixels[_width * y + x] = color
   return;
}

We can see that the resulting code is really the equivalent of assembly code, only written in JS rather than in a native binary. (The optimized version would be less verbose but also less readable).

Asm.js

The generated code is full of expressions like $x|0… this is a way to coerce the variable x to be of an integer type. So there are kind of static typing rules being enforced… what kind of JavaScript is this? It turns out that Emscripten uses a well specified and strict subset of JavaScript as compilation target, named Asm.js.

Asm.js was designed (by Mozilla) exactly for the purpose of being used as a target language for the source-to-source compilation from a statically-typed languages like C/C++. You can find more details in the spec, but in a few words, asm.js is JS without all the potentially slow parts (like dynamic type guards and garbage collection); it only supports the parts that can be executed very efficiently (strictly-typed integers, floats, arithmetic, function calls, and heap accesses).

The static-type rules allow much faster JIT compilation or ahead-of-time compilation, the script engine can perform optimizations that would not otherwise be possible, and the generated code can be surprisingly fast, running at near-native speed!
But Asm.js also has its disadvantages:

  • It is difficult to write Asm.js code by hand (exactly as it is to code in x86 assembly).
  • The whole C++ application gets compiled into a single massive JavaScript file, including the standard libraries; it is not trivial to make this compiled code interact with other JavaScript parts (but some form of interaction is always necessary).
  • It is difficult to port programs that have a lot of interactions with the OS, or that do a lot of I/O, or that use many external native libraries. There is no support for multi-threading.

Furthermore, while the compiled Asm.js code is valid JavaScript that can run on every browser, it is very important that browsers explicitly support Asm.js to really take advantage of the optimizations that are possible once the asm.js code is validated. For this reason the performance of the same Emscripten-generated code may vary widely across different browsers.

If you are curious, you can use the Octane benchmark to measure the performance of your browser. There is also an interesting demo that shows very effectively what difference running on a browser optimized for asm.js can make. At this page, two chess engines are matched against one another. They are both compiled from the same Stockfish source code, but in only one has a flag set to tell the browser that the code should be interpreted (and optimized) as Asm.js. Of course this is the engine that wins all the matches as long as the game is played on a browser that supports asm.js.

This chart, taken from http://caniuse.com/#feat=asmjs, describes the current status of the browser support matrix. Currently Asm.js is fully supported by Firefox and Edge, while Chrome is still a little behind.

pic8

The arcade emulator in Asm.js

After this long introduction to Emscripten and Asm.js I think you’ll see where I’m going: if we want to port our C++ arcade emulator to JavaScript we don’t need to write (almost) any code: we can just recompile the existing C++ sources!

In fact, while it’s unlikely that Asm.js will ever be used for traditional web-development (DOM/CSS), it seems perfect for a particular application like a software emulator (and already in 2011 Fabrice Bellard used Emscripten to write an impressive emulator of a Linux PC).

The code of the emulator engine already builds with Clang, it is “self-contained” (with no external dependencies a part from the CRT and the STL), it does not use multithreading and exposes a very simple interface to its clients. So, it does not require any adjustment: once it’s compiled with Emscripten the resulting asm.js code can be used right away in the browser.

The only question left to be settled now is: will asm.js be fast enough to run a (relatively) computationally intensive application like an arcade emulator in real time?

In order to find out, we first need to install the Emscripten SDK (the latest version can be found here).
Then we run the Emscripten compiler from the command-line:

emcc
    --bind                   // uses Embind
    -std=c++14               // supports C++14
    -O3                      // enable all optimizations
    src1.cpp … srcN.cpp      // the list of files to compile
    –o galaga.js             // the result of the compilation

All the source files will be compiled into a single, big JavaScript file (about 800K bytes). It is really so simple. I’ve never had any problems with the Emscripten compiler: as long as my code compiled without errors and warnings with the LLVM toolchain for Visual Studio, it also compiled without errors with Emscripten.

The only tricky part is now to “consume” the giant blob of almost-obfuscated code that we have generated from “normal” JavaScript code, and here the –bind option comes to our rescue. When this option is specified we can use special macros to bind C++ functions to JavaScript, and vice-versa, to call JavaScript functions from C++.
In our case, we only need to add an additional file to our sources, with the following code:

#include <emscripten/bind.h>
using namespace emscripten;

#include "galagamachine.h"

    // Binding code
    EMSCRIPTEN_BINDINGS(galaga_machine)
    {
        class_<GalagaMachine>("GalagaMachine")
            .constructor<>()
            .function("Run", &GalagaMachine::Run)
            .function("set_InsertCoin", &GalagaMachine::set_InsertCoin)
            .function("set_Start1Player", &GalagaMachine::set_Start1Player)
            .function("set_Start2Player", &GalagaMachine::set_Start2Player)
            .function("set_MoveLeft", &GalagaMachine::set_MoveLeft)
            .function("set_MoveRight", &GalagaMachine::set_MoveRight)
            .function("set_Button1", &GalagaMachine::set_Button1);
    }

This will tell the compiler that it must create binding code for our class GalagaMachine (or more precisely, for the function of this class listed in the EMSCRIPTEN_BINDINGS macro). As result, the giant blob of generated JavaScript code will also contain a class Module.GalagaMachine with the methods specified, and we will be able to instantiate the whole emulator simply writing:

    var galaga = new Module.GalagaMachine();

The HTML5 Host

We are now in the same situation we were when we started the Direct2D and UWP apps: we have encapsulated the arcade emulator with a simple interface and we just need to implement the “host” code to render the game screen frames and to play the audio samples. This time, however, we want to use HTML5 and JavaScript and things get more interesting.

pic9

HTML5: the scheduler

The first problem we need to solve is how to efficiently schedule the execution of the emulator so that it can produce 50 or 60 frames per second. The more simplistic idea would be to use JavaScript timers (setTimeout(msec)), but a much better solution is to use Window.requestAnimationFrame(callback).

This is an API intended to simplify animations: we call requestAnimationFrame and pass a callback that the browser will call before the next repaint. The callback routine must itself call requestAnimationFrame() to animate another frame at the next repaint, and so on forever. The repaint rate is usually 60 times per second, but the browser may reduce it to a lower rate when the page or tab is in background or when we are in a hidden , in order to improve performance and battery life.

The following code uses requestAnimationFrame to schedule the periodic execution of the function step():

    var prevTimestamp;

    function repeatOften(timestamp) {
        var interval;
        if (!prevTimestamp || !timestamp) {
            interval = 20.0;
        }
        else {
            interval = timestamp - prevTimestamp;
        }
        prevTimestamp = timestamp;
        step(interval);
        requestAnimationFrame(repeatOften);
    }

We’ll start the scheduling in a startup function main() that does all the initialization for the game:

    function main() {
        galaga = new Module.GalagaMachine();
        initVideo();
        initAudio();

        repeatOften();
    }

In step(), the call to galaga.Run() is where we finally go to Emscripten-compiled code and run the emulator for a quantum of execution time. The duration of this interval will vary according to the variable refresh rate, and to calculate it we use the timestamp that is passed as argument to our callback.

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // render videoFrame
                ...

                // play audioBuffer
                ...
            }
        );
    }

Note that the second argument passed to Run() is another callback. In fact, Emscripten bindings dictate another small change to our C++ code. While originally we could retrieve the screen frame and audio samples passing two objects by reference:

void Run(double us, VideoFrame& videoFrame, AudioFrame& audioFrame);

This would not work with Emscripten. Instead, we can make the interface more JavaScript-friendly by passing a callback function that will be called when the function Run() completes and that takes as arguments the same two objects that were originally returned by reference:

void GalagaMachine::Run(double us, emscripten::val onComplete)
{
    // ... runs the emulator for a timeslice, obtains videoFrame, audioFrame.

    onComplete(
        emscripten::typed_memory_view(Height * Width * 4, videoFrame.Pixels),
        emscripten::typed_memory_view(_audioFrame.Size, audioFrame.Samples)
    );
}

emscripten::val is a type provided by Enbind to represent any JavaScript object, and it is what makes it possible to call JavaScript code directly from C++.

HTML5 frame rendering

In order to render screen frames we need to find an efficient way to “blit” the pixels from the VideoFrame generated by the emulator onto the screen. Luckily, the Canvas 2D API provides the ImageData object that does just this. So, in HTML5 the whole UI for the game can be made by a single Canvas element suitably placed in our page:

    <canvas id="myCanvas"/>

At initialization time, we create an ImageData object with the same size of the arcade screen. A small complication is that we need to scale the ImageData to fill the whole canvas, which is resized to fill the whole browser window. The best way to do this is to draw the imageData into another off-screen canvas and then draw the off-screen canvas to the original canvas.

    var galaga;
    var canvas;
    var offscreenCanvas;
    var context;
    var offscreenContext;
    var imageData;

    function initVideo() {
        // create offscreen canvas and imageData
        offscreenCanvas = document.createElement('canvas');
        offscreenCanvas.width = 224;
        offscreenCanvas.height = 288;
        offscreenContext = offscreenCanvas.getContext('2d');
        imageData = offscreenContext.createImageData(224, 288);

        canvas = document.getElementById('myCanvas');
        context = canvas.getContext('2d');
        canvas.height = window.innerHeight - 24;
        canvas.width = canvas.height * offscreenCanvas.width / offscreenCanvas.height;
        context.scale(canvas.width / offscreenCanvas.width, canvas.height / offscreenCanvas.height);
    }

Then, periodically, after having run the emulator for a quantum of emulation time we just copy the generated screen frame onto the canvas ImageData. This copy can be done as fast as the copy of a JavaScript typed array. ImageData exposes its data as Uint8ClampedArray of four bytes per pixels, in the RGBA format. So, if the emulator generates videoFrame pixels in the same format, we can fill the ImageData with a simple call to Uint8ClampedArray.set():

    function step(interval) {
        galaga.Run(interval,
            function (videoFrame, audioBuffer) {
                // videoFrame is a Uint8Array that aliases directly into the Emscripten heap
                imageData.data.set(videoFrame);
                offscreenContext.putImageData(imageData, 0, 0);
                context.drawImage(offscreenCanvas, 0, 0);

                // Audio
                (later)
            }
        );
    }

HTML5 Audio

Finally, the last thing left to do is to play the audio sample synthesized by our emulator. This is probably the most complicated part of the whole project; one could argue that today with HTML5 WebAudio there is no really good way to do this.
WebAudio provides a rich API to capture, manipulate and play audio samples. Audio operations are performed by putting together a graph of audio nodes (sources nodes, filter nodes, output nodes, and so on).

The basic way to play a sound with WebAudio is simply to create an AudioBufferSourceNode, fill this buffer with audio samples and schedule it to be played back at the appropriate time with AudioBufferSourceNode.start(when). In our case, we want to play a continuous sequence of buffers, each containing 20ms of samples. Ideally, we should be able to schedule the buffers one after the other, like with bufferSourceNode0.start(t + 0); bufferSourceNode1.start(t + 20); bufferSourceNode2.start(t + 40); and so on. But this does not work and in all browsers produces glitching audio. Web Audio does not guarantee sample-precise stitching of contiguous buffers

There is no “buffer-queueing” functionality in the Web Audio API, which would be very useful to render audio synthesized in real time. I found a long discussion of this problem in the Audio discussion list of w3.org.

However, a (somehow convoluted) workaround does exist, using ScriptProcessorNodes. ScriptProcessorNodes are nodes that can be inserted in the graph to do some elaboration/filtering with a JavaScript event handler. The node is connected to an input and an output buffer and the event handler onaudioprocess() is called every time the input buffer contains new data and must fill the output buffer with new data.

pic10

In our case, the important feature of a ScriptProcessorNode is that it produces a continuous sequence of samples with no artifacts of glitches. Since we are generating the samples ourselves, we don’t really need any input buffer but we can rely on the fact that the system will call onaudioprocess() to fill an output buffer with audio samples that will be sent to the playback node.

pic11

But a limitation is that we cannot decide the playback rate of the output buffer; it will always be equal to the sample rate of the WebAudio context. To complicate things further, this frequency is platform-dependent. On Windows with Firefox, Chrome and Edge, it is usually 44.1 Khz, but I found out that it was instead 48KHz with any browser on my Surface Pro 3.

The problem is that the emulator also synthesizes audio at a fixed rate, 48 KHz, and when these two sample rates don’t match we need to resample our audio. And even with resampling there can still be the possibility of glitches because the ScriptProcessorNode works in the (main) thread and can easily cause performance issues. Initializing a ScriptProcessorNode takes as argument the size of the buffer that will be processed by the handler function, and which must be a power of 2 (between 256 and 16K samples). This buffer size then also determines how frequently the onaudioprocess event needs to be fired and even here there is a buffer vs latency trade-off. With smaller sizes we also have smaller latency, but the handler will be called more frequently and it will be easier to have glitching audio. For example having a 1024 samples buffer at 44.1 KHz means that onaudioprocess will be called every 1024/44.1 = 23.2 msec.

So, this is evidently not a perfect solution, and in fact the ScriptProcessorNode interface has now been deprecated, to be replaced by Audio Workers, which implement audio processing in the context of a web worker and not in the main application thread. But as far as I know, no browser supports Audio Workers yet, so ScriptProcessorNodes still are the best option.

The code to play audio with ScriptProcessorNode is not very different from the one we wrote for XAudio2. Even here we need a circular buffer to be resistant against timer jitters and changes in the CPU load. Buffering is particularly useful here, since as we have seen, the ScriptProcessorNode event handler is called in the main JavaScript thread, which is also busy periodically running the emulator.
At initialization time we initialize Web Audio and create a graph with a ScriptProcessorNode:

var BUFFER_COUNT = 8;
var BUFFER_SIZE = 1024; // samples
var _lastPlayed = BUFFER_COUNT - 1;
var _writePos = 4 * BUFFER_SIZE - 960; 

function initAudio() {
    window.AudioContext = window.AudioContext||window.webkitAudioContext;
    if (!window.AudioContext) {
        throw 'Web Audio API is not available!';
    }

    audio.context = new window.AudioContext;
    audio.buffer = new Int16Array(BUFFER_COUNT * BUFFER_SIZE);

    audio.scriptNode = audio.context.createScriptProcessor(BUFFER_SIZE, 0, 1);
    audio.scriptNode.connect(audio.context.destination);
    audio.scriptNode.onaudioprocess = ... // later
}

After running the emulator we insert the generated audio samples into the circular buffer. For brevity, I omitted the code that takes care of resampling the audio when the Audio context frequency does not match the 48KHz of our emulator.

galaga.Run(interval,
    function (videoFrame, audioBuffer) {
        // ... Render screen frame, as seen before...

        if (audio.context.sampleRate != 48000) {
            // resampling
        }

        // Queue audio samples. Note that the number of samples in audioBuffer
        // can be different from BUFFER_SIZE.
        var max = BUFFER_SIZE * _lastPlayed;
        var i = 0;
        var j = _writePos;
        var len = BUFFER_SIZE * BUFFER_COUNT;
        while (audioBuffer.length >= i) {
            if (j == max) {
                // overflow
                break;
            }
            audio.buffer[j] = audioBuffer[i];
            i++;
            j = (j + 1) % len;
        }
        _writePos = j;
    }
);

Finally, there is the onaudioprocess event handler, which is called periodically by the ScriptProcessorNode. Here we extract the samples from the queue and feed them to the output buffer.

    // This function is called periodically by the audio system and runs in the main thread.
    audio.scriptNode.onaudioprocess = function(audioEvent) {
        try {
            var channelData = audioEvent.outputBuffer.getChannelData(0);

            _lastPlayed = (_lastPlayed + 1) % BUFFER_COUNT;
            var j = _lastPlayed * BUFFER_SIZE;
            var i = 0;
            while (BUFFER_SIZE >= i)
            {
                // Convert from [-32768:32767] to [-1.0:1.0]
                channelData[i] = audio.buffer[j] / 0x8000;
                audio.buffer[j] = 0;
                i++; j++;
            }
        }
        catch (e) {
            // ...
        }
    }

And that’s really all! We can now play our synthesized sounds also on a browser. 🙂 I must say that the quality of the resulting audio is not always perfect, because of the limitations of ScriptProcessorNode, and there can be sporadic glitches here and there, on slow machines. It will be interesting to rewrite the buffering algorithm with Audio Workers, when they will become available.

Performance

Now we are in the condition to answer an interesting question. When I started porting the C++ engine to JavaScript with Enscripten I had no idea what the slow-down could be. With native (or also managed) code, it is obviously very easy to efficiently emulate 30+ year old hardware and ancient 8-bit microprocessor. But is this true even with a scripted language?

Looking back at the emulator interface, we saw that we have a single method Run(double ms) that executes the emulator for a specified interval, producing a video frame and a number of audio samples. This is the only really CPU-intensive part of our system. Once we generate a display frame, we can blit it onto the screen very quickly; the Canvas 2D API is fast. Dealing with audio samples means moving around very small arrays of bytes (for a single channel at 48 KHz, 20 milliseconds of audio only take 960 bytes) and the Web Audio API is also fast. So, the only bottleneck is in the engine itself, and everything works very well as long as a call to Run() takes much less than N milliseconds to emulate N milliseconds.

So, I timed how long running the emulator actually takes with JavaScript, on different web browsers, and I took as reference the time spent by the native emulator used in the Direct2D project. The comparison is particularly meaningful because we are comparing execution time for exactly the same code (even though transpiled into another language). The following chart shows the results, running the emulator on an old Sony laptop, i7 processor:

pic12

We can see that some browsers are faster than others with asm.js code. In particular, Chrome turned out to be the slowest, and Firefox the fastest, with Microsoft Edge not too far away. But was very, very interesting to me was to measure that the slow-down factor for Enscripten-generated JavaScript compared to the native code can be as low as 1.66. This is really not a lot, and it’s much much faster than I would have expected.

Looking forward

Today Asm.js seems only useful for a particular kind of applications, like games, emulators and in general for computational workloads. So, it seems to occupy the same niche which is also occupied by NaCl/PNaCl, the sandbox for native code in a browser developed by Google (which has not had huge success and it is still only supported by Chrome).

Very likely, Asm.js will never be used for “standard” web applications, but I wonder whether it could become in the future the “native” language for a new type of web applications based on HTML5. The browser could become the host for a new kind of portable OS, with the canvas and WebGL APIs used for the user interface, the WebAudio API for the audio, WebRTC for real time media streaming, WebSockets and XMLHttpRequests for networking, local File API and cloud APIs for storage, and so on.

Asm.js could be the target language for a number of Enscripten-compiled libraries and components that could be reused by “normal” web applications. For example, Asm.js should be fast enough to implement new video codecs, or to program whole -based UI frameworks, with controls, animations, databinding (even though the need for this is questionable). Or maybe the future of the web (and of mobile apps) could be in desktop-style apps that run in the browser, written in statically-typed languages and transpiled into JavaScript, with Asm.js as native language.

And very soon also C# and the other .NET languages could be viable alternatives to C++ to compile into Asm.js. Microsoft in now working on LLILC, an LLVM-based MSIL compiler. For now the goal of the project (open source) is to create a .NET JIT for LLVM, but the plan is to have also an AOT compiler in the future.

WebAssembly

It is also worth noting that people are already working on even faster alternatives to Asm.js. A small issue with Asm.js is that the assembly-like code is quite verbose, and with large programs, having to download and then parse large codebases can take some time. This problem could be solved by WebAssembly (wasm), a low-level assembly/binary format that directly represents the AST of a program and is designed to be smaller in size, faster to parse and faster to execute than JavaScript. All main browser vendors are working on supporting WebAssembly inside their script engines, and Emscripten should soon support WebAssembly as compilation target.

Insert coin…

This concludes my little exploration of the world of retro-gaming and the emulation of vintage arcade games. If you are interested, you can play with my version of Galaga online HERE.
This is a screenshot of the game in action, running on Microsoft Edge:

pic13

You can find the sources here:

15 thoughts on “Galaga: an Arcade machine emulator for Windows and HTML5

  1. Galaga is my *favorite* game — gonna have to figure out how to change the keys, however — I use ctrl+left/right to move between spaces on my mac, so firing and moving doesn’t do what I want 😦

    1. Hello, and thanks for your interest! 🙂
      It’s very easy to change the keys… just modify the onKeyDown and onKeyUp event handlers in file galaga.html.
      You can just save the files galaga.js and galaga.html and load the html page locally.

      function onKeyDown(event) {
      var key = event.keyCode || event.which;
      switch (key) {
      case 48: //”0″
      galaga.set_InsertCoin(true);
      break;
      case 49: //”1″
      galaga.set_Start1Player(true);
      break;
      case 50: //”2″
      galaga.set_Start2Player(true);
      break;
      case 37: //”Left”
      galaga.set_MoveLeft(true);
      break;
      case 39: //”Right”
      galaga.set_MoveRight(true);
      break;
      case 17: //”Control”
      galaga.set_Button1(true);
      break;
      default:
      break;
      }
      }

      function onKeyUp(event) {

      }

  2. Hi Paolo,

    Nice project, respect and certainly no waste of time and bandwidth 🙂

    Can we catch the Highscore through vars in php / html ?

    This would be a great feature.

    Cheers

    1. Hi James,

      Unfortunately getting highscore data would be very difficult to do. This is because the emulator works by running the original ROMs, so the original code of the game, and the logic to manage the highscore is hidden between thousands of Z80 assembly instructions.
      Yes, theoretically it could be possible to disassemble the ROMs, see where the highscore data is stored in memory, add hooks to the emulator to detect the high-score changes and notify the webpage, but it would be a lot of work…

      Thanks for your interest!
      Cheers,
      Paolo

  3. Whenever there is a new high score ,
    Galaga will copy it from screen memory at location 83ED – 83F2 to ram location 8A20 – 8A25
    (6 digits on reset RAM will contain ” 20000″ but stored as 00 00 00 00 02 24)

    It then initialises the initials stored at 8A3E – 8A41 (3 chars) to “AAA”
    and waits about 30 secs for the player to enter their initials.

    If the player enters their initials, then these are copied to 8A3E – 8A41 ,
    if nothing is entered Galaga again writes “AAA” to RAM locations 8A3E – 8A41 .

    Cheers 🙂

  4. Hi James,

    Wow! 🙂 Thanks for digging this info!!!

    This is a busy time at work for me, but as soon as I can I will add the high score to the machine scriptable interface and implement the high score feature!

    Thanks a lot,
    Paolo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s