Skip to content
July 9, 2008 / Abe Pralle

Adventures in Cross Compilation

Twelve days ago I got Plasmacore running on the Wii.  The problem was that while our current game project (MadStone) looked great in general, there was a slightly noticeable – but noticeable – frame rate drop when a level first starts and dozens of blocks fall from the sky.

I did some profiling.  Turns out that:

  • We were getting 30 frames per second during regular gameplay, dipping to 15 during level start.
  • Frames were taking around 1.5 * 1/60th of a second to generate.  Since games often wait to switch in their newly drawn frame until during the one of the television’s 60 vertical blanks (pauses between screen redraws) per second, if you can draw frames in 0.99 1/60ths of a second then you get 60fps but if it takes you 1.01 1/60ths of a second then you get 30fps.
  • Of the “1.5” update interval, Slag code was taking about 2/3 of the time and actual Wii graphic drawing was taking up the other third.

I plugged some numbers in and realized that if I could make Slag go 50% faster, that should bring the frame time down to 1.0 or less – a worthy goal.  I then started a project to make an ETC-to-CPP cross compiler that reads in compiled Slag ETC bytecode and spits out C++ source code.

I estimated it would take me 3-4 days.  It ended up taking me 11, which just proves that the old maxim is correct and programmers should always multiply their best guess time estimates by 3 – or by pi, as I understand some do.

It’s been an epic battle.  My epic battle, let me tell you it.

Background (SlagVM design)

A bit of technical background first.

Let’s say your source code has a statement like this:

    Z = X + (Y*2)

The compiler ends up creating an expression tree to represent this line of code. The operators are the parent nodes and the operands are the child nodes, something like this:

    code = OpAssign( OpPlus( OpVar("x"), OpTimes(OpVar("y"),2) ) )

The simplest kind of instruction set architecture to run a line of code like that on is a stack-based system.  On such a computer, the expression above would require the following stack manipulation pseudocode:

    PUSH X
    PUSH Y
    PUSH 2
    MUL
    ADD
    POP Z

The Slag VM uses an idea called threaded code to run programs – in this case threaded means “woven”, not “asynchronous”.  With threaded code you make a function or object to carry out each of the basic operations your instruction set supports. Here’s what some example OOP pseudo-code for a stack-based, threaded code VM (note: I actually used C functions and function pointers, but this is simpler to show):

    class FnPush : Fn
      method init( source_var ):
      method run: vm.PUSH( source_var )

    class FnPop : Fn
      method init( dest_var ):
      method run: dest_var = vm.POP()

    class FnAdd : Fn
      method init:
      method run: vm.PUSH( vm.POP() + vm.POP() )

    class FnMul : Fn
      method init:
      method run: vm.PUSH( vm.POP() * vm.POP() )

Loading a program into the VM would programmatically produce a data structure similar in structure to the following literal code:

  class SomeMethod : MethodInfo
    PROPERTIES
      cmds = { FnPush(var_x), FnPush(var_y), FnPush(2), FnMul(),
               FnAdd(), FnPop(var_z) } : Fn[]
    METHODS
      method call:
        # To run this method you just have each function object do its thing:
        forEach (cmd in cmds) cmd.run

  endClass

Note: the Slag VM actually has three stacks.  There’s a data stack to store regular primitive values, a reference stack to store object references (so that they can all easily be found during garbage collection), and a call stack that tracks which methods were called.

The plan (cross-compiler design)

I decided to keep the design as easy as possible and integrate on top of the existing SlagVM architecture for the cross-compiled code. So it would still have a data stack and a reference stack that are “manually” manipulated – it’s just that C++ commands would be generated to carry out the operations originally specified in each method. With our running example the cross-compiler would produce this:

  void SomeMethod()
  {
    vm.PUSH( var_x )
    vm.PUSH( var_y )
    vm.PUSH( 2 )
    vm.PUSH( vm.POP() * vm.POP() )
    vm.PUSH( vm.POP() + vm.POP() )
    var_z = vm.POP()
  }

While the compiled code uses the code stack and the reference stack, it doesn’t use the call stack.  Calls to other methods are just actual C++ calls to other functions.  In addition, Slag try/throw/catch blocks are translated into direct C++ equivalents.

It’s not truly compiled code as we normally think of it – the issue of garbage collection is a BIG stumbling block on that road – but this hybrid code should be faster than the interpreted/threaded bytecode, right?  At least 50% faster, I hope?  Well…

Unexpected Solutions

There’s the school of thought that says “programming is design”, and I completely agree with that.  The idea is that you haven’t truly designed a program until you you’ve programmed it.  There’s been many a time that I’ve had a great idea formulated and then the moment I sit down to program it in I realize the fatal flow in my thinking.

That kind of disconnect between plan and implementation definitely happened with this project, though happily most of the reality checks were in the form of simpler solutions to problems I’d thought were more complex.

First off there’s the issue of the stack frame.  At a low level, functions are called by pushing the arguments onto a stack and then jumping to the code at the beginning of the function body.  Consequently the parameters to each function are on the top of the stack when a function is first called.  So each function notes the top of the stack when it’s first called and uses that as it’s base point of operations.  That’s all easy enough.  The harder part is that functions need to clean up after themselves when they’re finished by removing their parameters from the stack and then (in my case) pushing their return value on to the stack.

I was a little worried that a) I was going to be putting a lot of redundant framework code at the beginning of each generated function, and that b) in using C++’s own try/catch mechanism, it was going to be difficult to clean up the stack when throws happened.  However, I quickly realized that a simple local object was the answer to both problems.  When an function is abruptly terminated in C++, the destructors of any local objects are automatically called.  What I did was make a simple set of five StackFrame classes (or structs, actually) for the five broad categories of return value a Slag method can have: no return value, an object reference, a 64-bit primitive, a 32-bit primitive, or an N-bit (“other”) primitive. For example:

    struct StackFrameReturn32
    {
      stack frame vars;

      StackFrameReturn32( num_parameters, num_local_vars )
      {
        stack frame vars = ....;
      }

      ~StackFrameReturn32()
      {
        return value = vm.POP();
        clean up stack frame;
        vm.PUSH(return value);
      }
    };

At the beginning of each generated function I just declared a StackFrame object of the appropriate type:

    void SomeMethod()
    {
      StackFrameReturn32 frame(2,1);
      ...
    }

With that, no matter how the method ended (return,throw), the stack frame’s destructor still cleans up!

Secondly: the cross-compiled code has to be compiled in with the regular VM code since it still uses the VM’s type information and stacks, etc.  I was thinking for a while that I was gonna have to have some global #define that was used to say that either “set up the VM for general use” or “let’s call this external setup function since we’re including cross-compiled code”.  This time it was global objects to the rescue!  First off I set a global function pointer to zero in the general VM code:

  FnPtr auxiliary_setup_fn = 0;

Then, in the cross-compiled code I just throw in a little struct definition that resets the function pointer and make a global object of that struct:

    extern FnPtr auxiliary_setup_fn;
    void cross_compiled_setup_fn() { ... }
    struct HookInAuxSetup
    {
      HookInAuxSetup() { auxiliary_setup_fn = cross_compiled_setup_fn; }
    };
    HookInAuxSetup hybrid_code_enabler;

Via this mechanism the compiled code just hooks itself in to the system! If you don’t want the compiled code no more, just delete the contents of that .cpp file and it goes back to being a regular Slag VM!

Third: in my original code, hooking in native methods was a bit of a pain.  If the program loader saw that a method call was flagged as native, it would search through a lookup table of native implementations.  With the cross compiler, every method was going to be native and I needed a better alternative.  The answer is blindingly obvious in hindsight and simplifies everything: I just gave each MethodInfo data structure a function pointer variable.  When a program is first loading, the loader goes through each native function listed in the lookup table and sets the function pointer in the corresponding method.  Then, on every method call for regular interpreted code, the loader just checks the associated function pointer.  If there is one, it calls that instead of the normal routine.  Piece of cake.

Unexpected Problems

Half of the issues I experienced during implementation were complete surprises for me.  Most of them came from the fact that the generated C++ code is big – about 15 times bigger than the original Slag code.  My cross-compiler itself cross-compiles to 88,000 lines; MadStone to 250k lines.  This isn’t itself a problem since most every line is a single instruction whereas every line of a typical high-level program represents 6 to 12 assembly instructions.  In other words: it’s only huge in high-level language source code terms.  When it’s compiled it’s completely typical.

The first time I got a test program cross-compiling and then subsequently compiling I was thrilled.  Of course it didn’t work right off the bat – there was some kind of problem.  I started tracing through the code on the debugger.  Step… step… step… and all of a sudden, on one particular function call, the cursor jumped into the middle of another function instead of starting at the beginning.  WTF?!

For a while I figured that I had some kind of memory corruption that was messing up my jump pointers.  But then, as I was doing a binary search between “problem” areas and “no problem” areas, I noticed that a function on line 60,000 was reading correctly but a function on line 70,000 had problems.  Aha!  Turns out that Visual Studio won’t correctly step through source code that’s more than 65,535 lines long (the largest value of an unsigned 16-bit integer).  It’s actually running the commands correctly, but the displayed location is (line# mod 65536)!  Who knew?  So I added file-splitting functionality to my cross compiler.

The next shock came when I tried to compile the same code in Radix, which is the de facto IDE used for Wii development.  It compiled 4 of the files fine, but on the 5th it froze for a long time and then came back with an “Out of Memory” error.  WTF?!

The way the cross compiler works is it reads in a compiled ETC file (Slag’s bytecode) and then spits out C++ code.  The C++ code also contains a stripped-down version of the same ETC file – the method bodies have all been translated, but the VM still needs the meta-info about the types and the method signatures.

The 5th file contained the embedded ETC file as a literal char array of 186,000 bytes.  That was the only thing I could spot; I figured that array was just too big for Radix.  In fact, I was so confident I’d found the problem that instead of deleting most of that byte array and recompiling – which would have taken 10 seconds – I spent an hour or so banging my head against alternative embedding solutions.

Turns out: that wasn’t the problem at all.  It was actually a single big 4000-line generated function that hooked in all the native calls into the VM.  Nothing too weird about it, it was just too big!  So I split that function up too.

Lessons learned

The other half of the issues were mainly focused on two simple VM functions that I didn’t spend time properly testing.

In general my testing philosophy was: cross-compile some large Slag/Plasmacore programs and let’r rip.  If it’s working, they’ll work.  If it doesn’t, they won’t.

In the back of my mind I knew there was one tricky area with this: compounds.  Compounds in Slag are like structs in C; they’re just a packed collection of primitives that are passed around by value instead of by reference.  The problem is that most of the time, most of the compounds are composed of Int32s and Real64s – values that don’t suffer from tricky alignment problems and such.  So even if there’s some problems with compound-handling in my cross-compiler, it won’t necessarily show up right away.  Some examples:

Problem: MadStone won’t run at all on the PC.

Solution: oh that’s right – a compound consisting of an Int32, a Char, and an Int32 only strictly requires 10 bytes, but the VM aligns 2 byte values to 2 byte boundaries and 4 byte values to 4 byte values (since some processors require this) and it ends up being 12 bytes.  I just implement this same fix in my two compound-handling functions.

Problem: MadStone loads and runs 99% fine on the PC, but every time there’s an in-game earthquake there’s an error – and the font that displays the error message is all garbled.

Solution: compound creation was messed up – it was being left at the wrong place on the stack.  Most of the time that wrong place happened to contain the correct data that the program had just been manipulated.  I implement the fix in one of my two compound-handling functions.

Problem: MadStone runs 100% perfect on the PC.  It loads the title screen on the Wii and seems to be running just fine except the joystick controls no longer respond.

Solution: if the arguments for a compound are 12 bytes on the stack and a compound itself is also 12 bytes, when can you not just leave the data along and have it turn out correctly?  Answer: when there’s extra bytes of padding inside and you need code that runs on CPUs with low-high byte ordering as well as CPUs with high-low byte ordering.  That middle “Char” of the compound containing Int32, Char, Int64?  On one kind of CPU there’s the two data bytes followed by two padding bytes.  On the other kind of CPU there’s two padding bytes followed by two data bytes.  I implement the fix in one of my two compound-handling functions.

    The moral of the story is that I should have sat down and constructed some good, robust tests for commands I knew weren’t being given a good workout in my test apps.  It would have taken about 20 minutes and I’m quite sure it would have shaved off 1-2 days from my total time spent!

    Results

    After I got my “Hello World” program cross-compiling, I then compiled the cross compiler itself, which was supposed to print out how many seconds it took to process the file.  At that point there was a 32-bit/64 bit stack manipulation error that was causing Char array values to be interpreted all crazy.  This was the output:

    It wasn’t completely working, but I was still elated.  Wooot!

    I then did some time tests – both on a simple counting loop and on the cross compiler itself as a typical app.  Both tests came back with a speed improvement of… 30%.

    I was fairly disappointed.  I needed 50% or better to break the barrier into 60fps.

    I took a break.  I thought about how I needed to take the cross-compiler to the next level.  And then I realized there’s one more thing I should check.  I hadn’t turned on C++ optimizations!

    I set Visual Studio to /O2 (optimize for speed) and recompiled.  It took a lot longer to compile.  And… now all programs are getting a speed boost of 75-90%!!

    MadStone is working all nice now and running at a rock-steady 60 fps at the beginning of the level – the critical section.  It’ll slow down to 30 fps once in a while when there’s an earthquake happening, but you don’t even notice if you’re not looking at the debug FPS counter I added in!!

    Advertisements

    2 Comments

    Leave a Comment
    1. Jacob / Jul 9 2008 8:06 pm

      Holy Cow! Epic! Your strategy for cross compilation is totally different than I though it would be. I assumed the code would be like the original Slag statements, but in C. Just like you, I wondered if GC would be a nightmare, so I’m glad to see there is another way.

      I’m going to have to read this again in the morning when I’m fresh. My 8pm brain isn’t quite up to the challenge of this technical genius!

    2. Jacob / Jul 9 2008 8:08 pm

      PS: Have you ever thought about teaching a class covering this sort of heavy-duty material? Talk about learning from experience!

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Google+ photo

    You are commenting using your Google+ account. Log Out / Change )

    Connecting to %s

    %d bloggers like this: