Skip to content
June 21, 2008 / Abe Pralle

Wii don’t like exceptions

You may have heard me say that threads are the Devil himself.  Well, I stand by it – and present more evidence.

Yesterday I added the last missing feature to Wii Plasmacore: MIDI music playing.  Everything seemed to go smoothly – music was playing and everything – until the Wii randomly froze.

We looked at each other: “Shit!” we said at the same time (this whole phrase is rhetorical and meant to impart the mood and dramatic flair of a cop-buddy action-drama-mystery movie).

Collecting evidence

The evidence: I had started the program to hear the music volume.  I had not given any input and in fact the Wiimote was entirely switched off.  It just froze after a few minutes.

This was good in a way, because any input exponentially increases the possible sources of error.  Almost all bugs will recur predictably if you can repeat the exact same input – and timing issues aside, the biggest problem there is we usually don’t remember exactly what input we’ve just been giving when we first see a bug.  Having no input at all made it cake.

I started the program up again and let it run… and run… and run.  With no errors and no freezing.

We looked at each other again: “Damn!”

This meant the error was one of two nasty kinds: either garbage memory was being accessed or threads were somehow involved.  As I had just added a required music system callback function that tells the MIDI sequencer to update itself, the latter seemed likely.  I just didn’t see how – the callback function didn’t interact with any of my own code at all; it just made system calls.

I ran the program many more times (starting it several times in an hour, or sometimes letting it sit for a couple of hours while I did errands).  I finally got it to crash in debug mode.  This ended up not telling me much, but I did find out that the problem was happening just after the Slag virtual machine returned from executing Slag code.

I thought I’d better rule out garbage memory, just in case.

Testing for garbage memory

Garbage memory problems are where you forget to initialize variables in C/C++ and just start using them.  A lot of times a variable will happen to be an acceptable value and your program works fine.  But then you run some other program which changes that memory location and then you run your program again and it crashes.  It’s still deterministic, it’s just that the variables now include all inputs to all programs you’ve run on your computer since you booted.

Ah, how well I remember my first serious garbage memory encounter.  My program would sometimes crash on startup, sometimes not.  I wasn’t as good at debugging in those days, so it took me a while to find.  Turns out I was loading a custom font and expecting a zero byte at the end to tell me there were no more letters to define.  My data file didn’t actually have a zero byte at the end, so after I loaded the file into memory it was a crap-shoot as to whether that terminating byte would be zero or non-zero.

Since Wii Plasmacore had crashed just after the VM finished executing byte code – and the VM is full of instances where function pointers are fetched from memory and then those functions are called – I thought the problem might be that the VM was popping one too many return values off its stack or some such.  Normally the VM stack data areas aren’t initialized to any particular values because the first use puts new data in there.  But perhaps I was accidentally reading in the wrong spot and picking up a garbage value.

In debugging it’s really important to be able to reliably reproduce the error so that you can tell right away when you fix the problem and be certain about it.  I certainly couldn’t just touch up some code that looked suspicious, run it once, and say that I was done because no errors had happened.  So before I could fix the bug, I needed to make it crash every time.

I modified the VM to allocate 8 more bytes than necessary for its stack space.  I went through and specifically filled all the stack locations with the repeating pattern 0x01234567.  This nonsensical but identifiable value would be easy to spot in the debugger if it reliably caused a crash. Then I set the stack pointers to point to +4 bytes inside the allocated memory so that erroneous accesses on either side would pick up the “guard value”.

I let the program run… no crashy.  It ran fine most times – but then oh! There’s a lockup.  My garbage memory test had failed; the program was working no worse than before.

We looked at each other a final time: “Threads!”

The Devil himself

Well, a threading issue had crashed the program just after the VM finished executing code, maybe I could force a crash with a concentrated dose of the same circumstances.  Normally the two Slag functions that get called in a Plasmacore program are “update” and “redraw”, both nominally 60 times per second.  That’s not very fast in computer land.

I made an empty method in my Slag program called “dummy”.  It accepted no parameters and returned no values.  In my native code I then called dummy() a thousand times in a row for every time I called update(). So that’s 60,000 calls per second.  Something should happen if it’s gonna happen.

It worked!  My program crashed near-instantaneously!

I had now worked my way into debugger heaven: I had a short and simple path of code that was guaranteed to crash the program!

After looking it over, there was nothing obviously wrong… but I picked out a black sheep among the commands: a try…catch exception handler.  I tried removing it in favor of a flag-based solution and… problem solved!  I’d fixed it!

Aftermath

So what was really going on?  Right now I’m not included to dig deeper than “shun try/catch on the Wii”, but I’ll speculate.

When you say “try” in C++, the language sets a hook of some sort so that any exceptions that are unwinding the call stack know where to stop (this is different than in Java and Slag, incidentally).  After control leaves a try/catch block, the exception handler hook is removed.  When an actual exception is thrown, some kind of helper routine travels back up the call stack, calling destructors on any local objects and looking for a “catch” to terminate at (this is the unwinding).

It seems like there must be a bug with the Wii implementation of try/catch – something that corrupts either the try/catch hook or the actual process of unwinding the stack.

My original Plasmacore code threw regular exceptions to halt the VM and return control to native code.  More or less:

  try
  {
    for( ; ; ) execute_next_instruction;
  }
  catch (Halt& err) { }

My new code uses a flag instead.  Effectively:

  while (keep_running) execute_next_instruction;

The downside to the second approach is that it adds an overhead of at least one more assembly instruction per VM instruction carried out.  The upside, of course, is that my Wii Plasmacore games don’t lock up!

Addendum ( June 30, 2008 )

After finally running some further tests, I can confirm that disabling system interrupts for the duration of a throw..catch avoids the problem.  In other words, you disable interrupts just before a throw and you re-enable interrupts as the first line of a catch.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: