Asm | 0dmg

Debug Mode Performance

A while ago, I added some code to colour the terminal output red if the emulator was running slower than real-time. I found that this was always triggered in debug mode. However, I thought it was a bug in how I was sleeping to syncronize time, because the emulator time was still within ~20% of real time, which seemed pretty close for a coincidence.

It wasn’t a coincidence. The emulator was running a bit slower than real-time, when compiled in debug mode.

The source was obvious: I’m formatting string representations of every instruction as I run them, which is always more expensive than the instruction itself. (These are stripped out in release mode, so it is able to run much faster than real-time.) This is wasteful, because I’m now only actually logging ~100 instructions out of every million that we execute.

These formatted strings are currently returned by the instruction implementation functions themselves.

    |_13, gb| {
        let de0 = gb.de();
        let de1 = de0.wrapping_add(1);
        gb.set_de(de1);
        op_execution!{
            cycles: 2;
            asm: "INC DE";
            trace: "DE₀ = ${:04x}, DE₁ = ${:04x}", de0, de1;
        }
    },

The simplest solution to the performance problem would be for these functions to only produce these strings when they know they’ll be used, perhaps by checking a gb.instructionLoggingActive() flag. This would require some changes. The logs are currently written to a circular buffer, which is read retroactively when we decide we want to display a log message. It would be easy to accomidate our periodic log samples – saying “log the next 100 instructions” works as well as “log the last 100 instructions” for that purpose. However, we would also like to be able to log recent instructions when we crash, and that requires the circular buffer of every instruction.

Assembling

This has me mulling over the way our CPU currently works.

It would probably be more efficient if it were working on a structured representation of the instruction stream, and returning some type of structured log data, instead of decoding machine code and formatting strings at every step.

Maybe an incremental/lazy disassembler that runs on insturctions before the CPU sees them would be worthwhile.

This could get much more complicated if we have to deal with RAM and swappable ROM, instead of just fixed ROM, but the only place we’re currently hitting that is unmapping the boot ROM, which we may be able to treat as a small special case.

It might be worth trying this as an independent experiment in a branch.

Come up with some toy examples.

/// An operation and its operands
enum Operation {
  LOAD(OneByteRegister, OneByteRegister),
}

// possible exits for an operation, for control flow analysis
enum Exits {
  NextInstruction,
  RelativeJump(i8),
  ConditionalRelativeJump(i8),
  AbsoluteJump(u16),
  ConditionalAbsoluteJump(u16),
  Operation(Box<Operation>),
}

We could define known entry points (source and destination) for each chunk of non-branching code…

let program: Vec<Operation> = vec![
  LOAD_IMMEDIATE(A, 0xF0),
  LOAD_IMMEDIATE(B, 0xF0),
  LOAD_IMMEDIATE(C, 0xF0),
  LOAD_IMMEDIATE(D, 0xF0),
  DEC(A),
  JUMP_REL_IF_A_NOT_ZERO(-1),
];

These could tell us where to put labels in the disassembled code.

0x0000_BOOT:
  LD A, 0xF0;
  LD B, 0xF0;
  LD C, 0xF0;
  LD D, 0xF0;
0x0004_FROM_0x0003:
0x0004_FROM_0x0005:
  DEC(A);
  JUMP NZ, 0x0004_FROM_0x0005;