The future is history

Introduction

In this part we’re going to discuss some potential alternative approaches for handling load delay-slots.

The discussion is going to be fairly light and not include any code changes to our experimental recompiler.

We’re going to explore how we might manipulate the instructions before recompiling, to make the recompiler’s job easier.

Recap

In the last part we discussed how to implement the LW instruction and how to handle load delay-slots inherent in the early MIPS architecture.

void EmitLw(RecompilerState &state, EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
    // Load word
    Label resume = emitter.NewLabel();
    const uint32_t rt = InstructionRt(opcode);
    if (state.GetLoadDelaySlot()) {
        const uint32_t dr = state.GetLoadDelayRegister();
        if (rt != dr) {
            WriteGuestRegisterFromStack(emitter, processor, dr);
        }
        state.SetLoadDelaySlot(false);
    }
    // Restore the program counter
    CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
    // Call load word
    CallLoadWord(emitter, processor, opcode);
    // Return from recompiled function in event of an exception
    emitter.TestALImm8(1u);
    emitter.Jne(resume);
    CallSetLoadDelayValue(emitter, processor);
    CallInterpreterFunction(emitter, AddressOf(SetLoadDelayRegister), processor, 0u);
    CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlotNext), processor, false);
    CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlot), processor, false);
    emitter.MovR64R64(RSP, RBP);
    emitter.PopR64(RBP);
    emitter.Ret();
    emitter.Bind(resume);
    // Update recompiler state
    state.SetLoadDelaySlotNext(true);
    state.SetLoadDelayRegister(rt);
}

We extended the Emitter to handle the LEA instruction and used that to interact with the local variables on the stack.

No Delay-Slot, No Problems!

The motivation here is to ask ourselves the question Do we really need that load delay-slot mess at all?

Consider the following case, which is a pretty common given that a lot of compilers didn’t attempt to put useful work into delay-slots,

LW R2, (R3)     ; load instruction
NOP             ; load delay-slot

Given that nothing interacts with R2 or R3 in the delay-slot our recompiler may as well just go ahead and complete the load immediately. It makes no difference if the delayed load is completed before or after the NOP instruction.

What about when the compiler did decide to put work into a load-delay slot? When the delay-slot instruction doesn’t use R2 or R3 then it doesn’t make any difference if the load is completed before or after the delay-slot instruction

LW R2, (R3)     ; load instruction
ADD R5, R6, R7  ; load delay-slot, don't care about the value of R2!

so we may as well go ahead and complete the load immediately prior to the ADD instruction.

I believe you have my stapler!

What about when R2 is used in the delay-slot instruction?

When R2 is used as an input

LW R2, (R3)     ; load instruction
ADD R5, R2, R7  ; load delay-slot, we want the old value of R2!

then we could achieve the same result by reordering the instructions and executing them out of order

ADD R5, R2, R7  ; Do the addition first, since we want the old value of R2
LW R2, (R3)     ; load R2 immediately, don't have a delay-slot.

This will work in most cases, but it’s possible that in the original order the LW raised an exception, and that the ADD was never executed, which is not the case in this reordered version.

Another way to solve this problem is to imagine that the MIPS architecture has an additional register R32. We can then keep the original order

MOV R32, R2      ; Store the current value of R2 in R32
LW R2, (R3)      ; load instruction, complete immediately
ADD R5, R32, R7  ; load delay-slot, we want the old value of use old value of R2 in R32

and achieve the same result.

What about when R2 is used as an output?

LW R2, (R3)     ; load instruction
ADD R2, R6, R7  ; load delay-slot, R2 is going to get immediately overwritten

when R2 is used as an output of the load delay-slot instruction it’s going to immediately get overwritten, so there’s no point performing the ADD instruction, so we may as well remove it from our recompiled code (see caveats below).

I asked for a Mai Tai, and they brought me a Piña Colada

What about R3?

When R3 is used as an input

LW R2, (R3)      ; load instruction
ADD R5, R3, R7   ; load delay-slot

then we can go ahead and complete the load immediately prior to the ADD.

But when R3 is the output then it depends on R2 being used as an input.

When R2 isn’t being used

LW R2, (R3)      ; load instruction
ADD R3, R6, R7   ; load delay-slot, overwriting load instruction input.

Then we can go ahead and complete the load immediately prior to the ADD instruction.

When R2 is being used as an input,

LW R2, (R3)      ; load instruction
ADD R3, R2, R7   ; load delay-slot, overwriting load instruction input and using old value of R2

we appeal once more to our imaginary extra register R32

MOV R32, R2       ; make a note of R2 in R32
LW R2, (R3)       ; load instruction, complete immediately
ADD R3, R32, R7   ; load delay slot, wants the old value of R2 

we can then convert our load instructions into ones that completes immediately.

Note: in this case even if reordering was desirable it’s not possible since doing the ADD first will change the value of R3 and change the load address.

It’s not a place you can get to by boat or train

Maybe it’s time to bring to bear a degree of realism to the point of the imaginary register R32.

The idea was in part motivated by Static single-assignment form1 or SSA for short. SSA is used by more advanced recompilers as their Intermediate Representation.

The way that SSA works is transform every assignment into a unique variable. Consider the following contrived example

x = x * x + 1
x = x + y + 2

then SSA, given x_0 and y_0 as initial values for x and y, will rewrite the example as follows

x_1 = x_0 * x_0 + 1
x_2 = x_1 + y_0 + 2

This representation makes certain code analysis operations easier, and it can be used in Register Allocation algorithms. It might be also able to make keeping track of delay-slots at code emission time unnecessary.

Let’s recast

LW R2, (R3)      ; load instruction
ADD R3, R2, R7   ; load delay-slot, using old value of R2

in SSA form (given R2_0, R3_0 and R7_0 for the initial values) this might look like

R2_1 = (R3_0)
R3_1 = R2_0 + R7_0

thereby giving us a delay-slot free intermediate representation of MIPS instructions.

It should be stressed this discussion is mostly theoretical and there are many details to worked out in an actual implementation. Hopefully it’s clear that if we were to turn our MIPS instructions into an intermediate form that can be manipulated prior to recompiling we can do much better than we can with our current approach.

Caveat emptor

In the above discussion we discussed the case of instructions in load delay-slots doing useless work and then removing or nullifying them

LW R2, (R3)     ; load instruction
ADD R2, R6, R7  ; load delay-slot, R2 is going to get immediately overwritten

There are cases however, where although the immediate result is discarded, there are side effects at work.

Consider

LW R2, (R4) ; initial load, discarded because the next instruction targets the same register
LW R2, (R5) ; cancels the previous load

In this example the first load instruction, from the R3051’s point of view is redundant. However, if the address pointed to R4 is mapped to the read buffer of a CD-ROM drive, then we might have needed the discarded load to service the read buffer. By nullifying the instruction entirely we might cause problems when interacting with the CD-ROM drive later.

The devil is in the detail

Finally, it’s probably worth discussing the Toxic Twins of the MIPS instruction set, LWL and LWR.

These two instructions can can access the value about to be loaded in the delay-slot. This is known as forwarding.

The typical usage might be

LWL R1, 0(R2)  ; load word left at R2
LWR R1, 3(R2)  ; delay-slot, load word right at R2 + 3
ADD R8, R1, R1 ; delay-slot
ADD R3, R1, R4

so that a non-aligned 32-bit word can be loaded into R1. The LWR in the load delay slot has access to the value loaded by LWL so that it can combine the values bitwise.

In our synthetic SSA form it might look something like

R1_1 = 0(R2_0)
R1_2 = (3(R2_0) & 0xFF) + (R1_1 & ~0xFF)
R8_1 = R1_0 + R1_0
R3_1 = R1_2 + R4_0

and would hopefully allow us to treat the delay-slots relatively cleanly.

Note: The bitmasks in this example are only illustrative. The actual values used depend on the address in question and the endianness of the system amongst other things.

Conclusion

In this part we considered an alternative approach to handling load-delay slots.

We considered instruction reordering and Static single-assignment form.

We also discussed some potential problems in nullifying instructions with side effects and talked briefly about LWL and LWR.

In the next and final part, we’ll deal with the last class of instruction, branches.

References