Dynamic Recompilation Part 7
We scare because we care
Introduction
In this part we’re going to tackle one of the trickier instructions, Load Word or LW.
We’re going to discuss one of the peculiarities of the early MIPS (and other RISC designs), the load delay-slot.
First we’ll outline how an interpreter might go about handling such subtleties by introducing additional state.
Then we’ll describe making similar changes to our recompiler, by creating additional state to help us to keep track of load delay-slots during recompilation.
We’ll discuss using the stack to store variables in order to deal with the load delay-slot and introduce the LEA x86–64 instruction.
Finally, we’ll discuss the pitfalls of our simple approach.
Recap
Last time we implemented the StoreWord or SW instruction.
void EmitSw(EmitterX64& emitter, R3051& processor, uint32_t opcode) {
// Store word
Label resume = emitter.NewLabel();
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call store word
CallStoreWord(emitter, processor, opcode);
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume)
}
We introduced the concept of interpreter functions signalling success or failure by returning a boolean value and having the recompiler testing the return value in order to decide if we should return from the recompiled code.
Forever delayed
To begin our discussion we present the format of LW instruction, which like SW, is an I-Type instruction.
The challenge with implementing the LW instruction is that it introduces one of the most problematic elements of emulating the MIPS R3000 family, the load delay-slot.
As a consequence of the early MIPS pipeline design the effects of a load instruction are not immediately observable. There is a one instruction delay and attempting to read the affected register will return the old value. In effect, loads on the R3000 take two instructions to complete. The second instruction is said to occupy the load delay-slot.
As an example, in,
LW R2, (R3) ; R2 is 6, (R3) is 10
NOP ; Load delay-slot! R2 is still 6
ADDU R5, R2, R4 ; R2 is now 10
the loaded value isn’t visible to the programmer until the ADDU instruction.
The design intention was that clever compilers could utilise load delay-slots efficiently but, anecdotally at least, most inserted No-Operations (NOPs) into load delay-slots and chose to waste cycles instead.
Things aren’t as straightforward as they might seem because there are a number of complications we have to keep in mind while handling load delay-slots.
- Exceptions can occur in load delay-slots, and the delayed load has to be completed before the exception can be processed.
- Load instructions can occur in load delay-slots. In most cases the delayed load completes, but when two load instructions in succession target the same destination register, the original load never completes.
- Certain instructions, LWL and LWR for example, can access the loaded value early.
It is worth noting that later revisions, starting with the R4000, changed things so that loads were immediately visible to the programmer and removed this particular cognitive burden.
What a load
Let’s provide a function to allow the processor to retrieve values from the outside world similar StoreWord discussed in the previous part.
bool LoadWord(R3051* r3051, uint32_t virtualAddress, uint32_t* value) {
return true;
}
The function returns true if the operation completes without exception and places the result in the value argument, which is passed by pointer.
Extending the interpreter to support load delay-slots will require,
class R3051 {
private:
//...
bool loadDelaySlot;
bool loadDelaySlotNext;
uint32_t loadDelayRegister;
uint32_t loadDelayValue;
};
The loadDelaySlot variable tells the interpreter that the current instruction is in a load delay-slot. The loadDelaySlotNext variable tells the interpreter that the next instruction is going to be in a load delay-slot.
The interpreter will also need to remember which register was the target of LW instruction and the value that was retrieved from memory. These are stored in loadDelayRegister and loadDelayValue respectively.
A matter of interpretation
We can now modify the interpreter to manage load delay-slots.
void Interpret(R3051* r3051) {
uint32_t instruction = ReadNextInstruction(r3051);
ExecuteInstruction(r3051, instruction);
if (GetLoadDelaySlot(r3051)) {
const uint32_t r = GetLoadDelayRegister(r3051);
const uint32_t v = GetLoadDelayValue(r3051);
WriteRegister(r3051, r, v);
}
bool next = GetLoadDelaySlotNext(r3051);
SetLoadDelaySlot(r3051, next);
SetLoadDelaySlotNext(r3051, false);
}
Now, after executing an instruction, it checks to see if we completed a delay-slot instruction and if so completes the load operation. It then moves the current value of loadDelaySlotNext into loadDelaySlot and assumes the instruction after next isn’t going to be in a load delay slot.
An interpreted implementation of InterpretLW then might look something like
void InterpretLw(R3051* r3051, uint32_t opcode) {
const uint32_t rt = InstructionRt(opcode);
if (GetLoadDelaySlot(r3051)) {
const uint32_t dr = GetLoadDelayRegister(r3051);
const uint32_t dv = GetLoadDelayValue(r3051);
if (rt != dr) {
WriteRegister(r3051, dr, dv);
}
SetLoadDelaySlot(r3051, false);
}
const uint32_t base = ReadRegisterRs(r3051, opcode);
const uint32_t offset = InstructionImmediateExtended(opcode);
uint32_t value;
if (LoadWord(r3051, base + offset, &value)) {
SetLoadDelaySlotNext(r3051, true);
SetLoadDelayRegister(r3051, rt);
SetLoadDelayValue(r3051, value);
}
}
There are two points of interest here.
Firstly, after decoding the target register, the method checks to see if we’re currently in a delay-slot. If we are, then it retrieves the delayed target register and value, and if the current load instruction isn’t targeting the same register, it completes the delayed load (remember, when two consecutive load instructions target the same register, the first load never completes). As the delayed load is now complete, the method then clears the load delay-slot flag, thus preventing the interpreter loop from mistakenly completing the load we’re about to set up an instruction early.
Secondly, if the load completes without raising an exception as we would hope, rather than writing to the register immediately, the method stores the register number and loaded value into the interpreter state and tells the interpreter that the next instruction is going be in a load delay-slot.
No memory, no problems
With the necessary interpreter changes described we can now turn our attention to the recompiler. Our goal is to provide the EmitLw function, and it looks something like this,
void EmitLw(EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
// Load word
Label resume = emitter.NewLabel();
if (delay slot) {
??? Complete delayed load ???
}
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call load word
??? CallLoadWord(emitter, processor, opcode); ???
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
??? Synchronise up any load delay-slot specific state ???
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
??? set next instruction as delayed ???
??? remember target register ???
??? remember loaded value ???
}
It’s similar to the EmitSW function implemented in the previous part because a memory access can cause an exception there’s the necessary code to exit the recompiled routine early.
However, there’ a bit more to work to be done. At the top of the function the recompiler needs to know if the current instruction is in a load delay-slot, and if so it needs to emit the necessary code to complete the delayed load. This is similar to the logic at the top of the InterpretLw function.
If an exception occurs as a result of calling LoadWord then whatever the recompiler knows about such things as load-delay slots needs to be written back to the interpreter, so that if required, the interpreter can continue normal execution.
Finally, after emitting all the necessary instructions, the recompiler needs to inform itself that the next instruction is going to be in a load delay-slot and which register was involved in the delayed load, much like the interpreter in InterpretLW.
This then strongly implies that our recompiler needs to be stateful.
The Memory Remains
In order to help the recompiler manage load delay-slots we introduce our RecompilerState object
class RecompilerState {
private:
uint32_t loadDelayRegister;
bool loadDelaySlot;
bool loadDelaySlotNext;
};
to help track if the instruction currently being recompiled is in a delay-slot, its target register, and if the next instruction is going to be a delay-slot instruction.
The load delayed value is not part of the RecompilerState since generally we do not know the value to be loaded at recompile time. The recompiled code may itself modify the value in memory during execution and the value we used during recompilation will then be wrong.
Note: There is of course an exception to this, and that is when the value is being loaded from ROM (Read Only Memory) but by making the assumption that the loaded value is not known at recompile time, we can use the same code for both cases.
The recompiler loop will also need changing to look something like
RecompilerState state;
for (const uint32_t opcode : opcodes) {
Emit(state, emitter, processor, opcode);
if (state.GetLoadDelaySlot()) {
??? complete delayed load ???
}
bool loadDelaySlotNext = state.GetLoadDelaySlotNext();
state.SetLoadDelaySlot(loadDelaySlotNext);
state.SetLoadDelaySlotNext(false);
}
This hopefully looks similar to the modified interpreter loop.
After emitting an instruction we check to see if we emitted an instruction that was in a delay-slot. If so, then the recompiler needs to emit code to complete the delayed load.
We also then update the recompiler’s delay-slot flags in a similar fashion that of the interpreter loop.
The code for EmitLW will now look something like
void EmitLw(RecompilerState &state, EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
// Load word
Label resume = emitter.NewLabel();
const uint32_t rt = InstructionRt(opcode);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
if (rt != dr) {
??? complete delayed load ???
}
state.SetLoadDelaySlot(false);
}
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call load word
??? CallLoadWord(emitter, processor, opcode); ???
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
??? Synchronise up any load delay-slot specific state ???
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
// Update recompiler state
state.SetLoadDelaySlotNext(true);
state.SetLoadDelayRegister(rt);
}
with a check to see if the recompiler is emitting a load delay-slot instruction prior to calling LoadWord and the recompiler state being updated to indicate the next recompiled instruction is going to be in a load delay-slot.
In the next section we’ll discuss how to implement CallLoadWord by using the stack.
The House That Stack Built
Before any of this can be made to work, the recompiled code is going to have to retrieve the value to be loaded and store it somewhere for the duration of one instruction.
That somewhere is going to have to be in memory because if you recall signature of the LoadWord function
bool LoadWord(R3051* r3051, uint32_t virtualAddress, uint32_t* value);
the value read is returned by pointer.
This means that the recompiled code needs a 32-bit location in memory it can pass to this function as a pointer. There a number of places we could use, including but not limited to, the address of a R3051 member variable, an address in dedicated space allocated to the recompiled code or, our choice for this part, the stack.
Note: The choice to use the stack here is somewhat arbitrary. It was motivated by wanting to learn a bit more about the x64 ISA and since we have the base pointer RBP available it might be possible to generate shorter code.
Since the stack was discussed a long time ago in Part 2.5 let’s briefly recap what the function prologues and epilogues do.
The prologue pushes the caller’s RBP onto the stack and then moves the current RSP into RBP. Thereby establishing our method’s stack frame. At this point RSP and RBP have the same value.
The epilogue then reverses this operation by restoring the original value of RBP by popping it back off the stack.
As discussed in Part 2.5, if we want to store variables in the stack, then we have to reserve space by subtracting from RSP. The situation can be seen in the diagram below.
On the left-hand side we have the state of RSP and RBP after initially setting up the stack frame and on the right-hand side we have the state of RSP and RBP after subtracting 16 from RSP.
By subtracting 16 from RSP (remember the stack has to be 16-byte aligned) we’re reserving space for four 32-bit variables. We’re going to use the upper four bytes to store our load delayed values. The address of these four bytes is RBP - 4 and this is the address we want to pass to LoadWord.
In order to do this, our prologues and epilogues need to be as follows,
CodeBuffer buffer(1024);
// Prologue
EmitterX64 emitter(buffer);
emitter.PushR64(RBP);
emitter.MovR64R64(RBP, RSP);
emitter.SubR64Imm8(RSP, 16);
// Epilogue
emitter.AddR64Imm8(RSP, 16);
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
subtracting and adding 16 to reserve and free space respectively.
The implementation of CallLoadWord (recalling CallStoreWord from the previous part) is as follows,
void CallLoadWord(
EmitterX64 &emitter,
R3051 &processor,
uint32_t opcode) {
const uint32_t rs = InstructionRs(opcode);
const uint32_t offset = InstructionImmediateExtended(opcode);
const size_t size = sizeof(uint32_t);
emitter.MovR64Imm64(RDI, AddressOf(processor));
emitter.MovR64Imm64(RSI, processor.RegisterAddress(0));
emitter.MovR32Disp8(RSI, RSI, static_cast<uint8_t>(rs * size));
emitter.AddR32Imm32(RSI, offset);
??? MOVE RBP minus 4 into RDX ???
emitter.Call(AddressOf(LoadWord));
}
with the outstanding task of putting RBP - 4 into RDX to pass as the 3rd parameter.
How do we do this?
Well we could do it by moving RBP into RDX and subtracting 4 from RDX, but there’s another way.
There’s an instruction called LEA which stands for Load Effective Address. What this instruction does is compute the effective address of a location in memory and instead of loading the value in memory at that address into the register, it simply loads the address, which is precisely what we want.
Consulting 1 gives us an implementation of LeaR64Disp8 that looks like
void EmitterX64::LeaR64Disp8(uint32_t reg, uint32_t rm, uint8_t disp8) {
const uint8_t rex = Rex(1u, reg >> 3u, 0, rm >> 3u);
const uint8_t mod = ModRM(1u, reg, rm);
buffer.Bytes({rex, 0x8D, mod, disp8});
}
and CallLoadWord looks like
void CallLoadWord(
EmitterX64 &emitter,
R3051 &processor,
uint32_t opcode) {
const uint32_t rs = InstructionRs(opcode);
const uint32_t offset = InstructionImmediateExtended(opcode);
const size_t size = sizeof(uint32_t);
emitter.MovR64Imm64(RDI, AddressOf(processor));
emitter.MovR64Imm64(RSI, processor.RegisterAddress(0));
emitter.MovR32Disp8(RSI, RSI, static_cast<uint8_t>(rs * size));
emitter.AddR32Imm32(RSI, offset);
emitter.LeaR64Disp8(RDX, RBP, static_cast<uint8_t>(-4));
emitter.Call(AddressOf(LoadWord));
}
placing RBP - 4 into RDX.
Our implementation of EmitLw, with the addition of the call to LoadWord, now looks like
void EmitLw(RecompilerState &state, EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
// Load word
Label resume = emitter.NewLabel();
const uint32_t rt = InstructionRt(opcode);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
if (rt != dr) {
??? complete delayed load ???
}
state.SetLoadDelaySlot(false);
}
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call load word
CallLoadWord(emitter, processor, opcode);
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
??? Synchronise up any load delay-slot specific state ???
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
// Update recompiler state
state.SetLoadDelaySlotNext(true);
state.SetLoadDelayRegister(rt);
}
In the next section we’ll discuss how to complete delayed loads since this is required by both EmitLW and the recompiler loop.
The Persistence of Memory
In order to complete a delayed load we’re going to have to move the load delayed value from the stack back into the R3051 register array. These locations are both in memory but as 2 states, memory to memory moves aren’t possible with x64, so we’re going to have to go via a Host register.
Therefore, given a Guest register rt we want to write to we can achieve this in three steps
- Loading the load delayed value from the stack into RAX
- Loading the address of the register array into RCX
- We can then move RAX to an address pointed to by RCX and the 8-bit displacement of rt
We have all the methods in the emitter available to us, so we can propose a method of the form
void WriteGuestRegisterFromStack(EmitterX64 &emitter, R3051 &processor, uint32_t rt) {
const size_t size = sizeof(uint32_t);
emitter.MovR32Disp8(RAX, RBP, static_cast<uint8_t>(-4));
emitter.MovR64Imm64(RCX, processor.RegisterAddress(0));
emitter.MovDisp8R32(RCX, static_cast<uint8_t>(rt * size), RAX);
}
and with this the EmitLW method is nearly complete, with just the state synchronisation missing,
void EmitLw(RecompilerState &state, EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
// Load word
Label resume = emitter.NewLabel();
const uint32_t rt = InstructionRt(opcode);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
if (rt != dr) {
WriteGuestRegisterFromStack(emitter, processor, dr);
}
state.SetLoadDelaySlot(false);
}
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call load word
CallLoadWord(emitter, processor, opcode);
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
??? Synchronise up any load delay-slot specific state ???
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
// Update recompiler state
state.SetLoadDelaySlotNext(true);
state.SetLoadDelayRegister(rt);
}
and the recompiler loop is done.
for (const uint32_t opcode : opcodes) {
Emit(state, emitter, processor, opcode);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
WriteGuestRegisterFromStack(emitter, processor, dr);
}
bool loadDelaySlotNext = state.GetLoadDelaySlotNext();
state.SetLoadDelaySlot(loadDelaySlotNext);
state.SetLoadDelaySlotNext(false);
}
All that remains is to discuss exception handling.
Synchronicity I
Up until now we’ve been diligently avoiding what to do in the event of an exception.
The reason is that it’s complicated.
In order to make our recompiled code smaller we haven’t been synchronising the load delay related state back to the R3051 after every instruction. The less the recompiled code does the faster it is.
However, whenever we exit from our recompiled code, we have to restore the related state. Not only do we have to do this during a normal exit, we have to do it for every single exception.
In our exploratory recompiler this means we have to write back four fields during an exit.
Three of the fields, the delay-slot target register, the delay-slot and the next delay-slot state can be written back using the CallInterpreterFunction developed in Part 3. There are potentially more efficient mechanisms, but those are left as an exercise for the reader.
In order to call the SetLoadDelayValue
SetLoadDelayValue(R3051*, uint32_t);
function, we’re going so need to be able to call an interpreter function with the second argument retrieved from the stack.
This can be achieved with,
void CallSetLoadDelayValue(EmitterX64 &emitter, R3051 &processor) {
emitter.MovR64Imm64(RDI, AddressOf(processor));
emitter.MovR32Disp8(RSI, RBP, static_cast<uint8_t>(-4));
emitter.Call(AddressOf(SetLoadDelayValue));
}
which moves the contents of RBP - 4 into RSI to pass it as the second argument to the underlying interpreter function.
The recompiled epilogue can then take the form
// Epilogue
CallSetLoadDelayValue(emitter, processor);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelayRegister), processor, state.GetLoadDelayRegister());
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlotNext), processor, state.GetLoadDelaySlotNext());
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlot), processor, state.GetLoadDelaySlot());
emitter.AddR64Imm8(RSP, 16);
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
which takes the values from the recompiler state and writes them back to the guest processor as one would hopefully expect.
The EmitLw function is a little different,
void EmitLw(RecompilerState &state, EmitterX64 &emitter, R3051 &processor, uint32_t opcode) {
// Load word
Label resume = emitter.NewLabel();
const uint32_t rt = InstructionRt(opcode);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
if (rt != dr) {
WriteGuestRegisterFromStack(emitter, processor, dr);
}
state.SetLoadDelaySlot(false);
}
// Restore the program counter
CallInterpreterFunction(emitter, AddressOf(WritePC), processor, 0xBADC0FFE);
// Call load word
CallLoadWord(emitter, processor, opcode);
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
CallSetLoadDelayValue(emitter, processor);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelayRegister), processor, 0u);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlotNext), processor, false);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlot), processor, false);
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
// Update recompiler state
state.SetLoadDelaySlotNext(true);
state.SetLoadDelayRegister(rt);
}
because in the event of an exception we tell the interpreter that we’re no longer in a delay-slot since prior to calling LoadWord we completed the delayed load. Technically calling SetLoadDelayValue here is redundant since the recompiler’s value is now stale, so this could be removed.
Synchronicity II
The exception handling in EmitLW differs from the normal flow exception handling flow since when a LW instruction is issued in a load delay-slot it has to complete the delayed load before performing any interactions that may raise an exception.
For other instructions like ADDU that raise an exception the exception handling code should complete the pending load. In effect, something like
// Return from recompiled function in event of an exception
emitter.TestALImm8(1u);
emitter.Jne(resume);
if (state.GetLoadDelaySlot()) {
const uint32_t dr = state.GetLoadDelayRegister();
WriteGuestRegisterFromStack(emitter, processor, dr);
}
CallSetLoadDelayValue(emitter, processor);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelayRegister), processor, 0u);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlotNext), processor, false);
CallInterpreterFunction(emitter, AddressOf(SetLoadDelaySlot), processor, false);
emitter.MovR64R64(RSP, RBP);
emitter.PopR64(RBP);
emitter.Ret();
emitter.Bind(resume);
should suffice.
This then means in the event of an exception, the recompiled leaves the interpreter in a state where it’s able to resume execution at the exception handler without a delayed load pending.
Now you have two problems
It’s worth pointing out some details here that are worth keeping in mind.
- This first exploratory approach into handling delay slots rather closely mimics what the interpreter was doing, and I think it’s worth discussing alternative approaches before proceeding to branch instructions.
- There is an unstated assumption here that the processor is not in a load delay-slot when being passed to the recompiler. As it stands this might not always be the case and this requires further thought.
- The current code assumes that the recompiler can exit after any kind of instruction and that normally isn’t the case. Normally recompilers exit after a branch, so we’re almost (but not completely, thanks branch delay-slots!) guaranteed not to exit the recompilation routine in a load-delay slot.
Conclusion
If you’ve survived all that then I offer my congratulations.
In this part we implemented the Load Word (LW) instruction and discussed one of the peculiarities of the early MIPS architecture, the load delay-slot.
We outlined how an interpreter could handle load delay slots and interactions with exceptions.
We adapted our recompiler to handle load delay-slots and used the host stack to store the value to be loaded across instructions.
We added the LEA instruction to the emitter so that our recompiler can easily interact with interpreter functions taking pointer arguments.
Finally, we covered the topic of synchronising recompiler and interpreter state only when needed to avoid performing extraneous work.
In the next part, we’ll briefly explore other approaches to handling load delay-slots.