Part II: Tomasulo and Branch Predictor

Assigned: September 28th, Due Date: Midnight (11:59 pm), November 19th


In this part, we are going to implement the Tomasulo algorithm and Branch Predictor for the pipelined processor which executes a reduced set of MIPS32 instructions as defined in Part I of the project. For simplicity, we ignore JR instruction for this part. Chapter 3 and Appendix A in the book are good resource for learning Tomasulo algorithm and Branch Predictor.

Description of pipeline


The pipeline has five stages.


(1) Instruction Fetch (IF): IF fetches one instruction per cycle and writes into instruction queue(IQ). IF uses Branch Predictor to predict what address to fetch by the end of IF stage of an instruction, no matter it’s a branch instruction or not.


(2) Issue: Get an instruction from the instruction queue per cycle and decode the instruction. Issue the instruction if there is an empty reservation station and an empty slot in the ROB; send the operands to the reservation station(RS) if they are available in either the register file or the ROB. Update the control entries to indicate the buffers are in use. The number of the ROB allocated for the result is also sent to the RS, so that the number can be used to tag the result when it’s placed on the CDB. If either all reservations are full or the ROB is full, then instruction issue is stalled until both have available entries. Note, for simplicity, for J instruction, compute the target address from instruction code, set the PC to the target address, and cancel the following instructions from instruction queue(IQ). Instruction Fetch fetches the instruction from the target address of J at the next cycle.


(3) Execute: If one or more of the operands is not yet available, monitor the CDB while waiting for the register to be computed. This step checks for RAW hazards. When both operands are available at RS, execute the operation. Instructions may take multiple clock cycles in this stage.


(4) Write Result: When the result is available, write it into the ROB through CDB, as well as to any RS waiting for this result. Mark the RS as available. For store instruction, if the value to be stored is available, it’s written into the value field of the ROB entry for the store. If the value to be stored is not available yet, the CDB must be monitored until that value is broadcast, at which time the value field of the ROB entry of the store is updated. We assume CDB has unlimited bandwidth, and there is no hardware hazard at this stage. Instruction spends one cycle at this stage.


(5) Commit: There are three different sequences of actions at commit depending on whether the committing instruction is a normal commit, a store, or a miss-predicted branch. The normal commit occurs when an instruction reaches the head of the ROB and its result is present in the buffer; at this point, the processor updates the register with the result and removes the instruction from the ROB. Committing a store is similar except that memory is updated rather than a result register. When commits a miss-predicted branch, the instruction queue, ROB, RS, load buffer are flushed and execution is restarted at the correct successor of the branch. If the branch was correctly predicted, the branch is finished. Upon an instruction commits, its entry in the ROB is reclaimed and the register or memory destination is updated. 


The pipeline has following function units.


(1) Instruction Queue(IQ): An unlimited buffer to hold instructions fetched by instruction fetch stage. The instruction stays in IQ until issue cycle.


(2) Reservation Station(RS):  Instructions enters RS after being issued and waits in the corresponding RS until the functional unit and all the source operands (via register file or CDB) are available. There are 7 entries in RS. You can refer to Figure 3.30 in the book for the detail of implementation.


Entry Type


Entry Count

Exec Cycles





(one for address calculation,

one for memory accessing)












Table 1 Reservation Station.


(3) Load Buffers(LB): Load instruction enters the load buffer, which has unlimited entries, after being issued and waits until the Write result cycle.


(4) Reorder Buffers(ROB): It contains 6 entries. All the issued instructions have to stay in ROB until Commit stage.


(5) Address Unit: It calculates one effective address per cycle for Load/Store instructions at the Execution stage.


(6) Adder: There are three Adders. All the instructions in RSs of Add type need one cycle at an Adder.


(7) Multiplier: There are two pipelined multipliers. All the instruction in RSs of Multiply type need five cycles at a Multiplier.


(8) Register Files: There are two types of register files, integer and floating point, each has 32 registers. Floating point instructions use floating point register file, while integer instructions use integer register file. We assume both register files have unlimited read/write ports, there will be no hardware hazard for register read/write.


For non-mentioned unit, we assume they have unlimited function units/entries/ports.


The following table clarifies the relationship between pipeline stages and pipeline function units. For example, an instruction enters Instruction Queue after Instruction Fetch cycle and leaves after finishing Issue cycle; a load instruction enters RS after Issue stage and leaves after finishing WR cycle; a load instruction enters ROB after Issue stage and leaves after Commit stage.



Instruction Fetch



Write Result




1 Address Unit

(1 inst/cycle)

MEM Unit

(1 inst/cycle)



3 Adders


2 Multipliers


Instruction Queue







RSs (Load entries)







RSs(Add entries)







RSs(Multiplier entries)







ROBs (non-load instruction)







ROBs (load instruction)








Description of Branch Prediction

You are required to implement a 1-bit branch predictor using an 8-entry Branch-Target Buffer (BTB). For details of the 1-bit branch predictor and BTB, please refer to page 198 and page 209 on the textbook. Please use "predict not taken 0" to initialize the branch predictors. Note that for simplicity, both predicted taken and not-taken branches are recorded in the BTB.

The BTB is accessed during the IF stage using the instruction address of the fetched instruction to index the buffer. Upon a hit, the predicted instruction address is known at the end of the IF stage. So if a matching entry is found in BTB, fetching begins immediately at the predicted PC. Note: since both predicted taken and not-taken are recorded in BTB, the target address of the predicted not-taken branch is meaningless (always equal to PC+4). A miss of the BTB is treated as a predicted not-taken branch.

Structure of 1-bit predictor:

We use the 3rd to the last to the 5th to the last bits of PC to index an 8-entry BTB. Note that the last two bits are always '0' with 4-byte instructions.


BTBs are all initialized with 0s.


For non-branch instructions, it always results a lookup miss for Branch Predictor, and PC is increased by 4. For branch instructions (BEQ, BNE, BGEZ, BGTZ, BLEZ, BLTZ), PC will be updated with the predicted target address if the address of the branch instruction hits in Branch Predictor, otherwise PC is increased by 4.


Lookup BTB at IF stage as following pseudo-code shows:

if (PC hit in BTB)


            if (Predict bit == 1)

                        PC = Predicted PC;

            else      PC = PC + 4;


else PC = PC+ 4;


A miss-predicted branch instruction (lookup miss/predict incorrectly) updates BTB at Commit stage as following pseudo-code shows:

if(PC found in BTB)

            Updates Predicted PC and Predictor Bit as outcomes;



            insert PC into BTB; (since it’s direct mapping, you don’t need FIFO replacement scheme)

            initialize the new entry as outcomes;




Command Line

Your simulator (MIPSsim) should provide the following options to users:

MIPSsim inputfilename outputfilename [-Tm:n]  

Output Format

The simulation output format is available here.

Sample Data


Correct handling of the sample code (with possible different data values) will be used to determine 60% of the credit awarded. The remaining 40% will be determined from other test cases that you will not have access prior to grading. A late submission with 48 hours (by the midnight, 11:59pm, of Nov. 21st.) will be deducted 20% no matter how many hours. Even it’s a group project, each of you still have to submit a copy individually through WebCT.


Please test your code in CISE Solaris machines before submission. Since we will be using “diff -wB” to check your output versus ours, follow the output formatting. TA's may or may not have time to debug in case of any mismatch. In other words, mismatches will be treated as wrong output.