Homework 2 for CS 222 Due April 14, 2009 1. Consider the following MIPS assembly code. LD R1, 45(R2) DADD R7, R1, R5 DSUB R8, R1, R6 DADD R9, R5, R1 BNEZ R7, target DADD R10, R8, R5 DSUB R2, R3, R4 a) Identify each dependence by type; liste the two instructions involved; identify which instruction is dependent; and, if there is one, name the storage location involved. b) Use information about the MIPS five-stage pipeline from Appendix A and assume a register file that writes in the first half of the clock cycle and reads in the second half-cycle forwarding. Which of the dependences that you found in part (a) become harzards and which do not? Why? 2. Construct a version of the table that we have in class for 1/1 predictor assuming the 1-bit predictors are initialized to NT, the correlation bit is initialized to T, and the value of d (leftmost column of the table) alternates 0,1,2,0,1,2. Also, note and count the number of instances of misprediction. 3. Increasing the size of a branch-prediction buffer means that it is less likely that two branches in a program will share the same predictor. A single predictor predicting a single branch instruction is generally more accurate than is the same predictor serving more that one branch instruction. a) List a sequence of branch taken and not taken actions to show a simple example of 1-bit predictor sharing that reduces misprediction rate. b) List a sequence of branch taken and not taken actions to show a simple example of 1-bit predictor sharing that increases misprediction rate. c) Discuss why the sharing of branch predictors can be expected to increase mispredictions for the long instruction execution sequences of actual programs. 4. Consider the following loop. bar: L.D F2, 0(R1) MUL.D F4, F2, F0 L.D F6, 0(R2) ADD.D F6, F4, F6 S.D F6, 0(R2) ADDI R1, R1, #8 ADDI R2, R2, #8 ADDI R3, R3, #-8 BNEZ R3, bar a) Assume a single-issue pipeline. Show how the loop would look both unscheduled by the compiler and after compiler scheduling for both floating-point operation and branch delays, including any stall or idle clock cycles. What is the execution time per interation of the result, unscheduled and scheduled? How much faster must the clock be for processor hardware alone to match the performance improvement achieved by the scheduling compiler (neglect the possible increase in the number of cycles necessary for memory system access effects of higher processor clock speed on memeory system proformance?) b) Assume a single-issue pipeline. Unroll the loop as many times as necessary to schedule it without any stalls, collapsing the loop overhead instructions. How many times must the loop be unrolled? Show the instruction schedule. What is the execution time per element of the result iteration? What is the major contribution to the reduction in time per iteration? (Use the latencies in page 75 for FP ALU ops)