Complex Pipelining

MBT
Complex Pipelining: Motivation

Pipelining becomes complex when we want high performance in the presence of

- Long latency or partially pipelined floating-point units
- Multiple function and memory units
- Memory systems with variable access time

Adapted from Arvind and Asanovic’s MIT Course 6.823
Complex Pipeline Control Issues

- Structural conflicts at the write-back stage due to variable latencies of different function units
- Structural conflicts at the execution stage if some FPU or memory unit is not pipelined and takes more than one cycle
- Out-of-order write hazards due to variable latencies of different function units
- How to handle exceptions?
Complex In-Order Pipeline

- Delay writeback so all operations have same latency to W stage
  - Write ports never oversubscribed (one inst. in & one inst. out every cycle)

How to prevent increased writeback latency from slowing down single cycle integer operations?

Bypassing
Complex In-Order Pipeline

How should we handle data hazards for very long latency operations?

- Stall pipeline on long latency operations, e.g., divides, cache misses
- Exceptions handled in program order at commit point

Adapted from Arvind and Asanovic’s MIT Course 6.823
Superscalar In-Order Pipeline

- Fetch two instructions per cycle; issue both simultaneously if one is integer/memory and other is floating-point
- Inexpensive way of increasing throughput, examples include Alpha 21064 (1992) & MIPS R5000 series (1996)
- Same idea can be extended to wider issue by duplicating functional units (e.g. 4-issue UltraSPARC) but register file ports and bypassing costs grow quickly

Adapted from Arvind and Asanovic's MIT Course 6.823
Modern High-Frequency In-Order Pipeline
- with deferred commit & precise exceptions

- non-stalling pipeline (ala UltraSparc 3)
  → no stall signal propagates across chip, only stall/bubbles at dispatch stage
  → if there is a stall/exception condition, we just kill the instruction and all following instructions when they arrive at the commit point and start re-fetching
  → need to defer stores to dcache to commit point
  → much cleaner approach than shown in book
Modern In-Order Pipeline – Long Occupancy Instructions

- today, mostly only matters for divides/sqrt
- for MIPS-style HI/LO registers [which do not cause exceptions (ala MIPS MUL/DIV)]
  → a signal tells the unit to squash its result if it followed killed instructions
  → unit runs in parallel
- for register-to-register operations [ala FDIV]; or non-blocking loads:
  - a) do not issue successive instructions until
    P cycles before we find out about whether instruction causes an exception, where P
    is distance to commit → we can often detect exceptions earlier than the operand is available
  - b) Once we are sure they will not cause exceptions, we can perform out-of-order completion:
    issue any instructions that do not read/write to the dest register (prevent RAW / WAW) and
defer the write-back of the output operand (steal a cycle from the register file when it is free)
until it is ready. If there is an interrupt, the exception handler’s read to the register will stall
until the operand is ready.
Dependence Analysis
Types of Data Hazards

Consider executing a sequence of

\[ r_k \leftarrow (r_i) \text{ op } (r_j) \]

type of instructions

Data-dependence

\[ r_3 \leftarrow (r_1) \text{ op } (r_2) \]
\[ r_5 \leftarrow (r_3) \text{ op } (r_4) \]
Read-after-Write (RAW) hazard

Anti-dependence

\[ r_3 \leftarrow (r_1) \text{ op } (r_2) \]
\[ r_1 \leftarrow (r_4) \text{ op } (r_5) \]
Write-after-Read (WAR) hazard

Output-dependence

\[ r_3 \leftarrow (r_1) \text{ op } (r_2) \]
\[ r_3 \leftarrow (r_6) \text{ op } (r_7) \]
Write-after-Write (WAW) hazard

Adapted from Arvind and Asanovic’s MIT Course 6.823
Detecting Data Hazards

**Range and Domain of instruction i**

- $R(i) =$ Registers (or other storage) modified by instruction $i$
- $D(i) =$ Registers (or other storage) read by instruction $i$

Suppose instruction $j$ follows instruction $i$ in the program order. Executing instruction $j$ before the effect of instruction $i$ has taken place can cause a

- **RAW hazard if** $R(i) \cap D(j) \neq \emptyset$
- **WAR hazard if** $D(i) \cap R(j) \neq \emptyset$
- **WAW hazard if** $R(i) \cap R(j) \neq \emptyset$

Adapted from Arvind and Asanovic’s MIT Course 6.823
Register vs. Memory
Data Dependence

Data hazards due to register operands can be determined at the decode stage but data hazards due to memory operands can be determined only after computing the effective address.

\[
\text{store} \quad \quad M[(r_1) + \text{disp1}] \leftarrow (r_2) \\
\text{load} \quad \quad r_3 \leftarrow M[(r_4) + \text{disp2}]
\]

Does \((r_1 + \text{disp1}) = (r_4 + \text{disp2})\)?
Data Hazards: An Example

\[
\begin{align*}
I_1 & \quad \text{DIVD} & \quad f6, & \quad f6, & \quad f4 \\
I_2 & \quad \text{LD} & \quad f2, & \quad 45(r3) \\
I_3 & \quad \text{MULTD} & \quad f0, & \quad f2, & \quad f4 \\
I_4 & \quad \text{DIVD} & \quad f8, & \quad f6, & \quad f2 \\
I_5 & \quad \text{SUBD} & \quad f10, & \quad f0, & \quad f6 \\
I_6 & \quad \text{ADDD} & \quad f6, & \quad f8, & \quad f2
\end{align*}
\]

RAW Hazards
WAR Hazards
WAW Hazards

Adapted from Arvind and Asanovic’s MIT Course 6.823
## Instruction Scheduling

**Valid orderings:**

<table>
<thead>
<tr>
<th>in-order</th>
<th>out-of-order</th>
<th>out-of-order</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>I₁</strong> DIVD</td>
<td><strong>I₂</strong> LD</td>
<td><strong>I₁</strong> ADDD</td>
</tr>
<tr>
<td><strong>I₂</strong> LD</td>
<td><strong>I₃</strong> MULTD</td>
<td><strong>I₅</strong> SUBD</td>
</tr>
<tr>
<td><strong>I₃</strong> MULTD</td>
<td><strong>I₄</strong> DIVD</td>
<td><strong>I₆</strong> ADDD</td>
</tr>
<tr>
<td><strong>I₄</strong> DIVD</td>
<td><strong>I₅</strong> SUBD</td>
<td><strong>I₁</strong> ADDD</td>
</tr>
<tr>
<td><strong>I₅</strong> SUBD</td>
<td><strong>I₆</strong> ADDD</td>
<td><strong>I₂</strong> LD</td>
</tr>
<tr>
<td><strong>I₆</strong> ADDD</td>
<td><strong>I₁</strong> ADDD</td>
<td><strong>I₃</strong> MULTD</td>
</tr>
</tbody>
</table>

Adapted from Arvind and Asanovic’s MIT Course 6.823
# Out-of-order Completion

## In-order Issue

<table>
<thead>
<tr>
<th>$I_1$</th>
<th>DIVD</th>
<th>f6, f6, f4</th>
<th>Latency</th>
<th>4</th>
</tr>
</thead>
<tbody>
<tr>
<td>$I_2$</td>
<td>LD</td>
<td>f2, 45(r3)</td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>$I_3$</td>
<td>MULTD</td>
<td>f0, f2, f4</td>
<td></td>
<td>3</td>
</tr>
<tr>
<td>$I_4$</td>
<td>DIVD</td>
<td>f8, f6, f2</td>
<td></td>
<td>4</td>
</tr>
<tr>
<td>$I_5$</td>
<td>SUBD</td>
<td>f10, f0, f6</td>
<td></td>
<td>1</td>
</tr>
<tr>
<td>$I_6$</td>
<td>ADDD</td>
<td>f6, f8, f2</td>
<td></td>
<td>1</td>
</tr>
</tbody>
</table>

### in-order comp

| 1 | 2 | 1 | 2 | 3 | 4 | 3 | 5 | 4 | 6 | 5 | 6 |

### out-of-order comp

| 1 | 2 | 2 | 3 | 1 | 4 | 3 | 5 | 5 | 4 | 6 | 6 |
Scoreboard: A Hardware Data Structure to Detect Hazards Dynamically
CDC 6600  *Seymour Cray, 1963*

- A fast pipelined machine with 60-bit words
  - 128 Kword main memory capacity, 32 banks
- Ten functional units (parallel, unpipelined)
  - Floating Point: adder, 2 multipliers, divider
  - Integer: adder, 2 incrementers, ...
- Hardwired control (no microcoding)
- Dynamic scheduling of instructions using a scoreboard
- Ten Peripheral Processors for Input/Output
  - a fast multi-threaded 12-bit integer ALU
- Very fast clock, 10 MHz (FP add in 4 clocks)
- >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling
- Fastest machine in world for 5 years (until 7600)
  - over 100 sold ($7-10M each)
IBM Memo on CDC6600

Thomas Watson Jr., IBM CEO, August 1963:

“Last week, Control Data ... announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers... Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world's most powerful computer.”

To which Cray replied: “It seems like Mr. Watson has answered his own question.”
Can we solve write hazards without equalizing all pipeline depths and without bypassing?
When is it Safe to Issue an Instruction?

Suppose a data structure keeps track of all the instructions in all the functional units.

The following checks need to be made before the Issue stage can dispatch an instruction:

- Is the required function unit available?
- Is the input data available? ⇒ RAW?
- Is it safe to write the destination? ⇒ WAR? WAW?
- Is there a structural conflict at the WB stage?
Non-blocking loads / Long latency instrs

No WAR hazard
⇒ (in-order scoreboard) no need to keep src1 and src2

The Issue stage does not dispatch an instruction in case of a WAW hazard
⇒ register names are reserved by uncommitted instrs

WP[reg#] : a bit-vector to record the registers for which writes are pending

*These bits are set to true by the Issue stage and set to false by the WB stage*
⇒ Each pipeline stage in the FU's must carry the dest field and a flag to indicate if it is valid
  “the (we, ws) pair”
Ultrasparc I paper
(Global Code Motion Example)

- FOR
  - LD 1 <- 8 3 (3 cycles) 1
  - ADD X <- 1 3
  - BNE 4 5 OUT 3
  - LD 2 <- 6 7 3
  - B TOP 4

- LD 1 <- 8 3 (3 cycles) 1
- LDS 2 <- 6 7 2 (exception?)
- ADD X <- 1 3
- BNE 4 5 OUT 3
- B TOP 4

Adapted from Arvind and Asanovic’s MIT Course 6.823