Computer architecture


Preliminaries

Combinational vs sequential circuits

Combinational Sequential
has no internal state has internal state
output depends only on current input output depends on current input and state
aka state elements
has state elements as input and output

Clocking mechanism

MIPS

RISC

RISC-V

Design principles

Instructions

A list of instructions

RISC-V fields

funct7 rs2 rs1 funct3 rd opcode
7b 5b 5b 3b 5b 7b

Register addressing

Term Number of bits
byte 8
half word 16
word 32
double word 64

Registers

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15
0 R SP GP TP T T T S S P P P P P P
x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31
P P S S S S S S S S S S T T T T

RISC-V calling conventions for registers

Reference: https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf

Register ABI Name Description Saver
x0 zero Hard-wired zero -
x1 ra Return address Caller
x2 sp Stack pointer Callee
x3 gp Global pointer -
x4 tp Thread pointer -
x5-7 t0-t2 Temporaries Caller
x8 s0/fp Saved register/frame pointer Callee
x9 s1 Saved register Callee
x10-11 a0-a1 Function arguments/return values Caller
x12-17 a2-a7 Function arguments Caller
x18-27 s2-s11 Saved registers Callee
x28-31 t3-t6 Temporaries Caller

Instruction formats

      31                25 24 20 19    15 14    12 11                7 6      0 
┌────┬────────────────────┬─────┬────────┬────────┬───────────────────┬────────┐
│ R  │       funct7       │ rs2 │  rs1   │ funct3 │        rd         │ opcode │
├────┼────────────────────┴─────┼────────┼────────┼───────────────────┼────────┤
│ I  │     immediate[11:0]      │  rs1   │ funct3 │        rd         │ opcode │
├────┼────────────────────┬─────┼────────┼────────┼───────────────────┼────────┤
│ S  │  immediate[11:5]   │ rs2 │  rs1   │ funct3 │  immediate[4:0]   │ opcode │
├────┼────────────────────┼─────┼────────┼────────┼───────────────────┼────────┤
│ SB │ immediate[12,10:5] │ rs2 │  rs1   │ funct3 │ immediate[4:1,11] │ opcode │
├────┼────────────────────┴─────┴────────┴────────┼───────────────────┼────────┤
│ U  │              immediate[31:12]              │        rd         │ opcode │
├────┼────────────────────────────────────────────┼───────────────────┼────────┤
│ UJ │        immediate[20,10:1,11,19:12]         │        rd         │ opcode │
└────┴────────────────────────────────────────────┴───────────────────┴────────┘

I-type instruction format

immediate rs1 funct3 rd opcode
12b 5b 3b 5b 7b

R-type instruction format

funct7 rs2 rs1 funct3 rd opcode
7b 5b 5b 3b 5b 7b

S-type instruction format

immediate[11:5] rs2 rs1 funct3 immediate[4:0] opcode
7b 5b 5b 3b 5b 7b

Instructions

RV32I instructions (v2.2): https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf https://mark.theis.site/riscv/

Notes

Goal of this language: Make it easy to build hardware and compiler while maximizing performance and minimising cost and energy

Usual memory layout

As per commonly followed conventions, stack grows 'downwards' and heap grows 'upwards'.

     |             |
     +-------------+
SP → |             | 0000 003f ffff fff0
     |    Stack    |
     |-------------|
     |      ↓      |
     |      ↓      |
     |      .      |
     |      ↑      |
     |      ↑      |
     |-------------|
     |             | 0000 0000 1000 0000
     |    Heap     |
     +-------------+
     | Static data |
     +-------------+
     |             |
PC → |    Text     | 0000 0000 0040 0000
     +-------------+
     |   Reserved  |
     +-------------+ 0000 0000 0000 0000

Immediate values

Immediate value for instructions are limited to 12 bit values. If we need 32 bit value, we can use a combination of lui and addi.

But this works properly only for positive numbers as sign extension comes to play.

Data paths

Computational efficiency

Execution time (ET)

ET = IC * CPI * T

where:

IC, CPI and T have an orthogonal relationship. We can't decrease one while keeping the other two same.

Decrease in one might lead to increase in the other two.

Ideally we would like all three to be minimized. But can't do it.

Performance

Performance = (1 / ET)

Speedup

speedup = (performance with enhancement) / (performance without enhancement)

which is same as

speedup = (ET with enhancement) / (ET without enhancement)

Amdahl's law

Speedup via parallelism is limited by that component of the application which cannot be enhanced (ie, the sequential component).

If a fraction (1-E) of the execution time E of an application cannot be enhanced for parallel implementation, then speedup is limited by a factor of 1 / (1-E).

Power consumption

Datapath

Single cycle datapath

TODO Multi cycle datapath

Register file

Program Counter (PC)

PC = PC + 4 (ie, 4 bytes = 32b)

Types of memories

Cache memories

Types of cache

Handling writes

Write through

Flash memory

Virtual memory

Advantages

Virtual address

Translation Lookaside Buffer (TLB)

Handling TLB miss

Two possibilities if a TLB miss occurs:

RISC-V has two registers for handling exceptions:

SEPC
Supervisor Exception Program Counter
SCAUSE
Supervisor exception cause

Pipelining

RISC-V pipeline stages

RISC-V instruction set was designed to be pipelined.

IF → ID → EX → MEM → WB

 - IF  = Instruction fetch
 - ID  = Instruction decode
 - EX  = Execute operation
 - MEM = Data memory access
 - WB  = Write back results

RISC-V instructions classically take 5 steps:

  1. Fetch instruction from memory.
  2. Read registers and decode instruction.
  3. Execute operation or calculate address.
  4. Access operand in data memory if needed.
  5. Write results back into a register if needed.

ie, the datapath is separated into 5 pieces, with each corresponding to a pipeline stage.

RISC-V

RISC-V instruction set was designed to be pipelined.

A RISC-V instruction takes 5 steps:

ie, RISC-V pipeline consists of 5 stages.

Thankfully, in RISC-V:

Pipeline hazards

Situations where the next instruction in the pipeline cannot execute in the next clock cycle.

Hazards may be classified as:

Structural hazards

Data hazards

-- a RAW dependency
-- because add instruction won't produce output till stage 5 (WB)
-- whereas sub needs the input at stage 3?? (EX)
-- 3 clock cycles got to be spent waiting.
add x5, x6, x7
sub x8, x5, x8

-- I guess it would look something like
--
-- IF ID EX MEM WB
-- b  b  b  b   IF ID EX MEM WB
--
-- where b denotes a bubble/stall

Forwarding (aka bypassing)

add x1, x2, x3
sub x4, x1, x5

IF ID EX MEM WB
       ↓
       -→-
         ↓ 
   IF ID EX MEM WB

Data depenedencies

RAW (Read After Write)

add x5, x6, x7   -- new value written to x5
sub x1, x5, x2   -- new value of x5 is being used

WAW (Write After Write)

WAR (Write After Read)

Control hazards

Branch prediction

Static branch prediction Dynamic branch prediction
Rigid Flexible
Prediction doesn't change Prediction can change
Don't account for individuality Guesses based on past behaviour
of specific instructions.
Can be done by compiler Done by hardware

Delayed branch

Multiple issue

Types of multiple issue processors

Steps by compiler Remaining by hardware
code generation
Superscalar
instr grouping
EPIC
FU assignment
Dynamic VLIW
initiation timing
VLIW

Static multi issue processors

Dynamic multi issue processors

Dynamic pipeline scheduling

Factors to consider in multiple-issue designs

Speculative execution

Out-of-order execution

In-order issue → Out-of-order execution → In-order commit

Ways to improve pipeline efficiency

Loop unrolling

// Following loop:
for(int i=0; i<16; ++i) {
  c[i] = a[i] + b[i];
}

// could be unrolled once to make it
for(int i=0; i<16; i+=2) {
  c[i] = a[i] + b[i];
  c[i+1] = a[i+1] + b[i+1]
}

Register renaming

-- WAR data dependency here
add x4, x2, x1
and x1, x2, x0

-- circumvented by renaming reg x1 → x3 in the `and` instruction
add x4, x2, x1
and x3, x2, x0
-- WAW data dependency here
add x9, x2, x1
and x9, x3, x1

-- circumvented by renaming reg x9 → x5 in the `and` instruction
add x9, x2, x1
and x5, x3, x1

Instruction reordering

ld x1, 0(x0)
ld x2, 4(x0)
add x3, x1, x2
sd x3, 12(x0)
ld x4, 8(x0)
add x5, x1, x4
sd x5, 16(x0)

could be made

ld x1, 0(x0)
ld x2, 4(x0)
ld x4, 8(x0)     -- reordered this instr
add x3, x1, x2
sd x3, 12(x0)
add x5, x1, x4
sd x5, 16(x0)

-- ie, here we just grouped the loads together and
-- the (data) hazard disappeared.

New terms

Latency of an instruction

Essentially time taken to finish executing that instruction. ie, its execution time.

nops

PC relative addressing

EPIC

Explicitly Parallel Instruction Computing

VLIW

Examples

Execution time PERFORMANCE

Problem statement

Consider a program comprising of 100 instructions which runs in two different machines 𝑀1 and 𝑀2. Each instruction takes 2 clock cycles in 𝑀1 and 1 clock cycle in 𝑀2. The clock period of 𝑀1 and 𝑀2 are 20 nanosecond (ns) and 30ns, respectively. Find the execution times of the program in both machines. Which machine has better performance?

Solution

ET = IC * CPI * T

IC CPI T ET
M1 100 2 20ns 4000ns
M2 100 1 30ns 3000ns

∴ M2 is faster.

Non-leaf procedures: Max element of array ASSEMBLY

Problem

Write a non-leaf RISC-V procedure, "Maximum" to find the maximum number from an array of integers. The address of the first element of the array is stored in register X5 and the size of the array, ‘ASize’ is stored in register X6. "Maximum" calls a leaf procedure ‘max-2’. The function ‘max-2’ takes two values from the registers X10 and X11 and returns the greater value through register X10. Eventually, the maximum value in the array is stored in the memory location 0x3345A204. Kindly state your assumptions clearly, for instance specify register that will hold index or temporary value.

RISC-V assembly

# max-2
max2:
  bge x10, x11, max2end
  add x10, x11, x0
max2end:
  jalr x0, 0(x1)


# Maximum
maximum:
  # Initialize max and i
  ld x30, x5       # max = arr[0]
  addi x7, x0, 1   # i=1

loop:
  # Find offset
  slli x29, x7, 3  # offset=i*8

  # Find address of arr[i]
  add x29, x29, x5 # offset = offset + &arr[0]

  # Load arr[i]
  ld x10, 0(x29)   # cur = arr[i]

  # push return address to stack
  addi sp, sp, -8
  sd x1, 0(sp)

  # Call max2
  jal x1, max2

  # pop original return address from stack
  ld x1, 0(sp)
  addi sp, sp, 8

  # Copy return value
  add x30, x10, x0 # max = max2(max, cur)

  # Increment i
  addi x7, x7, 1   # i=i+1

  # Loop exit condition
  bne x7, x6, loop

  # Load storage address
  lui x31, 3345a
  addi x31, 204

  # Store result
  sd x30, 0(x31)

Non-leaf procedures: Factorial ASSEMBLY

C program

long long int fact (long long int n) {
    if (n < 1) {
        return f;
    } else {
        return n * fact(n - 1);
    }
}

RISC-V assembly

Own attempt. May be wrong.

fact:
    bgt  x10, x0, recurse    # non-base case: n > 0
    addi x10, x0, 1          # return 1
recurse:
    addi  x2, x2, -16        # expand stack (x2 is SP)
    sw x10, 8(x2)            # save n to stack
    sw x1, 0(x2)             # save return address to stack
    addi x5, x10, -1         # calculate (n - 1)
    add x10, x5, x0          # set up (n-1) as argument to next function call
    jal x1, fact             # recursive function call
    lw x1, 0(x2)             # retrieve original return address
    lw x5, 8(x2)             # retrieve original n
    mul x10, x5, x10         # set up return value as (n * fact(n-1))
                             # mul available only in RV64M mulitply extension
over:
    jalr x0, 0(x1)           # return from fact() to caller function

Procedure calls PERFORMANCE

Problem statement

After graduating, you are asked to become the lead computer designer at Intel.

Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations.

You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns.

The first thing you do is run some experiments with and without this optimization.

Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information:

Which is faster? Justify your decision quantitatively.

Solution

Optimized Un-optimized
T m m(1 + 5/100) = (21/20)m
CPI
IC
Count L-S% 30%
Ex L-S% (2/3)*k k

References

Doubts

Unaddressed

Datapaths

WAW

Why does it matter? Write after write means that the first write can simply be ignored, right?

-- d'accord matters here
add x1, x2, x3
add x1, x1, x4
add x1, x2, x3
add x1, x4, x4

beq

I found this sentence under the section 'Reducing the delay of branches' in Chapter 4 ('The processor') in the Hennessy Patterson RISC-V book (page 309): 'Equality can be tested by XORing individual bit positions of two registers and ORing the XORed result.'

Why is an OR operation needed as well? Isn't XOR result enough?

Structural hazards not a problem in multiple issue designs?

In the textbook it is mentioned (page 322) that one of the 'two primary and distinct responsibilities must be dealt with in a multiple-issue pipeline' is 'dealing with data and control hazards'. Strucutural hazard is not mentioned there. Does that mean structural hazards are not much of a problem in multiple issue designs?

The other factor to be dealt with was 'packaging instructions into issue slots'. Does that part take care of structural hazards?

Branch delay

In the static-prediction - delayed branch slide (slide number 44 in pdf), how does the processor know that an instruction is a branch instruction before decoding it? Doesn't it need to wait till instruction decode stage to know what kind of an instruction it is? In the diagram, the independent instruction (that runs in the branch delay slot) starts right after the branch instruction's Instruction Fetch stage (together with the branch instruction's ID stage).

RISC-V extensions

Memory hierarchy


  1. Page 102   
    
    ↩︎