Preliminaries
Combinational vs sequential circuits
Combinational | Sequential |
---|---|
has no internal state | has internal state |
output depends only on current input | output depends on current input and state |
aka state elements | |
has state elements as input and output |
Clocking mechanism
- Defines when signals can be read and when they can be written.
- To avoid situations like reading and writing happening at once.
- Needed to make the hardware behaviour predictable (and thus make it reliable).
- Edge-triggered clocking: state changes happen on clock edges.
- ie, during transition from low → high or high → low.
MIPS
- A RISC architecture with many similarities with RISC-V.
- Microprocessor without Interlocked Pipeline Stages (MIPS).
- Assembly instructions are written using a different notation (uses
$
sign). - Like RISC-V, has optional extensions.
- Development has ceased by 2021 as the MIPS company is transitioning to RISC-V instead (announcement in March 2021).
RISC
- All instructions are of same length.
- But that means different instruction formats (distinguished implicitly by opcode ranges)
- R-type (register type)
- I-type (immediate? type)
- S-type (store? type)
RISC-V
- Reduced Instruction Set Architecture
- UC Berkeley from 2010 onwards
- word: 32 bits
- doubleword: 64 bits
- Number of general purpose registers: 32
- Size of registers: 64 bits
- Register names: starts with 'x' (eg: x0, x1, x26, x31, etc)
- Little endian (ie, least significant part is obtained on 'indexing')
- Memory addresses refers to byte addresses.
- Sequential doubleword addresses differ by 8 instead of 1.
- Stack pointer always point to the last location that was used (not to be used).
Design principles
- Simplicity over regularity.
- Good design demands good compromises.
Instructions
- Arithmetic instructions can operate using 2 registers.
- Data transfer instruction can read/write using 1 register but cannot operate on them.
- R-type, I-type
A list of instructions
- `ld` instruction: load doubleword
- `sd` instruction: store doubleword
- `srai`: shift right arithmetic immediate
- `beq rs1, rs2, L1`: branch if equal
RISC-V fields
funct7 | rs2 | rs1 | funct3 | rd | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
- opcode: indicates the instruction
- rd: destination register. Gets result of operation
- funct3: additional opcode field
- funct7: additional opcode field
- rs1: first source register
- rs2: second source register
Register addressing
- Immediate
- Indexed
Term | Number of bits |
---|---|
byte | 8 |
half word | 16 |
word | 32 |
double word | 64 |
Registers
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R | SP | GP | TP | T | T | T | S | S | P | P | P | P | P | P |
x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 |
P | P | S | S | S | S | S | S | S | S | S | S | T | T | T | T |
Each register is a 64-bit register.
There are 32 general purpose registers:
x0
tox31
?x0
: hardwired to be always zero.- Useful for initializing variables. Like:
add x5 x0 x0
: initializex5
to0
addi x5 x0 2
: initializex5
to2
- Useful for copying value in a register to another.
add x5 x7 x0
: copy value inx7
tox5
- Any attempt to change the value in
x0
would simply be ignored.
- Useful for initializing variables. Like:
RISC-V calling conventions for registers
Reference: https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
x10
tox17
: used for parameter passingx1
: return address for procedures [²]1x5
tox7
andx28
tox31
: temporary registers (not preserved by the callee)x8
tox9
andx18
tox27
: saved registers? (callee, ie called fn, maintains original value when returning)
Register | ABI Name | Description | Saver |
---|---|---|---|
x0 | zero | Hard-wired zero | - |
x1 | ra | Return address | Caller |
x2 | sp | Stack pointer | Callee |
x3 | gp | Global pointer | - |
x4 | tp | Thread pointer | - |
x5-7 | t0-t2 | Temporaries | Caller |
x8 | s0/fp | Saved register/frame pointer | Callee |
x9 | s1 | Saved register | Callee |
x10-11 | a0-a1 | Function arguments/return values | Caller |
x12-17 | a2-a7 | Function arguments | Caller |
x18-27 | s2-s11 | Saved registers | Callee |
x28-31 | t3-t6 | Temporaries | Caller |
Instruction formats
- R-type: Arithmetic and logic operations
- I-type: Immediate
- S-type: Store
- SB-type: Branch
- U-type: lui ??
- UJ-type:
jal
is the only instruction with this format
31 25 24 20 19 15 14 12 11 7 6 0
┌────┬────────────────────┬─────┬────────┬────────┬───────────────────┬────────┐
│ R │ funct7 │ rs2 │ rs1 │ funct3 │ rd │ opcode │
├────┼────────────────────┴─────┼────────┼────────┼───────────────────┼────────┤
│ I │ immediate[11:0] │ rs1 │ funct3 │ rd │ opcode │
├────┼────────────────────┬─────┼────────┼────────┼───────────────────┼────────┤
│ S │ immediate[11:5] │ rs2 │ rs1 │ funct3 │ immediate[4:0] │ opcode │
├────┼────────────────────┼─────┼────────┼────────┼───────────────────┼────────┤
│ SB │ immediate[12,10:5] │ rs2 │ rs1 │ funct3 │ immediate[4:1,11] │ opcode │
├────┼────────────────────┴─────┴────────┴────────┼───────────────────┼────────┤
│ U │ immediate[31:12] │ rd │ opcode │
├────┼────────────────────────────────────────────┼───────────────────┼────────┤
│ UJ │ immediate[20,10:1,11,19:12] │ rd │ opcode │
└────┴────────────────────────────────────────────┴───────────────────┴────────┘
immediate[0] is implicitly
0
in B-type instructionsBranching only possible to even addresses
Note :: Memory operands appear only in load and store instructions in RISC-V
I-type instruction format
immediate | rs1 | funct3 | rd | opcode |
---|---|---|---|---|
12b | 5b | 3b | 5b | 7b |
R-type instruction format
funct7 | rs2 | rs1 | funct3 | rd | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
S-type instruction format
immediate[11:5] | rs2 | rs1 | funct3 | immediate[4:0] | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
Instructions
RV32I instructions (v2.2): https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf https://mark.theis.site/riscv/
Notes
- Instructions: Words in the computer's language
- ISA (Instruction Set Architecture): Vocabulary of the computer's language
Goal of this language: Make it easy to build hardware and compiler while maximizing performance and minimising cost and energy
- Spilling registers: process of putting less frequently used data from registers into memory.
- Leaf procedure: A procedure that doesn't call other procedures.
- Attention must be paid when using new registers for non-leaf procedures. Conflict over the use of register may arise between the procedures called by the non-leaf procedure.
- Recursive functions are non-leaf procedures.
- Memory addresses in RISC-V refers to bye address.
- Pseudo instructions: instructions that need not be available in the hardware but is still offered by the assembler for convenience.
- Eg:
li
instruction (load immediate) (in lieu oflui
?)
- Eg:
- Unconditional jump can be done with:
jal x0, Label
- Calling a function can be done with:
jal x1, fn_name
wherex1
has the return address. - Returning from a function can be done with:
jalr
.
Usual memory layout
As per commonly followed conventions, stack grows 'downwards' and heap grows 'upwards'.
| |
+-------------+
SP → | | 0000 003f ffff fff0
| Stack |
|-------------|
| ↓ |
| ↓ |
| . |
| ↑ |
| ↑ |
|-------------|
| | 0000 0000 1000 0000
| Heap |
+-------------+
| Static data |
+-------------+
| |
PC → | Text | 0000 0000 0040 0000
+-------------+
| Reserved |
+-------------+ 0000 0000 0000 0000
Immediate values
Immediate value for instructions are limited to 12 bit values. If we need 32 bit value, we can use a combination of lui
and addi
.
- Use
lui
to load upper 20 bits (bit 31 to bit 12). - And add the remaining 12 bits to it with the usual
addi
.
But this works properly only for positive numbers as sign extension comes to play.
Data paths
Computational efficiency
Execution time (ET)
- aka response time.
- Time between the start and completion of an event.
ET = IC * CPI * T
where:
- IC: Instruction count
- CPI: Clock cycles per instruction
- T: Clock period (ie, 1/ν)
IC, CPI and T have an orthogonal relationship. We can't decrease one while keeping the other two same.
Decrease in one might lead to increase in the other two.
Ideally we would like all three to be minimized. But can't do it.
Performance
Performance = (1 / ET)
Speedup
speedup = (performance with enhancement) / (performance without enhancement)
which is same as
speedup = (ET with enhancement) / (ET without enhancement)
Amdahl's law
- Expresses the law of diminshing returns.
Speedup via parallelism is limited by that component of the application which cannot be enhanced (ie, the sequential component).
If a fraction (1-E)
of the execution time E
of an application cannot be enhanced for parallel implementation, then speedup is limited by a factor of 1 / (1-E)
.
Power consumption
Datapath
- memory
- PC: address of next instr to execute
Single cycle datapath
- Everything that needs to be done is done in the same clock cycle.
- Clock frequency will be low, as clock period is more. (bad)
- But CPI is also low (=1 ?) (good)
TODO Multi cycle datapath
Register file
- All 32 general purpose registers of the processor are stored in a structure called a register file.
- Any register in it can be accessed by specifying its number (register number).
- Register number input here is just 5 bits (2⁵ = 32) and is not 64 bits wide.
Program Counter (PC)
PC = PC + 4 (ie, 4 bytes = 32b)
Types of memories
Cache memories
Types of cache
- Direct mapped cache
- Effectively k-way set-associative cache
- Least 'liberal'.
- Fully associative cache
- Effectively k-way set-associative cache
- Blocks can go anywhere
- Most 'liberal'.
- Set-associative cache
- n in 'n-way cache' refers to number of blocks per set.
- Each block mapped to a set
Handling writes
- Write-through
- Write-buffer
- Write-back
Write through
- Keep the main memory and cache always upto date (consistent).
- Write all changes to cache to main memory as soon as it happens
- ie, always write the data into both the main memory and the cache.
- Frequent main memory accesses => slower
Flash memory
- An Electrically Erasable Programmable ROM (EEPROM).
- Non volatile.
Virtual memory
Simply put, main memory acting like a cache for secondary memory.
- manages caching between main memory and secondary memory.
main memory aka physical memory.
page :: a virtual memory block
page fault :: a virtual memory miss
Virtual memory => virtual address. Translated to a physical address.
- This translation = address mapping or address translation.
Managed by the MMU (Memory Management Unit) and not the OS.
- OS kicks to handle page faults.
Mapping between virtual addresses and physical addresses is fully associative. ie, virtual pages can be placed anywhere in the main memory.
thrashing :: if the program is constantly swapping pages between main and secondary memories.
Advantages
- Allows a single program to expand its address space beyond the limits of the main memory.
- Allows sharing of main memory between different processes in a secure manner.
Virtual address
- Virtual page number
- Page offset
Translation Lookaside Buffer (TLB)
- Helps in translating virtual addresses to physical addresses.
- A special cache. Keeps track of recently used translations.
- Simply loads physical address and protection tags from the (last level) page table.
- Effectively a 'translation cache'.
- When TLB is there, it is consulted before page register.
- So TLB needs status bits like valid bit, dirty bit, etc.
- Needs status bits for:
- valid bit
- dirty bit
- reference bit :: useful when LRU page replacement algorithm is used. Helps to find entries which were recently referenced/used.
Handling TLB miss
Two possibilities if a TLB miss occurs:
- Page is in memory => Just add the 'translation' entry to TLB
- Page not in memory => page fault. Transfer control to OS.
- TLB miss or page fault exception should be asserted by the end of the clock cycle where the memory access occurs. So that the next cycle can make amends instead of fruitlessly continuing execution.
RISC-V has two registers for handling exceptions:
- SEPC
- Supervisor Exception Program Counter
- SCAUSE
- Supervisor exception cause
Pipelining
- Exploits parallelism between instructions in a sequential instruction stream.
- Improves instruction throughput.
- Doesn't decrease execution time of individual instructions.
- Pipelined datapath has multiple stages, each of which operate concurrently.
- if each stage takes same time (ideal conditions), speedup = number of stages.
- Duration of a clock cycle must be duration of the longest pipeline stage.
- Fundamentally invisible to the programmer (unlike in the case of multiprocessor systems).
- Needn't adjust the program to make use of a pipelined architecture.
- Processor with pipelining is a scalar processor. ??
RISC-V pipeline stages
RISC-V instruction set was designed to be pipelined.
IF → ID → EX → MEM → WB
- IF = Instruction fetch
- ID = Instruction decode
- EX = Execute operation
- MEM = Data memory access
- WB = Write back results
RISC-V instructions classically take 5 steps:
- Fetch instruction from memory.
- Read registers and decode instruction.
- Execute operation or calculate address.
- Access operand in data memory if needed.
- Write results back into a register if needed.
ie, the datapath is separated into 5 pieces, with each corresponding to a pipeline stage.
RISC-V
RISC-V instruction set was designed to be pipelined.
A RISC-V instruction takes 5 steps:
- Instruction fetched from memory
- Read registers and decode instruction
- Execute operation or calculate address
- Access operand in data memory (if needed)
- Write result into register (if needed)
ie, RISC-V pipeline consists of 5 stages.
Thankfully, in RISC-V:
- instructions are all of the same length (32b). Much easier to do pipelining.
- en fait, x86 translate their instructions into regular instructions similar to RISC-V instructions.
- only few instruction formats.
- source and destination register fields are located in the same place in every instruction
- only load/store instruction have operands in memory => we can use execute stage (stage 3) to calculate memory address. (TODO: What's the advantage?)
Pipeline hazards
Situations where the next instruction in the pipeline cannot execute in the next clock cycle.
Hazards may be classified as:
- Structural hazards
- Data hazards
- Control hazards
Structural hazards
- Hardware can't support instruction combination that we wish to execute in the same clock cycle.
- ie, a planned instruction can't execute in the proper clock cycle because the hardware doesn't support the combination of instructions that are set to execute.
- eg: A functional unit that is needed is not available.
Data hazards
- Pipeline is stalled because one step of an instruction instruction cannot execute till one step of a previous instruction finishes.
- due to dependence of an instruction on an earlier instruction that is still in the pipeline.
-- a RAW dependency
-- because add instruction won't produce output till stage 5 (WB)
-- whereas sub needs the input at stage 3?? (EX)
-- 3 clock cycles got to be spent waiting.
add x5, x6, x7
sub x8, x5, x8
-- I guess it would look something like
--
-- IF ID EX MEM WB
-- b b b b IF ID EX MEM WB
--
-- where b denotes a bubble/stall
Forwarding (aka bypassing)
- A way to get around data hazards.
- A bypass by which result can be given to next instruction as soon as the result is formed.
- Extra hardware involved.
add x1, x2, x3
sub x4, x1, x5
IF ID EX MEM WB
↓
-→-
↓
IF ID EX MEM WB
Data depenedencies
RAW (Read After Write)
- True dependence aka flow dependence
- next instruction cannot execute till previous instruction has produced results.
- Register value is being read to get the value written to it by the previous instruction.
add x5, x6, x7 -- new value written to x5
sub x1, x5, x2 -- new value of x5 is being used
WAW (Write After Write)
- Output dependence
- next instruction cannot write its output till a previous instruction has written its output.
WAR (Write After Read)
- Anti-dependence.
- Next instruction cannot write its results till a previous instruction has read its input.
Control hazards
- aka branch hazard.
- arises from the need to make a decision based on the results of one instruction while others are executing.
Branch prediction
- A way to get around control hazards.
- May be classified into:
- Static branch prediction
- Dynamic branch prediction
- Disadvantage: wrong prediction => got to rollback the wrong course of action.
- This problem is particularly relevant in the case of long pipelines.
Static branch prediction | Dynamic branch prediction |
---|---|
Rigid | Flexible |
Prediction doesn't change | Prediction can change |
Don't account for individuality | Guesses based on past behaviour |
of specific instructions. | |
Can be done by compiler | Done by hardware |
Delayed branch
- Another way to handle control hazards.
- Delayed decision.
- Execution of the part after a branch is delayed till the result of branching instruction is known.
- Meanwhile, an independent instruction will be executed (in delay slot).
- ie, next sequential execution is always executed. During this time, result of the branching instruction may be calculated.
- Effectively, the branching occurs one instruction after the branch instruction.
- Assembler/compiler?? arranges it this way. Programmer don't have to worry about it. ie, this is static.
- MIPS does this.
Multiple issue
- Replicate internal components of the computer so that multiple instructions can be launched/issued.
- Another way (apart from simple pipelining) to increase parallelism.
- Launch multiple instructions in every pipeline-stage.
- There will be multiple sets of the pipeline. A replication of hardware, so that multiple 'pipelines' are being run simultaneously.
- Using this, it is possible to make CPI < 1
- ie, instruction execution rate can exceed the clock rate.
- Another metric, IPC (Instruction per clock cycle) instead of CPI, can also be used.
- Even moderate designs aim for a peak IPC of 2.
- Classification of multiple issue designs commonly made based on the division of work between:
- compiler (ie, static)
- hardware (ie, dynamic)
- Eg: Static multiple issue and dynamic multiple issue
- Dynamic multiple issue processors are known as Super-scalar processors aka superscalars.
Types of multiple issue processors
- Many types of multiple issue processors are there.
- Differs in terms of the division of work between the compiler and processor
- Decisions made statically compiler itself: Static multiple issue.
- Decisions made dynamically at run-time: Dynamic multiple issue.
- Usually practical multiple issue processors are a blend of static and dynamic.
Steps by compiler | Remaining by hardware |
---|---|
code generation | |
↓ | Superscalar |
instr grouping | |
↓ | EPIC |
FU assignment | |
↓ | Dynamic VLIW |
initiation timing | |
VLIW |
Static multi issue processors
- Compiler usually does something to handle control and data hazards as well.
- Like code rescheduling to reduce hazards, and static branch prediction.
Dynamic multi issue processors
- Dynamic multi issue processors are often termed superscalars.
Dynamic pipeline scheduling
- In the context of dynamic multi issue processors.
- Commit unit ::
- Reorder buffer :: Buffer in the commit unit that stores the result until it is safe to write to memory or register file.
- Also provides operands needed by other instructions (data) dependent on already executed but uncommitted instructions.
- Reservation stations :: buffer associated with a FU.
- Holds operands and operator.
- As soon as all operands and operator are available and FU is ready to execute, result is calculated.
Factors to consider in multiple-issue designs
- Packaging instructions into issue slots.
- Dealing with data and control hazards.
Speculative execution
- Compiler or proecessor 'guesses' about the properites of an instruction.
Out-of-order execution
- Instructions can be executed in a different order than the one in which they were fetched.
- Processor executes instructions in an order that preserves the data flow order of the program.
In-order issue → Out-of-order execution → In-order commit
Ways to improve pipeline efficiency
Loop unrolling
- A technique employable by the compiler to enhance possibility of parallelism.
- Body of loop is 'unrolled' to have more ILP (instruction level parallelism) available by overlapping instructions from different iterations.
- Disadvantage :: More register resources needed.
- More register pressure.
- Increase in concurrency demands more resources.
// Following loop:
for(int i=0; i<16; ++i) {
c[i] = a[i] + b[i];
}
// could be unrolled once to make it
for(int i=0; i<16; i+=2) {
c[i] = a[i] + b[i];1] = a[i+1] + b[i+1]
c[i+ }
Register renaming
- Eliminates dependencies that aren't true dependencies (ie, RAW data dependencies).
- For instructions that can be made independent if different register was used.
- Because they just look like they are dependent but aren't actually so.
- This is because of name dependence or antidependence.
- No data values flow between the antidepedent instructions.
- Because the 'dependence' is forced purely by the reuse of a name.
- Register renaming eliminates name dependencies while preserving true dependencies.
- Disadvantage :: More register resources needed.
-- WAR data dependency here
add x4, x2, x1
and x1, x2, x0
-- circumvented by renaming reg x1 → x3 in the `and` instruction
add x4, x2, x1
and x3, x2, x0
-- WAW data dependency here
add x9, x2, x1
and x9, x3, x1
-- circumvented by renaming reg x9 → x5 in the `and` instruction
add x9, x2, x1
and x5, x3, x1
Instruction reordering
ld x1, 0(x0)
ld x2, 4(x0)
add x3, x1, x2
sd x3, 12(x0)
ld x4, 8(x0)
add x5, x1, x4
sd x5, 16(x0)
could be made
ld x1, 0(x0)
ld x2, 4(x0)
ld x4, 8(x0) -- reordered this instr
add x3, x1, x2
sd x3, 12(x0)
add x5, x1, x4
sd x5, 16(x0)
-- ie, here we just grouped the loads together and
-- the (data) hazard disappeared.
New terms
Latency of an instruction
Essentially time taken to finish executing that instruction. ie, its execution time.
nops
- An instruction that does nothing. Just filler.
- No operation.
- Does not change state.
- Acts essentially like a stall/bubble in the pipeline.
- Not counted while calculating CPI or IPC as it plays no role in enhancing performance.
PC relative addressing
- Addressing scheme where: address = PC + constant
- RISC-V uses PC-relative addressing for both conditional branches and jumps as the destination instruction tends to be located close by.
EPIC
Explicitly Parallel Instruction Computing
VLIW
- Very Large Instruction Word
- An ISA that launches many operations (that are defined to be independent) in a single 'wide' instruction.
- Multiple (independent) instructions are bundled together into a long instruction.
Examples
Execution time PERFORMANCE
Problem statement
Consider a program comprising of 100 instructions which runs in two different machines 𝑀1 and 𝑀2. Each instruction takes 2 clock cycles in 𝑀1 and 1 clock cycle in 𝑀2. The clock period of 𝑀1 and 𝑀2 are 20 nanosecond (ns) and 30ns, respectively. Find the execution times of the program in both machines. Which machine has better performance?
Solution
ET = IC * CPI * T
IC | CPI | T | ET | |
---|---|---|---|---|
M1 | 100 | 2 | 20ns | 4000ns |
M2 | 100 | 1 | 30ns | 3000ns |
∴ M2 is faster.
Non-leaf procedures: Max element of array ASSEMBLY
Problem
Write a non-leaf RISC-V procedure, "Maximum" to find the maximum number from an array of integers. The address of the first element of the array is stored in register X5 and the size of the array, ‘ASize’ is stored in register X6. "Maximum" calls a leaf procedure ‘max-2’. The function ‘max-2’ takes two values from the registers X10 and X11 and returns the greater value through register X10. Eventually, the maximum value in the array is stored in the memory location 0x3345A204. Kindly state your assumptions clearly, for instance specify register that will hold index or temporary value.
RISC-V assembly
# max-2
max2:
bge x10, x11, max2end
add x10, x11, x0
max2end:
jalr x0, 0(x1)
# Maximum
maximum:
# Initialize max and i
ld x30, x5 # max = arr[0]
addi x7, x0, 1 # i=1
loop:
# Find offset
slli x29, x7, 3 # offset=i*8
# Find address of arr[i]
add x29, x29, x5 # offset = offset + &arr[0]
# Load arr[i]
ld x10, 0(x29) # cur = arr[i]
# push return address to stack
addi sp, sp, -8
sd x1, 0(sp)
# Call max2
jal x1, max2
# pop original return address from stack
ld x1, 0(sp)
addi sp, sp, 8
# Copy return value
add x30, x10, x0 # max = max2(max, cur)
# Increment i
addi x7, x7, 1 # i=i+1
# Loop exit condition
bne x7, x6, loop
# Load storage address
lui x31, 3345a
addi x31, 204
# Store result
sd x30, 0(x31)
Non-leaf procedures: Factorial ASSEMBLY
C program
long long int fact (long long int n) {
if (n < 1) {
return f;
else {
} return n * fact(n - 1);
} }
RISC-V assembly
Own attempt. May be wrong.
fact:
bgt x10, x0, recurse # non-base case: n > 0
addi x10, x0, 1 # return 1
recurse:
addi x2, x2, -16 # expand stack (x2 is SP)
sw x10, 8(x2) # save n to stack
sw x1, 0(x2) # save return address to stack
addi x5, x10, -1 # calculate (n - 1)
add x10, x5, x0 # set up (n-1) as argument to next function call
jal x1, fact # recursive function call
lw x1, 0(x2) # retrieve original return address
lw x5, 8(x2) # retrieve original n
mul x10, x5, x10 # set up return value as (n * fact(n-1))
# mul available only in RV64M mulitply extension
over:
jalr x0, 0(x1) # return from fact() to caller function
Procedure calls PERFORMANCE
Problem statement
After graduating, you are asked to become the lead computer designer at Intel.
Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations.
You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns.
The first thing you do is run some experiments with and without this optimization.
Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information:
- The clock rate of the unoptimized version is 5% higher.
- Thirty percent of the instructions in the unoptimized version are loads or stores.
- The optimized version executes two-thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged.
- All instructions (including load and store) take one clock cycle.
Which is faster? Justify your decision quantitatively.
Solution
Optimized | Un-optimized | |
---|---|---|
T | m | m(1 + 5/100) = (21/20)m |
CPI | ||
IC | ||
Count L-S% | 30% | |
Ex L-S% | (2/3)*k | k |
References
- Computer organization and design: The hardware-software interface (RISC-V Edition) by David A. Patterson, John L. Hennessy
- Computer organization: A quantitative analysis 2nd edition (or later) by David A. Patterson, John L. Hennessy
Doubts
Unaddressed
2.4: First commercial computer was a decimal computer?
Is
li
(load immediate) a RISC-V instruction? Mentioned in page 136 but not in the reference card.Which RISC-V assembler is being talked about in section 2.12 (page 124)
Any difference between
li x5 100
andaddi x5 x0 100
performance-wise? (thoughli
is a pseudo-instruction)Why the
0(x23)
needed even when the offset is zero? Is it mandatory for some instructions likeld x6, 0(x5)
?Class: ALU does 64b but then why registers only 32b?
Datapaths
WAW
Why does it matter? Write after write means that the first write can simply be ignored, right?
-- d'accord matters here
add x1, x2, x3
add x1, x1, x4
add x1, x2, x3
add x1, x4, x4
beq
I found this sentence under the section 'Reducing the delay of branches' in Chapter 4 ('The processor') in the Hennessy Patterson RISC-V book (page 309): 'Equality can be tested by XORing individual bit positions of two registers and ORing the XORed result.'
Why is an OR operation needed as well? Isn't XOR result enough?
Structural hazards not a problem in multiple issue designs?
In the textbook it is mentioned (page 322) that one of the 'two primary and distinct responsibilities must be dealt with in a multiple-issue pipeline' is 'dealing with data and control hazards'. Strucutural hazard is not mentioned there. Does that mean structural hazards are not much of a problem in multiple issue designs?
The other factor to be dealt with was 'packaging instructions into issue slots'. Does that part take care of structural hazards?
Branch delay
In the static-prediction - delayed branch slide (slide number 44 in pdf), how does the processor know that an instruction is a branch instruction before decoding it? Doesn't it need to wait till instruction decode stage to know what kind of an instruction it is? In the diagram, the independent instruction (that runs in the branch delay slot) starts right after the branch instruction's Instruction Fetch stage (together with the branch instruction's ID stage).
RISC-V extensions
- M: Multiply instructions
- F: Floating point
Memory hierarchy
p371: "SRAMs typically use 6-8 transistors per bit to prevent the information from being disturbed when read". Why this disturbance?
[INFORMAL] p373: harddsik: 'entire drive is permanently sealed to control the en inside the drive': does that mean that if you open in it ourselves, it will no longer function if we put it back together.
p374: all disk heads will be on the same track. Why this restriction? Any advantage?
p375: does the speed at which the disks rotate have any bearing on its efficiency?
Difference between l1/l2/l3 caches. Technology is different as wel?
- What are caches usually made of anyway? SRAM?
Cache miss => Instruct main memory for a read is same as a complete load instruction (maybe without instruction decode phase)?
p400: Set associativity for caches with content addressible memories upto how much? Book says 8 and above.
Increase in block size of cache means: advntge: more cache hits disadvntge: increase in miss penalty; any other disadvantages like more complex hardware maybe?
What's the reason for using $ sign to denote cache? Is it really because it sounds like 'cash' or is there another reason?
Regarding the section on VMs in textbook (p416): if we are running a single VM in our laptop, how does the hypervisor act? Does it manage both the host OS (our computer's os) and the guest OS (os used in the VM)? Or does the hypervisor play a greater role only when multiple guest VMs are running? Is hypervisor something that's part of the hardware or something that's software? What does enabling virtualization in the BIOS mean? Is it same as enabling hypervisor or something? Hypervisor sort of translates the instructions issued by the guest systems to that of host and maps the hardware to be used by the guests to actual hardware of the host, right?
Are the secondary storage in smart phones flash memory? Is it so even in the old Nokia phones (with a physical keypad and all)? Or did they start being used in phones only after smart phones were made? Is this the kind of memory used even in small devices which can store just a few numbers like a glucose meter or a similar device being able to remember the last few measurements?
p421: Virtual memory: RISC-V has 64 bit address but upper 16 bits are not used. Virtual address is 48bits. Why upper bits not used? (It later says that RISC-V also supports 39b and 57b virtual address spaces as well)
Virtual memory: Which component does the virtual to physical address translation? How does it manage if we expand the physical memory by replacing the RAM with one of a higher capacity? Range of physical addresses would change right? So the addressing mapping part would also need to change.
Virtual 'memory: A page fault to disk would result in millions of clock cycles to process'. Why would it need that many cycles? Does a typical harddisk access really take that long? And is an SSD faster than an HDD because the number of cycles needed for SSD is lesser?
DONE p385: how many clock cycles or part of a clock cycle would it take to service a cache miss? So if the pipeline must stall? It's not stalling for just a few pipeline stages, but for entire clocks?
DONE p385: Cache miss okay. But what's a write-miss? What could be missed when writing?
↩︎Page 102