Preliminaries

Combinational vs sequential circuits

Combinational	Sequential
has no internal state	has internal state
output depends only on current input	output depends on current input and state
	aka state elements
has state elements as input and output

Clocking mechanism

Defines when signals can be read and when they can be written.
- To avoid situations like reading and writing happening at once.
Needed to make the hardware behaviour predictable (and thus make it reliable).
Edge-triggered clocking: state changes happen on clock edges.
- ie, during transition from low → high or high → low.

MIPS

A RISC architecture with many similarities with RISC-V.
Microprocessor without Interlocked Pipeline Stages (MIPS).
Assembly instructions are written using a different notation (uses $ sign).
Like RISC-V, has optional extensions.
Development has ceased by 2021 as the MIPS company is transitioning to RISC-V instead (announcement in March 2021).

RISC

All instructions are of same length.
But that means different instruction formats (distinguished implicitly by opcode ranges)
- R-type (register type)
- I-type (immediate? type)
- S-type (store? type)

RISC-V

Reduced Instruction Set Architecture
UC Berkeley from 2010 onwards
word: 32 bits
doubleword: 64 bits
Number of general purpose registers: 32
Size of registers: 64 bits
Register names: starts with 'x' (eg: x0, x1, x26, x31, etc)
Little endian (ie, least significant part is obtained on 'indexing')
Memory addresses refers to byte addresses.

Sequential doubleword addresses differ by 8 instead of 1.
Stack pointer always point to the last location that was used (not to be used).

Design principles

Simplicity over regularity.
Good design demands good compromises.

Instructions

Arithmetic instructions can operate using 2 registers.
Data transfer instruction can read/write using 1 register but cannot operate on them.
R-type, I-type

A list of instructions

`ld` instruction: load doubleword
`sd` instruction: store doubleword
`srai`: shift right arithmetic immediate
`beq rs1, rs2, L1`: branch if equal

RISC-V fields

funct7	rs2	rs1	funct3	rd	opcode
7b	5b	5b	3b	5b	7b

opcode: indicates the instruction
rd: destination register. Gets result of operation
funct3: additional opcode field
funct7: additional opcode field
rs1: first source register
rs2: second source register

Register addressing

Immediate
Indexed

Term	Number of bits
byte	8
half word	16
word	32
double word	64

Registers

x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15
0	R	SP	GP	TP	T	T	T	S	S	P	P	P	P	P	P
x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31
P	P	S	S	S	S	S	S	S	S	S	S	T	T	T	T

Each register is a 64-bit register.
There are 32 general purpose registers: x0 to x31?
x0: hardwired to be always zero.
- Useful for initializing variables. Like:
  - add x5 x0 x0: initialize x5 to 0
  - addi x5 x0 2: initialize x5 to 2
- Useful for copying value in a register to another.
  - add x5 x7 x0: copy value in x7 to x5
- Any attempt to change the value in x0 would simply be ignored.

RISC-V calling conventions for registers

Reference: https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf

x10 to x17: used for parameter passing
x1: return address for procedures [²]¹
x5 to x7 and x28 to x31: temporary registers (not preserved by the callee)
x8 to x9 and x18 to x27: saved registers? (callee, ie called fn, maintains original value when returning)

Register	ABI Name	Description	Saver
x0	zero	Hard-wired zero	-
x1	ra	Return address	Caller
x2	sp	Stack pointer	Callee
x3	gp	Global pointer	-
x4	tp	Thread pointer	-
x5-7	t0-t2	Temporaries	Caller
x8	s0/fp	Saved register/frame pointer	Callee
x9	s1	Saved register	Callee
x10-11	a0-a1	Function arguments/return values	Caller
x12-17	a2-a7	Function arguments	Caller
x18-27	s2-s11	Saved registers	Callee
x28-31	t3-t6	Temporaries	Caller

Instruction formats

R-type: Arithmetic and logic operations
I-type: Immediate
S-type: Store
SB-type: Branch
U-type: lui ??
UJ-type: jal is the only instruction with this format

      31                25 24 20 19    15 14    12 11                7 6      0 
┌────┬────────────────────┬─────┬────────┬────────┬───────────────────┬────────┐
│ R  │       funct7       │ rs2 │  rs1   │ funct3 │        rd         │ opcode │
├────┼────────────────────┴─────┼────────┼────────┼───────────────────┼────────┤
│ I  │     immediate[11:0]      │  rs1   │ funct3 │        rd         │ opcode │
├────┼────────────────────┬─────┼────────┼────────┼───────────────────┼────────┤
│ S  │  immediate[11:5]   │ rs2 │  rs1   │ funct3 │  immediate[4:0]   │ opcode │
├────┼────────────────────┼─────┼────────┼────────┼───────────────────┼────────┤
│ SB │ immediate[12,10:5] │ rs2 │  rs1   │ funct3 │ immediate[4:1,11] │ opcode │
├────┼────────────────────┴─────┴────────┴────────┼───────────────────┼────────┤
│ U  │              immediate[31:12]              │        rd         │ opcode │
├────┼────────────────────────────────────────────┼───────────────────┼────────┤
│ UJ │        immediate[20,10:1,11,19:12]         │        rd         │ opcode │
└────┴────────────────────────────────────────────┴───────────────────┴────────┘

immediate[0] is implicitly 0 in B-type instructions
- https://stackoverflow.com/questions/77262729/how-do-i-concatenate-immediate-value-of-type-b-risc-v-instruction
- https://www.reddit.com/r/RISCV/comments/187gflo/i_have_a_question_about_sbformat/
Branching only possible to even addresses
Note :: Memory operands appear only in load and store instructions in RISC-V

I-type instruction format

immediate	rs1	funct3	rd	opcode
12b	5b	3b	5b	7b

R-type instruction format

funct7	rs2	rs1	funct3	rd	opcode
7b	5b	5b	3b	5b	7b

S-type instruction format

immediate[11:5]	rs2	rs1	funct3	immediate[4:0]	opcode
7b	5b	5b	3b	5b	7b

Instructions

RV32I instructions (v2.2): https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf https://mark.theis.site/riscv/

Notes

Instructions: Words in the computer's language
ISA (Instruction Set Architecture): Vocabulary of the computer's language

Goal of this language: Make it easy to build hardware and compiler while maximizing performance and minimising cost and energy

Spilling registers: process of putting less frequently used data from registers into memory.
Leaf procedure: A procedure that doesn't call other procedures.
- Attention must be paid when using new registers for non-leaf procedures. Conflict over the use of register may arise between the procedures called by the non-leaf procedure.
- Recursive functions are non-leaf procedures.
Memory addresses in RISC-V refers to bye address.
Pseudo instructions: instructions that need not be available in the hardware but is still offered by the assembler for convenience.
- Eg: li instruction (load immediate) (in lieu of lui?)
Unconditional jump can be done with: jal x0, Label
Calling a function can be done with: jal x1, fn_name where x1 has the return address.
Returning from a function can be done with: jalr.

Usual memory layout

As per commonly followed conventions, stack grows 'downwards' and heap grows 'upwards'.

     |             |
     +-------------+
SP → |             | 0000 003f ffff fff0
     |    Stack    |
     |-------------|
     |      ↓      |
     |      ↓      |
     |      .      |
     |      ↑      |
     |      ↑      |
     |-------------|
     |             | 0000 0000 1000 0000
     |    Heap     |
     +-------------+
     | Static data |
     +-------------+
     |             |
PC → |    Text     | 0000 0000 0040 0000
     +-------------+
     |   Reserved  |
     +-------------+ 0000 0000 0000 0000

Immediate values

Immediate value for instructions are limited to 12 bit values. If we need 32 bit value, we can use a combination of lui and addi.

Use lui to load upper 20 bits (bit 31 to bit 12).
And add the remaining 12 bits to it with the usual addi.

But this works properly only for positive numbers as sign extension comes to play.

Data paths

Computational efficiency

Execution time (ET)

aka response time.
Time between the start and completion of an event.

ET = IC * CPI * T

where:

IC: Instruction count
CPI: Clock cycles per instruction
T: Clock period (ie, 1/ν)

IC, CPI and T have an orthogonal relationship. We can't decrease one while keeping the other two same.

Decrease in one might lead to increase in the other two.

Ideally we would like all three to be minimized. But can't do it.

Performance

Performance = (1 / ET)

Speedup

speedup = (performance with enhancement) / (performance without enhancement)

which is same as

speedup = (ET with enhancement) / (ET without enhancement)

Amdahl's law

Expresses the law of diminshing returns.

Speedup via parallelism is limited by that component of the application which cannot be enhanced (ie, the sequential component).

If a fraction (1-E) of the execution time E of an application cannot be enhanced for parallel implementation, then speedup is limited by a factor of 1 / (1-E).

Power consumption

Datapath

memory
PC: address of next instr to execute

Single cycle datapath

Everything that needs to be done is done in the same clock cycle.
Clock frequency will be low, as clock period is more. (bad)
But CPI is also low (=1 ?) (good)

TODO Multi cycle datapath

Register file

All 32 general purpose registers of the processor are stored in a structure called a register file.
Any register in it can be accessed by specifying its number (register number).
Register number input here is just 5 bits (2⁵ = 32) and is not 64 bits wide.

Program Counter (PC)

PC = PC + 4 (ie, 4 bytes = 32b)

Types of memories

Cache memories

Types of cache

Direct mapped cache
- Effectively k-way set-associative cache
- Least 'liberal'.
Fully associative cache
- Effectively k-way set-associative cache
- Blocks can go anywhere
- Most 'liberal'.
Set-associative cache
- n in 'n-way cache' refers to number of blocks per set.
- Each block mapped to a set

Handling writes

Write-through
Write-buffer
Write-back

Write through

Keep the main memory and cache always upto date (consistent).
Write all changes to cache to main memory as soon as it happens
- ie, always write the data into both the main memory and the cache.
Frequent main memory accesses => slower

Flash memory

An Electrically Erasable Programmable ROM (EEPROM).
Non volatile.

Virtual memory

Simply put, main memory acting like a cache for secondary memory.
- manages caching between main memory and secondary memory.
main memory aka physical memory.
page :: a virtual memory block
page fault :: a virtual memory miss
Virtual memory => virtual address. Translated to a physical address.
- This translation = address mapping or address translation.
Managed by the MMU (Memory Management Unit) and not the OS.
- OS kicks to handle page faults.
Mapping between virtual addresses and physical addresses is fully associative. ie, virtual pages can be placed anywhere in the main memory.
thrashing :: if the program is constantly swapping pages between main and secondary memories.

Advantages

Allows a single program to expand its address space beyond the limits of the main memory.
Allows sharing of main memory between different processes in a secure manner.

Virtual address

Virtual page number
Page offset

Translation Lookaside Buffer (TLB)

Helps in translating virtual addresses to physical addresses.
A special cache. Keeps track of recently used translations.
- Simply loads physical address and protection tags from the (last level) page table.
Effectively a 'translation cache'.
When TLB is there, it is consulted before page register.
- So TLB needs status bits like valid bit, dirty bit, etc.
Needs status bits for:
- valid bit
- dirty bit
- reference bit :: useful when LRU page replacement algorithm is used. Helps to find entries which were recently referenced/used.

Handling TLB miss

Two possibilities if a TLB miss occurs:

Page is in memory => Just add the 'translation' entry to TLB
Page not in memory => page fault. Transfer control to OS.
- TLB miss or page fault exception should be asserted by the end of the clock cycle where the memory access occurs. So that the next cycle can make amends instead of fruitlessly continuing execution.

RISC-V has two registers for handling exceptions:

SEPC: Supervisor Exception Program Counter
SCAUSE: Supervisor exception cause

Pipelining

Exploits parallelism between instructions in a sequential instruction stream.
Improves instruction throughput.
- Doesn't decrease execution time of individual instructions.
Pipelined datapath has multiple stages, each of which operate concurrently.
- if each stage takes same time (ideal conditions), speedup = number of stages.
Duration of a clock cycle must be duration of the longest pipeline stage.
Fundamentally invisible to the programmer (unlike in the case of multiprocessor systems).
- Needn't adjust the program to make use of a pipelined architecture.
Processor with pipelining is a scalar processor. ??

RISC-V pipeline stages

RISC-V instruction set was designed to be pipelined.

IF → ID → EX → MEM → WB

 - IF  = Instruction fetch
 - ID  = Instruction decode
 - EX  = Execute operation
 - MEM = Data memory access
 - WB  = Write back results

RISC-V instructions classically take 5 steps:

Fetch instruction from memory.
Read registers and decode instruction.
Execute operation or calculate address.
Access operand in data memory if needed.
Write results back into a register if needed.

ie, the datapath is separated into 5 pieces, with each corresponding to a pipeline stage.

RISC-V

RISC-V instruction set was designed to be pipelined.

A RISC-V instruction takes 5 steps:

Instruction fetched from memory
Read registers and decode instruction
Execute operation or calculate address
Access operand in data memory (if needed)
Write result into register (if needed)

ie, RISC-V pipeline consists of 5 stages.

Thankfully, in RISC-V:

instructions are all of the same length (32b). Much easier to do pipelining.
- en fait, x86 translate their instructions into regular instructions similar to RISC-V instructions.
only few instruction formats.
source and destination register fields are located in the same place in every instruction
only load/store instruction have operands in memory => we can use execute stage (stage 3) to calculate memory address. (TODO: What's the advantage?)

Pipeline hazards

Situations where the next instruction in the pipeline cannot execute in the next clock cycle.

Hazards may be classified as:

Structural hazards
Data hazards
Control hazards

Structural hazards

Hardware can't support instruction combination that we wish to execute in the same clock cycle.
- ie, a planned instruction can't execute in the proper clock cycle because the hardware doesn't support the combination of instructions that are set to execute.
- eg: A functional unit that is needed is not available.

Data hazards

Pipeline is stalled because one step of an instruction instruction cannot execute till one step of a previous instruction finishes.
- due to dependence of an instruction on an earlier instruction that is still in the pipeline.

-- a RAW dependency
-- because add instruction won't produce output till stage 5 (WB)
-- whereas sub needs the input at stage 3?? (EX)
-- 3 clock cycles got to be spent waiting.
add x5, x6, x7
sub x8, x5, x8

-- I guess it would look something like
--
-- IF ID EX MEM WB
-- b  b  b  b   IF ID EX MEM WB
--
-- where b denotes a bubble/stall

Forwarding (aka bypassing)

A way to get around data hazards.
A bypass by which result can be given to next instruction as soon as the result is formed.
Extra hardware involved.

add x1, x2, x3
sub x4, x1, x5

IF ID EX MEM WB
       ↓
       -→-
         ↓ 
   IF ID EX MEM WB

Data depenedencies

RAW (Read After Write)

True dependence aka flow dependence
next instruction cannot execute till previous instruction has produced results.
- Register value is being read to get the value written to it by the previous instruction.

add x5, x6, x7   -- new value written to x5
sub x1, x5, x2   -- new value of x5 is being used

WAW (Write After Write)

Output dependence
next instruction cannot write its output till a previous instruction has written its output.

WAR (Write After Read)

Anti-dependence.
Next instruction cannot write its results till a previous instruction has read its input.

Control hazards

aka branch hazard.
arises from the need to make a decision based on the results of one instruction while others are executing.

Branch prediction

A way to get around control hazards.
May be classified into:
- Static branch prediction
- Dynamic branch prediction
Disadvantage: wrong prediction => got to rollback the wrong course of action.
- This problem is particularly relevant in the case of long pipelines.

Static branch prediction	Dynamic branch prediction
Rigid	Flexible
Prediction doesn't change	Prediction can change
Don't account for individuality	Guesses based on past behaviour
of specific instructions.
Can be done by compiler	Done by hardware

Delayed branch

Another way to handle control hazards.
Delayed decision.
Execution of the part after a branch is delayed till the result of branching instruction is known.
- Meanwhile, an independent instruction will be executed (in delay slot).
- ie, next sequential execution is always executed. During this time, result of the branching instruction may be calculated.
- Effectively, the branching occurs one instruction after the branch instruction.
- Assembler/compiler?? arranges it this way. Programmer don't have to worry about it. ie, this is static.
MIPS does this.

Multiple issue

Replicate internal components of the computer so that multiple instructions can be launched/issued.
Another way (apart from simple pipelining) to increase parallelism.
Launch multiple instructions in every pipeline-stage.
- There will be multiple sets of the pipeline. A replication of hardware, so that multiple 'pipelines' are being run simultaneously.
Using this, it is possible to make CPI < 1
- ie, instruction execution rate can exceed the clock rate.
- Another metric, IPC (Instruction per clock cycle) instead of CPI, can also be used.
- Even moderate designs aim for a peak IPC of 2.
Classification of multiple issue designs commonly made based on the division of work between:
- compiler (ie, static)
- hardware (ie, dynamic)
Eg: Static multiple issue and dynamic multiple issue
Dynamic multiple issue processors are known as Super-scalar processors aka superscalars.

Types of multiple issue processors

Many types of multiple issue processors are there.
- Differs in terms of the division of work between the compiler and processor
- Decisions made statically compiler itself: Static multiple issue.
- Decisions made dynamically at run-time: Dynamic multiple issue.
- Usually practical multiple issue processors are a blend of static and dynamic.

Steps by compiler	Remaining by hardware
code generation
↓	Superscalar
instr grouping
↓	EPIC
FU assignment
↓	Dynamic VLIW
initiation timing
	VLIW

Static multi issue processors

Compiler usually does something to handle control and data hazards as well.
- Like code rescheduling to reduce hazards, and static branch prediction.

Dynamic multi issue processors

Dynamic multi issue processors are often termed superscalars.

Dynamic pipeline scheduling

In the context of dynamic multi issue processors.
Commit unit ::
Reorder buffer :: Buffer in the commit unit that stores the result until it is safe to write to memory or register file.
- Also provides operands needed by other instructions (data) dependent on already executed but uncommitted instructions.
Reservation stations :: buffer associated with a FU.
- Holds operands and operator.
- As soon as all operands and operator are available and FU is ready to execute, result is calculated.

Factors to consider in multiple-issue designs

Packaging instructions into issue slots.
Dealing with data and control hazards.

Speculative execution

Compiler or proecessor 'guesses' about the properites of an instruction.

Out-of-order execution

Instructions can be executed in a different order than the one in which they were fetched.
Processor executes instructions in an order that preserves the data flow order of the program.

In-order issue → Out-of-order execution → In-order commit

Ways to improve pipeline efficiency

Loop unrolling

A technique employable by the compiler to enhance possibility of parallelism.
Body of loop is 'unrolled' to have more ILP (instruction level parallelism) available by overlapping instructions from different iterations.
Disadvantage :: More register resources needed.
- More register pressure.
- Increase in concurrency demands more resources.

// Following loop:
for(int i=0; i<16; ++i) {
  c[i] = a[i] + b[i];
}

// could be unrolled once to make it
for(int i=0; i<16; i+=2) {
  c[i] = a[i] + b[i];
  c[i+1] = a[i+1] + b[i+1]
}

Register renaming

Eliminates dependencies that aren't true dependencies (ie, RAW data dependencies).
For instructions that can be made independent if different register was used.
- Because they just look like they are dependent but aren't actually so.
- This is because of name dependence or antidependence.
- No data values flow between the antidepedent instructions.
- Because the 'dependence' is forced purely by the reuse of a name.
Register renaming eliminates name dependencies while preserving true dependencies.
Disadvantage :: More register resources needed.

-- WAR data dependency here
add x4, x2, x1
and x1, x2, x0

-- circumvented by renaming reg x1 → x3 in the `and` instruction
add x4, x2, x1
and x3, x2, x0

-- WAW data dependency here
add x9, x2, x1
and x9, x3, x1

-- circumvented by renaming reg x9 → x5 in the `and` instruction
add x9, x2, x1
and x5, x3, x1

Instruction reordering

ld x1, 0(x0)
ld x2, 4(x0)
add x3, x1, x2
sd x3, 12(x0)
ld x4, 8(x0)
add x5, x1, x4
sd x5, 16(x0)

could be made

ld x1, 0(x0)
ld x2, 4(x0)
ld x4, 8(x0)     -- reordered this instr
add x3, x1, x2
sd x3, 12(x0)
add x5, x1, x4
sd x5, 16(x0)

-- ie, here we just grouped the loads together and
-- the (data) hazard disappeared.

New terms

Latency of an instruction

Essentially time taken to finish executing that instruction. ie, its execution time.

nops

An instruction that does nothing. Just filler.
No operation.
Does not change state.
Acts essentially like a stall/bubble in the pipeline.
Not counted while calculating CPI or IPC as it plays no role in enhancing performance.

PC relative addressing

Addressing scheme where: address = PC + constant
RISC-V uses PC-relative addressing for both conditional branches and jumps as the destination instruction tends to be located close by.

EPIC

Explicitly Parallel Instruction Computing

VLIW

Very Large Instruction Word
An ISA that launches many operations (that are defined to be independent) in a single 'wide' instruction.
Multiple (independent) instructions are bundled together into a long instruction.

Examples

Execution time PERFORMANCE

Problem statement

Consider a program comprising of 100 instructions which runs in two different machines 𝑀1 and 𝑀2. Each instruction takes 2 clock cycles in 𝑀1 and 1 clock cycle in 𝑀2. The clock period of 𝑀1 and 𝑀2 are 20 nanosecond (ns) and 30ns, respectively. Find the execution times of the program in both machines. Which machine has better performance?

Solution

ET = IC * CPI * T

	IC	CPI	T	ET
M1	100	2	20ns	4000ns
M2	100	1	30ns	3000ns

∴ M2 is faster.

Non-leaf procedures: Max element of array ASSEMBLY

Problem

Write a non-leaf RISC-V procedure, "Maximum" to find the maximum number from an array of integers. The address of the first element of the array is stored in register X5 and the size of the array, ‘ASize’ is stored in register X6. "Maximum" calls a leaf procedure ‘max-2’. The function ‘max-2’ takes two values from the registers X10 and X11 and returns the greater value through register X10. Eventually, the maximum value in the array is stored in the memory location 0x3345A204. Kindly state your assumptions clearly, for instance specify register that will hold index or temporary value.

RISC-V assembly

# max-2
max2:
  bge x10, x11, max2end
  add x10, x11, x0
max2end:
  jalr x0, 0(x1)


# Maximum
maximum:
  # Initialize max and i
  ld x30, x5       # max = arr[0]
  addi x7, x0, 1   # i=1

loop:
  # Find offset
  slli x29, x7, 3  # offset=i*8

  # Find address of arr[i]
  add x29, x29, x5 # offset = offset + &arr[0]

  # Load arr[i]
  ld x10, 0(x29)   # cur = arr[i]

  # push return address to stack
  addi sp, sp, -8
  sd x1, 0(sp)

  # Call max2
  jal x1, max2

  # pop original return address from stack
  ld x1, 0(sp)
  addi sp, sp, 8

  # Copy return value
  add x30, x10, x0 # max = max2(max, cur)

  # Increment i
  addi x7, x7, 1   # i=i+1

  # Loop exit condition
  bne x7, x6, loop

  # Load storage address
  lui x31, 3345a
  addi x31, 204

  # Store result
  sd x30, 0(x31)

Non-leaf procedures: Factorial ASSEMBLY

C program

long long int fact (long long int n) {
    if (n < 1) {
        return f;
    } else {
        return n * fact(n - 1);
    }
}

RISC-V assembly

Own attempt. May be wrong.

fact:
    bgt  x10, x0, recurse    # non-base case: n > 0
    addi x10, x0, 1          # return 1
recurse:
    addi  x2, x2, -16        # expand stack (x2 is SP)
    sw x10, 8(x2)            # save n to stack
    sw x1, 0(x2)             # save return address to stack
    addi x5, x10, -1         # calculate (n - 1)
    add x10, x5, x0          # set up (n-1) as argument to next function call
    jal x1, fact             # recursive function call
    lw x1, 0(x2)             # retrieve original return address
    lw x5, 8(x2)             # retrieve original n
    mul x10, x5, x10         # set up return value as (n * fact(n-1))
                             # mul available only in RV64M mulitply extension
over:
    jalr x0, 0(x1)           # return from fact() to caller function

Procedure calls PERFORMANCE

Problem statement

After graduating, you are asked to become the lead computer designer at Intel.

Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations.

You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns.

The first thing you do is run some experiments with and without this optimization.

Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information:

The clock rate of the unoptimized version is 5% higher.
Thirty percent of the instructions in the unoptimized version are loads or stores.
The optimized version executes two-thirds as many loads and stores as the unoptimized version. For all other instructions the dynamic execution counts are unchanged.
All instructions (including load and store) take one clock cycle.

Which is faster? Justify your decision quantitatively.

Solution

	Optimized	Un-optimized
T	m	m(1 + 5/100) = (21/20)m
CPI
IC
Count L-S%		30%
Ex L-S%	(2/3)*k	k

References

Computer organization and design: The hardware-software interface (RISC-V Edition) by David A. Patterson, John L. Hennessy
Computer organization: A quantitative analysis 2nd edition (or later) by David A. Patterson, John L. Hennessy

Doubts

Unaddressed

2.4: First commercial computer was a decimal computer?
Is li (load immediate) a RISC-V instruction? Mentioned in page 136 but not in the reference card.
Which RISC-V assembler is being talked about in section 2.12 (page 124)
Any difference between li x5 100 and addi x5 x0 100 performance-wise? (though li is a pseudo-instruction)
Why the 0(x23) needed even when the offset is zero? Is it mandatory for some instructions like ld x6, 0(x5)?
Class: ALU does 64b but then why registers only 32b?

Datapaths

WAW

Why does it matter? Write after write means that the first write can simply be ignored, right?

-- d'accord matters here
add x1, x2, x3
add x1, x1, x4

add x1, x2, x3
add x1, x4, x4

beq

I found this sentence under the section 'Reducing the delay of branches' in Chapter 4 ('The processor') in the Hennessy Patterson RISC-V book (page 309): 'Equality can be tested by XORing individual bit positions of two registers and ORing the XORed result.'

Why is an OR operation needed as well? Isn't XOR result enough?

Structural hazards not a problem in multiple issue designs?

In the textbook it is mentioned (page 322) that one of the 'two primary and distinct responsibilities must be dealt with in a multiple-issue pipeline' is 'dealing with data and control hazards'. Strucutural hazard is not mentioned there. Does that mean structural hazards are not much of a problem in multiple issue designs?

The other factor to be dealt with was 'packaging instructions into issue slots'. Does that part take care of structural hazards?

Branch delay

In the static-prediction - delayed branch slide (slide number 44 in pdf), how does the processor know that an instruction is a branch instruction before decoding it? Doesn't it need to wait till instruction decode stage to know what kind of an instruction it is? In the diagram, the independent instruction (that runs in the branch delay slot) starts right after the branch instruction's Instruction Fetch stage (together with the branch instruction's ID stage).

RISC-V extensions

M: Multiply instructions
F: Floating point

Memory hierarchy

p371: "SRAMs typically use 6-8 transistors per bit to prevent the information from being disturbed when read". Why this disturbance?
[INFORMAL] p373: harddsik: 'entire drive is permanently sealed to control the en inside the drive': does that mean that if you open in it ourselves, it will no longer function if we put it back together.
p374: all disk heads will be on the same track. Why this restriction? Any advantage?
p375: does the speed at which the disks rotate have any bearing on its efficiency?
Difference between l1/l2/l3 caches. Technology is different as wel?
- What are caches usually made of anyway? SRAM?
Cache miss => Instruct main memory for a read is same as a complete load instruction (maybe without instruction decode phase)?
p400: Set associativity for caches with content addressible memories upto how much? Book says 8 and above.
Increase in block size of cache means: advntge: more cache hits disadvntge: increase in miss penalty; any other disadvantages like more complex hardware maybe?
What's the reason for using $ sign to denote cache? Is it really because it sounds like 'cash' or is there another reason?
Regarding the section on VMs in textbook (p416): if we are running a single VM in our laptop, how does the hypervisor act? Does it manage both the host OS (our computer's os) and the guest OS (os used in the VM)? Or does the hypervisor play a greater role only when multiple guest VMs are running? Is hypervisor something that's part of the hardware or something that's software? What does enabling virtualization in the BIOS mean? Is it same as enabling hypervisor or something? Hypervisor sort of translates the instructions issued by the guest systems to that of host and maps the hardware to be used by the guests to actual hardware of the host, right?
Are the secondary storage in smart phones flash memory? Is it so even in the old Nokia phones (with a physical keypad and all)? Or did they start being used in phones only after smart phones were made? Is this the kind of memory used even in small devices which can store just a few numbers like a glucose meter or a similar device being able to remember the last few measurements?
p421: Virtual memory: RISC-V has 64 bit address but upper 16 bits are not used. Virtual address is 48bits. Why upper bits not used? (It later says that RISC-V also supports 39b and 57b virtual address spaces as well)
Virtual memory: Which component does the virtual to physical address translation? How does it manage if we expand the physical memory by replacing the RAM with one of a higher capacity? Range of physical addresses would change right? So the addressing mapping part would also need to change.
Virtual 'memory: A page fault to disk would result in millions of clock cycles to process'. Why would it need that many cycles? Does a typical harddisk access really take that long? And is an SSD faster than an HDD because the number of cycles needed for SSD is lesser?
DONE p385: how many clock cycles or part of a clock cycle would it take to service a cache miss? So if the pipeline must stall? It's not stalling for just a few pipeline stages, but for entire clocks?
DONE p385: Cache miss okay. But what's a write-miss? What could be missed when writing?

```
Page 102   
```
↩︎

x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15
0	R	SP	GP	TP	T	T	T	S	S	P	P	P	P	P	P
x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31
P	P	S	S	S	S	S	S	S	S	S	S	T	T	T	T

x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15
0	R	SP	GP	TP	T	T	T	S	S	P	P	P	P	P	P
x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31
P	P	S	S	S	S	S	S	S	S	S	S	T	T	T	T

x0	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10	x11	x12	x13	x14	x15
0	R	SP	GP	TP	T	T	T	S	S	P	P	P	P	P	P
x16	x17	x18	x19	x20	x21	x22	x23	x24	x25	x26	x27	x28	x29	x30	x31
P	P	S	S	S	S	S	S	S	S	S	S	T	T	T	T