Combinational | Sequential |
---|---|
has no internal state | has internal state |
output depends only on current input | output depends on current input and state |
aka state elements | |
has state elements as input and output |
$
sign).funct7 | rs2 | rs1 | funct3 | rd | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
Term | Number of bits |
---|---|
byte | 8 |
half word | 16 |
word | 32 |
double word | 64 |
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | R | SP | GP | TP | T | T | T | S | S | P | P | P | P | P | P |
x16 | x17 | x18 | x19 | x20 | x21 | x22 | x23 | x24 | x25 | x26 | x27 | x28 | x29 | x30 | x31 |
P | P | S | S | S | S | S | S | S | S | S | S | T | T | T | T |
Each register is a 64-bit register.
There are 32 general purpose registers: x0
to x31
?
x0
: hardwired to be always zero.
add x5 x0 x0
: initialize x5
to 0
addi x5 x0 2
: initialize x5
to 2
add x5 x7 x0
: copy value in x7
to x5
x0
would simply be ignored.Reference: https://riscv.org/wp-content/uploads/2015/01/riscv-calling.pdf
x10
to x17
: used for parameter passingx1
: return address for procedures [²]1x5
to x7
and x28
to x31
: temporary registers (not preserved by the callee)x8
to x9
and x18
to x27
: saved registers? (callee, ie called fn, maintains original value when returning)Register | ABI Name | Description | Saver |
---|---|---|---|
x0 | zero | Hard-wired zero | - |
x1 | ra | Return address | Caller |
x2 | sp | Stack pointer | Callee |
x3 | gp | Global pointer | - |
x4 | tp | Thread pointer | - |
x5-7 | t0-t2 | Temporaries | Caller |
x8 | s0/fp | Saved register/frame pointer | Callee |
x9 | s1 | Saved register | Callee |
x10-11 | a0-a1 | Function arguments/return values | Caller |
x12-17 | a2-a7 | Function arguments | Caller |
x18-27 | s2-s11 | Saved registers | Callee |
x28-31 | t3-t6 | Temporaries | Caller |
jal
is the only instruction with this format 31 25 24 20 19 15 14 12 11 7 6 0
┌────┬────────────────────┬─────┬────────┬────────┬───────────────────┬────────┐
│ R │ funct7 │ rs2 │ rs1 │ funct3 │ rd │ opcode │
├────┼────────────────────┴─────┼────────┼────────┼───────────────────┼────────┤
│ I │ immediate[11:0] │ rs1 │ funct3 │ rd │ opcode │
├────┼────────────────────┬─────┼────────┼────────┼───────────────────┼────────┤
│ S │ immediate[11:5] │ rs2 │ rs1 │ funct3 │ immediate[4:0] │ opcode │
├────┼────────────────────┼─────┼────────┼────────┼───────────────────┼────────┤
│ SB │ immediate[12,10:5] │ rs2 │ rs1 │ funct3 │ immediate[4:1,11] │ opcode │
├────┼────────────────────┴─────┴────────┴────────┼───────────────────┼────────┤
│ U │ immediate[31:12] │ rd │ opcode │
├────┼────────────────────────────────────────────┼───────────────────┼────────┤
│ UJ │ immediate[20,10:1,11,19:12] │ rd │ opcode │
└────┴────────────────────────────────────────────┴───────────────────┴────────┘
immediate[0] is implicitly 0
in B-type instructions
Branching only possible to even addresses
Note :: Memory operands appear only in load and store instructions in RISC-V
immediate | rs1 | funct3 | rd | opcode |
---|---|---|---|---|
12b | 5b | 3b | 5b | 7b |
funct7 | rs2 | rs1 | funct3 | rd | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
immediate[11:5] | rs2 | rs1 | funct3 | immediate[4:0] | opcode |
---|---|---|---|---|---|
7b | 5b | 5b | 3b | 5b | 7b |
RV32I instructions (v2.2): https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf https://mark.theis.site/riscv/
Goal of this language: Make it easy to build hardware and compiler while maximizing performance and minimising cost and energy
li
instruction (load immediate) (in lieu of lui
?)jal x0, Label
jal x1, fn_name
where x1
has the return address.jalr
.As per commonly followed conventions, stack grows 'downwards' and heap grows 'upwards'.
| |
+-------------+
SP → | | 0000 003f ffff fff0
| Stack |
|-------------|
| ↓ |
| ↓ |
| . |
| ↑ |
| ↑ |
|-------------|
| | 0000 0000 1000 0000
| Heap |
+-------------+
| Static data |
+-------------+
| |
PC → | Text | 0000 0000 0040 0000
+-------------+
| Reserved |
+-------------+ 0000 0000 0000 0000
Immediate value for instructions are limited to 12 bit values. If we need 32 bit value, we can use a combination of lui
and addi
.
lui
to load upper 20 bits (bit 31 to bit 12).addi
.But this works properly only for positive numbers as sign extension comes to play.
ET = IC * CPI * T
where:
IC, CPI and T have an orthogonal relationship. We can't decrease one while keeping the other two same.
Decrease in one might lead to increase in the other two.
Ideally we would like all three to be minimized. But can't do it.
Performance = (1 / ET)
speedup = (performance with enhancement) / (performance without enhancement)
which is same as
speedup = (ET with enhancement) / (ET without enhancement)
Speedup via parallelism is limited by that component of the application which cannot be enhanced (ie, the sequential component).
If a fraction (1-E)
of the execution time E
of an application cannot be enhanced for parallel implementation, then speedup is limited by a factor of 1 / (1-E)
.
PC = PC + 4 (ie, 4 bytes = 32b)
Simply put, main memory acting like a cache for secondary memory.
main memory aka physical memory.
page :: a virtual memory block
page fault :: a virtual memory miss
Virtual memory => virtual address. Translated to a physical address.
Managed by the MMU (Memory Management Unit) and not the OS.
Mapping between virtual addresses and physical addresses is fully associative. ie, virtual pages can be placed anywhere in the main memory.
thrashing :: if the program is constantly swapping pages between main and secondary memories.
Two possibilities if a TLB miss occurs:
RISC-V has two registers for handling exceptions:
RISC-V instruction set was designed to be pipelined.
IF → ID → EX → MEM → WB
- IF = Instruction fetch
- ID = Instruction decode
- EX = Execute operation
- MEM = Data memory access
- WB = Write back results
RISC-V instructions classically take 5 steps:
ie, the datapath is separated into 5 pieces, with each corresponding to a pipeline stage.
RISC-V instruction set was designed to be pipelined.
A RISC-V instruction takes 5 steps:
ie, RISC-V pipeline consists of 5 stages.
Thankfully, in RISC-V:
Situations where the next instruction in the pipeline cannot execute in the next clock cycle.
Hazards may be classified as:
-- a RAW dependency
-- because add instruction won't produce output till stage 5 (WB)
-- whereas sub needs the input at stage 3?? (EX)
-- 3 clock cycles got to be spent waiting.
add x5, x6, x7
sub x8, x5, x8
-- I guess it would look something like
--
-- IF ID EX MEM WB
-- b b b b IF ID EX MEM WB
--
-- where b denotes a bubble/stall
add x1, x2, x3
sub x4, x1, x5
IF ID EX MEM WB
↓
-→-
↓
IF ID EX MEM WB
add x5, x6, x7 -- new value written to x5
sub x1, x5, x2 -- new value of x5 is being used
Static branch prediction | Dynamic branch prediction |
---|---|
Rigid | Flexible |
Prediction doesn't change | Prediction can change |
Don't account for individuality | Guesses based on past behaviour |
of specific instructions. | |
Can be done by compiler | Done by hardware |
Steps by compiler | Remaining by hardware |
---|---|
code generation | |
↓ | Superscalar |
instr grouping | |
↓ | EPIC |
FU assignment | |
↓ | Dynamic VLIW |
initiation timing | |
VLIW |
In-order issue → Out-of-order execution → In-order commit
// Following loop:
for(int i=0; i<16; ++i) {
c[i] = a[i] + b[i];
}
// could be unrolled once to make it
for(int i=0; i<16; i+=2) {
c[i] = a[i] + b[i];1] = a[i+1] + b[i+1]
c[i+ }
-- WAR data dependency here
add x4, x2, x1
and x1, x2, x0
-- circumvented by renaming reg x1 → x3 in the `and` instruction
add x4, x2, x1
and x3, x2, x0
-- WAW data dependency here
add x9, x2, x1
and x9, x3, x1
-- circumvented by renaming reg x9 → x5 in the `and` instruction
add x9, x2, x1
and x5, x3, x1
ld x1, 0(x0)
ld x2, 4(x0)
add x3, x1, x2
sd x3, 12(x0)
ld x4, 8(x0)
add x5, x1, x4
sd x5, 16(x0)
could be made
ld x1, 0(x0)
ld x2, 4(x0)
ld x4, 8(x0) -- reordered this instr
add x3, x1, x2
sd x3, 12(x0)
add x5, x1, x4
sd x5, 16(x0)
-- ie, here we just grouped the loads together and
-- the (data) hazard disappeared.
Essentially time taken to finish executing that instruction. ie, its execution time.
Explicitly Parallel Instruction Computing
Consider a program comprising of 100 instructions which runs in two different machines 𝑀1 and 𝑀2. Each instruction takes 2 clock cycles in 𝑀1 and 1 clock cycle in 𝑀2. The clock period of 𝑀1 and 𝑀2 are 20 nanosecond (ns) and 30ns, respectively. Find the execution times of the program in both machines. Which machine has better performance?
ET = IC * CPI * T
IC | CPI | T | ET | |
---|---|---|---|---|
M1 | 100 | 2 | 20ns | 4000ns |
M2 | 100 | 1 | 30ns | 3000ns |
∴ M2 is faster.
Write a non-leaf RISC-V procedure, "Maximum" to find the maximum number from an array of integers. The address of the first element of the array is stored in register X5 and the size of the array, ‘ASize’ is stored in register X6. "Maximum" calls a leaf procedure ‘max-2’. The function ‘max-2’ takes two values from the registers X10 and X11 and returns the greater value through register X10. Eventually, the maximum value in the array is stored in the memory location 0x3345A204. Kindly state your assumptions clearly, for instance specify register that will hold index or temporary value.
# max-2
max2:
bge x10, x11, max2end
add x10, x11, x0
max2end:
jalr x0, 0(x1)
# Maximum
maximum:
# Initialize max and i
ld x30, x5 # max = arr[0]
addi x7, x0, 1 # i=1
loop:
# Find offset
slli x29, x7, 3 # offset=i*8
# Find address of arr[i]
add x29, x29, x5 # offset = offset + &arr[0]
# Load arr[i]
ld x10, 0(x29) # cur = arr[i]
# push return address to stack
addi sp, sp, -8
sd x1, 0(sp)
# Call max2
jal x1, max2
# pop original return address from stack
ld x1, 0(sp)
addi sp, sp, 8
# Copy return value
add x30, x10, x0 # max = max2(max, cur)
# Increment i
addi x7, x7, 1 # i=i+1
# Loop exit condition
bne x7, x6, loop
# Load storage address
lui x31, 3345a
addi x31, 204
# Store result
sd x30, 0(x31)
long long int fact (long long int n) {
if (n < 1) {
return f;
else {
} return n * fact(n - 1);
} }
Own attempt. May be wrong.
fact:
bgt x10, x0, recurse # non-base case: n > 0
addi x10, x0, 1 # return 1
recurse:
addi x2, x2, -16 # expand stack (x2 is SP)
sw x10, 8(x2) # save n to stack
sw x1, 0(x2) # save return address to stack
addi x5, x10, -1 # calculate (n - 1)
add x10, x5, x0 # set up (n-1) as argument to next function call
jal x1, fact # recursive function call
lw x1, 0(x2) # retrieve original return address
lw x5, 8(x2) # retrieve original n
mul x10, x5, x10 # set up return value as (n * fact(n-1))
# mul available only in RV64M mulitply extension
over:
jalr x0, 0(x1) # return from fact() to caller function
After graduating, you are asked to become the lead computer designer at Intel.
Your study of usage of high-level language constructs suggests that procedure calls are one of the most expensive operations.
You have invented a scheme that reduces the loads and stores normally associated with procedure calls and returns.
The first thing you do is run some experiments with and without this optimization.
Your experiments use the same state-of-the-art optimizing compiler that will be used with either version of the computer. These experiments reveal the following information:
Which is faster? Justify your decision quantitatively.
Optimized | Un-optimized | |
---|---|---|
T | m | m(1 + 5/100) = (21/20)m |
CPI | ||
IC | ||
Count L-S% | 30% | |
Ex L-S% | (2/3)*k | k |
2.4: First commercial computer was a decimal computer?
Is li
(load immediate) a RISC-V instruction? Mentioned in page 136 but not in the reference card.
Which RISC-V assembler is being talked about in section 2.12 (page 124)
Any difference between li x5 100
and addi x5 x0 100
performance-wise? (though li
is a pseudo-instruction)
Why the 0(x23)
needed even when the offset is zero? Is it mandatory for some instructions like ld x6, 0(x5)
?
Class: ALU does 64b but then why registers only 32b?
Why does it matter? Write after write means that the first write can simply be ignored, right?
-- d'accord matters here
add x1, x2, x3
add x1, x1, x4
add x1, x2, x3
add x1, x4, x4
I found this sentence under the section 'Reducing the delay of branches' in Chapter 4 ('The processor') in the Hennessy Patterson RISC-V book (page 309): 'Equality can be tested by XORing individual bit positions of two registers and ORing the XORed result.'
Why is an OR operation needed as well? Isn't XOR result enough?
In the textbook it is mentioned (page 322) that one of the 'two primary and distinct responsibilities must be dealt with in a multiple-issue pipeline' is 'dealing with data and control hazards'. Strucutural hazard is not mentioned there. Does that mean structural hazards are not much of a problem in multiple issue designs?
The other factor to be dealt with was 'packaging instructions into issue slots'. Does that part take care of structural hazards?
In the static-prediction - delayed branch slide (slide number 44 in pdf), how does the processor know that an instruction is a branch instruction before decoding it? Doesn't it need to wait till instruction decode stage to know what kind of an instruction it is? In the diagram, the independent instruction (that runs in the branch delay slot) starts right after the branch instruction's Instruction Fetch stage (together with the branch instruction's ID stage).
p371: "SRAMs typically use 6-8 transistors per bit to prevent the information from being disturbed when read". Why this disturbance?
[INFORMAL] p373: harddsik: 'entire drive is permanently sealed to control the en inside the drive': does that mean that if you open in it ourselves, it will no longer function if we put it back together.
p374: all disk heads will be on the same track. Why this restriction? Any advantage?
p375: does the speed at which the disks rotate have any bearing on its efficiency?
Difference between l1/l2/l3 caches. Technology is different as wel?
Cache miss => Instruct main memory for a read is same as a complete load instruction (maybe without instruction decode phase)?
p400: Set associativity for caches with content addressible memories upto how much? Book says 8 and above.
Increase in block size of cache means: advntge: more cache hits disadvntge: increase in miss penalty; any other disadvantages like more complex hardware maybe?
What's the reason for using $ sign to denote cache? Is it really because it sounds like 'cash' or is there another reason?
Regarding the section on VMs in textbook (p416): if we are running a single VM in our laptop, how does the hypervisor act? Does it manage both the host OS (our computer's os) and the guest OS (os used in the VM)? Or does the hypervisor play a greater role only when multiple guest VMs are running? Is hypervisor something that's part of the hardware or something that's software? What does enabling virtualization in the BIOS mean? Is it same as enabling hypervisor or something? Hypervisor sort of translates the instructions issued by the guest systems to that of host and maps the hardware to be used by the guests to actual hardware of the host, right?
Are the secondary storage in smart phones flash memory? Is it so even in the old Nokia phones (with a physical keypad and all)? Or did they start being used in phones only after smart phones were made? Is this the kind of memory used even in small devices which can store just a few numbers like a glucose meter or a similar device being able to remember the last few measurements?
p421: Virtual memory: RISC-V has 64 bit address but upper 16 bits are not used. Virtual address is 48bits. Why upper bits not used? (It later says that RISC-V also supports 39b and 57b virtual address spaces as well)
Virtual memory: Which component does the virtual to physical address translation? How does it manage if we expand the physical memory by replacing the RAM with one of a higher capacity? Range of physical addresses would change right? So the addressing mapping part would also need to change.
Virtual 'memory: A page fault to disk would result in millions of clock cycles to process'. Why would it need that many cycles? Does a typical harddisk access really take that long? And is an SSD faster than an HDD because the number of cycles needed for SSD is lesser?
DONE p385: how many clock cycles or part of a clock cycle would it take to service a cache miss? So if the pipeline must stall? It's not stalling for just a few pipeline stages, but for entire clocks?
DONE p385: Cache miss okay. But what's a write-miss? What could be missed when writing?
Page 102
↩︎