Most of this is remnants of notes from 2021 some of which were found in 2024…

High level synthesis:

Input:

Behavioural HDL processes. All of which needs to be run concurrently
- Only functionality is modeled here.
Clock period
Component library (the components that are availble to be used, I guess)
- Components of different area, delay, power dissipation, number of ports, etc.
- ALUs, memory, adders, multipliers, comparators, etc.
Constraints
- Resource, delays, power, etc (which may also depend on the budget allocated to the design)

Output:

FSM (controller)
datapath

Example: The behavioural model

y <= (A+B) * C - (D+E)/F;

which is implictly

y <= [(A+B) * C] - [(D+E)/F];

and becomes

Clock0:

t1 <= A+B
t2 <= D+E

Clock1:

t3 <= t1*C
t4 <= t2/F

Clock2:

y <= t3-t4

Scheduling

Leads to the FSM part.
Maps operations to specific clock cycles.

|   │   |   │     ┆      |   │   |   │            
|   │   |   │     ┆     ┄|┄┄┄│┄┄┄│┄┄┄│┄┄ Clk0     
└─┬─┘   └─┬─┘     ┆      └─┬─┘   └─┬─┘                 
  +       +       ┆        +       +                 
  |       │       ┆     ┄┄┄|┄┄┄┄┄┄┄│┄┄┄┄ Clk1        
  └───┬───┘       ┆        └───┬───┘            
      +           ┆            +                 
      │                 ┄┄┄┄┄┄┄|┄┄┄┄┄┄┄┄ Clk2    

where the '+' denotes operations.

Function unit binding:

Operations are mapped to appropriate FUs from the resource library.
The kind of FUs chosen determines the interconnections and hence the data path.

Assigning values (temporary variables) to registers.
Values that would be used in a future clock cycle.
- 'variables persisting across clock boundaries implies register'

FSM encoding

aka 'State encoding' aka 'State assignment'.
State assignment. ie, making states.
- Assign unique codes to symbolic states of FSM.
Encoding is needed to have an implementation.
Number of states = minimum number of bits (ie, D-FFs) in state register of FSM.
State table is constructed.
State table is converted to truth table.
Truth table => logic for output and next state .
Logic minimization can be done using truth table.

 Current state
+--------------+
|              |        
|              ^        
|              |        
|   +------------------------+
|   |    State register      |
|   +------------------------+
|              |
v              ^   Next state
|              |
|   +------------------------+
|   | Next state and output  |
|   |    output logic        |------> Output 
|   +------------------------+
|           |        |
|           ^        ^
|           |        |
+-----------+        |
                    Input

Mealy machine notation: condition/action

Synthesis

Converting an abstract, high-level specification to detailed implementation.
Those who write the high-level specification needn't worry about the implementation details. They can focus on their problem to be solved.
The synthesis tool takes the HLL as input and figures out which all components and circuits would be needed to implement it.
HLL describes the functionality of the circuit.
HLL captures only the behaviour. Not how it is implemented.
Similar to a compiler. But in synthesis,
- Language: HDL, Bluespec
Same HLL can lead to different equivalent implementations.

Detailed implementation

Target specific: Gates, modules, etc.
Different for different systems.

Example:

A system may have an ALU => that can be used.
Some other system may not have an ALU, so we would have to build an equivalent component.
Or maybe there is a chip available in the system, which makes things easier.
Based on the target, implementation changes.

Netlist

Interconnection between the gates and modules.
The components alone are not enough. We need to figure out how they are to be connected together.
Netlist can be used by a chip manufacturer to create the actual chip.

Multiple abstraction levels

(like in the TCP/IP :-) maybe )
Area and power considerations

Comparison with compilers

In compilers, there may be a strict order in which the instructions are to be completed.
But in digital circuits, parallelism is there. Parallel activity of gates and components.
Instructions in compilers.
Gates and modules in sods.

Both need to be clever enough to choose the 'best' path.

Metrics to measure 'goodness'.

Measuring performance

What's the largest chip that can be manufactured?
What's the most energy efficient chip that can be manufactured?
- there are limits. Chip may get hot.
- hot chip => won't last long.
- lower power dissipation => frequency of switching decreases
Frequency of the transistors in the chip
Runaway effect
Spiraling effect
Power dissipation
Power density
Temperature
Temperature density: Temp generated per unit area

—

Constraints in chip on silicon wafer (die)
Processor is just one kind of a chip.
What is the fastest chip that can be made? Keeping in mind all limitations?
Speed with which wires transmit signals?
Crosstalk btw wires (capacitance?)
There are physical limits on how fast a chip can operate.

Metrics to measure efficiancy

Area
Performance
Energy/Power

Perf => Energy
Energy => Power
Power => Heating up

These are orthogonal metrics. We need to make a choice. Trade-offs for realistic designs.

Tools should be able to deal with large designs

geometric layout
component instantiations and interconnections

Abstract models

State transition diagrams
DFG (Data Flow Graph)
CDFG (Control/Data Flow Graphs)
Sequence graphs
Model call: Calling a module in a sequence graph
Links and operations

Target independent optimizations

Data-flow transformations
Control-flow transformations

Data-Flow transformations

Tree height reduction

With respect to the parse tree. Applies arithmetic properties to the arithmetic expression tree. Assume no hardware restriction now. Expression split into two-operand expressions. => Exploits parallelism better.

Eg:

x = a + b + c + d
x = a + b
x = x + c
x = x + d

But what if later when we take hardware restrictions into account? What if there was only one adder? Still this optimizations is good because it is better in most hardware configurations??

Constant and variable propagation (ie, folding)

Pre-compute values statically. Detect copies of the variable/constant. The input to this form of optimization itself may have been the result of some other optimizations.

Eg:

a = 0
b = a
c = 10 * b
d = a * b + c + 1

becomes

a = 0
~b = a~
c = 10 * a
d = a * a + c + 1

becomes

a = 0
~b = 0~
c = 0
d = 1

Common subexpression elimination

Common subexpression patterns show up as isomorphic (ie, similar in form) patterns in parse trees (especially when reduced to 3 address code) Operator commutativity can be exploited?? No need to compute the subexpression over and over again.

Eg:

a = x + y
b = a + 2
c = (x + y) * (a + 1)

becomes

a = x + y
b = a + 2
c = a * (a + 1)

Operator strength reduction

Replace time consuming operations with faster operations.

Eg:

b = a * 2

could be made

b = a << 1

because shift register cheaper and faster than building up a multiplier.

This is particularly useful when the only kind of multiplication is by a power of 2. Then we don't need a multiplier and can be okay with a shift register instead.

But what if there's an overflow on shifting??

Code motion

'Move' code to make it more efficient. Applies to loop invariants.

for(int i=0; i<a*b; ++i) {
    ...
}

Provided `a` and `b` doesn't change within the loop, this could become

t = a * b;
for(int i=0; i<t; ++i) {
    ...
}

Control flow based transformations

Model expansion

'Flatten' models (submodules) that are called only once. ie, remove the hierarchy. This would expose optimization opportunities across the model's boundaries. ie, larger scope for optimizations.

Conditionals expansion

Control flow is converted to a data flow. Hardware is inherently parallel (unlike in the case of software). This could in turn expose more opportunities for optimization.

y = ab
if (a) {
    x = b+d
} else {
    x = bd
}

could be made into

y = ab

which could in turn become

y = ab
x = ab + ad + a'bd

to become

y = ab
x = y + ad + a'bd  // -> x = y + d(a + a'b) -> x = y + d(a' + d)

Loop unrolling (Loop expansion)

Used in software world for Software pipelining.

Dbts

a-synthesis, l-synthesis
architectural synthesis
logic synthesis
logic binding

Abbreviations

FU: Functional unit

Misc

CHERI

Capability Enhanced Enhanced RISC Instructions (CHERI)

Joint work by SRI and University of Cambridge
https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
https://tratt.net/laurie/blog/2023/two_stories_for_what_is_cheri.html
CHERI-aware OS needed to take advantage of the extra features
'Capabilities'

MUX vs serializers

Not the same.
Likewise with DEMUX and deserializers