Most of this is remnants of notes from 2021 some of which were found in 2024…
High level synthesis:
Input:
- Behavioural HDL processes. All of which needs to be run concurrently
- Only functionality is modeled here.
- Clock period
- Component library (the components that are availble to be used, I guess)
- Components of different area, delay, power dissipation, number of ports, etc.
- ALUs, memory, adders, multipliers, comparators, etc.
- Constraints
- Resource, delays, power, etc (which may also depend on the budget allocated to the design)
Output:
- FSM (controller)
- datapath
Example: The behavioural model
y <= (A+B) * C - (D+E)/F;
which is implictly
y <= [(A+B) * C] - [(D+E)/F];
and becomes
Clock0:
- t1 <= A+B
- t2 <= D+E
Clock1:
- t3 <= t1*C
- t4 <= t2/F
Clock2:
- y <= t3-t4
Scheduling
- Leads to the FSM part.
- Maps operations to specific clock cycles.
| │ | │ ┆ | │ | │
| │ | │ ┆ ┄|┄┄┄│┄┄┄│┄┄┄│┄┄ Clk0
└─┬─┘ └─┬─┘ ┆ └─┬─┘ └─┬─┘
+ + ┆ + +
| │ ┆ ┄┄┄|┄┄┄┄┄┄┄│┄┄┄┄ Clk1
└───┬───┘ ┆ └───┬───┘
+ ┆ +
│ ┄┄┄┄┄┄┄|┄┄┄┄┄┄┄┄ Clk2
where the '+' denotes operations.
Function unit binding:
- Operations are mapped to appropriate FUs from the resource library.
- The kind of FUs chosen determines the interconnections and hence the data path.
Register allocation:
- Assigning values (temporary variables) to registers.
- Values that would be used in a future clock cycle.
- 'variables persisting across clock boundaries implies register'
FSM encoding
- aka 'State encoding' aka 'State assignment'.
- State assignment. ie, making states.
- Assign unique codes to symbolic states of FSM.
- Encoding is needed to have an implementation.
- Number of states = minimum number of bits (ie, D-FFs) in state register of FSM.
- State table is constructed.
- State table is converted to truth table.
- Truth table => logic for output and next state .
- Logic minimization can be done using truth table.
Current state
+--------------+
| |
| ^
| |
| +------------------------+
| | State register |
| +------------------------+
| |
v ^ Next state
| |
| +------------------------+
| | Next state and output |
| | output logic |------> Output
| +------------------------+
| | |
| ^ ^
| | |
+-----------+ |
Input
Mealy machine notation: condition/action
Synthesis
- Converting an abstract, high-level specification to detailed implementation.
- Those who write the high-level specification needn't worry about the implementation details. They can focus on their problem to be solved.
- The synthesis tool takes the HLL as input and figures out which all components and circuits would be needed to implement it.
- HLL describes the functionality of the circuit.
- HLL captures only the behaviour. Not how it is implemented.
- Similar to a compiler. But in synthesis,
- Language: HDL, Bluespec
- Same HLL can lead to different equivalent implementations.
Detailed implementation
- Target specific: Gates, modules, etc.
- Different for different systems.
Example:
- A system may have an ALU => that can be used.
- Some other system may not have an ALU, so we would have to build an equivalent component.
- Or maybe there is a chip available in the system, which makes things easier.
- Based on the target, implementation changes.
Netlist
- Interconnection between the gates and modules.
- The components alone are not enough. We need to figure out how they are to be connected together.
- Netlist can be used by a chip manufacturer to create the actual chip.
Multiple abstraction levels
- (like in the TCP/IP :-) maybe )
- Area and power considerations
Comparison with compilers
In compilers, there may be a strict order in which the instructions are to be completed.
But in digital circuits, parallelism is there. Parallel activity of gates and components.
Instructions in compilers.
Gates and modules in sods.
Both need to be clever enough to choose the 'best' path.
Metrics to measure 'goodness'.
Measuring performance
- What's the largest chip that can be manufactured?
- What's the most energy efficient chip that can be manufactured?
- there are limits. Chip may get hot.
- hot chip => won't last long.
- lower power dissipation => frequency of switching decreases
- Frequency of the transistors in the chip
- Runaway effect
- Spiraling effect
- Power dissipation
- Power density
- Temperature
- Temperature density: Temp generated per unit area
—
Constraints in chip on silicon wafer (die)
Processor is just one kind of a chip.
What is the fastest chip that can be made? Keeping in mind all limitations?
Speed with which wires transmit signals?
Crosstalk btw wires (capacitance?)
There are physical limits on how fast a chip can operate.
Metrics to measure efficiancy
- Area
- Performance
- Energy/Power
Perf => Energy
Energy => Power
Power => Heating up
These are orthogonal metrics. We need to make a choice. Trade-offs for realistic designs.
Tools should be able to deal with large designs
- geometric layout
- component instantiations and interconnections
Abstract models
State transition diagrams
DFG (Data Flow Graph)
CDFG (Control/Data Flow Graphs)
Sequence graphs
Model call: Calling a module in a sequence graph
Links and operations
Target independent optimizations
- Data-flow transformations
- Control-flow transformations
Data-Flow transformations
Tree height reduction
With respect to the parse tree. Applies arithmetic properties to the arithmetic expression tree. Assume no hardware restriction now. Expression split into two-operand expressions. => Exploits parallelism better.
Eg:
x = a + b + c + d
x = a + b
x = x + c
x = x + d
But what if later when we take hardware restrictions into account? What if there was only one adder? Still this optimizations is good because it is better in most hardware configurations??
Constant and variable propagation (ie, folding)
Pre-compute values statically. Detect copies of the variable/constant. The input to this form of optimization itself may have been the result of some other optimizations.
Eg:
a = 0
b = a
c = 10 * b
d = a * b + c + 1
becomes
a = 0
~b = a~
c = 10 * a
d = a * a + c + 1
becomes
a = 0
~b = 0~
c = 0
d = 1
Common subexpression elimination
Common subexpression patterns show up as isomorphic (ie, similar in form) patterns in parse trees (especially when reduced to 3 address code) Operator commutativity can be exploited?? No need to compute the subexpression over and over again.
Eg:
a = x + y
b = a + 2
c = (x + y) * (a + 1)
becomes
a = x + y
b = a + 2
c = a * (a + 1)
Operator strength reduction
Replace time consuming operations with faster operations.
Eg:
b = a * 2
could be made
b = a << 1
because shift register cheaper and faster than building up a multiplier.
This is particularly useful when the only kind of multiplication is by a power of 2. Then we don't need a multiplier and can be okay with a shift register instead.
But what if there's an overflow on shifting??
Code motion
'Move' code to make it more efficient. Applies to loop invariants.
for(int i=0; i<a*b; ++i) {
...
}
Provided `a` and `b` doesn't change within the loop, this could become
t = a * b;
for(int i=0; i<t; ++i) {
...
}
Control flow based transformations
Model expansion
'Flatten' models (submodules) that are called only once. ie, remove the hierarchy. This would expose optimization opportunities across the model's boundaries. ie, larger scope for optimizations.
Conditionals expansion
Control flow is converted to a data flow. Hardware is inherently parallel (unlike in the case of software). This could in turn expose more opportunities for optimization.
y = ab
if (a) {
x = b+d
} else {
x = bd
}
could be made into
y = ab
which could in turn become
y = ab
x = ab + ad + a'bd
to become
y = ab
x = y + ad + a'bd // -> x = y + d(a + a'b) -> x = y + d(a' + d)
Loop unrolling (Loop expansion)
Used in software world for Software pipelining.
Dbts
- a-synthesis, l-synthesis
- architectural synthesis
- logic synthesis
- logic binding
Abbreviations
- FU: Functional unit
Misc
CHERI
Capability Enhanced Enhanced RISC Instructions (CHERI)
- Joint work by SRI and University of Cambridge
- https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
- https://tratt.net/laurie/blog/2023/two_stories_for_what_is_cheri.html
- CHERI-aware OS needed to take advantage of the extra features
- 'Capabilities'
MUX vs serializers
- Not the same.
- Likewise with DEMUX and deserializers