As part of my masters course project, I generate verilog designs that are then synthesized using vivado before being loaded onto FPGAs.
I once tried to see if open source tools can be used to achieve the same.
Spent some time looking up stuff like Skywater-PDK (being a hardware novice, I didn't realize for a while that PDK is for ASIC designs).
Then I came across this article which describes how a verilog design can be synthesised and loaded onto an FPGA. The board used was a Sipeed Tang Nano 9K.
FOSS tools involved in this included:
Tool / Project | Purpose |
---|---|
Yosys | Synthesis |
nextpnr | Placement, routing, STA |
Project apicula | Bitstream generation |
openFPGALoader | Bitstream loading |
I found a Sipeed Tang Nano 9K and tried doing the stuff the stuff mentioned in the article and it worked perfect!
This blog post is a description of that experience.
First thing to do was to get an FPGA that is open-source friendly (or became so by community effort).
After some time on the internet, I figured there were two options: a lattice FPGA or a gowin FPGA. I couldn't find a lattice FPGA anywhere but could get my hands on an gowin FPGA. What more? The board was a Sipeed Tang Nano 9K itself. It helped that gowin FPGAs are relatively cheap.
Sipeed Tang Nano 9K is a board made by Sipeed, a company from China, and uses a Gowin FPGA. As the 'nano' in its name indicates, Tang Nano 9K is one of the smaller boards offered by Sipeed. It is one of the most cost effective boards with an FPGA available in the market that is suitable for beginners.
This board is powered by a GW1NR-9 FPGA (GW1NR-LV9QN88PC6/I5) made by
Gowin, again a Chinese company.
(If it weren't for Chinese companies offering cheap boards, a lot of us
wouldn't even see an FPGA.)
Some of the specs of this board ˡ ˡ:
LUTs | 8640 (LUT4) |
Registers / FFs | 6480 |
PLLs | 2 |
Buttons | 2 |
LEDs | 6 |
Crystal oscillator frequency | 27MHz |
Hard core processor | -NA- |
Debugging | Onboard USB-JTAG, USB-UART |
A schematic of Sipeed Tang Nano 9K is available here.
In the example design that we use, we would be using all 6 LEDs for output.
Gowin offers an IDE of its own which is free but still needs license like Vivado. Since we are focusing on open source tools, we don't use this.
We are using a simple 6-bit ring counter as an example. I got the verilog code for that as follows:
module counter
(
input clk,
output [5:0] led
);
// 27M cycles for 1 second
localparam WAIT_TIME = 27000000;
// Set initial value of cycle counter
reg [23:0] clockCounter = 0;
// Indicates currently active LED
// Initially, first LED is lit
reg [5:0] ledCounter = 0;
always @(posedge clk) begin
// Step up cycle counter
<= clockCounter + 1;
clockCounter
if (clockCounter == WAIT_TIME) begin
// Reset cycle counter once 1s is up
<= 0;
clockCounter
// Register change in active LED
if (ledCounter == 0)
<= 1;
ledCounter else
<= ledCounter << 1;
ledCounter end
end
// Update change in active LED
assign led = ~ledCounter;
endmodule
There is no input other than clock. 6 output signals are activated one by one, one at a time, with a delay of 1 second when operated at 27MHz frequency. These output signals are meant to be mapped to LEDs.
(iverilog
can be used for simulation with a test bench
or to just play around with the verilog file.)
yosys
)yosys is used to convert the verilog design into corresponding netlist containing data needed for placement and routing.
yosys -p "read_verilog counter.v; synth_gowin -top counter -json counter.json"
This tells yosys to read the verilog file, run synthesis targeting a
gowin FPGA where top module is named counter
and write
results into a file named counter.json
. yosys will perform
optimizations, techmap, etc.
Output of synthesis is a netlist. The contents of
counter.json
is a form of netlist.
We can use the synth_gowin
command here thanks to yosys
providing support for gowin FPGAs out of the box.
synth_gowin
is actually short for a bunch of yosys
commands. I've included these commands in the addendum of this post.
The counter.json
will contain information like mapping
of blocks to FPGA components.
The synth_gowin
command includes stat
command of yosys, which prints resource utilzation data like number of
registers needed.
A sample report looks like this:
=== counter ===
Number of wires: 78
Number of wire bits: 127
Number of public wires: 78
Number of public wire bits: 127
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 87
ALU 30
DFFE 6
DFFR 24
GND 1
IBUF 1
LUT1 9
LUT4 7
MUX2_LUT5 2
OBUF 6
VCC 1
I didn't know what many of these abbreviations meant. So looked up some of them at a gowin user guide.
nextpnr
)nextpnr is used to perform placement and routing after synthesis
based on the information generated by yosys. It offers a gowin-specific
command: nextpnr-gowin
.
nextpnr-gowin \
--json counter.json \ # Info from yosys
--freq 27 \ # Desired frequency in MHz
--write counter_pnr.json \ # Output file
--device GW1NR-LV9QN88PC6/I5 \ # Target FPGA info
--family GW1N-9C \ # Target FPGA family
--cst tangnano9k.cst # Physical constraints file
The constraints file say which pins are to be mapped to what. In our case, its contents are as follows:
IO_LOC "clk" 52;
IO_PORT "clk" PULL_MODE=UP;
IO_LOC "led[0]" 10;
IO_LOC "led[1]" 11;
IO_LOC "led[2]" 13;
IO_LOC "led[3]" 14;
IO_LOC "led[4]" 15;
IO_LOC "led[5]" 16;
which specifies the ports for clock and the LEDs.
The port numbers are available from the schematics of the gowin Tang nano 9K FPGA. I could find such a schematic here.
Above nextpnr command would write the information about the placed
and routed design to a file named counter_pnr.json
. The
contents of this file is a netlist with placement and routing info.
First, nextpnr produces an FPGA-specific utilization report:
Info: Device utilisation:
Info: VCC: 1/ 1 100%
Info: SLICE: 79/ 8640 0%
Info: IOB: 7/ 274 2%
Info: OSER16: 0/ 38 0%
Info: IDES16: 0/ 38 0%
Info: IOLOGIC: 0/ 296 0%
Info: MUX2_LUT5: 2/ 4320 0%
Info: MUX2_LUT6: 0/ 2160 0%
Info: MUX2_LUT7: 0/ 1080 0%
Info: MUX2_LUT8: 0/ 1056 0%
Info: GND: 1/ 1 100%
Info: RAMW: 0/ 270 0%
Info: OSC: 0/ 1 0%
Info: rPLL: 0/ 2 0%
Looks like it needed only two 5-input LUTs. I guess, the 7 IOBs are the 6 LEDs and the input clock. But I'm not sure how the number of slices is 79.
Abbreviationsˡ:
(Couldn't find what RAMW means..)
nextpnr is 'timing driven'. ie, it does some form of static timing analysis by itself. Every path in the netlist that starts from a FF and ends at another FF is analysed.
(nextpnr can be asked to output a report in json format with
--report
. ᵈ)
nextpnr architechture is something like this (this is an ascii-art version of an image from here):
+----------+ +--------+
| JSON |-->--| Packer |
|front end | +--------+
+----------+ | +--------+
v | Timing |
| | model |
| +--------+
+--------+ |
+--->---| Placer |--<--+ v
| +--------+ | |
+------+ | +----------+
| Chip | v | Timing |
| data | | | analysis |
+------+ | +----------+
| +--------+ |
+--->---| Router |--<--+
+--------+
|
v
where timing model, chip-data and packer varies with the target board.
I used to think that timing analysis cannot be done before routing. But turns out that isn't the case. My wild guess is that paths can be inferred from the netlist that yosys generated since it specifies the paths in some form anyway. The separate routing proces can create more efficient paths specific to target board.
This was the slack histogram that nextpnr gave post-placement but pre-routing:
Info: Max frequency for clock 'clk_IBUF_I_O': 316.86 MHz (PASS at 27.00 MHz)
Info: Max delay posedge clk_IBUF_I_O -> <async>: 6.69 ns
Info: Slack histogram:
Info: legend: * represents 1 endpoint(s)
Info: + represents [1,1) endpoint(s)
Info: [ 33881, 33996) |***************************
Info: [ 33996, 34111) |
Info: [ 34111, 34226) |
Info: [ 34226, 34341) |
Info: [ 34341, 34456) |
Info: [ 34456, 34571) |
Info: [ 34571, 34686) |*
Info: [ 34686, 34801) |*
Info: [ 34801, 34916) |
Info: [ 34916, 35031) |
Info: [ 35031, 35146) |
Info: [ 35146, 35261) |
Info: [ 35261, 35376) |
Info: [ 35376, 35491) |
Info: [ 35491, 35606) |
Info: [ 35606, 35721) |
Info: [ 35721, 35836) |
Info: [ 35836, 35951) |
Info: [ 35951, 36066) |
Info: [ 36066, 36181) |*
Info: Checksum: 0xb9798815
The slack histogram groups slacks (in ps) into ranges and show how many paths fit in each of the ranges.
In the above histogram, there are 27 asterisks next to [ 33881,
33996)
which means that there are 27 paths whose slack is greater
than or equal to 33.881ns but less than 33.996ns.
The histogram would show negative values if there was negative slack. No negative slack. Yay!
After doing placement and associated timing analysis, nextpnr will do routing.
As it tries various routes, it will show what it's upto by printing stuff like this:
Info: Routing..
Info: Setting up routing queue.
Info: Routing 225 arcs.
Info: | (re-)routed arcs | delta | remaining| time spent |
Info: IterCnt | w/ripup wo/ripup | w/r wo/r | arcs| batch(sec) total(sec)|
Info: 426 | 200 226 | 200 226 | 0| 1.82 1.82|
Info: Routing complete.
Info: Router1 time 1.82s
Info: Checksum: 0x756ec238
('ripup and reroute' is the name of a class of routing algorithms. ʷ)
An arc is a 'source-sink pair on a net' or a 'directed connection between two nodes'. Looks like there were 225 of them in the netlist and nextpnr spent 1.82 seconds performing 426 iterations.
Once routing is done, nextpnr will perform timing analysis once again.
Info: Critical path report for cross-domain path 'posedge clk_IBUF_I_O' -> '<async>':
Info: curr total
Info: 0.5 0.5 Source ledCounter_DFFE_Q_DFFLC.Q
Info: 1.9 2.4 Net ledCounter[5] (3,17) -> (1,22)
Info: Sink ledCounter_LUT1_I0_3_LC.A
Info: Defined in:
Info: counter.v:8.11-8.21
Info: 1.0 3.4 Source ledCounter_LUT1_I0_3_LC.F
Info: 1.4 4.8 Net led_OBUF_O_I[5] (1,22) -> (0,25)
Info: Sink led_OBUF_O$iob.I
Info: 1.5 ns logic, 3.4 ns routing
Info: Max frequency for clock 'clk_IBUF_I_O': 315.86 MHz (PASS at 27.00 MHz)
Info: Max delay posedge clk_IBUF_I_O -> <async>: 4.84 ns
Info: Slack histogram:
Info: legend: * represents 1 endpoint(s)
Info: + represents [1,1) endpoint(s)
Info: [ 33871, 33991) |*
Info: [ 33991, 34111) |
Info: [ 34111, 34231) |*
Info: [ 34231, 34351) |******
Info: [ 34351, 34471) |
Info: [ 34471, 34591) |*********
Info: [ 34591, 34711) |**********
Info: [ 34711, 34831) |*
Info: [ 34831, 34951) |*
Info: [ 34951, 35071) |
Info: [ 35071, 35191) |
Info: [ 35191, 35311) |
Info: [ 35311, 35431) |
Info: [ 35431, 35551) |
Info: [ 35551, 35671) |
Info: [ 35671, 35791) |
Info: [ 35791, 35911) |
Info: [ 35911, 36031) |
Info: [ 36031, 36151) |
Info: [ 36151, 36271) |*
Hmm.. Not sure if I'm reading this correctly, but from the histograms alone, it looks as if total slack actually went up after routing…
Still the value shown as 'Max delay posedge
clk_IBUF_I_O
' has gone down to 4.84ns from 6.69ns.
Yet the max frequency of the clock went down a bit to 315.86MHz from 316.86MHz.
We had requested the design to be run at 27MHz. nextpnr figures it can be run even at 315.86MHz (Fmax), so we are good in that aspect.
I have not yet figured out how the histogram is meant to be read. This histogram is not that intriguing since it is a tiny design. Histograph for a relatively less trivial design is shown here.
nextpnr also has a gui that is not part of default installation which can provide visualizations of placed and routed nets. I have not used it yet, but it looks nice.
Now that we have all the information needed for a loadable FPGA design, we convert the data into a bit stream that can then actually loaded onto the FPGA.
For this we need to know the bitstream format used by the FPGA. Usually FPGA vendors are not enthusiastic about revealing this information to the public.
(Found a hackernews discussion on the topic which was in response to this post suggesting that there's little point in keeping bitstream format secret.)
There have been many attempts to document the bitstream formats of different architectures over the years.
Examples include:
Project apicula is an effort that successfully managed to figure out the bitstream format of a class of gowin FPGAs that includes the one used by Sipeed Tang Nano 9K, which is what we are using.
The tools for apicula can be obtained by installing a python package
called apycula
.
pip3 install apycula
apycula
offers the gowin_pack
to generate
bitstream. In our case, it can be used like:
gowin_pack \
-d GW1N-9C \ # Target FPGA family
-o counter.fs \ # Output file with bitstream
counter_pnr.json # Input file with info from nextpnr
The counter.fs
file is the generated bitstream.
(I tried opening this file in a text editor. Was expecting gibberish-like strings, but it was literally showing as a bunch 0s and 1s.)
openFPGALoader
)Once we have the bitstream file, we can load it on to our FPGA with openfpgaloader.
openFPGALoader \
-b tangnano9k \ # Board name
-f counter.fs \ # Bitstream to be loaded
But before this command can work, something needs to be done inorder for openfpgaloader to be able to detect connected FPGA connected to the computer.
Otherwise we can get errors like this:
$ openFPGALoader -b tangnano9k -f counter.fs
empty
write to flash
unable to open ftdi device: -3 (device not found)
JTAG init failed with: unable to open ftdi device
$ ls /dev/ttyUSB*
/dev/ttyUSB0 /dev/ttyUSB1
The developer of openfpgaloader themselves has got a blog post on getting around this error.
What we could do is to use udev rules:
$ sudo cp 99-openfpgaloader.rules /etc/udev/rules.d/
$ ls /etc/udev/rules.d/
52-xilinx-digilent-usb.rules 52-xilinx-ftdi-usb.rules 52-xilinx-pcusb.rules 99-openfpgaloader.rules
$ sudo udevadm control --reload-rules
$ sudo udevadm trigger
$ sudo usermod -a $USER -G plugdev
The rule file (99-openfpgaloader.rules
) file can be
obtained from here.
Once that's done, it should be smooth sailing. We can flash the FPGA with:
$ openFPGALoader -b tangnano9k -f counter.fs
empty
write to flash
Jtag frequency : requested 6.00MHz -> real 6.00MHz
Parse file Parse counter.fs:
Done
DONE
Jtag frequency : requested 2.50MHz -> real 2.00MHz
Erase SRAM DONE
Erase FLASH DONE
Erasing FLASH: [==================================================] 100.00%
Done
write Flash: [==================================================] 100.00%
Done
CRC check : FAIL
Read: 0x0000431b checksum: 0xb4bb
I still have not figured out why the 'CRC check FAIL' is showing up, but it does not seem to prevent the bitstream being loaded onto the FPGA.
Once the FPGA is powered on, we can see the LED light blinking one-by-one, one-at-a-time.
:-)
The 6-bit ring counter design we used is tiny and simple. One got to try larger designs to get a better feel of how it works.
It is helpful to have a Makefile to run the commands. I made one like:
NAME=counter
BOARD=tangnano9k
FAMILY=GW1N-9C
DEVICE=GW1NR-LV9QN88PC6/I5
all: $(NAME).fs
synth: $(NAME).v
yosys -p "read_verilog $(NAME).v; synth_gowin -top $(NAME) -json $(NAME).json"
pnr: $(NAME).json
nextpnr-gowin --json $(NAME).json --freq 1 --write $(NAME)_pnr.json \
--device $(DEVICE) --family $(FAMILY) --cst $(BOARD).cst
bits: $(NAME)_pnr.json
gowin_pack -d $(FAMILY) -o $(NAME).fs $(NAME)_pnr.json
load: $(NAME).fs
openFPGALoader -b $(BOARD) -f $(NAME).fs
Versions of software used:
Also see: Yosys+nextpnr: an Open Source Framework from Verilog to Bitstream for Commercial FPGAs - David Shah, Eddie Hung, Clifford Wolf, Serge Bazanski, Dan Gisselquist, Miodrag Milanović (2019)
synth_gowin
of yosysAs mentioned in the manual,
synth_gowin
stands for:
# yosys> help synth_gowin
begin:
read_verilog -specify -lib +/gowin/cells_sim.v
read_verilog -specify -lib +/gowin/cells_xtra.v
hierarchy -check -top <top>
flatten: (unless -noflatten)
proc
flatten
tribuf -logic
deminout
coarse:
synth -run coarse [-no-rw-check]
map_ram:
memory_libmap -lib +/gowin/lutrams.txt -lib +/gowin/brams.txt [-no-auto-block] [-no-auto-distributed] (-no-auto-block if -nobram, -no-auto-distributed if -nolutram)
techmap -map +/gowin/lutrams_map.v -map +/gowin/brams_map.v
map_ffram:
opt -fast -mux_undef -undriven -fine
memory_map
opt -undriven -fine
map_gates:
techmap -map +/techmap.v -map +/gowin/arith_map.v
opt -fast
abc -dff -D 1 (only if -retime)
iopadmap -bits -inpad IBUF O:I -outpad OBUF I:O -toutpad TBUF ~OEN:I:O -tinoutpad IOBUF ~OEN:O:I:IO (unless -noiopads)
map_ffs:
opt_clean
dfflegalize -cell $_DFF_?_ 0 -cell $_DFFE_?P_ 0 -cell $_SDFF_?P?_ r -cell $_SDFFE_?P?P_ r -cell $_DFF_?P?_ r -cell $_DFFE_?P?P_ r
techmap -map +/gowin/cells_map.v
opt_expr -mux_undef
simplemap
map_luts:
read_verilog -icells -lib -specify +/abc9_model.v
abc9 -maxlut 8 -W 500
clean
map_cells:
techmap -map +/gowin/cells_map.v
opt_lut_ins -tech gowin
setundef -undriven -params -zero
hilomap -singleton -hicell VCC V -locell GND G
splitnets -ports (only if -vout used)
clean
autoname
check:
hierarchy -check
stat
check -noinit
blackbox =A:whitebox
vout:
write_verilog -simple-lhs -decimal -attr2comment -defparam -renameprefix gen <file-name>
write_json <file-name>
I guess files like the one at
/usr/share/yosys/gowin/cells_map.v
are used for
techmap.
Its contents were like:
// DFFR D Flip-Flop with Synchronous Reset
module \$_SDFF_PP0_ (input D, C, R, output Q);
(.D(D), .Q(Q), .CLK(C), .RESET(R));
DFFR _TECHMAP_REPLACE_ wire _TECHMAP_REMOVEINIT_Q_ = 1;
endmodule
// ..
// ..
module \$lut (A, Y);
parameter WIDTH = 0;
parameter LUT = 0;
(* force_downto *)
input [WIDTH-1:0] A;
output Y;
generate
if (WIDTH == 1) begin
(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
LUT1 #(A[0]));
.I0end else
if (WIDTH == 2) begin
(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
LUT2 #(A[0]), .I1(A[1]));
.I0end else
if (WIDTH == 3) begin
(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
LUT3 #(A[0]), .I1(A[1]), .I2(A[2]));
.I0end else
if (WIDTH == 4) begin
(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
LUT4 #(A[0]), .I1(A[1]), .I2(A[2]), .I3(A[3]));
.I0end else
if (WIDTH == 5) begin
wire f0, f1;
$lut #(.LUT(LUT[15: 0]), .WIDTH(4)) lut0 (.A(A[3:0]), .Y(f0));
\$lut #(.LUT(LUT[31:16]), .WIDTH(4)) lut1 (.A(A[3:0]), .Y(f1));
\(.I0(f0), .I1(f1), .S0(A[4]), .O(Y));
MUX2_LUT5 mux5end else
if (WIDTH == 6) begin
wire f0, f1;
$lut #(.LUT(LUT[31: 0]), .WIDTH(5)) lut0 (.A(A[4:0]), .Y(f0));
\$lut #(.LUT(LUT[63:32]), .WIDTH(5)) lut1 (.A(A[4:0]), .Y(f1));
\(.I0(f0), .I1(f1), .S0(A[5]), .O(Y));
MUX2_LUT6 mux6end else
if (WIDTH == 7) begin
wire f0, f1;
$lut #(.LUT(LUT[63: 0]), .WIDTH(6)) lut0 (.A(A[5:0]), .Y(f0));
\$lut #(.LUT(LUT[127:64]), .WIDTH(6)) lut1 (.A(A[5:0]), .Y(f1));
\(.I0(f0), .I1(f1), .S0(A[6]), .O(Y));
MUX2_LUT7 mux7end else
if (WIDTH == 8) begin
wire f0, f1;
$lut #(.LUT(LUT[127: 0]), .WIDTH(7)) lut0 (.A(A[6:0]), .Y(f0));
\$lut #(.LUT(LUT[255:128]), .WIDTH(7)) lut1 (.A(A[6:0]), .Y(f1));
\(.I0(f0), .I1(f1), .S0(A[7]), .O(Y));
MUX2_LUT8 mux8end else begin
wire _TECHMAP_FAIL_ = 1;
end
endgenerate
endmodule
yosys> stat -top NAME -json NAME_synth_stat.json
https://yosyshq.readthedocs.io/projects/yosys/en/latest/cmd/stat.html
yosys has sta but how to use??