Hardware design with open source tools

EDA with open source tools: yosys + nextpnr.

As part of my masters course project, I generate verilog designs that are then synthesized using vivado before being loaded onto FPGAs.

I once tried to see if open source tools can be used to achieve the same.

Spent some time looking up stuff like Skywater-PDK (being a hardware novice, I didn't realize for a while that PDK is for ASIC designs).

Then I came across this article which describes how a verilog design can be synthesised and loaded onto an FPGA. The board used was a Sipeed Tang Nano 9K.

FOSS tools involved in this included:

Tool / Project Purpose
Yosys Synthesis
nextpnr Placement, routing, STA
Project apicula Bitstream generation
openFPGALoader Bitstream loading

I found a Sipeed Tang Nano 9K and tried doing the stuff the stuff mentioned in the article and it worked perfect!

This blog post is a description of that experience.

The board

First thing to do was to get an FPGA that is open-source friendly (or became so by community effort).

After some time on the internet, I figured there were two options: a lattice FPGA or a gowin FPGA. I couldn't find a lattice FPGA anywhere but could get my hands on an gowin FPGA. What more? The board was a Sipeed Tang Nano 9K itself. It helped that gowin FPGAs are relatively cheap.

Sipeed Tang Nano 9K is a board made by Sipeed, a company from China, and uses a Gowin FPGA. As the 'nano' in its name indicates, Tang Nano 9K is one of the smaller boards offered by Sipeed. It is one of the most cost effective boards with an FPGA available in the market that is suitable for beginners.

This board is powered by a GW1NR-9 FPGA (GW1NR-LV9QN88PC6/I5) made by Gowin, again a Chinese company.
(If it weren't for Chinese companies offering cheap boards, a lot of us wouldn't even see an FPGA.)

Some of the specs of this board ˡ ˡ:

LUTs 8640 (LUT4)
Registers / FFs 6480
PLLs 2
Buttons 2
LEDs 6
Crystal oscillator frequency 27MHz
Hard core processor -NA-
Debugging Onboard USB-JTAG, USB-UART

A schematic of Sipeed Tang Nano 9K is available here.

In the example design that we use, we would be using all 6 LEDs for output.

Gowin offers an IDE of its own which is free but still needs license like Vivado. Since we are focusing on open source tools, we don't use this.

Flow

Verilog design

We are using a simple 6-bit ring counter as an example. I got the verilog code for that as follows:

module counter
(
    input clk,
    output [5:0] led
);

// 27M cycles for 1 second
localparam WAIT_TIME = 27000000;

// Set initial value of cycle counter
reg [23:0] clockCounter = 0;

// Indicates currently active LED
// Initially, first LED is lit
reg [5:0] ledCounter = 0;

always @(posedge clk) begin
    // Step up cycle counter
    clockCounter <= clockCounter + 1;

    if (clockCounter == WAIT_TIME) begin
        // Reset cycle counter once 1s is up
        clockCounter <= 0;

        // Register change in active LED
        if (ledCounter == 0)
          ledCounter <= 1;
        else
          ledCounter <= ledCounter << 1;
    end
end

// Update change in active LED
assign led = ~ledCounter;
endmodule

There is no input other than clock. 6 output signals are activated one by one, one at a time, with a delay of 1 second when operated at 27MHz frequency. These output signals are meant to be mapped to LEDs.

(iverilog can be used for simulation with a test bench or to just play around with the verilog file.)

Synthesis (yosys)

yosys is used to convert the verilog design into corresponding netlist containing data needed for placement and routing.

yosys -p "read_verilog counter.v; synth_gowin -top counter -json counter.json"

This tells yosys to read the verilog file, run synthesis targeting a gowin FPGA where top module is named counter and write results into a file named counter.json. yosys will perform optimizations, techmap, etc.

Output of synthesis is a netlist. The contents of counter.json is a form of netlist.

We can use the synth_gowin command here thanks to yosys providing support for gowin FPGAs out of the box. synth_gowin is actually short for a bunch of yosys commands. I've included these commands in the addendum of this post.

The counter.json will contain information like mapping of blocks to FPGA components.

The synth_gowin command includes stat command of yosys, which prints resource utilzation data like number of registers needed.

A sample report looks like this:

=== counter ===

   Number of wires:                 78
   Number of wire bits:            127
   Number of public wires:          78
   Number of public wire bits:     127
   Number of memories:               0
   Number of memory bits:            0
   Number of processes:              0
   Number of cells:                 87
     ALU                            30
     DFFE                            6
     DFFR                           24
     GND                             1
     IBUF                            1
     LUT1                            9
     LUT4                            7
     MUX2_LUT5                       2
     OBUF                            6
     VCC                             1

I didn't know what many of these abbreviations meant. So looked up some of them at a gowin user guide.

Placement and routing (nextpnr)

nextpnr is used to perform placement and routing after synthesis based on the information generated by yosys. It offers a gowin-specific command: nextpnr-gowin.

nextpnr-gowin \
  --json counter.json \             # Info from yosys
  --freq 27 \                       # Desired frequency in MHz
  --write counter_pnr.json \        # Output file
  --device GW1NR-LV9QN88PC6/I5 \    # Target FPGA info
  --family GW1N-9C \                # Target FPGA family
  --cst tangnano9k.cst              # Physical constraints file

The constraints file say which pins are to be mapped to what. In our case, its contents are as follows:

IO_LOC "clk" 52;
IO_PORT "clk" PULL_MODE=UP;
IO_LOC "led[0]" 10;
IO_LOC "led[1]" 11;
IO_LOC "led[2]" 13;
IO_LOC "led[3]" 14;
IO_LOC "led[4]" 15;
IO_LOC "led[5]" 16;

which specifies the ports for clock and the LEDs.

The port numbers are available from the schematics of the gowin Tang nano 9K FPGA. I could find such a schematic here.

Above nextpnr command would write the information about the placed and routed design to a file named counter_pnr.json. The contents of this file is a netlist with placement and routing info.

First, nextpnr produces an FPGA-specific utilization report:

Info: Device utilisation:
Info:                    VCC:     1/    1   100%
Info:                  SLICE:    79/ 8640     0%
Info:                    IOB:     7/  274     2%
Info:                 OSER16:     0/   38     0%
Info:                 IDES16:     0/   38     0%
Info:                IOLOGIC:     0/  296     0%
Info:              MUX2_LUT5:     2/ 4320     0%
Info:              MUX2_LUT6:     0/ 2160     0%
Info:              MUX2_LUT7:     0/ 1080     0%
Info:              MUX2_LUT8:     0/ 1056     0%
Info:                    GND:     1/    1   100%
Info:                   RAMW:     0/  270     0%
Info:                    OSC:     0/    1     0%
Info:                   rPLL:     0/    2     0%

Looks like it needed only two 5-input LUTs. I guess, the 7 IOBs are the 6 LEDs and the input clock. But I'm not sure how the number of slices is 79.

Abbreviationsˡ:

(Couldn't find what RAMW means..)

nextpnr is 'timing driven'. ie, it does some form of static timing analysis by itself. Every path in the netlist that starts from a FF and ends at another FF is analysed.

(nextpnr can be asked to output a report in json format with --report. )

nextpnr architechture is something like this (this is an ascii-art version of an image from here):

+----------+     +--------+
|  JSON    |-->--| Packer |
|front end |     +--------+
+----------+         |          +--------+
                     v          | Timing |
                     |          |  model |
                     |          +--------+
                 +--------+         |  
         +--->---| Placer |--<--+   v
         |       +--------+     |   |
     +------+        |       +----------+
     | Chip |        v       |  Timing  |
     | data |        |       | analysis |
     +------+        |       +----------+
         |       +--------+     |        
         +--->---| Router |--<--+  
                 +--------+
                     |
                     v

where timing model, chip-data and packer varies with the target board.

I used to think that timing analysis cannot be done before routing. But turns out that isn't the case. My wild guess is that paths can be inferred from the netlist that yosys generated since it specifies the paths in some form anyway. The separate routing proces can create more efficient paths specific to target board.

This was the slack histogram that nextpnr gave post-placement but pre-routing:

Info: Max frequency for clock 'clk_IBUF_I_O': 316.86 MHz (PASS at 27.00 MHz)

Info: Max delay posedge clk_IBUF_I_O -> <async>: 6.69 ns

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 33881,  33996) |***************************
Info: [ 33996,  34111) |
Info: [ 34111,  34226) |
Info: [ 34226,  34341) |
Info: [ 34341,  34456) |
Info: [ 34456,  34571) |
Info: [ 34571,  34686) |*
Info: [ 34686,  34801) |*
Info: [ 34801,  34916) |
Info: [ 34916,  35031) |
Info: [ 35031,  35146) |
Info: [ 35146,  35261) |
Info: [ 35261,  35376) |
Info: [ 35376,  35491) |
Info: [ 35491,  35606) |
Info: [ 35606,  35721) |
Info: [ 35721,  35836) |
Info: [ 35836,  35951) |
Info: [ 35951,  36066) |
Info: [ 36066,  36181) |*
Info: Checksum: 0xb9798815

The slack histogram groups slacks (in ps) into ranges and show how many paths fit in each of the ranges.

In the above histogram, there are 27 asterisks next to [ 33881, 33996) which means that there are 27 paths whose slack is greater than or equal to 33.881ns but less than 33.996ns.

The histogram would show negative values if there was negative slack. No negative slack. Yay!

After doing placement and associated timing analysis, nextpnr will do routing.

As it tries various routes, it will show what it's upto by printing stuff like this:

Info: Routing..
Info: Setting up routing queue.
Info: Routing 225 arcs.
Info:            |   (re-)routed arcs  |   delta    | remaining|       time spent     |
Info:    IterCnt |  w/ripup   wo/ripup |  w/r  wo/r |      arcs| batch(sec) total(sec)|
Info:        426 |      200        226 |  200   226 |         0|       1.82       1.82|
Info: Routing complete.
Info: Router1 time 1.82s
Info: Checksum: 0x756ec238

('ripup and reroute' is the name of a class of routing algorithms. ʷ)

An arc is a 'source-sink pair on a net' or a 'directed connection between two nodes'. Looks like there were 225 of them in the netlist and nextpnr spent 1.82 seconds performing 426 iterations.

Once routing is done, nextpnr will perform timing analysis once again.

Info: Critical path report for cross-domain path 'posedge clk_IBUF_I_O' -> '<async>':
Info: curr total
Info:  0.5  0.5  Source ledCounter_DFFE_Q_DFFLC.Q
Info:  1.9  2.4    Net ledCounter[5] (3,17) -> (1,22)
Info:                Sink ledCounter_LUT1_I0_3_LC.A
Info:                Defined in:
Info:                  counter.v:8.11-8.21
Info:  1.0  3.4  Source ledCounter_LUT1_I0_3_LC.F
Info:  1.4  4.8    Net led_OBUF_O_I[5] (1,22) -> (0,25)
Info:                Sink led_OBUF_O$iob.I
Info: 1.5 ns logic, 3.4 ns routing

Info: Max frequency for clock 'clk_IBUF_I_O': 315.86 MHz (PASS at 27.00 MHz)

Info: Max delay posedge clk_IBUF_I_O -> <async>: 4.84 ns

Info: Slack histogram:
Info:  legend: * represents 1 endpoint(s)
Info:          + represents [1,1) endpoint(s)
Info: [ 33871,  33991) |*
Info: [ 33991,  34111) |
Info: [ 34111,  34231) |*
Info: [ 34231,  34351) |******
Info: [ 34351,  34471) |
Info: [ 34471,  34591) |*********
Info: [ 34591,  34711) |**********
Info: [ 34711,  34831) |*
Info: [ 34831,  34951) |*
Info: [ 34951,  35071) |
Info: [ 35071,  35191) |
Info: [ 35191,  35311) |
Info: [ 35311,  35431) |
Info: [ 35431,  35551) |
Info: [ 35551,  35671) |
Info: [ 35671,  35791) |
Info: [ 35791,  35911) |
Info: [ 35911,  36031) |
Info: [ 36031,  36151) |
Info: [ 36151,  36271) |*

Hmm.. Not sure if I'm reading this correctly, but from the histograms alone, it looks as if total slack actually went up after routing…

Still the value shown as 'Max delay posedge clk_IBUF_I_O' has gone down to 4.84ns from 6.69ns.

Yet the max frequency of the clock went down a bit to 315.86MHz from 316.86MHz.

We had requested the design to be run at 27MHz. nextpnr figures it can be run even at 315.86MHz (Fmax), so we are good in that aspect.

I have not yet figured out how the histogram is meant to be read. This histogram is not that intriguing since it is a tiny design. Histograph for a relatively less trivial design is shown here.

nextpnr also has a gui that is not part of default installation which can provide visualizations of placed and routed nets. I have not used it yet, but it looks nice.

Bitstream generation (Project Apicula)

Now that we have all the information needed for a loadable FPGA design, we convert the data into a bit stream that can then actually loaded onto the FPGA.

For this we need to know the bitstream format used by the FPGA. Usually FPGA vendors are not enthusiastic about revealing this information to the public.

(Found a hackernews discussion on the topic which was in response to this post suggesting that there's little point in keeping bitstream format secret.)

There have been many attempts to document the bitstream formats of different architectures over the years.

Examples include:

Project apicula is an effort that successfully managed to figure out the bitstream format of a class of gowin FPGAs that includes the one used by Sipeed Tang Nano 9K, which is what we are using.

The tools for apicula can be obtained by installing a python package called apycula.

pip3 install apycula

apycula offers the gowin_pack to generate bitstream. In our case, it can be used like:

gowin_pack \
  -d GW1N-9C \      # Target FPGA family
  -o counter.fs \   # Output file with bitstream
  counter_pnr.json  # Input file with info from nextpnr

The counter.fs file is the generated bitstream.

(I tried opening this file in a text editor. Was expecting gibberish-like strings, but it was literally showing as a bunch 0s and 1s.)

Loading bitstream to FPGA (openFPGALoader)

Once we have the bitstream file, we can load it on to our FPGA with openfpgaloader.

openFPGALoader \
  -b tangnano9k \  # Board name
  -f counter.fs \  # Bitstream to be loaded

But before this command can work, something needs to be done inorder for openfpgaloader to be able to detect connected FPGA connected to the computer.

Otherwise we can get errors like this:

$ openFPGALoader -b tangnano9k -f counter.fs
empty
write to flash
unable to open ftdi device: -3 (device not found)
JTAG init failed with: unable to open ftdi device


$ ls /dev/ttyUSB*
/dev/ttyUSB0  /dev/ttyUSB1

The developer of openfpgaloader themselves has got a blog post on getting around this error.

What we could do is to use udev rules:

$ sudo cp 99-openfpgaloader.rules /etc/udev/rules.d/

$ ls /etc/udev/rules.d/
52-xilinx-digilent-usb.rules  52-xilinx-ftdi-usb.rules  52-xilinx-pcusb.rules  99-openfpgaloader.rules

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger
$ sudo usermod -a $USER -G plugdev

The rule file (99-openfpgaloader.rules) file can be obtained from here.

Once that's done, it should be smooth sailing. We can flash the FPGA with:

$ openFPGALoader -b tangnano9k -f counter.fs
empty
write to flash
Jtag frequency : requested 6.00MHz   -> real 6.00MHz
Parse file Parse counter.fs:
Done
DONE
Jtag frequency : requested 2.50MHz   -> real 2.00MHz
Erase SRAM DONE
Erase FLASH DONE
Erasing FLASH: [==================================================] 100.00%
Done
write Flash: [==================================================] 100.00%
Done
CRC check : FAIL
Read: 0x0000431b checksum: 0xb4bb

I still have not figured out why the 'CRC check FAIL' is showing up, but it does not seem to prevent the bitstream being loaded onto the FPGA.

Once the FPGA is powered on, we can see the LED light blinking one-by-one, one-at-a-time.

:-)

Conclusion

The 6-bit ring counter design we used is tiny and simple. One got to try larger designs to get a better feel of how it works.

It is helpful to have a Makefile to run the commands. I made one like:

NAME=counter
BOARD=tangnano9k
FAMILY=GW1N-9C
DEVICE=GW1NR-LV9QN88PC6/I5

all: $(NAME).fs

synth: $(NAME).v
    yosys -p "read_verilog $(NAME).v; synth_gowin -top $(NAME) -json $(NAME).json"

pnr: $(NAME).json
    nextpnr-gowin --json $(NAME).json --freq 1 --write $(NAME)_pnr.json \
        --device $(DEVICE) --family $(FAMILY) --cst $(BOARD).cst

bits: $(NAME)_pnr.json
    gowin_pack -d $(FAMILY) -o $(NAME).fs $(NAME)_pnr.json

load: $(NAME).fs
    openFPGALoader -b $(BOARD) -f $(NAME).fs

Versions of software used:

Also see: Yosys+nextpnr: an Open Source Framework from Verilog to Bitstream for Commercial FPGAs - David Shah, Eddie Hung, Clifford Wolf, Serge Bazanski, Dan Gisselquist, Miodrag Milanović (2019)

Addendum

Commands corresponding to synth_gowin of yosys

As mentioned in the manual, synth_gowin stands for:

# yosys> help synth_gowin

    begin:
        read_verilog -specify -lib +/gowin/cells_sim.v
        read_verilog -specify -lib +/gowin/cells_xtra.v
        hierarchy -check -top <top>

    flatten:    (unless -noflatten)
        proc
        flatten
        tribuf -logic
        deminout

    coarse:
        synth -run coarse [-no-rw-check]

    map_ram:
        memory_libmap -lib +/gowin/lutrams.txt -lib +/gowin/brams.txt [-no-auto-block] [-no-auto-distributed]    (-no-auto-block if -nobram, -no-auto-distributed if -nolutram)
        techmap -map +/gowin/lutrams_map.v -map +/gowin/brams_map.v

    map_ffram:
        opt -fast -mux_undef -undriven -fine
        memory_map
        opt -undriven -fine

    map_gates:
        techmap -map +/techmap.v -map +/gowin/arith_map.v
        opt -fast
        abc -dff -D 1    (only if -retime)
        iopadmap -bits -inpad IBUF O:I -outpad OBUF I:O -toutpad TBUF ~OEN:I:O -tinoutpad IOBUF ~OEN:O:I:IO    (unless -noiopads)

    map_ffs:
        opt_clean
        dfflegalize -cell $_DFF_?_ 0 -cell $_DFFE_?P_ 0 -cell $_SDFF_?P?_ r -cell $_SDFFE_?P?P_ r -cell $_DFF_?P?_ r -cell $_DFFE_?P?P_ r
        techmap -map +/gowin/cells_map.v
        opt_expr -mux_undef
        simplemap

    map_luts:
        read_verilog -icells -lib -specify +/abc9_model.v
        abc9 -maxlut 8 -W 500
        clean

    map_cells:
        techmap -map +/gowin/cells_map.v
        opt_lut_ins -tech gowin
        setundef -undriven -params -zero
        hilomap -singleton -hicell VCC V -locell GND G
        splitnets -ports    (only if -vout used)
        clean
        autoname

    check:
        hierarchy -check
        stat
        check -noinit
        blackbox =A:whitebox

    vout:
        write_verilog -simple-lhs -decimal -attr2comment -defparam -renameprefix gen <file-name>
        write_json <file-name>

Techmap stuff

I guess files like the one at /usr/share/yosys/gowin/cells_map.v are used for techmap.

Its contents were like:

// DFFR          D Flip-Flop with Synchronous Reset
module  \$_SDFF_PP0_ (input D, C, R, output Q);
    DFFR _TECHMAP_REPLACE_ (.D(D), .Q(Q), .CLK(C), .RESET(R));
    wire _TECHMAP_REMOVEINIT_Q_ = 1;
endmodule

// ..
// ..

module \$lut (A, Y);
    parameter WIDTH = 0;
    parameter LUT = 0;

    (* force_downto *)
    input [WIDTH-1:0] A;
    output Y;

    generate
        if (WIDTH == 1) begin
            LUT1 #(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
                .I0(A[0]));
        end else
        if (WIDTH == 2) begin
            LUT2 #(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
                .I0(A[0]), .I1(A[1]));
        end else
        if (WIDTH == 3) begin
            LUT3 #(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
                .I0(A[0]), .I1(A[1]), .I2(A[2]));
        end else
        if (WIDTH == 4) begin
            LUT4 #(.INIT(LUT)) _TECHMAP_REPLACE_ (.F(Y),
                .I0(A[0]), .I1(A[1]), .I2(A[2]), .I3(A[3]));
        end else
        if (WIDTH == 5) begin
            wire f0, f1;
            \$lut #(.LUT(LUT[15: 0]), .WIDTH(4)) lut0 (.A(A[3:0]), .Y(f0));
            \$lut #(.LUT(LUT[31:16]), .WIDTH(4)) lut1 (.A(A[3:0]), .Y(f1));
            MUX2_LUT5 mux5(.I0(f0), .I1(f1), .S0(A[4]), .O(Y));
        end else
        if (WIDTH == 6) begin
            wire f0, f1;
            \$lut #(.LUT(LUT[31: 0]), .WIDTH(5)) lut0 (.A(A[4:0]), .Y(f0));
            \$lut #(.LUT(LUT[63:32]), .WIDTH(5)) lut1 (.A(A[4:0]), .Y(f1));
            MUX2_LUT6 mux6(.I0(f0), .I1(f1), .S0(A[5]), .O(Y));
        end else
        if (WIDTH == 7) begin
            wire f0, f1;
            \$lut #(.LUT(LUT[63: 0]), .WIDTH(6)) lut0 (.A(A[5:0]), .Y(f0));
            \$lut #(.LUT(LUT[127:64]), .WIDTH(6)) lut1 (.A(A[5:0]), .Y(f1));
            MUX2_LUT7 mux7(.I0(f0), .I1(f1), .S0(A[6]), .O(Y));
        end else
        if (WIDTH == 8) begin
            wire f0, f1;
            \$lut #(.LUT(LUT[127: 0]), .WIDTH(7)) lut0 (.A(A[6:0]), .Y(f0));
            \$lut #(.LUT(LUT[255:128]), .WIDTH(7)) lut1 (.A(A[6:0]), .Y(f1));
            MUX2_LUT8 mux8(.I0(f0), .I1(f1), .S0(A[7]), .O(Y));
        end else begin
            wire _TECHMAP_FAIL_ = 1;
        end
    endgenerate
endmodule

Writing synth stat to file

yosys> stat -top NAME -json NAME_synth_stat.json

https://yosyshq.readthedocs.io/projects/yosys/en/latest/cmd/stat.html

static timing analsys

yosys has sta but how to use??