Skip to content
This repository has been archived by the owner on Aug 20, 2024. It is now read-only.

Add -fpga flag to enable FPGA-oriented compilation strategies (currently for memories) #2111

Merged
merged 10 commits into from
Apr 5, 2021

Conversation

albert-magyar
Copy link
Contributor

@albert-magyar albert-magyar commented Mar 11, 2021

Type of change: enhancement
API impact: API addition + make an internal method private (can keep public in backport).
Backend code-generation impact: More readwrite inference. Also, significant changes to emitted Verilog iff new flags set.
Desired merge strategy: merge

Release notes (WIP):
FirrtlStage now supports new flags to incorporate compilation strategies intended to better integrate with downstream FPGA tools. When the --target:fpga flag is used, the Verilog emitted for synchronous-read memories will change significantly. However, the default flow will produce Verilog that is identical to head-of-tree before this PR, aside from the purely syntactic change of using more net-declaration assignments.

The existing, optional InferReadWrite pass may now infer readwrite ports from read/write port pairs that share the same clock and address for undefined read-under-write synchronous-read memories. Previously recognized cases with mutually exclusive read and write enables will continue to be combined as before.

Commentary:

(This is a draft; it still needs documentation and more tests.)

After the discussion on #2092 and #2094, I am opening this to subsume both of those PRs and a few related changes. This PR introduces a number of optional compiler strategies that are all enabled by the --target:fpga flag:

  1. (From RFC: Emit sync-read mems intact, with readwrite ports if applicable #2092) Allowing some synchronous-read memories and readwrite ports to pass through VerilogMemDelays without introducing explicit pipeline registers or splitting ports.
  2. (From Add SimplifyMems transform to lower memories without splitting #1111) Lower aggregate-typed memories with always-high masks to packed memories without splitting.
  3. (From Improve QoR when threading models with SyncReadMems firesim/firesim#639) Specify that memories with undefined read-under-write behavior should map to microarchitectures characteristic of "read-first" ports by default. This could be further modified by adding heuristics of when a memory is worth changing, but read-first ports provide a good path to Xilinx BRAMs.
  4. Enable InferReadWrite (which also receives some enhancements; see below).

Most of these are effectively vendor-neutral, as they simply avoid the elaborate and inference-unfriendly permutation of FIRRTL memories and work around the behavioral memory API in Chisel. While the strong preference for read-first ports in (3) is based largely on experience with Xilinx FPGAs, the extremely strict semantics of write-first FIRRTL memories are also not the simplest or most generic pattern for inference to any underlying macro.

Furthermore, these are combined with some other enhancements. The VerilogEmitter is now capable of emitting simple synchronous-read memories intact; however, this will never happen with the default flow, as VerilogMemDelays retains its previous, conservative behavior without the --target:fpga flag. Finally, when InferReadWrite is enabled, some extra matching-address port pairs (from #2094) will be combined. One open question is whether this enhancement should be tied to the --target:fpga flag or made available by default as part of --infer-rw.

Milestone:
This will backport cleanly to 1.4.x with a couple of minor changes. Notably, this sets up a deprecation for a future change of VerilogMemDelays from a Pass to a Transform so that it can look at annotations. For now, it uses an override of execute and a deprecation warning. Since this flow is generally made up of optional features, the consensus in the meeting was that this would make sense as a 1.4.x milestone if there is general support to add these features as options.

Contributor Checklist

  • Did you add Scaladoc to every public function/method?
  • Did you update the FIRRTL spec to include every new feature/behavior?
  • Did you add at least one test demonstrating the PR?
  • Did you delete any extraneous printlns/debugging code?
  • Did you specify the type of improvement?
  • Did you state the API impact?
  • Did you specify the code generation impact?
  • Did you request a desired merge strategy?
  • Did you add text to be included in the Release Notes for this change?

Reviewer Checklist (only modified by reviewer)

  • Did you add the appropriate labels?
  • Did you mark the proper milestone (1.2.x, 1.3.0, 1.4.0) ?
  • Did you review?
  • Did you check whether all relevant Contributor checkboxes have been checked?
  • Did you mark as Please Merge?

Copy link
Contributor

@jackkoenig jackkoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing before I realized this was a draft, looks good though!

@ekiwi
Copy link
Contributor

ekiwi commented Mar 11, 2021

@albert-magyar, could you add some HighFirrtl circuits which would synthesize differently with your new fpga target?
If you can deliver those examples, I would try to write a test that runs yosys for synthesis and checks the results.
While that does not guarantee that things would work with Vivado, we already have yosys on our CI image, so using yosys would be the most straight forward way to make an end-to-end check.

@albert-magyar
Copy link
Contributor Author

@albert-magyar, could you add some HighFirrtl circuits which would synthesize differently with your new fpga target?
If you can deliver those examples, I would try to write a test that runs yosys for synthesis and checks the results.
While that does not guarantee that things would work with Vivado, we already have yosys on our CI image, so using yosys would be the most straight forward way to make an end-to-end check.

Here is an example that stresses all of the improvements. With the FPGA flag enabled, this will emit a 1024x32b, 2-RW, synchronous-read memory. You will need to target an FPGA architecture that provides underlying memories that can be configured to support two RW ports, so I would recommend starting with the 7-series architecture if you're going to try with Yosys.

circuit syncreadmem:
  module syncreadmem:
    input clk: Clock
    input addr_a: UInt<10>
    input we_a: UInt<1>
    input wdata_a: UInt<8>[4]
    output rdata_a: UInt<8>[4]
    input addr_b: UInt<10>
    input we_b: UInt<1>
    input wdata_b: UInt<8>[4]
    output rdata_b: UInt<8>[4]

    mem m:
      data-type => UInt<8>[4]
      depth => 1024
      reader => r_a
      writer => w_a
      reader => r_b
      writer => w_b
      read-latency => 1
      write-latency => 1
      read-under-write => undefined

    m.r_a.clk <= clk
    m.r_a.addr <= addr_a
    m.r_a.en <= UInt(1)
    rdata_a <= m.r_a.data

    m.w_a.clk <= clk
    m.w_a.addr <= addr_a
    m.w_a.en <= we_a
    m.w_a.mask[0] <= UInt(1)
    m.w_a.mask[1] <= UInt(1)
    m.w_a.mask[2] <= UInt(1)
    m.w_a.mask[3] <= UInt(1)
    m.w_a.data <= wdata_a

    m.r_b.clk <= clk
    m.r_b.addr <= addr_b
    m.r_b.en <= UInt(1)
    rdata_b <= m.r_b.data

    m.w_b.clk <= clk
    m.w_b.addr <= addr_b
    m.w_b.en <= we_b
    m.w_b.mask[0] <= UInt(1)
    m.w_b.mask[1] <= UInt(1)
    m.w_b.mask[2] <= UInt(1)
    m.w_b.mask[3] <= UInt(1)
    m.w_b.data <= wdata_b

@albert-magyar albert-magyar force-pushed the fpga-backend branch 3 times, most recently from 47544e4 to df5c997 Compare March 11, 2021 20:37
@ekiwi
Copy link
Contributor

ekiwi commented Mar 11, 2021

Thanks for the example @albert-magyar.
Turns out, yosys might not actually be able to infer combined R/W ports atm: chipsalliance/chisel#1788 (comment)

@albert-magyar
Copy link
Contributor Author

Thanks for the example @albert-magyar.
Turns out, yosys might not actually be able to infer combined R/W ports atm: chipsalliance/chisel3#1788 (comment)

Thanks for looking into this regardless. It is always nice to try to make this open-source "CI-able."

@carlosedp
Copy link
Contributor

@albert-magyar we found another case where FPGA and ASIC generation differs. Reading memory files in SYNTHESYS block.
Take a look at #2114

@albert-magyar
Copy link
Contributor Author

@albert-magyar we found another case where FPGA and ASIC generation differs. Reading memory files in SYNTHESYS block.
Take a look at #2114

I agree that readmemX bitstream initialization is an important feature to support. Making that more robust might be a good feature to add to --target:fpga mode in the future should this PR get accepted -- it's already getting quite large as it is.

@ekiwi
Copy link
Contributor

ekiwi commented Mar 12, 2021

I agree that readmemX bitstream initialization is an important feature to support. Making that more robust might be a good feature to add to --target:fpga mode in the future should this PR get accepted -- it's already getting quite large as it is.

I put the whole memory initialization question on the dev meeting agenda for Monday. We will put what ever solution we come up with in a follow up PR.

Copy link
Contributor

@ekiwi ekiwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to merge this as is. But we should wait for @seldridge or @jackkoenig to also give their OK.

@albert-magyar
Copy link
Contributor Author

One note for discussion/review is that the enhancements to InferReadWrite (merging same-address ports on undefined read-under-write memories) take effect whenever --infer-rw is active and are not tied to the --target:fpga flag. Ultimately, this enhanced RW inference can also be useful for a flow involving --repl-seq-mem and can help work around the vagaries of behavioral Chisel memories.

Also, I am leaving this as a draft for now, since there is still testing of various cases that I would like to do with Vivado. I am not sure when I will have time to do that.

@albert-magyar albert-magyar added the Merge Commit Please This branch has useful history, please merge with a merge commit label Mar 12, 2021
@albert-magyar albert-magyar force-pushed the fpga-backend branch 3 times, most recently from d048bb9 to 19d76d2 Compare March 16, 2021 21:10
@albert-magyar
Copy link
Contributor Author

I updated this with some Scaladoc.

@albert-magyar albert-magyar marked this pull request as ready for review March 18, 2021 04:54
* This is enabled by adding a PassthroughSimpleSyncReadMemsAnnotation
* Can be emitted directly with new changes to the Verilog emitter
* Add some new deprecations to VerilogMemDelays
* Run scalafmt on VerilogMemDelays
* Optionally defines read-under-write behavior for all 'undefined' memories
* Use DefaultReadFirstAnnotation to choose read-first default
* Use DefaultWriteFirstAnnotation to choose write-first default
* Seal DefaultReadUnderWriteAnnotation based on Jack's feedback
* Update test to include both 'old' and 'new' read-under-write values
* Address @ekiwi comments from review
* Change match cases to scalafmt-mandated lined-up style
* Update name of FPGA flag based on Jack's comment
* Add Scaladoc to describe what each constituent transform does
* Add SeparateWriteClocks to --target:fpga
@albert-magyar albert-magyar force-pushed the fpga-backend branch 2 times, most recently from d3bbfca to a95df00 Compare April 5, 2021 19:23
@schoeberl
Copy link

I tried to test it with Intel Quartus, using the following Chisel code:

  val mem = SyncReadMem(1024, UInt(8.W))

  io.rdDataA := mem.read(io.addrA)
  when(io.wrEnaA) {
    mem.write(io.addrA, io.wrDataA)
  }
  io.rdDataB := mem.read(io.addrB)
  when(io.wrEnaB) {
    mem.write(io.addrA, io.wrDataB)
  }

I am using x.5-SNAPSHOT

Without the "--target:fpga" it generates registers, which we expect. With the option enabled it generates invalid Verilog code:

  always @(posedge clock) begin
    if (mem_MPORT_2_en) begin
      mem_MPORT_2_data <= mem[mem_MPORT_2_addr]; // @[Memory.scala 64:24]
    end
  end
  always @(posedge mem_rw_clk) begin
    if (mem_rw_en) begin
      mem_rw_rdata <= mem[mem_rw_addr]; // @[Memory.scala 64:24]
    end
    if (mem_rw_en & (mem_rw_wmode & mem_rw_wmask)) begin
      mem[mem_rw_addr] <= mem_rw_wdata; // @[Memory.scala 64:24]
    end
  end
  always @(posedge mem_rw_0_clk) begin
    if (mem_rw_0_en) begin
      mem_rw_0_rdata <= mem[mem_rw_0_addr]; // @[Memory.scala 64:24]
    end
    if (mem_rw_0_en & (mem_rw_0_wmode & mem_rw_0_wmask)) begin
      mem[mem_rw_0_addr] <= mem_rw_0_wdata; // @[Memory.scala 64:24]
    end
  end

Quartus complains:

Info (12127): Elaborating entity "TrueDualPortMemory" for the top level hierarchy
Warning (10036): Verilog HDL or VHDL warning at TrueDualPortMemory.v(30): object "mem_rw_rdata" assigned a value but never read
Error (10028): Can't resolve multiple constant drivers for net "mem[0][7]" at TrueDualPortMemory.v(68)
Error (10029): Constant driver at TrueDualPortMemory.v(60)
Error (10028): Can't resolve multiple constant drivers for net "mem[0][6]" at TrueDualPortMemory.v(68)
Error (10028): Can't resolve multiple constant drivers for net "mem[0][5]" at TrueDualPortMemory.v(68)
Error (10028): Can't resolve multiple constant drivers for net "mem[0][4]" at TrueDualPortMemory.v(68)

@ekiwi
Copy link
Contributor

ekiwi commented May 10, 2021

Thanks for the bug report @schoeberl !

Could you try to fix the generated Verilog manually so that the memory is correctly inferred by Quartus?

The diff of changes that you had to apply might be helpful in fixing this issue.

@schoeberl
Copy link

OK, I will try. BTW, I am wondering why the generated Verilog code has actually 3 read ports and from 3 possible different clocks. One of them, mem_rw_data, is then not used, as we can see in the warning.

@schoeberl
Copy link

Update: the Verilog error resulted from an error in my Chisel code, as followed:

  io.rdDataB := mem.read(io.addrB)
  when(io.wrEnaB) {
    mem.write(io.addrA, io.wrDataB)
  }

having the wrong address for the second write port. After fixing it, Quartus compiles but does not infer an on-chip memory, but registers. Will try to propose working Verilog code.

@schoeberl
Copy link

Quartus needs Verilog that requests newly written data to be returned on the reading. So following code works:

  always @(posedge mem_rw_clk) begin
    if (mem_rw_en) begin
      mem_rw_rdata <= mem[mem_rw_addr]; // @[Memory.scala 65:24]
    end
    if (mem_rw_en & (mem_rw_wmode & mem_rw_wmask)) begin
      mem[mem_rw_addr] <= mem_rw_wdata; // @[Memory.scala 65:24]
      mem_rw_rdata <= mem_rw_wdata; // @[Memory.scala 65:24]
    end
  end
  always @(posedge mem_rw_0_clk) begin
    if (mem_rw_0_en) begin
      mem_rw_0_rdata <= mem[mem_rw_0_addr]; // @[Memory.scala 65:24]
    end
    if (mem_rw_0_en & (mem_rw_0_wmode & mem_rw_0_wmask)) begin
      mem[mem_rw_0_addr] <= mem_rw_0_wdata; // @[Memory.scala 65:24]
      mem_rw_0_rdata <= mem_rw_0_wdata; // @[Memory.scala 65:24]
    end
  end

The only differences are the following lines added:

      mem_rw_rdata <= mem_rw_wdata; // @[Memory.scala 65:24]

and

      mem_rw_0_rdata <= mem_rw_0_wdata; // @[Memory.scala 65:24]

However, a better style would be to have the read on an else branch:

	always @ (posedge clk)
	begin // Port a
		if (we_a)
		begin
			ram[addr_a] <= data_a;
			q_a <= data_a;
		end
		else
			q_a <= ram[addr_a];
	end

BTW, I am wondering what signal mem_rw_en and mem_rw_0_en should be good for. They are tied to a constant, so no problem. However, if we would use an enable, what would be the result of a read? Just for fun, I connected one of them to an input and Quartus complaint with a simple:

Error (276001): Cannot synthesize dual-port RAM logic "mem"

@ekiwi
Copy link
Contributor

ekiwi commented May 12, 2021

BTW, I am wondering what signal mem_rw_en and mem_rw_0_en should be good for. They are tied to a constant, so no problem. However, if we would use an enable, what would be the result of a read

According to firrtl semantics, when the en on a read port is low, then the data on that port in the current cycle is arbitrary. The idea here is to support some kinds of ASIC SRAMs that allow you to save power by using a read enable pin. For FPGAs we probably just want to ignore the read enable (which is a valid implementation of the firrtl semantics).

@kammoh
Copy link
Contributor

kammoh commented May 12, 2021

At least on Xilinx FPGAs the read enable (as currently supported by firrtl) can be inferred for clock gating BRAMs. I'm also using it as an implementation trick to keep the read data value unchanged (which probably doesn't quite match the firrtl semantics but nonetheless the generated Verilog works fine with Vivado).
I really hope we can keep the current behavior with -fpga!

@ekiwi
Copy link
Contributor

ekiwi commented May 12, 2021

I wish we had a good way to automatically test for regressions in how Quartus/Vivado synthesize our memories.

@schoeberl
Copy link

Some behavior of hardware can simply not be expressed in Verilog or VHDL. One example is undefined behavior on read during write being undefined for the same address. Bad, but that is what HW and HW languages are. Keeping read data unchanged when disabling read enable is, I guess, not a strong property of the language and not so easily achieved in HW. What is the implementation in an ASIC RAM? Using read enable to enable a register update for the read address?

I think we would need to reread the Verilog and VHDL spec for this corner case. For other logic when an update depends on an enable signal this would generate latches. Something we would like to avoid.

@ekiwi
Copy link
Contributor

ekiwi commented May 12, 2021

What is the implementation in an ASIC RAM? Using read enable to enable a register update for the read address?

I have no experience with ASIC tapeouts. I did find this manual for a suite of memory compilers: https://users.ece.cmu.edu/~koopman/ece548/hw/hw5/meml80.pdf On page 17 it describes the interface for a Single-Port Synchronous RAM Generator. The generated cell features a OEN pin:

OEN is used to enable/disable the data output driver.

So if we use this cell from a Chisel design, we could wire the .en pin of the read port to the OEN pin.

@ekiwi
Copy link
Contributor

ekiwi commented May 12, 2021

We could also turn off the memory in cycles where both, the read and the write enable are not active:

Screenshot from 2021-05-12 13-49-38

@ekiwi
Copy link
Contributor

ekiwi commented May 12, 2021

. I'm also using it as an implementation trick to keep the read data value unchanged (which probably doesn't quite match the firrtl semantics but nonetheless the generated Verilog works fine with Vivado).

The behavior of retaining the last read value is definitely not guaranteed by the Chisel semantics.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backported This PR has been backported to marked stable branch Merge Commit Please This branch has useful history, please merge with a merge commit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants