ece 5745 asic tutorial (new) · design (which comes from the milkyway database generated by...

17
README.md - Grip ECE 5745 ASIC Tutorial (new) The five tutorials on the ECE 5745 website are really for the "old" ECE 5745 ASIC flow. You should still complete those tutorials before starting this one because those tutorials go into much more depth on all of the tools and how they fit together. In the lab assignments and project, we will be using the "new" ECE 5745 ASIC flow which is very similar but with some important differences. For example, we do not use Synopsys VCS for RTL simulation anymore, and we use a cleaner Makefile setup. We no longer use gate-level simulation for energy estimation. We instead combine the net activity information from RTL simulation with the post-place-and-route gate-level model to more rapidly estimate energy consumption. PyMTL-Based ECE 5745 ASIC Flow The following diagram illustrates the PyMTL-based ECE 5745 ASIC toolflow. There are four main steps. 1. We use the PyMTL framework to test, verify, and evaluate the execution time (in cycles) of our design. This part of the flow is exactly the same as ECE 4750. Note that we can write our RTL models in either PyMTL or Verilog. Once we are sure our design is working correctly, we can then start to push the design through the flow. The ASIC flow requires Verilog RTL as an input, so we can use PyMTL's automatic translation tool to translate PyMTL RTL models into Verilog RTL. 2. We use Synopsys Design Compiler (DC) to synthesize our design, which means to transform the Verilog RTL model into a Verilog gate-level netlist where all of the gates are selected from a standard cell library. We need to provide Synopsys

Upload: others

Post on 12-Mar-2020

10 views

Category:

Documents


1 download

TRANSCRIPT

README.md - Grip

ECE 5745 ASIC Tutorial (new)The five tutorials on the ECE 5745 website are really for the "old" ECE 5745 ASIC flow. You should still complete those

tutorials before starting this one because those tutorials go into much more depth on all of the tools and how they fit

together. In the lab assignments and project, we will be using the "new" ECE 5745 ASIC flow which is very similar but with

some important differences. For example, we do not use Synopsys VCS for RTL simulation anymore, and we use a cleaner

Makefile setup. We no longer use gate-level simulation for energy estimation. We instead combine the net activity

information from RTL simulation with the post-place-and-route gate-level model to more rapidly estimate energy

consumption.

PyMTL-Based ECE 5745 ASIC FlowThe following diagram illustrates the PyMTL-based ECE 5745 ASIC toolflow. There are four main steps.

1. We use the PyMTL framework to test, verify, and evaluate the execution time (in cycles) of our design. This part of the

flow is exactly the same as ECE 4750. Note that we can write our RTL models in either PyMTL or Verilog. Once we are

sure our design is working correctly, we can then start to push the design through the flow. The ASIC flow requires

Verilog RTL as an input, so we can use PyMTL's automatic translation tool to translate PyMTL RTL models into Verilog

RTL.

2. We use Synopsys Design Compiler (DC) to synthesize our design, which means to transform the Verilog RTL model into

a Verilog gate-level netlist where all of the gates are selected from a standard cell library. We need to provide Synopsys

DC with higher-level characterization information about our standard cell library. The primary file containing this

characterization is in a  .lib   file and it contains information about the logical functionality, timing, and power of each

cell.

3. We use Synopsys IC Compiler (ICC) to place-and-route our design, which means to place all of the gates in the gate-

level netlist into rows on the chip and then to generate the metal wires that connect all of the gates together. We need

to provide Synopsys ICC with lower-level characterization information about our standard cell library. The primary file

containing this characterization is in a  .lef   file and it contains information about the dimensions, pin placement, and

metal blockages of each cell. Synopsys ICC generates a Milkway Database which contains the actual layout as well as

additional characterization information. Synopsys ICC also generates reports that can be used to more accurately

characterize area and timing.

4. We use Synopsys PrimeTime (PT) to perform power-analysis of our design. This requires switching activity information

for every net in the design (which comes from our Verilog RTL VCD file) and capacitance information for every net in the

design (which comes from the Milkyway Database generated by Synopsys ICC). Synopsys PT puts the switching

activity, capacitance, clock frequency, and voltage together to estimate the power consumption of every net and thus

every module in the design.

Extensive documentation is provided by Synopsys for Design Compiler, IC Compiler, and PrimeTime. We have organized

this documentation and made it available to you on the public course webpage. The username/password was distributed

during lecture.

PyMTL-Based Testing, Simulation, TranslationFirst step is to source the setup script and clone the tutorial repository from GitHub. We create a bash variable to keep track

of the tutorial directory.

 %  source  setup-­‐ece5745.sh    %  mkdir  $HOME/ece5745    %  cd  $HOME/ece5745    %  git  clone  [email protected]:cornell-­‐ece5745/ece5745-­‐tut-­‐asic-­‐new    %  cd  ece5745-­‐tut-­‐asic-­‐new    %  TOPDIR=$PWD  

We will be pushing the sort unit from the PyMTL tutorial through the ASIC flow. As a reminder, the sort unit takes as input

four integers and a valid bit and outputs those same four integers in increasing order with the valid bit. The sort unit is

implemented using a three-stage pipelined, bitonic sorting network and the datapath is shown below.

Run the tests for the sort unit and note that the tests for the  SortUnitStructRTL   will fail. You can just copy over your

implementation of the  MinMaxUnit   from when you completed the PyMTL tutorial. If you have not completed the PyMTL

tutorial then go back and do that now. After running the tests we use the sort unit simulator to translate the PyMTL RTL

model into Verilog and to dump the VCD file that we want to use for power analysis.

 %  mkdir  $TOPDIR/pymtl/build  

 %  cd  $TOPDIR/pymtl/build  

 %  py.test  ../tut3_pymtl/sort  

 %  ../tut3_pymtl/sort/sort-­‐sim  -­‐-­‐impl  rtl-­‐struct  -­‐-­‐translate  -­‐-­‐dump-­‐vcd  

Take a moment to open up the translated Verilog which should be in a file named  SortUnitStructRTL_0x4b8e51bd8055176a.v  .The complicated hash suffix is used by PyMTL to make this filename unique even for parameterized modules which areinstantiated for a specific set of parameters. Try to see how both the structural composition and the behavioral modelingtranslates into Verilog. Here is an example of the translation for the  MinMaxUnit  . Notice how PyMTL will output the sourcePython embedded as a comment in the corresponding translated Verilog.

module  MinMaxUnit_0x4b8e51bd8055176a  

(  

   input    wire  [      0:0]  clk,  

   input    wire  [      7:0]  in0,  

   input    wire  [      7:0]  in1,  

   output  reg    [      7:0]  out_max,  

   output  reg    [      7:0]  out_min,  

   input    wire  [      0:0]  reset  

);  

   //  PYMTL  SOURCE:  

   //  

   //  @s.combinational  

   //  def  block():  

   //  

   //              if  s.in0  >=  s.in1:  

   //                  s.out_max.value  =  s.in0  

   //                  s.out_min.value  =  s.in1  

   //              else:  

   //                  s.out_max.value  =  s.in1  

   //                  s.out_min.value  =  s.in0  

   //  logic  for  block()  

   always  @  (*)  begin  

       if  ((in0  >=  in1))  begin  

           out_max  =  in0;  

           out_min  =  in1;  

       end  

       else  begin  

           out_max  =  in1;  

           out_min  =  in0;  

       end  

   end  

endmodule  //  MinMaxUnit_0x4b8e51bd8055176a  

Although we hope students will not need to actually open up this translated Verilog it is occasionally necessary. Forexample, PyMTL is not perfect and can translate incorrectly which might require looking at the Verilog to see where it wentwrong. Other steps in the ASIC flow might refer to an error in the translated Verilog which will also require looking at theVerilog to figure out why the other steps are going wrong. While we try and make things as automated as possible, studentswill eventually need to dig in and debug some of these steps themselves.

Using Synopsys Design Compiler ManuallyWe use Synopsys Design Compiler (DC) to synthesize Verilog RTL models into a gate-level netlist where all of the gates arefrom the standard cell library. So Synopsys DC will synthesize the Verilog  +   operator into a specific arithmetic block at the

gate-level. Based on various constraints it may synthesize a ripple-carry adder, a carry-look-ahead adder, or even moreadvanced parallel-prefix adders.

We will start by manually entering a sequence of commands into Synopsys DC and in the next section we will see how toautomate this process. Create a directory to work in and launch Synopsys DC.

 %  mkdir  $TOPDIR/asic/dc-­‐syn/manual-­‐dc  

 %  cd  $TOPDIR/asic/dc-­‐syn/manual-­‐dc  

 %  dc_shell-­‐xg-­‐t  

To make it easier to copy-and-paste commands from this document, we tell Synopsys DC to ignore the prefix  dc_shell>  using the following:

 dc_shell>  alias  "dc_shell>"  ""  

Before we can really start synthesizing the design we need to setup a bunch of variables and options. We need to pointSynopsys DC to where the standard cells are installed, where the Verilog we want to synthesize is located, where thestandard cell characterization files are located, and what the names for logic 0 and logic 1 are in the standard cell library.

 dc_shell>  set  stdcells_home  /research/brg/install/bare-­‐pkgs/noarch/synopsys-­‐90nm/toolflow  

 dc_shell>  set_app_var  search_path  "$stdcells_home  ../../../pymtl/build"  

 dc_shell>  set_app_var  target_library  "cells.db"  

 dc_shell>  set_app_var  link_library  "*  $target_library"  

 dc_shell>  set_app_var  alib_library_analysis_path  \  

       "/research/brg/install/bare-­‐pkgs/noarch/synopsys-­‐90nm/toolflow/alib"  

 dc_shell>  set_app_var  mw_logic1_net  "VDD"  

 dc_shell>  set_app_var  mw_logic0_net  "VSS"  

Now we create a new Milkyway Database. Milkyway is Synopsys' proprietary database format which is used to hold allkinds of design data (RTL models, gate-level models, standard-cell models, timing information, layout, etc). We open thenew database and also create a directory for Synopsys DC to work in.

 dc_shell>  create_mw_lib  -­‐technology  $stdcells_home/techfile.tf  \  

       -­‐mw_reference_library  $stdcells_home/milkyway.fr  "LIB"  

 dc_shell>  open_mw_lib  "LIB"  

 dc_shell>  define_design_lib  WORK  -­‐path  "./work"  

We are now ready to synthesize the design. We first read in the Verilog file which contains the top-level design and allreferenced modules.

 dc_shell>  analyze  -­‐format  verilog  "SortUnitStructRTL_0x4b8e51bd8055176a.v"  

We use the  elaborate   command to convert the Verilog models into a unified in-memory model format that Synopsys cananalyze. This is also when Synopsys starts to do some analysis on the design, and the command output can sometimesdisplay useful information about inferred latches and such. Notice that you need to give the  elaborate   command the nameof the Verilog module which is the top of the design.

 dc_shell>  elaborate  "SortUnitStructRTL_0x4b8e51bd8055176a"  

We use the  link   command to resolve all module references and then we use the  check_design   command to check for any

warnings or errors. Always be sure to explicitly look for errors; they can get buried in the tons of output that the Synopsys

tools produce. Synopsys DC does not usually stop if there is an error but instead just keeps going.

 dc_shell>  link    dc_shell>  check_design  

We need to create a clock constraint to tell Synopsys DC what our target cycle time is. Synopsys DC will not synthesize a

design to run "as fast as possible". Instead, the designer gives Synopsys DC a target cycle time and the tool will try to meet

this constraint while minimizing area and power. The  create_clock   command takes the name of the clock signal in the

Verilog (which in this course will always be  clk  ), the label to give this clock (i.e.,  ideal_clock1  ), and the target clock period

in nanoseconds. So in this example, we are asking Synopsys DC to see if it can synthesize the design to run at 1GHz (i.e., a

cycle time of 1ns).

 dc_shell>  create_clock  clk  -­‐name  ideal_clock1  -­‐period  1  

Finally, the  compile_ultra   command will do the synthesis. Without any options, the  compile_ultra   command will

sometimes flatten parts of the design. Flatten means to remove module hierarchy boundaries; so instead of having module

A and module B within module C, Synopsys DC will take all of the logic in module A and module B and put it directly in

module C. Without these extra hierarchy boundaries, Synopsys DC is able to perform more optimizations and potentially

achieve better area, energy, and timing. The  -­‐no_autoungroup   option prevents Synopsys DC from flattening any part of the

design and thus preserves the module hierarchy. This makes it much easier to interpret the reports since if there is a module

A in your RTL design that same module will always be in the synthesized gate-level netlist.

 dc_shell>  compile_ultra  -­‐no_autoungroup  

As  compile_ultra   runs it will display how it is trying to optimize your design. Synopsys DC will use sophisticated CAD

algorithms to try and meet the clock cycle constraint, then to reduce the area/power overhead, and then to again improve

the timing. It will iterate many times as it works hard to optimize the design.

Now that we have synthesized the design, we output the resulting gate-level netlist in two different file formats: Verilog and

DDC (which we will use with DesignVision).

 dc_shell>  write  -­‐f  verilog  -­‐hierarchy  -­‐output  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.v    dc_shell>  write  -­‐format  ddc  -­‐hierarchy  -­‐output  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.ddc  

We can use various commands to generate reports about area, energy, and timing. The  report_timing   command will show

the critical path through the design. Part of the report is displayed below.

 dc_shell>  report_timing  -­‐transition_time  -­‐nets  -­‐attributes  -­‐nosplit    ...      Point                                                                              Fanout          Trans    Incr    Path      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      clock  ideal_clock1  (rise  edge)                                                                0.00  

   clock  network  delay  (ideal)                                                                      0.00    0.00      elm_S1S2$000/out_reg[0]/CLK  (DFFX1)                                          0.00    0.00    0.00  r      elm_S1S2$000/out_reg[0]/Q  (DFFX1)                                              0.05    0.17    0.17  r      elm_S1S2$000/out[0]  (net)                                          4                              0.00    0.17  r      elm_S1S2$000/out[0]  (Reg_0x7a355c5a216e72a4_11)                              0.00    0.17  r      elm_S1S2$000$out[0]  (net)                                                                          0.00    0.17  r      minmax0_S2/in0[0]  (MinMaxUnit_0x4b8e51bd8055176a_3)                      0.00    0.17  r      minmax0_S2/in0[0]  (net)                                                                              0.00    0.17  r  

   minmax0_S2/U28/QN  (NOR2X0)                                                            0.05    0.12    0.29  f  

   minmax0_S2/n10  (net)                                                    1                              0.00    0.29  f  

   minmax0_S2/U29/Q  (OA22X1)                                                              0.03    0.11    0.40  f  

   minmax0_S2/n13  (net)                                                    1                              0.00    0.40  f  

   minmax0_S2/U30/Q  (OA22X1)                                                              0.03    0.11    0.51  f  

   minmax0_S2/n16  (net)                                                    1                              0.00    0.51  f  

   minmax0_S2/U31/Q  (OA22X1)                                                              0.03    0.11    0.62  f  

   minmax0_S2/n19  (net)                                                    1                              0.00    0.62  f  

   minmax0_S2/U32/Q  (OA22X1)                                                              0.03    0.11    0.72  f  

   minmax0_S2/n22  (net)                                                    1                              0.00    0.72  f  

   minmax0_S2/U33/Q  (OA22X1)                                                              0.03    0.11    0.83  f  

   minmax0_S2/n25  (net)                                                    1                              0.00    0.83  f  

   minmax0_S2/U34/Q  (OA22X1)                                                              0.05    0.12    0.95  f  

   minmax0_S2/n29  (net)                                                    5                              0.00    0.95  f  

   minmax0_S2/U18/Q  (OA22X1)                                                              0.06    0.15    1.10  f  

   minmax0_S2/n3  (net)                                                      3                              0.00    1.10  f  

   minmax0_S2/U44/Q  (MUX21X1)                                                            0.03    0.18    1.28  r  

   minmax0_S2/out_min[4]  (net)                                      1                              0.00    1.28  r  

   minmax0_S2/out_min[4]  (MinMaxUnit_0x4b8e51bd8055176a_3)              0.00    1.28  r  

   minmax0_S2$out_min[4]  (net)                                                                      0.00    1.28  r  

   elm_S2S3$000/in_[4]  (Reg_0x7a355c5a216e72a4_3)                                0.00    1.28  r  

   elm_S2S3$000/in_[4]  (net)                                                                          0.00    1.28  r  

   elm_S2S3$000/out_reg[4]/D  (DFFX2)                                              0.03    0.04    1.32  r  

   data  arrival  time                                                                                                      1.32  

   clock  ideal_clock1  (rise  edge)                                                                1.00    1.00  

   clock  network  delay  (ideal)                                                                      0.00    1.00  

   elm_S2S3$000/out_reg[4]/CLK  (DFFX2)                                                      0.00    1.00  r  

   library  setup  time                                                                                      -­‐0.08    0.92  

   data  required  time                                                                                                    0.92  

   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

   data  required  time                                                                                                    0.92  

   data  arrival  time                                                                                                    -­‐1.32  

   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

   slack  (VIOLATED)                                                                                                      -­‐0.40  

This timing report uses static timing analysis to find the critical path. Static timing analysis involves checking the timing

across all paths in the design (regardless of whether these paths can actually be used in practice) and finds the longest

path. You can learn more about static timing analysis in Chapter 1 of the Synopsys Timing Constraints and Optimization

User Guide. The report clearly shows that the critical path starts at the first pipeline register in between the S1 and S2

stages, goes into the first input of the top  MinMaxUnit  , comes out the  out_min   port of the  MinMaxUnit  , and ends at a

pipeline register in between the S2 and S3 stages. The report shows the delay through each logic gate (e.g., the clk-to-q

delay of the initial DFF is 170ps, the propagation delay of a NOR2 gate is 120ps) and the total delay for the critical path

which in this case is 1.32ns. We set the clock constraint to be 1ns, but also notice that the report factors in the setup time

required at the final register. The setup time is 80ps, so in order to operate the sort unit at 1ns and meet the setup time we

would need the critical path to arrive in 0.92ns.

The difference between the required arrival time and the actual arrival time is called the slack. Positive slack means the path

arrived before it needed to while negative slack means the path arrived after it needed to. If you end up with positive slack it

means you probably want to decrease your clock constraint to push the tools harder and produce a faster design. Even if

you have no slack you still probably want to decrease your clock constraint. This is because the tools rarely leave positive

slack preferring instead to take an overly fast design and resynthesize smaller logic to save area and power. In the above

example, we have 400ps of negative slack. Note that this does not mean the sort unit will not work. It just means the cycle

time would have to be 1.40ns in order for the sort unit to operate correctly. Because in this course we are primarily

interested in design-space exploration (as opposed to meeting some kind of arbitrary timing constraint), we suggest

adjusting the clock constraint until you end up with about 5-10% negative slack. This will result in a well-optimized design

and help identify the "fundamental" performance of the design.

The  report_area   command can show how much area each module uses and can enable detailed area breakdown analysis.

 dc_shell>  report_area  -­‐nosplit  -­‐hierarchy    ...                                            Global            Local                                            Cell  Area      Cell  Area                                            -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    Hierarchical  cell      Abs                                Non        Black-­‐                                            Total    %        Comb      Comb      boxes    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    SortUnitStructRTL    elm_S0S1$000                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_7    elm_S0S1$001                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_6    elm_S0S1$002                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_5    elm_S0S1$003                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_4    elm_S1S2$000                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_11    elm_S1S2$001                199.0    4.2        0.0    199.0    0.0    Reg_0x7a355c5a216e72a4_10    elm_S1S2$002                218.4    4.6        0.0    218.4    0.0    Reg_0x7a355c5a216e72a4_9    elm_S1S2$003                237.7    5.0        0.0    237.7    0.0    Reg_0x7a355c5a216e72a4_8    elm_S2S3$000                244.2    5.2        0.0    244.2    0.0    Reg_0x7a355c5a216e72a4_3    elm_S2S3$001                244.2    5.2        0.0    244.2    0.0    Reg_0x7a355c5a216e72a4_2    elm_S2S3$002                218.4    4.6        0.0    218.4    0.0    Reg_0x7a355c5a216e72a4_1    elm_S2S3$003                245.1    5.2        0.0    245.1    0.0    Reg_0x7a355c5a216e72a4_0    minmax0_S1                    427.6    9.1    427.6        0.0    0.0    MinMaxUnit_0x4b8e51bd8055176a_2    minmax0_S2                    426.7    9.0    426.7        0.0    0.0    MinMaxUnit_0x4b8e51bd8055176a_3    minmax1_S1                    414.7    8.8    414.7        0.0    0.0    MinMaxUnit_0x4b8e51bd8055176a_4    minmax1_S2                    434.0    9.2    434.0        0.0    0.0    MinMaxUnit_0x4b8e51bd8055176a_1    minmax_S3                      313.3    6.6    313.3        0.0    0.0    MinMaxUnit_0x4b8e51bd8055176a_0    val_S0S1                          35.9    0.8      11.0      24.8    0.0    RegRst_0x61d677aadab8bc25_0    val_S1S2                          30.4    0.6        5.5      24.8    0.0    RegRst_0x61d677aadab8bc25_1    val_S2S3                          30.4    0.6        5.5      24.8    0.0    RegRst_0x61d677aadab8bc25_2    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  -­‐-­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐  -­‐-­‐-­‐-­‐-­‐-­‐  -­‐-­‐-­‐-­‐-­‐-­‐    -­‐-­‐-­‐    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    Total                                                    2038.5  2677.2    0.0  

The units are in square micron. From the above report, we can see that each pipeline register consumes about 4-5% of thearea, while the  MinMaxUnits   consume a total of 43% of the area. This is one reason we do not flatten our designs, since themodule hierarchy helps us understand the area breakdowns. If we completely flattened the design there would only be oneline in the above table.

The  report_power   command can show how much power each module consumes. Note that this power analysis is actuallynot that useful yet, since at this stage of the flow the power analysis is based purely on statistical activity factor estimation.Basically, Synopsys DC assumes every net toggles 10% of the time. This is a pretty poor estimate.

 dc_shell>  report_power  -­‐nosplit  -­‐hier  

Finally, we go ahead and exit Synopsys DC.

 dc_shell>  exit  

Take a few minutes to examine the resulting Verilog gate-level netlist. Notice that the module hierarchy is preserved andalso notice that the  MinMaxUnit   synthesizes into a large number of basic logic gates.

 %  cd  $TOPDIR/asic/dc-­‐syn/manual-­‐dc    %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.v  

We can use the Synopsys Design Vision tool for browsing the resulting gate-level netlist, plotting critical path histograms,and generally analyzing our design. Start Synopsys Design Vision and setup the various variables and options as follows:

 %  design_vision-­‐xg  

 design_vision>  alias  "design_vision>"  ""  

 design_vision>  set  stdcells_home  /research/brg/install/bare-­‐pkgs/noarch/synopsys-­‐90nm/toolflow  

 design_vision>  set_app_var  search_path  "$stdcells_home  ../../../pymtl/build"  

 design_vision>  set_app_var  target_library  "cells.db"  

 design_vision>  set_app_var  link_library  "*  $target_library"  

 design_vision>  set_app_var  alib_library_analysis_path  \  

       "/research/brg/install/bare-­‐pkgs/noarch/synopsys-­‐90nm/toolflow/alib"  

To view a schematic of the gate-level netlist, right click on the module in the module hierarchy browser and choose

Schematic View. To see a histogram of path slack choose Timing > Paths Slack from the menu. To see a schematic of the

critical path, right click one of the bars in the path slack histogram, and choose Path Inspector.

Using Synopsys Design Compiler with MakefileObviously entering all of the above commands is tedious and error prone. We could also potentially directly drive synthesis

using the Design Vision GUI, but that is just as tedious and error prone. To enable an agile hardware design methodology,

we must script as much of the ASIC flow as possible. Luckily, Synopsys tools can be easily scripted using TCL, and even

better, the ECE 5745 staff have already created these TCL scripts. The ECE 5745 TCL scripts were based on the Synopsys

reference methodology which is copyrighted by Synopsys. This means you cannot take this repo and/or the scripts and

make them public. Please keep this in mind.

We use  make   to drive the ASIC flow. A special  Makefrag   describes the details of the specific design you want to push

through the flow. Go into the  asic   subdirectory and take a look at the  Makefrag  .

 %  cd  $TOPDIR/asic  

 %  more  Makefrag  

The  Makefrag   has one entry for each design. Each entry looks like this:

 ifeq  ($(design),pymtl-­‐sort)  

     flow                    =  pymtl  

     clock_period    =  1.0  

     sim_build_dir  =  pymtl/build  

     vsrc                    =  SortUnitStructRTL_0x4b8e51bd8055176a.v  

     vmname                =  SortUnitStructRTL_0x4b8e51bd8055176a  

     viname                =  TOP/v  

     vcd                      =  sort-­‐rtl-­‐struct-­‐random.verilator1.vcd  

 endif  

Every design has a name and in this case the design name is  pymtl-­‐sort  . For now the  flow   variable will always be  pymtl  ,

the  sim_build_dir   variable will always be  pymtl/build  , and the  viname   variable will always be  TOP/v  . The  clock_period  

variable is where you set the target clock period constraint for this design. The  vsrc   variable is the name of the Verilog file

you want to push through the flow. The  vmname   variable is the name of the Verilog module which is the top of the design.

For now it will always be the Verilog file name without the  .v   suffix. Finally, the  vcd   variable is the name of the VCD file you

want to use for power analysis.

We set the following line in the  Makefrag   to choose which design we want to push through the flow:

 design  =  pymtl-­‐sort  

Since this is already set to push our sort unit through the flow, we are all set. Now all we need to do use  make   like this:

 %  cd  $TOPDIR/asic/dc-­‐syn    %  make  

You will see  make   run some commands, start Synopsys DC, run some TCL scripts, and then finish up. Essentially, theautomated system is doing something very similar to what we did in the previous section manually.

If Synopsys DC exits with a status code of zero then something went wrong. You will need to carefully look through the logto search for errors or warnings that might hint at what went wrong. You may have used the incorrect file/module names inthe Makefrag or there might be code in your Verilog RTL that is not synthesizable. This is not easy and there is no simpleway to figure out these issues. You just need to poke through the log file:

 %  cd  $TOPDIR/asic/dc-­‐syn/current-­‐dc/log    %  more  dc.log  

When the synthesis is completed you can take a look at the resulting Verilog gate-level netlist here:

 %  cd  $TOPDIR/asic/dc-­‐syn/current-­‐dc/results  %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.v  

The automated system is also setup to output a bunch of reports. Here are the key ones:

 %  cd  $TOPDIR/asic/dc-­‐syn/current-­‐dc/reports  %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.qor.rpt    %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.timing.rpt    %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.area.rpt    %  more  SortUnitStructRTL_0x4b8e51bd8055176a.mapped.power.rpt  

The quality-of-results (QOR) report is a particularly useful summary. If you take a look that report you will see something likethis:

   Timing  Path  Group  'REGIN'      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      Levels  of  Logic:                              2.00      Critical  Path  Length:                    0.06      Critical  Path  Slack:                      0.87      Critical  Path  Clk  Period:            1.00      Total  Negative  Slack:                    0.00      No.  of  Violating  Paths:                0.00      Worst  Hold  Violation:                    0.00      Total  Hold  Violation:                    0.00      No.  of  Hold  Violations:                0.00      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

   Timing  Path  Group  'REGOUT'      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐      Levels  of  Logic:                              9.00      Critical  Path  Length:                    0.85      Critical  Path  Slack:                      0.15      Critical  Path  Clk  Period:            1.00      Total  Negative  Slack:                    0.00      No.  of  Violating  Paths:                0.00      Worst  Hold  Violation:                    0.00      Total  Hold  Violation:                    0.00      No.  of  Hold  Violations:                0.00      -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

   Timing  Path  Group  'ideal_clock1'  

   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

   Levels  of  Logic:                              9.00  

   Critical  Path  Length:                    0.88  

   Critical  Path  Slack:                      0.05  

   Critical  Path  Clk  Period:            1.00  

   Total  Negative  Slack:                    0.00  

   No.  of  Violating  Paths:                0.00  

   Worst  Hold  Violation:                    0.00  

   Total  Hold  Violation:                    0.00  

   No.  of  Hold  Violations:                0.00  

   -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  

Paths are organized into four groups: REGIN, REGOUT, INOUT, and CLK path groups. REGIN paths start at an input port

and end at a register; REGOUT paths start at a register and end at an output port; INOUT paths start at an input port and

end at an output port; and CLK paths start at a register and end at register. The following diagram is from Chapter 1 of the

Synopsys Timing Constraints and Optimization User Guide.

We have setup the flow so that the tools have to fit all four of these paths in a single cycle. The QOR report shows the worst

path within each path group. The overall critical path for your design will be the worse critical path across all four groups,

and the actual cycle time is calculated as the "Critical Path Clk Period" (this is the target clock constraint) minus the "Critical

Path Slack"). So in this example the cycle time would be 0.95ns. Recall that when we manually entered the commands for

synthesis the critical path was 1.40ns. What changed? The automated flow takes advantage of what is known as

"topological mode"; this is an advanced feature in Synopsys DC which involves more complex algorithms that do synthesis,

preliminary placement, more synthesis, and more preliminary placement. By incorporating some preliminary placement

algorithms into the synthesis part of the flow, Synopsys DC is able to achieve much higher QOR.

Keep in mid that the area, energy, timing results post-synthesis will not be as accurate as the post-place-and-route results.

While it is fine to iterate quickly just using synthesis, you will eventually need to use Synopsys IC Compiler for more

accurate area and timing analysis, and use Synopsys PrimeTime for more accurate power analysis.

Using IC Compiler with MakefileWe use Synopsys IC Compiler (ICC) for placing and routing standard cells, but also for power routing and clock tree

synthesis. The Verilog gate-level netlist generated by Synopsys DC has no physical information: it is just a netlist, so the

Synopsys IC will first try and do a rough placement of all of the gates into rows on the chip. Synopsys IC will then do some

preliminary routing, and iterate between more and more detailed placement and routing until it reaches the target cycle time

(or gives up). Synopsys IC will also route all of the power and ground rails in a grid and connect this grid to the power and

ground pins of each standard cell, and Synopsys IC will automatically generate a clock tree to distribute the clock to all

sequential state elements with hopefully low skew.

We can use  make   to run Synopsys ICC like this:

 %  cd  $TOPDIR/asic/icc-­‐par  

 %  make  

Place-and-route can take significantly longer than synthesis, so be prepared to wait a while with larger designs. If you lookat the output scrolling by you will see some of the optimization passes as Synopsys ICC attempts to iteratively improve thedesign. The automated system is also setup to output a bunch of reports. Here are the key ones:

 %  cd  $TOPDIR/asic/icc-­‐par/current-­‐icc/reports  

 %  more  chip_finish_icc.qor.rpt  

 %  chip_finish_icc.timing.rpt  

 %  more  chip_finish_icc.area.rpt  

 %  more  chip_finish_icc.power.rpt  

 %  more  summary.txt  

   vsrc              =  SortUnitStructRTL_0x4b8e51bd8055176a.v  

   area              =  5067  #  um^2  

   constraint  =  1.0  #  ns  

   slack            =  0.01  #  ns  

   cycle_time  =  0.99  #  ns  

If Synopsys ICC exits with an error or the reports look very odd, you will need to carefully look through the log to search forerrors or warnings that might hint at what went wrong. Usually we catch errors in Synopsys DC and after that we are all set,so you might want to go back and see if there were any errors in Synopsys DC. The Synopsys ICC log files are here:

 %  cd  $TOPDIR/asic/dc-­‐syn/current-­‐iccdp/log  

 %  cd  $TOPDIR/asic/dc-­‐syn/current-­‐icc/log  

We have written a little script to parse the reports and generate a  summary.txt   file. This script takes care of looking acrossall four path groups to fine the true cycle time that you should use in your analysis. The general format of the area, energy,timing reports is similar in spirit to what we saw earlier when working with Synopsys DC.

From the  summary.txt   file, we can see that the cycle time is not estimated to be 0.99ns, but recall that our post-synthesisestimate was 0.95ns. The key difference of course, is that these results are based on post-place-and-route analysis so theyfactor in routing congestion and interconnect overheads.

While we do not use GUIs to drive our flow, we often use GUIs to analyze the results. You can start the Synopsys ICC GUIto visualize the final layout like this:

 %  cd  $TOPDIR/asic/icc-­‐par/current-­‐icc  

 %  icc_shell  -­‐gui  

Once the GUI has finished opening, use the following steps to actually open up the most recently placed-and-routeddesign:

enter  source  icc_setup.tcl   at  icc_shell>   promptChose File > Open Design... from the menuClick the folder button to right of Library Name

Select the orange folder with L in file browserSelect chip_finish_icc in listClick  Okay  

We call the resulting plot an "amoeba plot" because the tool often generates blocks that look like amoebas. You can now

zoom in to see how the standard cells were placed and how the routing was done. You can turn on an off the visibility of

metal layers using the panel on the left. One very useful feature is to view the hierarchy and area breakdown. This will be

critical for producing high-quality amoeba plots. You can use the following steps to highlight various modules on the

amoeba plot:

Choose  Placement  >  Color  By  Hierarchy   from the menu

In the sidebar menu on right, select  Reload  

In the pop-up window, select  Color  hierarchical  cells  at  level  

Click  OK   in the pop up

Click checkmark and apply to show just one component

Another very useful feature is to highlight the critical path on the amoeba plot using the following steps:

Choose  Timing  >  New  Timing  Analysis  Window   from the menu

Focus on  Select  Paths   window, click  OK  

List of paths should appear

Click on path to see it highlighted in layout view

You can see an example amoeba plot below. Note that you will need to use some kind of "screen-capture" software to

capture the plot and by default it will have a black background. We strongly recommend inverting the colors so that the

amoeba plot you include in your reports is dark on white (instead of white on dark). This makes the chip plot easier to read.

You will also need to play with the colors to enable easily seeing the various parts of your design. In this example, we have

chosen to highlight the five  MinMaxUnits   (brown, blue, green, red, gray) and one of the critical paths which goes through the

red  MinMaxUnit  . Note how the tool has actually spread the red  MinMaxUnit   a part a bit. Keep in mind that these tools use

incredibly sophisticated heuristics and so it can sometimes be difficult to understand every detail about why it places cells

in specific places.

Using Primetime with MakefileWe use Synopsys PrimeTime (PT) for power analysis. There are many ways to perform power analysis. The power post-

synthesis and post-place-and-route power reports use statistic power analysis where we simply assume some toggle

probability on each net. For more accurate power analysis we need to find out the actual activity for every net for a given

experiment. One way to do this is to perform post-place-and-route gate-level simulation; in other words, we can do a

simulation of the gate-level netlist generated by synthesis and place-and-route. These kind of gate-level simulations can be

very, very slow and are tedious to setup correctly. So in this course we will use a slightly less accurate yet much simpler

approach. We will use the VCD from an RTL simulation instead of the VCD from a gate-level simulation. The challenge is

that not all of the nets in the gate-level simulation are actually in the RTL so we will only have activity information for a

subset of the nets that are in both the RTL and gate-level models (e.g., module ports, state elements). This is not as bad as

it seems, because Synopsys PT will use sophisticated algorithms including many tiny little gate-level simulations of just a

few gates in order to estimate the activity factor of all nets downstream from those nets we already know.

We can use  make   to run Synopsys PT like this:

 %  cd  $TOPDIR/asic/pt-­‐pwr  

 %  make  

   vsrc              =  SortUnitStructRTL_0x4b8e51bd8055176a.v  

   input            =  sort-­‐rtl-­‐struct-­‐random  

   area              =  5067  #  um^2  

   constraint  =  1.0  #  ns  

   slack            =  0.01  #  ns  

   cycle_time  =  0.99  #  ns  

   exec_time    =  104  #  cycles  

   power            =  6.309  #  mW      energy          =  0.64957464  #  nJ  

We have setup the flow to display the final summary information after this step. You can see the total area, cycle time,power, and energy for your design when running the given input (i.e., when using the VCD file specified in the  Makefrag  ).You can see a more detailed power breakdown by module here:

 %  cd  $TOPDIR/asic/pt-­‐pwr/current-­‐pt/reports  %  more  pt-­‐pwr.power.avg.max.report  

                                                       Int            Switch      Leak          Total    Hierarchy                                    Power        Power        Power        Power                %    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    SortUnitStructRTL                    4.94e-­‐03  1.34e-­‐03  2.76e-­‐05  6.31e-­‐03  100.0      elm_S2S3_000  (Reg_3)            2.23e-­‐04  1.08e-­‐05  1.04e-­‐06  2.35e-­‐04      3.7      elm_S2S3_001  (Reg_2)            2.60e-­‐04  3.24e-­‐05  1.07e-­‐06  2.93e-­‐04      4.7      elm_S2S3_002  (Reg_1)            2.64e-­‐04  3.50e-­‐05  1.08e-­‐06  3.00e-­‐04      4.8      elm_S2S3_003  (Reg_0)            2.39e-­‐04  9.61e-­‐06  1.08e-­‐06  2.50e-­‐04      4.0      elm_S1S2_000  (Reg_11)          2.52e-­‐04  3.55e-­‐05  1.06e-­‐06  2.89e-­‐04      4.6      elm_S1S2_001  (Reg_10)          2.66e-­‐04  3.49e-­‐05  1.08e-­‐06  3.02e-­‐04      4.8      elm_S1S2_002  (Reg_9)            2.54e-­‐04  3.25e-­‐05  1.07e-­‐06  2.88e-­‐04      4.6      elm_S1S2_003  (Reg_8)            2.65e-­‐04  3.30e-­‐05  1.08e-­‐06  3.00e-­‐04      4.7      elm_S0S1_000  (Reg_7)            2.61e-­‐04  3.67e-­‐05  1.07e-­‐06  2.99e-­‐04      4.7      elm_S0S1_001  (Reg_6)            2.63e-­‐04  3.66e-­‐05  1.07e-­‐06  3.01e-­‐04      4.8      elm_S0S1_002  (Reg_5)            2.68e-­‐04  3.89e-­‐05  1.08e-­‐06  3.08e-­‐04      4.9      val_S2S3  (RegRst_2)              5.55e-­‐05  1.21e-­‐07  1.72e-­‐07  5.58e-­‐05      0.9      elm_S0S1_003  (Reg_4)            2.64e-­‐04  3.43e-­‐05  1.08e-­‐06  2.99e-­‐04      4.7      minmax_S3  (MinMaxUnit_0)    1.90e-­‐04  1.20e-­‐04  2.11e-­‐06  3.13e-­‐04      5.0      minmax0_S1  (MinMaxUnit_2)  2.06e-­‐04  8.97e-­‐05  2.09e-­‐06  2.98e-­‐04      4.7      minmax0_S2  (MinMaxUnit_3)  2.01e-­‐04  9.09e-­‐05  2.11e-­‐06  2.94e-­‐04      4.7      minmax1_S1  (MinMaxUnit_4)  2.03e-­‐04  8.68e-­‐05  2.08e-­‐06  2.92e-­‐04      4.6      minmax1_S2  (MinMaxUnit_1)  2.00e-­‐04  8.87e-­‐05  2.07e-­‐06  2.91e-­‐04      4.6      val_S0S1  (RegRst_0)              3.30e-­‐05  2.58e-­‐07  2.12e-­‐07  3.35e-­‐05      0.5      val_S1S2  (RegRst_1)              2.19e-­‐05  1.01e-­‐07  1.48e-­‐07  2.22e-­‐05      0.4  

These estimates are in Watts. The power of each module is broken down into internal power, switching power, and leakagepower. Internal power and switching power are both forms of dynamic power. Internal power is the "dynamic powerdissipated within the boundary of a cell". According to Synopsys documentation it includes power due tocharging/discharging internal nodes within the cell but also short circuit power. Switching power is the dynamic "powerdissipated by the charging and discharging of the load capacitance at the output of the cell". To learn more about howSynopsys PT does power analysis see the PrimeTime PX User Guide. From the breakdown you can see a relatively evendistribution of the power across the modules, and that the dynamic power is much more significant than the leakage power.

Let's do a quick experiment to compare the energy for sorting a stream of all zeros to the energy for sorting a stream ofrandom values (which we just found to be 650pJ). We do not need to re-synthesize and re-place-and-route the design. Wejust need to generate a new VCD file and re-run Synopsys PT. So first we re-run the sort unit simulator with a different input:

 %  cd  $TOPDIR/pymtl/build    %  ../tut3_pymtl/sort/sort-­‐sim  -­‐-­‐impl  rtl-­‐struct  -­‐-­‐input  zeros  -­‐-­‐translate  -­‐-­‐dump-­‐vcd  

Now we need to change the entry in the  Makefrag   to point to the new VCD file. The entry in the  Makefrag   should look likethis:

 ifeq  ($(design),pymtl-­‐sort)        flow                    =  pymtl        clock_period    =  1.0        sim_build_dir  =  pymtl/build  

     vsrc                    =  SortUnitStructRTL_0x4b8e51bd8055176a.v        vmname                =  SortUnitStructRTL_0x4b8e51bd8055176a        viname                =  TOP/v        vcd                      =  sort-­‐rtl-­‐struct-­‐zeros.verilator1.vcd    endif  

Now we re-run Synopsys PT:

 %  cd  $TOPDIR/asic/pt-­‐pwr    &&  make  

   vsrc              =  SortUnitStructRTL_0x4b8e51bd8055176a.v      input            =  sort-­‐rtl-­‐struct-­‐zeros      area              =  5067  #  um^2      constraint  =  1.0  #  ns      slack            =  0.01  #  ns      cycle_time  =  0.99  #  ns      exec_time    =  104  #  cycles      power            =  2.667  #  mW      energy          =  0.27459432  #  nJ  

Not surprisingly, sorting a stream of zeros consumes significantly less energy compared to sorting a stream of random

values: 275pJ vs 650pJ. One might ask why the sort unit consumes any energy if it is just sorting a stream of zeros. We can

dig into the report to find the answer:

 %  cd  $TOPDIR/asic/pt-­‐pwr/current-­‐pt/reports  %  more  pt-­‐pwr.power.avg.max.report  

                                                       Int            Switch      Leak            Total  Hierarchy                                      Power        Power        Power          Power              %  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  SortUnitStructRTL                      2.19e-­‐03  4.53e-­‐04  2.75e-­‐05  2.67e-­‐03  100.0      elm_S2S3_000  (Reg_3)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S2S3_001  (Reg_2)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S2S3_002  (Reg_1)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S2S3_003  (Reg_0)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S1S2_000  (Reg_11)          1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S1S2_001  (Reg_10)          1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S1S2_002  (Reg_9)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S1S2_003  (Reg_8)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S0S1_000  (Reg_7)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S0S1_001  (Reg_6)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      elm_S0S1_002  (Reg_5)            1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      val_S2S3  (RegRst_2)              5.54e-­‐05  1.23e-­‐07  1.72e-­‐07  5.57e-­‐05      2.1      elm_S0S1_003  (Reg_)              1.63e-­‐04  0.000        8.88e-­‐07  1.64e-­‐04      6.1      minmax_S3  (MinMaxUnit_0)    0.000        0.000        2.28e-­‐06  2.28e-­‐06      0.1      minmax0_S1  (MinMaxUnit_2)  0.000        0.000        2.24e-­‐06  2.24e-­‐06      0.1      minmax0_S2  (MinMaxUnit_3)  0.000        0.000        2.24e-­‐06  2.24e-­‐06      0.1      minmax1_S1  (MinMaxUnit_4)  0.000        0.000        2.24e-­‐06  2.24e-­‐06      0.1      minmax1_S2  (MinMaxUnit_1)  0.000        0.000        2.24e-­‐06  2.24e-­‐06      0.1      val_S0S1  (RegRst_0)              3.30e-­‐05  2.57e-­‐07  2.12e-­‐07  3.35e-­‐05      1.3      val_S1S2  (RegRst_1)              2.18e-­‐05  9.97e-­‐08  1.48e-­‐07  2.21e-­‐05      0.8  

Notice that the switching power is indeed zero for the pipeline registers, but not the valid bit. This is probably because the

valid bit does toggle at the beginning and end of the simulation; the absolute switching power of valid bit is very, very small.

Notice that there is still leakage, but none of this accounts for the majority of the 275pJ. The key is the internal power of the

pipeline registers. Internal power also includes the clock power for sequential state elements, so effectively while sorting a

stream of zeros results in very little energy on the data bits we still require energy to toggle the clock across all of the

pipeline registers. In this design there are 12 16-bit pipeline registers which is quite a bit of state. So the key point here is

that we want to always try small experiments to verify that things are working as expected, and that you will almost certainlyneed to dig into the detailed reports to understand what is going on.

Using Verilog RTL ModelsStudents are welcome to use Verilog instead of PyMTL to design their RTL models. Having said this, we will still exclusivelyuse PyMTL for all test harnesses, FL/CL models, and simulation drivers. This really simplifies managing the course, andPyMTL is actually a very productive way to test/evaluate your Verilog RTL designs. We use PyMTL's Verilog import featuredescribed in the Verilog tutorial to make all of this work. The following commands will run all of the tests on the Verilogimplementation of the sort unit.

 %  cd  $TOPDIR/pymtl/build  

 %  rm  -­‐rf  *  

 %  py.test  ../tut4_verilog/sort  

As before, the tests for the SortUnitStructRTL will fail. You can just copy over your implementation of the MinMaxUnit fromwhen you completed the Verilog tutorial. If you have not completed the Verilog tutorial then go back and do that now. Afterrunning the tests we use the sort unit simulator to translate the PyMTL RTL model into Verilog and to dump the VCD file thatwe want to use for power analysis.

 %  cd  $TOPDIR/pymtl/build  

 %  ../tut4_verilog/sort/sort-­‐sim  -­‐-­‐impl  rtl-­‐struct  -­‐-­‐translate  -­‐-­‐dump-­‐vcd  

Take a moment to open up the translated Verilog which should be in a file named  SortUnitStructRTL_0x4b8e51bd8055176a.v  .You might ask, "Why do we need to use PyMTL to translate the Verilog if we already have the Verilog?" PyMTL will takecare of preprocessing all of your Verilog RTL code to ensure it is in a single Verilog file. This greatly simplifies getting yourdesign into the ASIC flow. This also ensures a one-to-one match between the Verilog that was used to generate the VCD fileand the Verilog that is used in the ASIC flow.

One small but important note. If you use Verilog as your RTL modeling language you may need to update the  Makefrag   topoint to a slightly different VCD file. So in this example, you will need to update the  Makefrag   entry for this design like this:

 ifeq  ($(design),pymtl-­‐sort)  

     flow                    =  pymtl  

     clock_period    =  1.0  

     sim_build_dir  =  pymtl/build  

     vsrc                    =  SortUnitStructRTL_0x4b8e51bd8055176a.v  

     vmname                =  SortUnitStructRTL_0x4b8e51bd8055176a  

     viname                =  TOP/v  

     vcd                      =  sort-­‐rtl-­‐struct-­‐random.verilator2.vcd  

 endif  

Notice that instead of  sort-­‐rtl-­‐struct-­‐random.verilator1.vcd   we use  sort-­‐rtl-­‐struct-­‐random.verilator2.vcd  . This is dueto an technical detail in how PyMTL manages VCD filenames. Once you have tested your design and generated the singleVerilog file and the VCD file, you can push the design through the ASIC flow using the exact same steps we used above.

 %  cd  $TOPDIR/asic/dc-­‐syn    &&  make  

 %  cd  $TOPDIR/asic/icc-­‐par  &&  make  

 %  cd  $TOPDIR/asic/pt-­‐pwr    &&  make  

   vsrc              =  SortUnitStructRTL_0x4b8e51bd8055176a.v  

   input            =  sort-­‐rtl-­‐struct-­‐random  

   area              =  5686  #  um^2  

 

   constraint  =  1.0  #  ns      slack            =  -­‐0.01  #  ns      cycle_time  =  1.01  #  ns      exec_time    =  104  #  cycles      power            =  6.261  #  mW      energy          =  0.65765544  #  nJ  

On Your Own

Now that you have gone through the entire ECE 5745 ASIC flow for both the PyMTL and Verilog implementation of the sort

unit, you should try the same approach for the GCD unit which is included in the tutorial. Explore the area, energy, and

timing of the GCD unit. Where is the critical path? How is the area allocated across the various submodules? How does the

energy of the GCD unit vary based on the input pattern?