winning with hdl. agenda introduction hdl coding techniques virtex hardware summary
Post on 13-Jan-2016
256 Views
Preview:
TRANSCRIPT
Winning with HDL
AGENDA
Introduction HDL coding techniques Virtex hardware Summary
Coding for Performance
Gate Arrays are relatively tolerant of poor coding styles and design practices
66 MHz is easy for an Gate Array
Designs coded for a Gate Array tend to perform 3x slower when converted to a FPGA
Not uncommon to see up to 30 layers of logic and 10-20 MHz FPGA designs
6-8 FPGA Logic Levels = 50 MHz
FPGAs require different coding styles and more effective design methodologies to reach gate array system speeds.
Coding for Performance
Common mistake is to ignore hardware and start coding as if programming. To achieve best performance, the designer must think about the hardware.
Improve performance by: avoiding unnecessary priority structures in logic optimizing logic for late-arriving signals structuring arithmetic for performance avoiding area-inefficient code buffering high fanout signals pipelining for high performance exploiting high performance cores from CoreGen
Effective Coding StyleCase vs. If-Then-Else
in0
in1
in2
in3
mux_out
sel
in0in1
in2
in3
sel=00sel=01
sel=10p_encoder_out
module mux (in0, in1, in2, in3, sel, mux_out);input in0, in1, in2, in3; input [1:0] sel;output mux_out;reg mux_out;always @(in0 or in1 or in2 or in3 or sel) begin
case (sel) 2'b00: mux_out = in0; 2'b01: mux_out = in1; 2'b10: mux_out = in2; default: mux_out = in3;endcase
endendmodule
module p_encoder (in0, in1, in2, in3, sel, p_encoder_out);input in0, in1, in2, in3;input [1:0] sel;output p_encoder_out;reg p_encoder_out;always @(in0 or in1 or in2 or in3 or sel) begin
if (sel == 2'b00) p_encoder_out = in0;else if (sel == 2'b01) p_encoder_out = in1;else if (sel == 2'b10) p_encoder_out = in2;else p_encoder_out = in3;
endendmodule
Generally, If-Else is slower unless you intend to build a priority encoder!
Priority Encoder “if-then-else”When to use?
Assign highest priority to a late arriving critical signalNested “if-then-else” can increase area and delayUse “case” statement if possible to describe the same function
always @(sel or in)begin if (sel == 3'h0)
out = in[0]; else if (sel == 3'h1)
out = in[1]; else if (sel == 3'h2)
out = in[2]; else if (sel == 3'h3)
out = in[3]; else if (sel == 3'h4)
out = in[4]; else
out = in[5];end
in [4]
in [3]
SS
SS
in [2]in [1]
in [0]
Benefits of “case” statementalways @(C or D or E or F or S)begin case (S)
2’b000 : Z = C;2’b001 : Z = D;2’b010 : Z = E;2’b011 : Z = F;2’b100 : Z = G;2’b101 : Z = H;2’b110 : Z = I;default : Z = J;
endcase
CDEFGHIJ
S
Z
8:1 Mux
Compact and delay optimized implementationImplemented in a single CLB
Synthesis maps to MUXF5 and MUXF6 functions4:1 MUX is implemented in a single CLB slice
Effective Coding StyleOptimize for the Critical Path
critical
in0in1
in2
in3out
in2
in0in1
in3
criticalout
module critical_bad (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical;
output out;
assign out = (((in0&in1) & ~critical) | ~in2) & ~in3;
endmodule
module critical_good (in0, in1, in2, in3, critical, out); input in0, in1, in2, in3, critical; output out;
assign out = ((in0&in1) | ~in2) & ~in3 & ~critical;
endmodule
Minimize the critical path where possible
-- No parenthesesOUT1 <= I1 + I2 + I3 + I4
-- No parenthesesOUT1 <= I1 + I2 + I3 + I4
-- With parenthesesOUT1 <= (I1 + I2) + (I3 + I4)
-- With parenthesesOUT1 <= (I1 + I2) + (I3 + I4)
I1
I2
I3
I4
OUT1
I4
I1
I2
I3
OUT1
Structuring Arithmetic for Performance
Know your tools: use Synthesis directives, options (vendor specific)
Area, Speed, Ungrouping and flattening, Resource sharing, "DesignWare" libraries
Attributes - ripple, look-ahead, fastest, smallest.– i.e. // exemplar attribute out1 modgen_sel fastest
LogiBlox, CORE Generator if vendor hasn't fully tuned yet
Use parentheses to control logical structure
How to use the Carry-In in FPGA Express
In FPGA Express, concatenate the Carry-In to get an adder with carry (Adder_c). Without concatenation (Adder_b), you would end up with 2 adders.
In other tools, like Leonardo, Adder_b will generate a single adder with carry-in -- no concatenation is necessary.
// ADDER_A// No carry-inAOUT = AIN1 + AIN2;
// ADDER_A// No carry-inAOUT = AIN1 + AIN2;
// ADDER_B// Carry-in used but 2 addersBOUT = BIN1 + BIN2 + BCARRYIN;
// ADDER_B// Carry-in used but 2 addersBOUT = BIN1 + BIN2 + BCARRYIN;
// ADDER_C// Carry-in used with only 1 adder required// Concatenate{COUT, CCARRYOUT} = {CIN1 ,CCARRYIN} + {CIN2,CCARRYIN};
// ADDER_C// Carry-in used with only 1 adder required// Concatenate{COUT, CCARRYOUT} = {CIN1 ,CCARRYIN} + {CIN2,CCARRYIN};
Verilog Notes
For CASE statements, be sure to use your synthesis vendor’s syntax to ensure optimum performance.
Full_case syntax allows you to avoid unwanted latches Parallel_case syntax allows you to ensure a parallel (as
opposed to priority encoded) hardware implementation in case statements where all cases are mutually exclusive.
Use “Don’t-Cares” to speed up your design and reduce area
Avoid inefficient code
a0b0
+
+a1b1
sum
sel
+ sumsel
a0
a1
b0
b1
module poor_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;always @(a0 or a1 or b0 or b1 or sel) begin
if (sel)sum = a1 + b1;
elsesum = a0 + b0;
endendmodule
module good_resource_sharing (a0, a1, b0, b1, sel, sum);input a0, a1, b0, b1, sel;output sum;reg sum;reg a_temp, b_temp;always @(a0 or a1 or b0 or b1 or sel) begin
if (sel) begina_temp = a1;b_temp = b1;
endelse begin
a_temp = a0;b_temp = b0;
endsum = a_temp + b_temp;
endendmodule
Use 2 muxes rather than 2 adders to reduce resource usage
Duplicate Registers to Reduce Fan-Out
module low_fanout(in, en, clk, out);input [23:0] in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en1, tri_en2;always @(posedge clk) begin
tri_en1 = en; tri_en2 = en;endalways @(tri_en1 or in)begin
if (tri_en1) out[23:12] = in[23:12];else out[23:12] = 12'bZ;
endalways @(tri_en2 or in) begin
if (tri_en2) out[11:0] = in[11:0];else out[11:0] = 12'bZ;
endendmodule
module high_fanout(in, en, clk, out);input [23:0]in;input en, clk;output [23:0] out;reg [23:0] out;reg tri_en;always @(posedge clk) tri_en = en;always @(tri_en or in) begin
if (tri_en) out = in;else out = 24'bZ;
endendmodule
en
clk
[23:0]in [23:0]out
tri_en
en
clk
[23:0]in[23:0]out
en
clk
24 loads
12 loads
12 loads
tri_en1
tri_en2
Design Partition - Reg at Boundary
a0
clk
a1
clk
+ sum
+a0
a1
clk
sum
module reg_at_boundary (a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;always @(posedge clk) begin
sum = a0 + a1;end
endmodule
module reg_in_module(a0, a1, clk, sum);input a0, a1, clk;output sum;reg sum;reg a0_temp, a1_temp;always @(posedge clk) begin
a0_temp = a0;a1_temp = a1;
endalways @(a0_temp or a1_temp) begin
sum = a0_temp + a1_temp;end
endmodule
Pipeline for Performance
1 cyclemodule no_pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp;always @(posedge clk) begin
out = (a_temp * b_temp) + c_temp;a_temp = a; b_temp = b; c_temp = c;
endendmodule
module pipeline (a, b, c, clk, out);input a, b, c, clk;output out;reg out;reg a_temp, b_temp, c_temp1, c_temp2, mult_temp;always @(posedge clk) begin
mult_temp = a_temp * b_temp;a_temp = a; b_temp = b;
endalways @(posedge clk) begin
out = mult_temp + c_temp2;c_temp2 = c_temp1;c_temp1 = c;
endendmodule
*
+
a
b
c
out
2 cycle
*
+
a
b
c
out
Pipeline to increase performance
Take Advantage of Virtex Hardware
Use flip-flops and pipeline! FPGA’s contain hordes of flip-flops.
Virtex gives you 4 DLL’s that can be used to synchronize clocks for superior system timing
Use the optimized cores from CoreGen to get high performance, pipelined arithmetic and sophisticated functional blocks.
RTL Flexibility for Register Configurations
Register Mapping forRegisters with sync/async set and resetClocks, inverted clocks, and clock enable
Positive Edge Triggered Flip-Flop with clock enable, sync clear and preset
always @(posedge clk or posedge preset)begin if (preset)
q = 1; else if (reset)
q = 0; else if (CE)
q = data;end
reset
data
clk
q
preset
ce
Timing Driven Register IOB Mapping
Technology Mapping will not duplicate registersCritical signal will not be absorbed in the IOB register
process (Tri, Clk) begin if (clk’event and clk =`1`) then Tri_R <= Tri; end if;end process;
process (Tri, Data_in) begin if (Tri_R = ‘1’) then Out <= Data_in; else Out <= (others => ‘Z’); end if;end process;
TRI TRI_R
CLK
D Q
DATA [23:0] OUT [23:0]
fanout = 24
Timing Driven Register IOB Mapping
Duplicate register on critical path for fanout of 1Mapping will absorb register in IOB
process (Tri_, Clk) begin if (clk’event and clk =`1`) then Tri_R1 <= Tri; Tri_R2 <= Tri; end if; end process;process (Tri_R1, Data_in) begin if (Tri_R1 = ‘1’) then Out(23) <= Data_in(23); else Out(23) <= ‘Z’); end if;end process;process (Tri_R2, Data_in) begin if (Tri_R2 = ‘1’) then
Out(22:0) <= Data_in(22:0); else
Out(22:0) <= (others => ‘Z’); end if;end process;
TRI
CLK
D QTRI_R1
DATA [23] OUT [23]
fanout = 1
TRI
CLK
D QTRI_R2
OUT [22:0]DATA [22:0]
fanout = 23
Area Efficient Muxes using TBUFs
Improve area efficiency by using tri-statesEach CLB has 2 TBUFs
assign Q[7:0] = E0 ? A[7:0] : 8'bzz..z;assign Q[7:0] = E1 ? B[7:0] : 8'bzz..z;assign Q[7:0] = E2 ? C[7:0] : 8'bzz..z;assign Q[7:0] = E3 ? D[7:0] : 8'bzz..z;
case (E) 4’b0001 : Q[7:0] = A[7:0]; 4’b0010 : Q[7:0] = B[7:0]; 4’b0100 : Q[7:0] = C[7:0]; 4’b1000 : Q[7:0] = D[7:0];endcase E[3:0]
A[7:0]
B[7:0]
C[7:0]
D[7:0]
Z[7:0]
A[7:0]
B[7:0]
C[7:0]
D[7:0]
E0
E1
E2
E3
Z[7:0]
TBUFs as Muxes Performance Summary
•Improve area efficiency by using tri-states•But often slower than equivalent muxes under most
circumstance•Too much delay getting onto TBUF
•Each CLB has 2 TBUFs•PAR can connect tri-states on multiple horizontal long
lines to build wide muxes
Distributed RAM Inferencing System Memory
module ramtest(q, addr, d, we, clk); output [3:0] q; input [3:0] d; input [2:0] addr; input we; input clk;
reg [3:0] mem [7:0];
assign q = mem[addr]; always @(posedge clk) begin if(we) mem[addr] = d; endendmodule
Synplicity (RAM 8x4) AO
A1
A2
A3
D
WCLK
WE
AO
A1
A2
D
WCLK
WE
Addr [2:0]
D [3:0]
clkwe
q [3:0]
RAM 16x1S
RAM 16x1S
Q.
..
.
.
•Synplify and LeonardoSpectrum can infer distributed RAM•FPGA Express will support RAM inferencing in future
Registered IO Mapping System Interfaces
System Timing Chip to chip performance limits system speeds
No need to instantiate IOB register cells
Implementation tools will pack registers in the IO map -pr b
b (both input and output)i (input only)o (output only)
IOB = TRUE attributeMapping for data and enable ports
S/R
D
CE
CLK
S/R
Q
OBUF
QCE
D
CLK
IBUF
Instantiating Technology Specific Features
Block RAM System Memory
CLKDLL Minimizes clock skew
Special IOs Interfacing with standard buses
LUTs for Datapath pipelining Add latency with minimal area impact
LUTs for Datapath pipelining LUT can be used in place of registers to balance pipeline stages
Area efficient implementation SRL16E can delay an input value up to 16 clock cycles - Sync up operands before the next operation
F
GH
A[31:0]
B[31:0]
C[31:0]
Z
8 cycles5 cycles
1 cycle
SRL16EDCE CLKA3A2A1A0
Q7
SRL16EDCE CLKA3A2A1A0
Q12
32 LUTs replace 256 registers
32 LUTs replace 416 registers
Block RAM: System Memory
RAMB4_S1 U1 (.WE(WE), .EN(EN), .RST(RST), .CLK(CLK), .ADDR(ADDR), .DI(DI), .DO(DO));
component RAMb4_S1port(WE,EN,RST,CLK: in STD_LOGIC; ADDR: in STD_LOGIC_VECTOR(11 downto 0); DO: out STD_LOGIC; DI: in STD_LOGIC_VECTOR(0 downto 0));end component;
begin U1: RAMB4_S1 port map(WE=>WE, EN=>EN, RST=>RST, CLK=>CLK, DI=>DI, ADDR=>ADDR, DO=>DO);
RAMB4_S1
doDO
addr
enwe
rstclk
di
ADDRWEENRST
DI
CLK
Instantiate single and dual port RAMUse CoreGen to build RAM and FIFO (Q1 ‘99)
wire clk_fb;BUFGDLL U4 (.I(clkin), .O(clk_fb));
BUFG
CLKIN
CLKFB
RST
CLKDLL
CLK0CLK90
CLK180CLK270CLK2XCLKDV
LOCKED
IBUFG
U4
clkin
rst
clk_fbIO
Virtex CLKDLLMinimize clock to out pad delay
Removes all delay from external GCLKPAD pin to the registers and RAM
BUFGDLL is available for instantiation Other configurations can be built by instantiating the CLKDLL macro
UCF only way to configure CLKDLL or BUFGDLL In future would like to use generics (VHDL) and parameters (Verilog) but synthesizers don't pass them on yet
Special IO Buffers: System Interfaces
Default IO buffer is LVTTL (12mA), available via inference
Process technology leads to mixed voltage systemsHigh performance, low power signal standards emerging
Instantiate IO buffers for non default current drivenon default voltage standardnon default slew
OBUF_AGP U0 (.I(awire), .O(oport));
OBUF_F_24 U1 (.I(awire), .O(oport));
awire oport
U0
awire oport
U1
Advanced Graphics Port bus interface (Pentium II graphics)Fast slew rate and 24 mA drive strength
Summary
Efficient HDL coding allows designers to build high performance designsDesigners should consider the underlying hardware as they code, to achieve best resultsExploit the hardware’s features for best performance
top related