lecture 19 performance optimization · lecture 19 performance optimization xuan ‘silvia’ zhang...

Post on 18-Jan-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Lecture 19 Performance Optimization

Xuan ‘Silvia’ Zhang Washington University in St. Louis

http://classes.engineering.wustl.edu/ese461/

Project FAQ

•  Correction –  typo in optical flow: Iy(i, j) = I1(i, j+1) – I1(i, j-1) –  I1(i, j+1) might not exist

•  Mid-project report –  behavioral Verilog code and testbench –  show proof of working functional simulation –  ensure synthesizable codes

•  Use of external memory –  instantiate in the test bench –  used for large data array or buffers

2

Arrays, Vectors, and Memories

3

Useful Verilog Features

•  Display tasks –  $display, $displayb (h, o) in binary, hex, and octal –  $write, $strobe, $monitor

•  File I/O tasks –  $fopen, $fclose –  $fdisplay, $fwrite, $fstrobe, $fmonitor –  $readmemb, $readmemh: read a text file into memory

4

Module Partitioning

•  Where possible, register module outputs and keep critical path in one block

•  Design Registering –  pipelining –  restructure a long data path with several levels of logic

and break it up over multiple cycles

5

Pipelining

6

Pipelining

7

Adding Structure

•  Control the structure by using separate assignment and parentheses

•  Example –  32-bit arithmetic shift right –  design 1 –  design 2

8

32-Bit Arithmetic Shift Right

•  Design 3

9

32-Bit Arithmetic Shift Right

•  Optimal structured design

10

32-Bit Arithmetic Shift Right

•  Without specifying the mux instantiations

11

Horizontal Partitioning

•  Break circuit into horizontal slices to minimize maximum fan-in

•  Example –  carry lookahead adder:

32-bit adder broken to eight 4-bit blocks –  32-bit priority encoder

12

32-Bit Priority Encoder

•  Restructured with four 8-bit blocks

13

Priority-Encoded Logic vs Balanced Logic

•  If-Then-Else vs Case Statement –  redundant priority

14

Hierarchy

•  Collapse hierarchy (flattening) –  more efficient synthesis

•  Add Hierarchy –  benefit results from structure preservation –  example: 32-bit decoder

–  least-efficient implementation

15

32-Bit Decoder

•  More concise representation

•  A balanced tree decoder is even better

16

32-Bit Balanced-Tree Decoder

17

Performing Operations in Parallel

•  Example –  linear search

18

Performing Operations in Parallel

•  Example –  binary search

19

Performing Operations in Parallel

•  Example –  parallel search

20

MUX for Conditional Assignment

•  Example: counter

21

MUX for Conditional Assignment

•  Example: counter

22

Replication

•  Large fanout –  manual register duplication to reduce congestion

23

Resource Sharing

•  Optimize area but hurt speed –  with resource sharing

24

Resource Sharing

•  Optimize area but hurt speed –  without resource sharing

25

Questions?

Comments?

Discussion?

26

top related