continuing challenges in static timing analysis

Continuing Challenges in Static Timing Analysis

Tom Spyrou

TAU 2013

3/2013

Goal of this talk

Higher level than latest trends Remind ourselves the trade-offs we have made as an

industry to have a workable solution for STA- Signoff- Embedded in Design Synthesis and Optimization- Plenty of discussion on new effects, lets discuss core STA

Explain basis of industrial algorithms to academic community

Challenge ourselves to look at the issues again Technology trends

- Design- Compute

2

Why Static Timing Analysis Dynamic simulation is impossible for even a

small chip- Assume combination logic only- 100 inputs implies 2^100 vectors needed to verify timing which is

about 10^30 vectors- If a simulator could process 10^6 vectors per second this works

out to a sim time of 10^19 days or about 10^15 years- Talk about a verification bottleneck!

Now add in state elements and the problem of making sure the critical path is actually in the vector set

STA can analyze such a design in 1 minute- There are some issues, but they can be mitigated

STA’s quality of result is not dependent on the quality of the vector set

What is the trade-off / core issues?These have been unchanged for a long time A different kind of setup

- Result is dependent on quality of constraints and exceptions- If all storage elements are clocked and i/o’s constrained generally safe

Less accurate delay analysis- Exact path is not really known as with event driven simulation- When STA was first introduced this was less of an issue, PBA now essential

Introduction of false paths due to topological not functional analysis- Users have to manually specify these

Multiple circuit modes take extra effort- Not just more vectors

Loops and level sensitive latches add complexity

Analysis

Every circuits looks the same to STA since it ignores the functions of the logic.

Topological analysis

Simplifies problem, possibility of reporting false paths

What do recent trends mean Design

- Hyper-optimization means accuracy is critical- When a chip is designed at a bleeding edge technology it will be pushed

on all dimensions of power, performance and area Simulation based delay calculation Path based analysis

- Design size means memory use is #1 problem Largest chips are approaching 1TB of RAM needed for flat runs Hierarchical / Parallel solutions must prioritize memory use on compute nodes Runtime also needs to be faster but the first step is to run on machines with

reasonable cost Recent design uses 750+Gig of RAM for single mode/corner STA

Compute- CPU is cheap, data movement is expensive

Whenever you hear its an expensive calculation don’t avoid it- Parallel computing must not only improve performance but also accuracy

and features. Don’t just make the same problem go faster or just divide the data

If you ask a designer what doesn’t work well

Hierarchical timing in the final verification loop SI calculations very conservative SDC’s are large and hard to verify Worst case timing is done and process variation

is modeled very pessimistically Block based analysis loses too much accuracy True delay (looking at combinational logic to

prove a path true) reporting is slow and can’t run during optimization

Libraries limit flexibility of analysis

STA Industry and Academia STA technology has been innovated inside Industry much

more than in Academia

The key approaches are not documented

There is no open source reference to build from

Industry protects the core concepts as trade secrets

Academia does not (rarely) publish on STA beyond single clock designs or delay calculation

We need a book on the core search algorithm

Example, Veritime from the 90’s

STA Engine that required vectors for the clock

Dynamic simulation of the clock- Period, multicycle paths, clock to clock false paths automatically

determined

STA for data portion

Absorbed by Cadence and forgotten since at the time SDCs were a lot easier to hand inspect

Requirements of an STA Engine I would like to begin by documenting the basics that everyone in Industry

knows. There are no company specific trade secrets

Must run in linear memory and runtime with circuit size, number of clocks, exceptions, and number of storage element- Touch each vertex only once, maybe twice to simplify pre-processing,

not once per clock or exception

Must support SDC timing constraints- Clocks, clock tree assumptions, multi-cycle paths, false paths, path

delays, cases and modes

Must be nearly spice accurate in delays and support path based

Must be incremental enough- Netlist changes / full retrace on one extreme- Query based incremental with limited tracing on the other

The Basic Search The Graph

- Startpoints are inputs to the circuit and clock inputs to storage elements- Endpoints are outputs of circuit and data inputs of storage elements

Propagate the Clocks- For each clock input BFS to all clock data pins- Offset startpoint arrival times and end point required times with

information from the clock propagation and cycle accounting

Propagate the Data- Use a BFS from startpoints to end points- Use multiple timing totals at every pin to take into account multiple

clocks and exceptions- Can optionally store back pointers to record K critical paths but this

time/memory is wasted on optimization programs and should be left to a reporting phase

Multiple Timing Totals with Partial Path

Simplistic implementation is that each clock and each exception gets its own total- Simultaneously or via separate traces

Memory and/or runtime increase quickly- Occurrence pins are the most common netlist object- There can be thousands of exceptions

At Timing endpoints like totals can be combined and evaluated

At Timing endpoints point to point exceptions can be evaluated

Multiple Timing Totals

Combinational Logic

Q

CLK1

D Q

CLK1

D

Q

CLK2

D Q

CLK2

D

CLK1

CLK2

N/A

0

CLK1

CLK2

0

N/A

Multiple Timing Totals with path completion data

A BFS has no information about paths However timing exceptions are specified in terms of

from, through, and to paths with a boolean expression of pins

Mcp –from a –through {b c} –to d From a through b or c and also through d Each total can have a small state machine about what

exception points it has seen At timing endpoints like totals with like exception point

data can be combined or if false not combined

Through exceptions

Q

CLK1

D Q

CLK1

D

Q

CLK2

D Q

CLK2

D

CLK1

CLK2

N/A

0

CLK1

CLK2

0

N/A+

X

B

C

B

C

N/A

N/A

B

C

N/A

N/A

CLK1

CLK2

N/A

val

B

C

N/A

val

Framework can be used for Clock Pessimism

17

d1,d2Arr 1

d1

d2

d3

d1,d2,d3Arr 2

d1,d2Arr 1d1,d2,d3Arr 2



Delay Calculation, Multiple Timing Totals

Worst case slew merging is pessimistic but allows Delay Calculation to be a pre-process step

If Delay Calculation is done in the BFS the critical slew merging can be done

It is also possible for each timing total to carry its own slew to improve accuracy

Loops can be auto detected and dynamically broken avoiding accidental critical path breaks

Incremental Timing

Netlist edits, full retime Netlist edits, fanout cone retime Netlist edits, query based retime

The choice of how incremental to go depends on the optimization approach

More global cost functions require less incrementalness

More locally greedy approaches require more

STA needs innovation

Increased sharing to Academia

Increased research on the problems that are still problems

Redirect solutions in light of the Design and Compute trends

There is a lot of interesting work to do!

Some ideas

New constraint language that is more functional

Try to propagate the function with the delays- Some combination with cycle based simulation- Constraint language enhancements

Library-less delay models

New data model which is stage based- Focus on data locality

Hierarchical timing model which is truly context independent within acceptable limitations- Constraint improvements to help constraint blocks more accurately

continuing challenges in static timing analysis

Documents

verifyworst case timing

memory use

delay calculationpath

design synthesis

accuracytrue delay

quality of constraints

path true reporting

tradeoff core issues