relevance heuristics for program analysis ken mcmillan cadence research labs texpoint fonts used in...

38
Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs

Upload: cody-grady

Post on 26-Mar-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Relevance Heuristicsfor Program Analysis

Ken McMillan

Cadence Research Labs

Page 2: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Introduction• Program Analysis

– Based on abstract interpretation

– Useful tool for optimization and verification

– Strong tension between precision and cost

• Relevance heuristics– Tailor abstract domain to property

– Key to scaling while maintaining enough information to prove useful properties

• This talk– General principles underlying relevance heuristics

– Applying these ideas to program analysis using Craig interpolation

– Some recent research on analysis of heap manipulating programs

Page 3: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Static Analysis• Compute the least fixed-point of an abstract transformer

– This is the strongest inductive invariant the analysis can provide

• Inexpensive analyses:– interval analysis

– affine equalities, etc.

• These analyses lose information at a merge:

x = y x = z

T

This analysis is inexpensive, but insufficient if theThis analysis is inexpensive, but insufficient if thedisjunction is needed to prove the desired propertydisjunction is needed to prove the desired property

Page 4: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Predicate abstraction• Abstract transformer:

– strongest Boolean postcondition over given predicates

• Advantage: does not lose information at a merge– join is disjunction

x = y x = z

x=y _ x=z

• Disadvantage:– Abstract state is exponential size in number of predicates

– Abstract domain has exponential height

• Result: – Must use only predicates relevant to proving property

Page 5: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Relevance Heuristics• Iterative refinement approach

– Analyze failure of abstraction to prove property

• Typically use failed program traces (CEGAR)

– Add relevant information to abstraction

• Must be sufficient to rule out failure

• Key questions– How do we decide what program state information is “relevant”?

– Is relevance even a well defined notion?

These questions have been well studied in the context of theThese questions have been well studied in the context of theBoolean satisfiability problem, and we can actually give someBoolean satisfiability problem, and we can actually give somefairly concrete answers.fairly concrete answers.

Page 6: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Principles• Relevance:

– A relevant predicate is one that is used in a parsimonious proof of the desired property

• Generalization principle:– Facts used in the proof of special cases tend to be relevant to the overall

proof.

Page 7: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Relevance principles and SAT• The Boolean Satisfiability Problem (SAT)

– Input: A Boolean formula in CNF

– Output: A satisfying assignment or UNSAT

• The DPLL approach:– Branch. (assign values to variables)

– Propagate. (make deductions by unit resolution, or BCP)

– Learn. (deduce new clauses in response to conflicts)

p p __ : : p p __

__

Resolution rule:Resolution rule:

Page 8: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

DPLL approach

(a b) (b c d) (b d)

c

a

Decisions

b

d

Conflict!

(b c )

resolve

Learned clause

• BCP guides clause learning by resolution• Learning generalizes failures• Learning guides decisions (VSIDS)

Page 9: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Two kinds of deduction

• Closing this loop focuses solver on relevant deductions– Allows SAT solvers to handle millions of clauses

– Generates parsimonious proofs in case of unsatisfiability

• What lessons can we learn from this architecture for program analysis?

Case Splits

Propagation • case-based• lightweight• exhaustive

Generalization• general• guided

Page 10: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Invariants from unwindings• Consider this very simple approach:

– Partially unwind a program into a loop-free, in-line program

– Construct a Floyd/Hoare proof for the in-line program

– See if this proof contains an inductive invariant proving the property

• Example program:

x = y = 0;while(*) x++; y++;while(x != 0) x--; y--;assert (y == 0);

{x == y}

invariant:

Page 11: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

{x = 0 ^ y = 0}

{x = y}

{x = y}

{x = y}

{x = 0 ) y = 0}

{False}

{True}

{y = 0}

{y = 1}

{y = 2}

{y = 1}

{y = 0}

{False}

{True}

Unwind the loops

Proof of inline program contains invariants

for both loops

• Assertions may diverge as we unwind• A practical method must somehow

prevent this kind of divergence!

x = y = 0;

x++; y++;

x++; y++;

[x!=0];x--; y--;

[x!=0];x--; y--;

[x == 0][y != 0]

Page 12: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Interpolation Lemma• Notation: L() is the set of FO formulas using

– the ininterpreted symbols of (predicates and functions)– the logical symbols ^, _, :, 9, 8, (), ...

• If A B = false, there exists an interpolant A' for (A,B) such that:A A'

A' ^ B = falseA' 2 L(A) \ L(B)

• Example: – A = p q, B = q r, A' = q

[Craig,57]

Page 13: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Interpolants for sequences• Let A1...An be a sequence of formulas

• A sequence A’0...A’n is an interpolant for A1...An when

– A’0 = True

– A’i-1 ^ Ai ) A’i, for i = 1..n

– An = False

– and finally, A’i 2 L (A1...Ai) \ L(Ai+1...An)

A1 A2 A3 An...

A'1 A'2 A'3 A‘n-1...True False) ) ) )

In other words, the interpolant is a structured

refutation of A1...An

Page 14: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Interpolants as Floyd-Hoare proofs

False

x1=y0

True

y1>x1

))

)

1. Each formula implies the next

2. Each is over common symbols of prefix and suffix

3. Begins with true, ends with false

Proving in-line programs

SSAsequence Prover

Interpolation

HoareProof

proof

x=y;

y++;

[x=y]

x1= y0

y1=y0+1

x1y1

{False}

{x=y}

{True}

{y>x}

x = y

y++

[x == y]

Page 15: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

FOCI: An Interpolating Prover• Proof-generating decision procedure for quantifier-free FOL

– Equality with uninterpreted function symbols

– Theory of arrays

– Linear rational arithmetic, integer difference bounds

• SAT Modulo Theories approach– Boolean reasoning performed by SAT solver

– Exploits SAT relevance heuristics

• Quantifier-free interpolants from proofs– Linear-time construction [TACAS 04]

– From Q-F interpolants, we can derive atomic predicates for Predicate Abstraction [Henzinger, et al, POPL 04]

• Allows counterexample-based refinement

– Integrated with software verification tools

• Berkeley BLAST, Cadence IMPACT

Page 16: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

But won’t we diverge?• Programs are infinite state, so convergence to a fixed point is not

guaranteed.• What would prevent us from computing an infinite sequence of

interpolants, say, x=0, x=1, x=2,... as we unwind the loops further?• Limited completeness result

– Stratify the logical language L into a hierarchy of finite languages

– Compute minimal interpolants in this hierarchy

– If an inductive invariant proving the property exists in L, you must eventually converge to one

Interpolation provides a means of static analysis in abstract domainsof infinite height. Though we cannot compute a least fixed point, wecan compute a fixed point implying a given property if one exists.

Page 17: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Experiments

Driver LOC* Previous

Time

Time with

FOCI

Predicates Total Average

Pure

Interpolation

kbfiltr 12k 1m12s 3m48s 72 6.5 0.35s

floppy 17k 7m10s 25m20s 240 7.7 2.37s

diskperf 14k 5m36s 13m32s 140 10 1.51s

cdaudio 18k 20m18s 23m51s 256 7.8 4.09s

parport 61k DNF 74m58s 753 8.1 3.84s

parclass 138k DNF 77m40s 382 7.2 6.47s

Windows DDK

* Pre-processedPOPL 04

CAV 06CAV 06

Page 18: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Relevance heuristics• Relevance heuristics are key to managing the precision/cost tradeoff

– In general, less information is better

– Effective relevance heuristics improve scaling behavior

– Based on principle of generalization from special cases

• Interpolation approach– Yields Floyd-Hoare proofs for loop-free program fragments

– Provides an effective relevance heuristic

• if we can solve the divergence problem

– Exploits prover’s ability to focus on a small set of relevant facts

Page 19: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Expressiveness hierarchy

CanonicalCanonicalHeapHeap

AbstractionsAbstractions

IndexedIndexedPredicatePredicate

AbstractionAbstraction

PredicatePredicateAbstractionAbstraction

88FO(TC)FO(TC)

QFQF

ParameterizedParameterizedAbstract DomainAbstract Domain

InterpolantInterpolantLanguageLanguage

Exp

ress

ive

ne

ssE

xpre

ssiv

en

ess

88FOFO

Page 20: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Need for quantified interpolants

• Existing interpolating provers cannot produce quantified interpolants• Problem: how to prevent the number of quantifiers from diverging in the

same way that constants diverge when we unwind the loops?

for(i = 0; i < N; i++) a[i] = i;

for(j = 0; j < N; j++) assert a[j] = j;

{8 x. 0 · x ^ x < i ) a[x] = x}

invariant:

Page 21: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Need for Reachability

• This condition needed to prove memory safety (no use after free).

• Cannot be expressed in FO– We need some predicate identifying a closed set of nodes that is allocated

• We require a theory of reachability (in effect, transitive closure)

... node *a = create_list(); while(a){ assert(alloc(a)); a = a->next; }...

invariant:

8 x (rea(next,a,x) ^ x nil ! alloc(x))

Can we build an interpolating prover for full FOLCan we build an interpolating prover for full FOLthan that handles reachability, and avoids divergence?than that handles reachability, and avoids divergence?

Page 22: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Clausal provers• A clausal refutation prover takes a set of clauses and returns a proof of

unsatisfiability (i.e., a refutation) if possible.• A prover is based on inference rules of this form:

P1 ... Pn

C

• where P1 ... Pn are the premises and C the conclusion.

• A typical inference rule is resolution, of which this is an instance:

p(a) p(U) ! q(U)q(a)

• This was accomplished by unifying p(a) and P(U), then dropping the complementary literals.

Page 23: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Superposition calculusModern FOL provers based on the superposition calculus

– example superposition inference:

– this is just substitution of equals for equals

– in practice this approach generates a lot of substitutions!

– use reduction order to reduce number of inferences

Q(a) P ! (a = c)

P ! Q(c)

Page 24: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Reduction orders• A reduction order  is:

– a total, well founded order on ground terms– subterm property: f(a)  a– monotonicity: a  b implies f(a)  f(b)

• Example: Recursive Path Ordering (with Status) (RPOS)

– start with a precedence on symbols: a  b  c  f– induces a reduction ordering on ground terms:

f(f(a)  f(a)  a  f(b)  b  c  f

Page 25: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

These terms must be maximal in their clauses

Ordering Constraint• Constrains rewrites to be “downward” in the reduction order:

Q(a) P ! (a = c)

P ! Q(c)

example: this inference only possible if a  c

Thm: Superposition with OC is complete for refutation in FOL with equality.

So how do we get interpolants from these proofs?

Page 26: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Local Proofs• A proof is local for a pair of clause sets (A,B) when every inference step

uses only symbols from A or only symbols from B.• From a local refutation of (A,B), we can derive an interpolant for (A,B) in

linear time.• This interpolant is a Boolean combination of formulas in the proof

Page 27: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Reduction orders and locality• A reduction order is oriented for (A,B) when:

– s  t for every s L (B) and t 2L(B)

• Intuition: rewriting eliminates first A variables, then B variables.

oriented: x y c d f

x = yA B

f(x) = c

f(y) = d

c d

x = y f(x) = c ` f(y) = c

f(y) = c f(y) = d ` c = d

c = d c d ` ?

Local!!

Page 28: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Orientation is not enough

• Local superposition gives only c=c.• Solution: replace non-local superposition with two inferences:

Q(a)

: Q(b)

A B

Q  a  b  ca = c

b = c

Q(a) a = c

Q(c)

Q(a)

a = U ! Q(U)

This “procrastination” step is an example of a reduction rule,and preserves completeness.

a = c

Q(c)

Second inference can be postponed until after resolving with : Q(b)

Page 29: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Completeness of local inference• Thm: Local superposition with procrastination is complete for refutation

of pairs (A,B) such that:– (A,B) has a universally quantified interpolant

– The reduction order is oriented for (A,B)

• This gives us a complete method for generation of universally quantified interpolants for arbitrary first-order formulas!

• This is easily extensible to interpolants for sequences of formulas, hence we can use the method to generate Floyd/Hoare proofs for inline programs.

Page 30: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Avoiding Divergence• As argued earlier, we still need to prevent interpolants from diverging as

we unwind the program further.• Idea: stratify the clause language

Example: Let Lk be the set of clauses with at most k

variables and nesting depth at most k.

Note that each Lk is a finite language.

• Stratified saturation prover:– Initially let k = 1

– Restrict prover to generate only clauses in Lk

– When prover saturates, increase k by one and continue

The stratified prover is complete, since every proof is contained

in some Lk.

Page 31: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Completeness for universal invariants• Lemma: For every safety program M with a 8 safety invariant, and

every stratified saturation prover P, there exists an integer k such that P

refutes every unwinding of M in Lk, provided:

– The reduction ordering is oriented properly

• This means that as we unwind further, eventually all the interpolants are contained in Lk, for some k.

• Theorem: Under the above conditions, there is some unwinding of M for which the interpolants generated by P contain a safety invariant for M.

This means we have a complete procedure for finding universally quantified safety invariants whenever these exist!

Page 32: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

In practice• We have proved theoretical convergence. But does the procedure

converge in practice in a reasonable time?

• Modify SPASS, an efficient superposition-based saturation prover:– Generate oriented precedence orders

– Add procrastination rule to SPASS’s reduction rules

– Drop all non-local inferences

– Add stratification (SPASS already has something similar)

• Add axiomatizations of the necessary theories– An advantage of a full FOL prover is we can add axioms!

– As argued earlier, we need a theory of arrays and reachability (TC)

Page 33: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Partially Axiomatizing FO(TC)• Axioms of the theory of arrays (with select and store)

8 (A, I, V) (select(update(A,I,V), I) = V

8 (A,I,J,V) (I J ! select(update(A,I,V), J) = select(A,J))

• Axioms for reachability (rea)

8 (L,E,X) (rea(L,select(L,E),X) ! rea(L,E,X))

8 (L,E) rea(L,E,E)

[ if e->link reaches x then e reaches x]

8 (L,E,X) (rea(L,E,X) ! E = X _ rea(L,select(L,E),X))

[ if e reaches x then e = x or e->link reaches x]etc...

Since FO(TC) is incomplete, these axioms must be incomplete

Page 34: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Simple example

for(i = 0; i < N; i++) a[i] = i;

for(j = 0; j < N; j++) assert a[j] = j;

{8 x. 0 · x ^ x < i ) a[x] = x}

invariant:

Page 35: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

i = 0;

[i < N];a[i] = i; i++;

[i < N];a[i] = i; i++;

[i >= N]; j = 0;

[j < N]; j++;

[j < N];a[j] != j;

Unwinding simple example• Unwind the loops twice

i0 = 0

i0 < Na1 = update(a0,i0,i0)i1 = i0 + 1

i1 < Na2 = update(a1,i1,i1)i2 = i+1 + 1

i ¸ N ^ j0 = 0

j0 < N ^ j1 = j0 + 1

j1 < Nselect(a2,j1) j1

invariant

invariant

{i0 = 0}

{0 · U ^ U < i1 ) select(a1,U)=U}

{0 · U ^ U < i2 ) select(a2,U)=U}

{j · U ^ U < N ) select(a2,U)=U}

{j · U ^ U < N ) select(a2,U) = U}

note: stratification prevents constants divergingas 0, succ(0), succ(succ(0)), ...

Page 36: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

List deletion example

• Invariant synthesized with 3 unwindings (after some: simplification):

a = create_list(); while(a){ tmp = a->next; free(a); a = tmp;}

{rea(next,a,nil) ^8 x (rea(next,a,x)! x = nil _ alloc(x))}

• That is, a is acyclic, and every cell is allocated• Note that interpolation can synthesize Boolean structure.

Page 37: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

More small examples

name description assertion unwindings bound time (s)array set set all array elements to 0 all elements zero 3 L 1 0.01array test set all array elements to 0 all tests O K 3 L 1 0.01

then test all elementsl l saf e create a linked list then memory safety 3 L 1 0.04

traverse itl l acyc create a linked list list acyclic 3 L 1 0.02l l del ete delete an acyclic list memory safety 2 L 1 0.01l l del mi d delete any element result acyclic 2 L 1 0.02

of acyclic listl l rev reverse an acyclic list result acyclic 3 L 1 0.02

This shows that divergence can be controlled. This shows that divergence can be controlled. But can we scale to large programs?...But can we scale to large programs?...

Page 38: Relevance Heuristics for Program Analysis Ken McMillan Cadence Research Labs TexPoint fonts used in EMF: A A A A A

Conclusion• Relevance heuristics are essential for scaling richer program analysis

domains to large programs• Relevance heuristics are based on a generalization principle:

– Relevant facts are those used in parsimonious proofs

– Facts relevant to special cases are likely to be useful in the general case

• Relevance heuristics for program analysis– Special cases can be program paths or loop-free unwindings

– Interpolation can extract relevant facts from proofs of special cases

– Must avoid divergence

• Quantified invariants– Needed for programs that manipulating arrays or heaps

– FO equality prover modified to produce local proofs (hence interpolants)

• Complete for universal invariants

– May be used as a relevance heuristic for shape analysis, IPA