introduction to abstract interpretation neil kettle, andy king and axel simon [email protected]...

Introduction to Abstract Interpretation

Neil Kettle, Andy King and Axel Simon

[email protected]://www.cs.kent.ac.uk/~amk

Acknowledgments: much of this material has been adapted from surveys by Patrick and Radia Cousot

Applications of abstract interpretation

Verification: can a concurrent program deadlock? Is termination assured?

Parallelisation: are two or more tasks independent? What is the worst/base-case running time of function?

Transformation: can a definition be unfolded? Will unfolding terminate?

Implementation: can an operation be specialised with knowledge of its (global) calling context?

Applications and “players” are incredibly diverse

House-keeping

Thursday 11.00 12.00 Methodology Andy

12.00 1.00 Lunch Free

1.00 2.00 Numerical and Widening Andy

2.00 3.30 SAT solving Neil

3.30 3.45 Break Free

4.00 5.00 Galois connections Andy

5.00 7.00 Shopping Free

7.00 9.00 Tasty Spice Not free

Friday 9.30 10.30 Two Variable Domain Axel

10.30 10.45 Break Free

10.45 12.00 SAT solving Neil

12.00 2.00 Xmas Party Origins

Computing Lab Xmas Party

Located in Origins – the “restaurant” in Darwin

A buffer lunch will be served – courtesy of the department

Department will supply some wine (which last year lasted 10 minutes)

Bar will be open afterwards if some wine is not enough wine

Send an e-mail to Deborah Sowrey [[email protected]] if you want to attend

Come along and meet other post-grads

Casting out nines algorithmWhich of the following multiplications are

correct: 2173 38 = 81574 or 2173 38 = 82574

Casting out nines is a checking technique that is really a form of abstract interpretation: Sum the digits in the multiplicand n1, multiplier n2 and

the product n to obtain s1, s2 and s. Divide s1, s2 and s by 9 to compute the remainder, that

is, r1 = s1 mod 9, r2 = s2 mod 9 and r = s mod 9. If (r1 r2) mod 9 r then multiplication is incorrect

The algorithm returns “incorrect” or “don’t know”

Running the numbers for 2173 38 = 81574

Compute r1 = (2+1+7+3) mod 9 = …Compute r2 = (3+8) mod 9 = …Calculate (r1 r2) mod 9 = …Calculate r = (8+1+5+7+4) mod 9 = …Check ((r1 r2) mod 9 = r) = …Deduce that 2173 38 = 81574 is …

Abstract interpretation is a theory of relationships

The computational domain for multiplication (concrete domain): N – the set of non-negative integers

The computational domain of remainders used in the checking algorithm (abstract domain): R = {0, 1, …, 8}

Key question is what is the relationship between an element nN which is used in the real algorithm and its analog rR in the check

What is the relationship?

When multiplicand is n1 = 456, say, then the check uses r1 = (4+5+6) mod 9 = 4

Observe that 456 mod 9 = (4*100 + 56) mod 9 = (4*90+ 4*10 + 56) mod 9 = (4*10 + 56) mod 9 = ((4 + 5)*10 + 6) mod 9 = ((4 + 5)*9 + (4 + 5) + 6) mod 9 = (4 + 5 + 6) mod 9

More generally, induction can show r1= n1 mod 9 and r2 = n2 mod 9

Correctness is the preservation of relationships

The check simulates the concrete multiplication and, in effect, is an abstract multiplication

Concrete multiplication is n = n1 n2

Abstract multiplication is r = (r1 r2) mod 9Where r1 describes n1 and r2 describes n2

For brevity, write r n iff r = n mod 9

Then abstract multiplication preserves iff whenever r1 n1 and r2 n2 it follows that r n

Correctness argument

Suppose r1 n1 and r2 n2 If

n = n1 n2 then n mod 9 = (n1 n2) mod 9 hence n mod 9 = ((n1 mod 9) (n2 mod 9)) mod 9

whence n mod 9 = (r1 r2) mod 9 = r therefore r n

Consequently if (r n) then n n1 n2

Summary

Formalise the relationship between the data

Check that the relationship is preserved by the abstract analogues of the concrete operations

The relational framework [Acta Informatica, 30(2):103-129,1993] not only emphases the theory of relations but is very general

Numeric approximation and widening

Abstract interpretation does not require a domain to be finite

Interval approximation

begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[0,15]} i := i + 1 {3: i[1,16]} end {4: i[16,16]}

Consider the following Pascal-like program

SYNTOX [PLDI’90] inferred the invariants scoped within {…}

Invariants occur between consecutive lines in the program

i[0,15] asserts 0i15 whereas i[0,0] means i=0

Compilation versus (classic) interpretation

Abstract compilation – compile the concrete program into an abstract program (equation system) and execute the abstract program: good separation of concerns that aids debugging the particulars of the domain can be exploited to

reorder operations, specialise operations, etcAbstract interpretation – run the concrete

program but on-the-fly interpret its concrete operations as abstract operations: ideal for a generic framework (toolkit) which is

parameterised by abstract domain plugins

Abstract domain that is used in interval analysis

Domain of intervals includes: [l,u] where l u and l,u Z for bounded

sets ie [0, 5]{0,1,4} since {0,1,4} [0, 5]

to represent the empty set of numbers, that is,

[l,] for sets which are bounded below such as {l,l+2,l+4,…}

[-,u] to represent sets which are bounded above such as {..,l-5,l-3,l}

Weakening intervals

Join (path merge) is defined: Put d1d2 = d1 if d2 = d2 else if d1 = [min(l1,l2), max(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]

if … then… {1: i[0,2]}else… {2: i[3,5]}endif{3: i[0,5]}

Strengthening intervals

Meet is defined: Put d1d2 = if (d1 = ) (d2 = ) [max(l1,l2), min(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]

{3: i[0,5]}if (2 < i) then {4: i[3,5]} …else {5: i[0,2]} …

Meet and join are the basic primitives for compilation

I1 = [0,0] since program point (1) immediately follows the i := 0

I2 = (I1 I3) [-, 15] since: control from program points (1) and (3) flow

into (2) point (2) is reached only if i < 16 holds

I3 = {n+1 | n I2} since (3) is only reachable from (2) via the increment

I4 = (I1 I3) [16, ] since: control from (1) and (3) flow into (4) point (4) is reached only if (i < 16) holds

Interval iteration

I1 [0,0] [0,0] [0,0] [0,0] [0,0]

[0,0] [0,0]

I2 [0,0] [0,0] [0,1] [0,1]

[0,2] [0,2]

I3 [1,1] [1,1] [1,2]

[1,2] [1,3]

I4 I1 … [0,0] [0,0] [0,0] [0,0]I2 … [0,15] [0,15] [0,15] [0,15]I3 … [1,15] [1,16] [1,16] [1,16]I4 … [16,16

][16,16]

Jacobi versus Gauss-Seidel iteration

I1 [0,0] [0,0] [0,0] … [0,0] [0,0] [0,0]

I2 [0,0] [0,1] [0,2] … [0,14] [0,15] [0,15]

I3 [1,1] [1,2] [1,3] … [1,15] [1,16] [1,16]

I4 … [16,16] [16,16]

With Jacobi, the new vector I1’,I2’,I3’,I4’ of intervals is calculated from the old I1,I2,I3,I4

With Gauss-Seidel iteration: I1’ is calculated from I1,I2,I3,I4 I2’ is calculated from I1’,I2,I3,I4 I3’ is calculated from I1’,I2’,I3,I4 I4’ is calculated from I1’,I2’,I3’,I4

Gauss-Seidel versus chaotic iteration

Observe that I4 might change if either I1 or I3 change, hence evaluate I4 after I1 and I3 stabilise

Suggests that wait until stability is achieved at one level before starting on the next

I1 I2

I3I4

{I1}

{I2, I3}

{I4}

Gauss-Seidel versus chaotic iteration

Chaotic iteration can postpone evaluating Ii for bounded number of iterations: I1’ is calculated from I1,-,-,- I2’ and I3’ are calculated Gauss-Seidel style from I1,I2,I3,- I4’ is calculated from I1’,I2’,I3’,I4

Fast and (incremental) fixpoint solvers [TOPLAS 22(2):187-223,2000] apply chaotic iteration

I1 [0,0]

[0,0]

[0,0]

… [0,0] [0,0] [0,0]

I2 - [0,0]

[0,1]

… [0,15]

[0,15]

[0,15]

I3 - [1,1]

[1,2]

… [1,16]

[1,16]

[1,16]

I4 - - - … - - [16,16]

Research challengeCompiling to equations and iteration is well-

understood (albeit not well-known)The implicit assumption is that source is

available With the advent of component and multi-

linguistic programming, the problem is how to generate the equations from: A specification of the algorithm or the API; The types of the algorithm or component

In the interim, environments with support for modularity either: Equip the programmer with an equation language Or make worst-case assumptions about behaviour

Suppose i was decremented rather than incremented

begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[-,0]} i := i -1 {3: i[-,-1]} end {4: i}

I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = {n-1 | n I2}I4 = (I1 I3) [16, ]

I1 [0,0]

[0,0]

[0,0] [0,0] [0,0] [0,0]

I2 - - [0,0] [-1,0] [-2,0] …

I3 - - [-1,0]

[-2,0] [-3,0] …

I4 - - - - - -

Ascending chain conditionA domain D is ACC iff it does not contain an

infinite strictly increasing chain d1<d2<d3<… where d<d’ iff dd’ and dd’ (see below)

The interval domain D is ordered by: d forall dD and [l1,u1] [l2,u2] iff l2l1u1u2

and is not ACC since [0,0]<[-1,0]<[-2,0]<…

… -4 –3 –2 –1 0 1 2 3 4 …

T

Some very expressive relational domains are ACC

The sub-expression elimination relies on detecting duplicated expression evaluation

Karr [Acta Informatica, 6, 133-151] noticed that detecting an invariance such as y = x/2 – 7 was key to this optimisation

begin x := sin(a) * 2; y := sin(a) – 7;end

The affine domain

The domain of affine equations over n variables is: D = {A,B|A is mn dimensional matrix and

B is m dimensional column vector}

D is ordered by: A1,B1A2,B2 iff (if A1x=B1 then A2x=B2)

Pre-orders versus posets

A pre-order D, is a set D ordered by a binary relation such that: If dd for all dD If d1d2 and d2d3 then d1d3

A poset is pre-order D, such that: If d1d2 and d2d3 then d1d3

The affine domain is a pre-order (so it is not ACC)

Observe A1,B1A2,B2 but A2,B2A1,B1

A1= B1= A2= B2=

To build a poset from a pre-order define dd’ iff dd’ and d’d define [d] = {d’D|dd’} and D = {[d]|dD} define [d] [d’] iff dd’

The poset D, is ACC since chain length is bounded by the number of variables n

1 0 0

0 1 0

0 0 1

1

0

0

2 0 0

0 1 0

0 0 1

2

0

0

Inducing termination for non-ACC (and huge ACC) domains

Enforce convergence for intervals with a widening operator :DD D d = d d = d [l1,u1] [l2,u2] = [if l2<l1 then - else l1,

if u1<u2 then else u1]Examples

[1,2][1,2] = [1,2] [1,2][1,3] = [1,] but [1,3][1,2] = [1,3]

Safe since [li,ui]([l1,u1][l2,u2]) for i{1,2}

Chaotic iteration with widening

To terminate it is necessary to traverse each loop a finite number of times

It is sufficient to pass through I2 or I3 a finite number of times [Bourdoncle, 1990]

Thus widen at I3 since it is simpler

I1 I2

I3I4

Termination for the decrement I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = I3{n-1 | n I2} note the fix I4 = (I1 I3) [16, ]

When I2 = [-1,0] and I3 = [-1,0], then

I3{n+1 | n I2} = [-1,0] [-2,-1] = [-,0]

I1 [0,0]

[0,0]

[0,0] [0,0] [0,0] [0,0] [0,0]

I2 - - [0,0] [-1,0] [-,0] [-,0] [-,0]

I3 - - [-1,0]

[-,0] *

[-,0] [-,0] [-,0]

I4 - - - - - -

Widening dynamic data-structures

cons

or

0 2

cons

0 nil

or

nil

cons

1

or

0 1

or

nil

cons

cons

0 nil

or

0 1

or

nil

cons

0 nil

begin i := 0; p := nil; while (i < 16) do i := i +1 p := new cons(i, p); {1:pcons(i, …cons(0,nil))} end

Depth-2 versus type-graph widening

or

0 2

or

nil

cons

1cons

or

0 2

or

nil

cons

1

any any

Type-graph widening is more compactType-graph widening becomes difficult

when a list contains lists as its elementsIn constraint-based analysis, widening is

dispensed with altogether

(Malicious) research challenge

Read a survey paper to find an abstract domain that is ACC but has a maximal chain length of O(2n)

Construct a program with O(n) symbols that iterates through all O(2n) abstractions

Publish the program in IPL

Not all numeric domains are convex

A set SRn is convex iff for all x,yS it follows that {x + (1-)y | 01} S

The 2 leftmost sets in R2 are convex but the 2 rightmost sets are not.

Are intervals or affine equations convex?

Suppose the values of n variables are represented by n intervals [l1,u1],…,[ln,un]

Suppose x=x1,…,xn, y=y1,…,ynRn are described by the intervals

Then each lixui and each liyui uLet 01 and observe z = x + (1-)y

= x1 + (1-)y1, …, xn + (1-)ynTherefore limin(xi, yi) xi + (1-)yi

max(xi, yi)ui and convexity follows

Arithmetic congruences are not convexElements of the arithmetic congruence (AC)

domain take the form x – 2y = 1 (mod 3) which describes integral values of x and y

More exactly, the AC domain consists of conjunctions of equations of the form c1x1+…+cmxm = (c mod n) where ci,cZ and nN

Incredibly AC is ACC [IJCM, 30, 165--190, 1989]

0

1

2

3

4

5

6

7

8

9

0 0.5 1 1.5 2 2.5 3 3.5

Research challengeSøndergaard [FSTTCS,95] introduced the

concept of an immediate fixpointConsider the following (groundness)

dependency equations over the domain of Boolean functions Bool, , f1 = x (y z) f2 = t(x(z(u (tx) v (tz) f4))) f3 = u (v(x u z v f2)) f4 = f1 f3

Where x(f) = f[x true]f[x false] thus x(xy) = true and x(xy) = y

The alternative tactic

f1 false x (yz) x (yz) x (yz) … x (yz)

f2 false false false v (yu) … (uy) v

f3 false false false false … (xy) z

f4 false false x (yz) x (yz) … (xy) z

The standard tactic is to apply iteration:

Søndergaard found that the system can be solved symbolically (like a quadratic)

This would be very useful for infinite domains for improved precision and predictability

Combining analyses

Verifiers and optimisers are often multi-pass, built from several separate analyses

Should the analyses be performed in parallel or in sequence?

Analyses can interact to improve one another (problem is in the complexity of the interaction [Pratt])

Pruning combined domains

Suppose that 1 D1C and 2D2C, then how is D=D1D2 interpreted?

Then d1,d2c iff d11c d22c

Ideally, many d1,d2D will be redundant, that is, cC . c1d1c2d2

Time versus precision from TOPLAS 17(1):28--44,1993

Time Precision

Share ASub ShareASub

Share ASub

ShareASub

serialise 9290 839 1870 235 35 35

init-subst 569 1250 829 5 72 5

map-color 4600 1040 5760 76 74 73

grammar 170 140 269 11 11 11

browse 51860 1609 49580 196 104 104

bid 1129 1000 1429 11 0 0

deriv 2819 2630 3550 0 0 0

rdtok 5670 4450 6389 185 48 48

read 8790 8380 11069 11 1 1

boyer 11040 3949 7709 242 93 93

peephole 20760 7990 23029 386 310 310

ann 93509 16789 53269 1935 1690

1690

The Galois framework

Abstract interpretation is often presented in terms of Galois connections

Lattices – a prelude to Galois connections

Suppose S, is a posetA mapping :SSS is a join (least upper

bound) iff ab is an upper bound of a and b, that is, aab

and bab for all a,bS ab is the least upper bound, that is, if cS is

an upper bound of a and b, then abc

The definition of the meet :SSS (the greatest lower bound) is analogous

Complete lattices

A lattice S, , , is a poset S, equipped with a join and a meet

The join concept can often be lifted to sets by defining :(S)S iff t(T) for all TS and for all tT if ts for all tT then (T)s

If meet can often be lifted analogously, then the lattice is complete

A lattice that contains a finite number of elements is always complete

A lattice that is not complete

A hyperplane in 2-d space in a line and in 3-d space is a plane

A hyperplane in Rn is any space that can be defined by {xRn | c1x1+…+cnxn = c} where c1,…,cn,cR

A halfspace in Rn is any space that can be defined by {xRn | c1x1+…+cnxn c}

A polyhedron is the intersection of a finite number of half-spaces

Examples and non-examples in planar space

Join for polyhedra

Join of polyhedra P1 and P2 in Rn coincides (with the topological closure) of the convex hull of P1P2

The “join” of an infinite set of polyhedra

Consider the following infinite chain of regular polyhedra:

The only space that contains all these polyhedra is a circle yet this is not polyhedral

A, , C, is Galois connection whenever

A, A and C, C are complete latticesThe mappings :CA and :AC are

monotonic, that is, If c1 C c2 then (c1) A (c2)

If a1 A a2 then (a1) C (a2)

The compositions :AA and :CC are extensive and reductive respectively, that is, c C ()(c) for all cC

()(a) A a for all aA

A classic Galois connection example

The concrete domain C,C,C,C is (Z),,,

The abstract domain A,A,A,A where: A = {,+,-,T} A a AT for all aA

join A and meet A are defined by:A + - T

+ - T

+ + + T T

- - T - T

T T T T T

A + - †

+ + †

- - †

† + - †

The relationship between A and C

The concretisation mapping :AC is defined: () = Ø (+) = {nZ | n > 0} (-) = {nZ | n < 0} (T) = Z

The abstraction mapping :CA is defined: (S) = if S = Ø (S) = + else if n > 0 for all nS (S) = - else if n < 0 for all nS (S) = Z otherwise

Avoiding repetition

Can define with and vice versa: (S) = A{aA | S (a)}

And dually (a) = {SZ | (S) A a}

As an example consider ({1,2}): {1,2} (T) {1,2} (+) {1,2} (-) {1,2} () Therefore ({1,2}) = A{+, T} = +

Collecting domains and semantics

Observe that C is not that concrete – programs include operations such as *:ZZZ

C=(Z) is collecting domain which is easier to abstract than Z since it already a lattice

To abstract *:ZZZ, say, we synthesise a collecting version *C:(Z)(Z)(Z) and then abstract that

Put S1 *C S2 = {n1*n2 | n1 S1 and n2 S2}

Safety and optimality requirements

Safety requires ((a1)*C(a2)) C a1 *A a2 for all a1,a2A

Optimality [POPL,269—282,1979] also requires a1 *A a2 C ((a1)*C(a2))

Arguing optimality is harder than safety since rare-case approximation can simplify a tricky argument [JLP]

Abstract multiplication

*A + - †

+ + - †

- - + †

† † † †

Consider safety for ((+)*C(+)) C +*A+ Recall (+) = {nZ | n > 0} Thus (+)*C(+) = {n1*n2 | n1*n2 > 0} Hence ((+)*C(+)) = + C + *A +

Need +*A+ C ((+)*C(+)) for optimality Recall ((+)*C(+)) C +*A+ Hence ((+)*C(+)) {, +} But (+) Ø, thus (+)*C(+) Ø Therefore ((+)*C(+))

Exotic applications of abstract interpretationRecovering programmer intentions for

understanding undocumented or third-party code

Verifying that a buffer-over cannot occur, or pin-pointing where one might occur in a C program

Inferring the environment in which is a system of synchronising agents will not deadlock

Lower-bound time-complexity analysis for granularity throttling

Binding-time analysis for inferring off-line unfolding decisions which avoid code-bloat

Pointers to the literatureSAS, POPL, ESOP, ICLP, ICFP,…Useful review articles and books:

Patrick and Radhia Cousot, Comparing the Galois connection and Widening/Narrowing approaches to Abstract Interpretation, PLILP, LNCS 631, 269-295, 1992. Available from LIX library.

Patrick and Radhia Cousot, Abstract interpretation and Application to Logic Programs, JLP, 13(2-3):103-179, 1992

Flemming Neilson, Hanne Riis Neilson and Chris Hankin, Principles of Program Analysis, Springer, 1999.

Patrick has a database of abstract interpretation researchers and regularly writes tutorials, see, CC’02.

Appendix: SAT solving

SAT is not a form of abstract interpretation but abstraction and abstract interpretation is often used to reduce a verification problem to a satisfiability checking problem

Acknowledgments: much of this material is adapted from the review article, “The Quest for Efficient Boolean Satisfiability Solvers” by Zhang and Malik, 2002.

The SAT problem

Given an arbitrary prepositional formula, f say, does there exist a variable assignment (a model) under which f evaluates to true

One model for f = (xy) is ={xtrue, ytrue}

SAT is the stereotypic NP-complete problem but this does not preclude the existence of efficient SAT algorithms for certain SAT instances

Stålmarck [US Patent N527689,1995] and applications in AI planning, software verification, circuit testing have promoted a resurgence of interest in SAT

The other type of completeness

A SAT algorithm is said to be complete iff (given enough resource) it will either: compute a satisfying variable assignment or verify that no such assignment exists

A SAT algorithm is incomplete (stochastic) iff unsatisfiability cannot always be detected

Trade incompleteness for speed when a solution is very likely to exist (planning applications).

In program verification (partial) correctness often follows by proving unsatisfiability

The Davis-Logemann-Loveland (DPLL) approach1st generation solvers such as POSIT, 2cl, CSAT,

etc based on PDLL as are the 2nd generation solvers such as SATO and zChaff which tune PDLL

Davis and Putman [JACM,7:201—215,1960] proposed resolution for Boolean SAT; DLL [CACM,5:394—397,1962] replaced resolution with search to improve memory usage (special case)

CNF used to simplify unsatisfiability checking; conversion is polynomial [JSC,2,293—304, 1986]

CNF is a conjunction of clauses, for example, (xy) = (xy)(yx) = (xy)(xy)

The Davis-Logemann-Loveland (PDLL) algorithm

bool function DPLL(f, )begin fail, ’ := unit(f, ); if (fail) return false; if (satisfied(f, ’)) return true; else if (unsatisfied(f, ’)) return false; else begin let x var(f)-var(’); if (DPLL(f, ’{xtrue})) return true; else return DPLL(f, ’{xfalse}); end endend

unit applies unit propagation, possibly detecting unsatisfiability

satisfied returns true if one literal in each clause is true

unsatisfied return false if there exists one clause with every literal false

non-determinacy is in the choice of variable

stack for search

Unit propagation

Unit clause rule: if all the literals but one are false, then the remainder is set to true

Many SAT solvers use a counter scheme [Crawford, AAAI, 1993] that uses: One counter per clause to track the number of

false literals in each clause If a count reaches the total number of literals,

then unsatisfiability has been detected Otherwise if it one less then remaining literal is set

Each assignment updates many counts and pointer bases scheme are used within SATO and zChaff [Gu et al, DIMACS series DM&TCS, 1997]

Choices, choices… If variables remain uninstantiated after

propagation, then resort to random bindingBetter to rank variables by the number of

times they occur in clauses which are not (yet) true

But a variable in 128 clauses each with 2 uninstantiated variables is a better candidate than another in 128 clauses each with 32 uninstantiated variables…

But what about the overhead of ranking especially with “learnt” clauses…

But what about trailing for backtracking…But what about intelligent back-jumping…

introduction to abstract interpretation neil kettle, andy king and axel simon [email protected]...

Documents