introduction to abstract interpretation neil kettle, andy king and axel simon [email protected]...
TRANSCRIPT
Introduction to Abstract Interpretation
Neil Kettle, Andy King and Axel Simon
[email protected]://www.cs.kent.ac.uk/~amk
Acknowledgments: much of this material has been adapted from surveys by Patrick and Radia Cousot
Applications of abstract interpretation
Verification: can a concurrent program deadlock? Is termination assured?
Parallelisation: are two or more tasks independent? What is the worst/base-case running time of function?
Transformation: can a definition be unfolded? Will unfolding terminate?
Implementation: can an operation be specialised with knowledge of its (global) calling context?
Applications and “players” are incredibly diverse
House-keeping
Thursday 11.00 12.00 Methodology Andy
12.00 1.00 Lunch Free
1.00 2.00 Numerical and Widening Andy
2.00 3.30 SAT solving Neil
3.30 3.45 Break Free
4.00 5.00 Galois connections Andy
5.00 7.00 Shopping Free
7.00 9.00 Tasty Spice Not free
Friday 9.30 10.30 Two Variable Domain Axel
10.30 10.45 Break Free
10.45 12.00 SAT solving Neil
12.00 2.00 Xmas Party Origins
Computing Lab Xmas Party
Located in Origins – the “restaurant” in Darwin
A buffer lunch will be served – courtesy of the department
Department will supply some wine (which last year lasted 10 minutes)
Bar will be open afterwards if some wine is not enough wine
Send an e-mail to Deborah Sowrey [[email protected]] if you want to attend
Come along and meet other post-grads
Casting out nines algorithmWhich of the following multiplications are
correct: 2173 38 = 81574 or 2173 38 = 82574
Casting out nines is a checking technique that is really a form of abstract interpretation: Sum the digits in the multiplicand n1, multiplier n2 and
the product n to obtain s1, s2 and s. Divide s1, s2 and s by 9 to compute the remainder, that
is, r1 = s1 mod 9, r2 = s2 mod 9 and r = s mod 9. If (r1 r2) mod 9 r then multiplication is incorrect
The algorithm returns “incorrect” or “don’t know”
Running the numbers for 2173 38 = 81574
Compute r1 = (2+1+7+3) mod 9 = …Compute r2 = (3+8) mod 9 = …Calculate (r1 r2) mod 9 = …Calculate r = (8+1+5+7+4) mod 9 = …Check ((r1 r2) mod 9 = r) = …Deduce that 2173 38 = 81574 is …
Abstract interpretation is a theory of relationships
The computational domain for multiplication (concrete domain): N – the set of non-negative integers
The computational domain of remainders used in the checking algorithm (abstract domain): R = {0, 1, …, 8}
Key question is what is the relationship between an element nN which is used in the real algorithm and its analog rR in the check
What is the relationship?
When multiplicand is n1 = 456, say, then the check uses r1 = (4+5+6) mod 9 = 4
Observe that 456 mod 9 = (4*100 + 56) mod 9 = (4*90+ 4*10 + 56) mod 9 = (4*10 + 56) mod 9 = ((4 + 5)*10 + 6) mod 9 = ((4 + 5)*9 + (4 + 5) + 6) mod 9 = (4 + 5 + 6) mod 9
More generally, induction can show r1= n1 mod 9 and r2 = n2 mod 9
Correctness is the preservation of relationships
The check simulates the concrete multiplication and, in effect, is an abstract multiplication
Concrete multiplication is n = n1 n2
Abstract multiplication is r = (r1 r2) mod 9Where r1 describes n1 and r2 describes n2
For brevity, write r n iff r = n mod 9
Then abstract multiplication preserves iff whenever r1 n1 and r2 n2 it follows that r n
Correctness argument
Suppose r1 n1 and r2 n2 If
n = n1 n2 then n mod 9 = (n1 n2) mod 9 hence n mod 9 = ((n1 mod 9) (n2 mod 9)) mod 9
whence n mod 9 = (r1 r2) mod 9 = r therefore r n
Consequently if (r n) then n n1 n2
Summary
Formalise the relationship between the data
Check that the relationship is preserved by the abstract analogues of the concrete operations
The relational framework [Acta Informatica, 30(2):103-129,1993] not only emphases the theory of relations but is very general
Numeric approximation and widening
Abstract interpretation does not require a domain to be finite
Interval approximation
begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[0,15]} i := i + 1 {3: i[1,16]} end {4: i[16,16]}
Consider the following Pascal-like program
SYNTOX [PLDI’90] inferred the invariants scoped within {…}
Invariants occur between consecutive lines in the program
i[0,15] asserts 0i15 whereas i[0,0] means i=0
Compilation versus (classic) interpretation
Abstract compilation – compile the concrete program into an abstract program (equation system) and execute the abstract program: good separation of concerns that aids debugging the particulars of the domain can be exploited to
reorder operations, specialise operations, etcAbstract interpretation – run the concrete
program but on-the-fly interpret its concrete operations as abstract operations: ideal for a generic framework (toolkit) which is
parameterised by abstract domain plugins
Abstract domain that is used in interval analysis
Domain of intervals includes: [l,u] where l u and l,u Z for bounded
sets ie [0, 5]{0,1,4} since {0,1,4} [0, 5]
to represent the empty set of numbers, that is,
[l,] for sets which are bounded below such as {l,l+2,l+4,…}
[-,u] to represent sets which are bounded above such as {..,l-5,l-3,l}
Weakening intervals
Join (path merge) is defined: Put d1d2 = d1 if d2 = d2 else if d1 = [min(l1,l2), max(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]
if … then… {1: i[0,2]}else… {2: i[3,5]}endif{3: i[0,5]}
Strengthening intervals
Meet is defined: Put d1d2 = if (d1 = ) (d2 = ) [max(l1,l2), min(u1,u2)] otherwise whenever d1 = [l1,u1] and d2 = [l2,u2]
{3: i[0,5]}if (2 < i) then {4: i[3,5]} …else {5: i[0,2]} …
Meet and join are the basic primitives for compilation
I1 = [0,0] since program point (1) immediately follows the i := 0
I2 = (I1 I3) [-, 15] since: control from program points (1) and (3) flow
into (2) point (2) is reached only if i < 16 holds
I3 = {n+1 | n I2} since (3) is only reachable from (2) via the increment
I4 = (I1 I3) [16, ] since: control from (1) and (3) flow into (4) point (4) is reached only if (i < 16) holds
Interval iteration
I1 [0,0] [0,0] [0,0] [0,0] [0,0]
[0,0] [0,0]
I2 [0,0] [0,0] [0,1] [0,1]
[0,2] [0,2]
I3 [1,1] [1,1] [1,2]
[1,2] [1,3]
I4 I1 … [0,0] [0,0] [0,0] [0,0]I2 … [0,15] [0,15] [0,15] [0,15]I3 … [1,15] [1,16] [1,16] [1,16]I4 … [16,16
][16,16]
Jacobi versus Gauss-Seidel iteration
I1 [0,0] [0,0] [0,0] … [0,0] [0,0] [0,0]
I2 [0,0] [0,1] [0,2] … [0,14] [0,15] [0,15]
I3 [1,1] [1,2] [1,3] … [1,15] [1,16] [1,16]
I4 … [16,16] [16,16]
With Jacobi, the new vector I1’,I2’,I3’,I4’ of intervals is calculated from the old I1,I2,I3,I4
With Gauss-Seidel iteration: I1’ is calculated from I1,I2,I3,I4 I2’ is calculated from I1’,I2,I3,I4 I3’ is calculated from I1’,I2’,I3,I4 I4’ is calculated from I1’,I2’,I3’,I4
Gauss-Seidel versus chaotic iteration
Observe that I4 might change if either I1 or I3 change, hence evaluate I4 after I1 and I3 stabilise
Suggests that wait until stability is achieved at one level before starting on the next
I1 I2
I3I4
{I1}
{I2, I3}
{I4}
Gauss-Seidel versus chaotic iteration
Chaotic iteration can postpone evaluating Ii for bounded number of iterations: I1’ is calculated from I1,-,-,- I2’ and I3’ are calculated Gauss-Seidel style from I1,I2,I3,- I4’ is calculated from I1’,I2’,I3’,I4
Fast and (incremental) fixpoint solvers [TOPLAS 22(2):187-223,2000] apply chaotic iteration
I1 [0,0]
[0,0]
[0,0]
… [0,0] [0,0] [0,0]
I2 - [0,0]
[0,1]
… [0,15]
[0,15]
[0,15]
I3 - [1,1]
[1,2]
… [1,16]
[1,16]
[1,16]
I4 - - - … - - [16,16]
Research challengeCompiling to equations and iteration is well-
understood (albeit not well-known)The implicit assumption is that source is
available With the advent of component and multi-
linguistic programming, the problem is how to generate the equations from: A specification of the algorithm or the API; The types of the algorithm or component
In the interim, environments with support for modularity either: Equip the programmer with an equation language Or make worst-case assumptions about behaviour
Suppose i was decremented rather than incremented
begin i := 0; {1: i[0,0]} while (i < 16) do {2: i[-,0]} i := i -1 {3: i[-,-1]} end {4: i}
I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = {n-1 | n I2}I4 = (I1 I3) [16, ]
I1 [0,0]
[0,0]
[0,0] [0,0] [0,0] [0,0]
I2 - - [0,0] [-1,0] [-2,0] …
I3 - - [-1,0]
[-2,0] [-3,0] …
I4 - - - - - -
Ascending chain conditionA domain D is ACC iff it does not contain an
infinite strictly increasing chain d1<d2<d3<… where d<d’ iff dd’ and dd’ (see below)
The interval domain D is ordered by: d forall dD and [l1,u1] [l2,u2] iff l2l1u1u2
and is not ACC since [0,0]<[-1,0]<[-2,0]<…
… -4 –3 –2 –1 0 1 2 3 4 …
T
Some very expressive relational domains are ACC
The sub-expression elimination relies on detecting duplicated expression evaluation
Karr [Acta Informatica, 6, 133-151] noticed that detecting an invariance such as y = x/2 – 7 was key to this optimisation
begin x := sin(a) * 2; y := sin(a) – 7;end
The affine domain
The domain of affine equations over n variables is: D = {A,B|A is mn dimensional matrix and
B is m dimensional column vector}
D is ordered by: A1,B1A2,B2 iff (if A1x=B1 then A2x=B2)
Pre-orders versus posets
A pre-order D, is a set D ordered by a binary relation such that: If dd for all dD If d1d2 and d2d3 then d1d3
A poset is pre-order D, such that: If d1d2 and d2d3 then d1d3
The affine domain is a pre-order (so it is not ACC)
Observe A1,B1A2,B2 but A2,B2A1,B1
A1= B1= A2= B2=
To build a poset from a pre-order define dd’ iff dd’ and d’d define [d] = {d’D|dd’} and D = {[d]|dD} define [d] [d’] iff dd’
The poset D, is ACC since chain length is bounded by the number of variables n
1 0 0
0 1 0
0 0 1
1
0
0
2 0 0
0 1 0
0 0 1
2
0
0
Inducing termination for non-ACC (and huge ACC) domains
Enforce convergence for intervals with a widening operator :DD D d = d d = d [l1,u1] [l2,u2] = [if l2<l1 then - else l1,
if u1<u2 then else u1]Examples
[1,2][1,2] = [1,2] [1,2][1,3] = [1,] but [1,3][1,2] = [1,3]
Safe since [li,ui]([l1,u1][l2,u2]) for i{1,2}
Chaotic iteration with widening
To terminate it is necessary to traverse each loop a finite number of times
It is sufficient to pass through I2 or I3 a finite number of times [Bourdoncle, 1990]
Thus widen at I3 since it is simpler
I1 I2
I3I4
Termination for the decrement I1 = [0,0] I2 = (I1 I3) [-, 15] I3 = I3{n-1 | n I2} note the fix I4 = (I1 I3) [16, ]
When I2 = [-1,0] and I3 = [-1,0], then
I3{n+1 | n I2} = [-1,0] [-2,-1] = [-,0]
I1 [0,0]
[0,0]
[0,0] [0,0] [0,0] [0,0] [0,0]
I2 - - [0,0] [-1,0] [-,0] [-,0] [-,0]
I3 - - [-1,0]
[-,0] *
[-,0] [-,0] [-,0]
I4 - - - - - -
Widening dynamic data-structures
cons
or
0 2
cons
0 nil
or
nil
cons
1
or
0 1
or
nil
cons
cons
0 nil
or
0 1
or
nil
cons
0 nil
begin i := 0; p := nil; while (i < 16) do i := i +1 p := new cons(i, p); {1:pcons(i, …cons(0,nil))} end
Depth-2 versus type-graph widening
or
0 2
or
nil
cons
1cons
or
0 2
or
nil
cons
1
any any
Type-graph widening is more compactType-graph widening becomes difficult
when a list contains lists as its elementsIn constraint-based analysis, widening is
dispensed with altogether
(Malicious) research challenge
Read a survey paper to find an abstract domain that is ACC but has a maximal chain length of O(2n)
Construct a program with O(n) symbols that iterates through all O(2n) abstractions
Publish the program in IPL
Not all numeric domains are convex
A set SRn is convex iff for all x,yS it follows that {x + (1-)y | 01} S
The 2 leftmost sets in R2 are convex but the 2 rightmost sets are not.
Are intervals or affine equations convex?
Suppose the values of n variables are represented by n intervals [l1,u1],…,[ln,un]
Suppose x=x1,…,xn, y=y1,…,ynRn are described by the intervals
Then each lixui and each liyui uLet 01 and observe z = x + (1-)y
= x1 + (1-)y1, …, xn + (1-)ynTherefore limin(xi, yi) xi + (1-)yi
max(xi, yi)ui and convexity follows
Arithmetic congruences are not convexElements of the arithmetic congruence (AC)
domain take the form x – 2y = 1 (mod 3) which describes integral values of x and y
More exactly, the AC domain consists of conjunctions of equations of the form c1x1+…+cmxm = (c mod n) where ci,cZ and nN
Incredibly AC is ACC [IJCM, 30, 165--190, 1989]
0
1
2
3
4
5
6
7
8
9
0 0.5 1 1.5 2 2.5 3 3.5
Research challengeSøndergaard [FSTTCS,95] introduced the
concept of an immediate fixpointConsider the following (groundness)
dependency equations over the domain of Boolean functions Bool, , f1 = x (y z) f2 = t(x(z(u (tx) v (tz) f4))) f3 = u (v(x u z v f2)) f4 = f1 f3
Where x(f) = f[x true]f[x false] thus x(xy) = true and x(xy) = y
The alternative tactic
f1 false x (yz) x (yz) x (yz) … x (yz)
f2 false false false v (yu) … (uy) v
f3 false false false false … (xy) z
f4 false false x (yz) x (yz) … (xy) z
The standard tactic is to apply iteration:
Søndergaard found that the system can be solved symbolically (like a quadratic)
This would be very useful for infinite domains for improved precision and predictability
Combining analyses
Verifiers and optimisers are often multi-pass, built from several separate analyses
Should the analyses be performed in parallel or in sequence?
Analyses can interact to improve one another (problem is in the complexity of the interaction [Pratt])
Pruning combined domains
Suppose that 1 D1C and 2D2C, then how is D=D1D2 interpreted?
Then d1,d2c iff d11c d22c
Ideally, many d1,d2D will be redundant, that is, cC . c1d1c2d2
Time versus precision from TOPLAS 17(1):28--44,1993
Time Precision
Share ASub ShareASub
Share ASub
ShareASub
serialise 9290 839 1870 235 35 35
init-subst 569 1250 829 5 72 5
map-color 4600 1040 5760 76 74 73
grammar 170 140 269 11 11 11
browse 51860 1609 49580 196 104 104
bid 1129 1000 1429 11 0 0
deriv 2819 2630 3550 0 0 0
rdtok 5670 4450 6389 185 48 48
read 8790 8380 11069 11 1 1
boyer 11040 3949 7709 242 93 93
peephole 20760 7990 23029 386 310 310
ann 93509 16789 53269 1935 1690
1690
The Galois framework
Abstract interpretation is often presented in terms of Galois connections
Lattices – a prelude to Galois connections
Suppose S, is a posetA mapping :SSS is a join (least upper
bound) iff ab is an upper bound of a and b, that is, aab
and bab for all a,bS ab is the least upper bound, that is, if cS is
an upper bound of a and b, then abc
The definition of the meet :SSS (the greatest lower bound) is analogous
Complete lattices
A lattice S, , , is a poset S, equipped with a join and a meet
The join concept can often be lifted to sets by defining :(S)S iff t(T) for all TS and for all tT if ts for all tT then (T)s
If meet can often be lifted analogously, then the lattice is complete
A lattice that contains a finite number of elements is always complete
A lattice that is not complete
A hyperplane in 2-d space in a line and in 3-d space is a plane
A hyperplane in Rn is any space that can be defined by {xRn | c1x1+…+cnxn = c} where c1,…,cn,cR
A halfspace in Rn is any space that can be defined by {xRn | c1x1+…+cnxn c}
A polyhedron is the intersection of a finite number of half-spaces
Examples and non-examples in planar space
Join for polyhedra
Join of polyhedra P1 and P2 in Rn coincides (with the topological closure) of the convex hull of P1P2
The “join” of an infinite set of polyhedra
Consider the following infinite chain of regular polyhedra:
The only space that contains all these polyhedra is a circle yet this is not polyhedral
A, , C, is Galois connection whenever
A, A and C, C are complete latticesThe mappings :CA and :AC are
monotonic, that is, If c1 C c2 then (c1) A (c2)
If a1 A a2 then (a1) C (a2)
The compositions :AA and :CC are extensive and reductive respectively, that is, c C ()(c) for all cC
()(a) A a for all aA
A classic Galois connection example
The concrete domain C,C,C,C is (Z),,,
The abstract domain A,A,A,A where: A = {,+,-,T} A a AT for all aA
join A and meet A are defined by:A + - T
+ - T
+ + + T T
- - T - T
T T T T T
A + - †
+ + †
- - †
† + - †
The relationship between A and C
The concretisation mapping :AC is defined: () = Ø (+) = {nZ | n > 0} (-) = {nZ | n < 0} (T) = Z
The abstraction mapping :CA is defined: (S) = if S = Ø (S) = + else if n > 0 for all nS (S) = - else if n < 0 for all nS (S) = Z otherwise
Avoiding repetition
Can define with and vice versa: (S) = A{aA | S (a)}
And dually (a) = {SZ | (S) A a}
As an example consider ({1,2}): {1,2} (T) {1,2} (+) {1,2} (-) {1,2} () Therefore ({1,2}) = A{+, T} = +
Collecting domains and semantics
Observe that C is not that concrete – programs include operations such as *:ZZZ
C=(Z) is collecting domain which is easier to abstract than Z since it already a lattice
To abstract *:ZZZ, say, we synthesise a collecting version *C:(Z)(Z)(Z) and then abstract that
Put S1 *C S2 = {n1*n2 | n1 S1 and n2 S2}
Safety and optimality requirements
Safety requires ((a1)*C(a2)) C a1 *A a2 for all a1,a2A
Optimality [POPL,269—282,1979] also requires a1 *A a2 C ((a1)*C(a2))
Arguing optimality is harder than safety since rare-case approximation can simplify a tricky argument [JLP]
Abstract multiplication
*A + - †
+ + - †
- - + †
† † † †
Consider safety for ((+)*C(+)) C +*A+ Recall (+) = {nZ | n > 0} Thus (+)*C(+) = {n1*n2 | n1*n2 > 0} Hence ((+)*C(+)) = + C + *A +
Need +*A+ C ((+)*C(+)) for optimality Recall ((+)*C(+)) C +*A+ Hence ((+)*C(+)) {, +} But (+) Ø, thus (+)*C(+) Ø Therefore ((+)*C(+))
Exotic applications of abstract interpretationRecovering programmer intentions for
understanding undocumented or third-party code
Verifying that a buffer-over cannot occur, or pin-pointing where one might occur in a C program
Inferring the environment in which is a system of synchronising agents will not deadlock
Lower-bound time-complexity analysis for granularity throttling
Binding-time analysis for inferring off-line unfolding decisions which avoid code-bloat
Pointers to the literatureSAS, POPL, ESOP, ICLP, ICFP,…Useful review articles and books:
Patrick and Radhia Cousot, Comparing the Galois connection and Widening/Narrowing approaches to Abstract Interpretation, PLILP, LNCS 631, 269-295, 1992. Available from LIX library.
Patrick and Radhia Cousot, Abstract interpretation and Application to Logic Programs, JLP, 13(2-3):103-179, 1992
Flemming Neilson, Hanne Riis Neilson and Chris Hankin, Principles of Program Analysis, Springer, 1999.
Patrick has a database of abstract interpretation researchers and regularly writes tutorials, see, CC’02.
Appendix: SAT solving
SAT is not a form of abstract interpretation but abstraction and abstract interpretation is often used to reduce a verification problem to a satisfiability checking problem
Acknowledgments: much of this material is adapted from the review article, “The Quest for Efficient Boolean Satisfiability Solvers” by Zhang and Malik, 2002.
The SAT problem
Given an arbitrary prepositional formula, f say, does there exist a variable assignment (a model) under which f evaluates to true
One model for f = (xy) is ={xtrue, ytrue}
SAT is the stereotypic NP-complete problem but this does not preclude the existence of efficient SAT algorithms for certain SAT instances
Stålmarck [US Patent N527689,1995] and applications in AI planning, software verification, circuit testing have promoted a resurgence of interest in SAT
The other type of completeness
A SAT algorithm is said to be complete iff (given enough resource) it will either: compute a satisfying variable assignment or verify that no such assignment exists
A SAT algorithm is incomplete (stochastic) iff unsatisfiability cannot always be detected
Trade incompleteness for speed when a solution is very likely to exist (planning applications).
In program verification (partial) correctness often follows by proving unsatisfiability
The Davis-Logemann-Loveland (DPLL) approach1st generation solvers such as POSIT, 2cl, CSAT,
etc based on PDLL as are the 2nd generation solvers such as SATO and zChaff which tune PDLL
Davis and Putman [JACM,7:201—215,1960] proposed resolution for Boolean SAT; DLL [CACM,5:394—397,1962] replaced resolution with search to improve memory usage (special case)
CNF used to simplify unsatisfiability checking; conversion is polynomial [JSC,2,293—304, 1986]
CNF is a conjunction of clauses, for example, (xy) = (xy)(yx) = (xy)(xy)
The Davis-Logemann-Loveland (PDLL) algorithm
bool function DPLL(f, )begin fail, ’ := unit(f, ); if (fail) return false; if (satisfied(f, ’)) return true; else if (unsatisfied(f, ’)) return false; else begin let x var(f)-var(’); if (DPLL(f, ’{xtrue})) return true; else return DPLL(f, ’{xfalse}); end endend
unit applies unit propagation, possibly detecting unsatisfiability
satisfied returns true if one literal in each clause is true
unsatisfied return false if there exists one clause with every literal false
non-determinacy is in the choice of variable
stack for search
Unit propagation
Unit clause rule: if all the literals but one are false, then the remainder is set to true
Many SAT solvers use a counter scheme [Crawford, AAAI, 1993] that uses: One counter per clause to track the number of
false literals in each clause If a count reaches the total number of literals,
then unsatisfiability has been detected Otherwise if it one less then remaining literal is set
Each assignment updates many counts and pointer bases scheme are used within SATO and zChaff [Gu et al, DIMACS series DM&TCS, 1997]
Choices, choices… If variables remain uninstantiated after
propagation, then resort to random bindingBetter to rank variables by the number of
times they occur in clauses which are not (yet) true
But a variable in 128 clauses each with 2 uninstantiated variables is a better candidate than another in 128 clauses each with 32 uninstantiated variables…
But what about the overhead of ranking especially with “learnt” clauses…
But what about trailing for backtracking…But what about intelligent back-jumping…