regular expressions as types: bit-coded regular expression ... · regular expressions as types:...

37
Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science University of Copenhagen Email: [email protected] WG 2.8 Meeting, Marble Falls, 2011-03-07 Joint work with Lasse Nielsen, DIKU

Upload: others

Post on 31-May-2020

50 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expressions as types:Bit-coded regular expression parsing

Fritz Henglein

Department of Computer ScienceUniversity of CopenhagenEmail: [email protected]

WG 2.8 Meeting, Marble Falls, 2011-03-07

Joint work with Lasse Nielsen, DIKU

Page 2: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expression

Definition (Regular expression)

A regular expression (RE) over finite alphabet A is an expression ofthe form

E ,F ::= 0 | 1 | a | E |F | EF | E∗

where a ∈ A

Used in bioinformatics, compilers (lexical analysis, control flowanalysis), logic, natural language processing, program verification,protocol specification, query processing, security, XML accesspaths and document types, operating systems, scripting ofsearching, matching and substitution in texts or semi-structureddata (Perl) . . .

2

Page 3: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Language interpretation of regular expressions

Definition (Language interpretation)

The language interpretation of a regular expression E is the set ofstrings L[[E ]] defined by

L[[0]] = ∅L[[1]] = {ε}L[[a]] = {a}

L[[E |F ]] = L[[E ]] ∪ L[[F ]]L[[EF ]] = L[[E ]]� L[[F ]]L[[E∗]] =

⋃i≥0(L[[E ]])i

where S � T = {s t | s ∈ S ∧ t ∈ T}, E 0 = {ε},E i+1 = E E i .

3

Page 4: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Kleene’s Theorem

Theorem (Kleene 1956)

A language is regular if and only it is denoted by a regularexpression under its language interpretation.

4

Page 5: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

What is regular expression “matching”?

Given regular expression and input string, return . . . what?

1 yes or no (membership testing)

2 zero or one substring matches for each regular subexpression(PCRE)

3 any finite number of substring matches for each regularsubexpression (regular expression types)

4 a parse tree

5

Page 6: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

What is regular expression “matching”?

1 Membership testing = language interpretation.

2 PCRE: Only one match under a Kleene star (typically the last)

3 RET: Matches under two Kleene stars not grouped

4 Parsing: Each Kleene star yields a list of matches (thus parsetree).

Note:

Increasing structure: Lower level matching outputconstructible from higher level matching output, in particularfrom parsing.

Classical automata theory (e.g. minimal DFA construction)only sound for membership testing.

6

Page 7: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Practice

PCRE-style programming1:

Group matching: Does the RE match and where do (some of)its sub-REs match in the string?

Substitution: Replace matched substrings by specified otherstrings

Extensions: Backreferences, look-ahead, look-behind,...

Lazy vs. greedy matching, possessive quantifiers, atomicgrouping

Optimization

Observe: Language interpretation (yes/no) inappropriate, needmore refined interpretation

1in Perl and such7

Page 8: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Example

((ab)(c|d)|(abc))*.

Match against abdabc .For each parenthesized group a substring is returned.a

PCRE POSIX

$1 = abc abc$2 = ab ε$3 = c ε$4 = ε abc

aOr special null-value

8

Page 9: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Intermezzo: Optimization??Optimizing regular expressions = rewriting them to equivalentform that is more efficient for matching.2

Cox (2007)

Perl-compliant regular expressions (what you get in Perl,Python, Ruby, Java) use backtracking parsing.Does not handle “problematic” regular expressions: E ∗ whereE contains ε – may crash at run-time (stack overflow).

2Friedl, Mastering Regular Expressions, chapter 6: Crafting an efficientexpression9

Page 10: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Why discrepancy between theory and practice?

Theory is extensional: About regular languages.

Does this string match the regular expression? Yes or no?

Practice is intensional: About regular expressions asgrammars.

Does this string match the regular expression and if sohow—which parts of the string match which parts of the RE?

Ideally: Regular expression matching = parsing +“catamorphic” processing of syntax tree

Reality:

Naive backtracking matching, orfinite automaton + opportunistic instrumentation to get someparsing information (TCL (?), Laurikari 2000, Cox 2010).

10

Page 11: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expression parsing

Regular expression parsing: Construct parse tree for givenstring.

Representation of parse tree: Regular expression as type

Example

Parse abdabc according to ((ab)(c|d)|(abc))*.

p1 = [inl ((a, b), inr d), inr (a, (b, c))]

p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]

p1, p2 have type ((a× b)× (c + d) + a× (b × c)) list .

Compare with regular expression ((ab)(c|d)|(abc))* .

The elements of type E correspond to the syntax trees forstrings parsed according to regular expression E !

11

Page 12: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Type interpretation

Definition (Type interpretation)

The type interpretation T [[.]] compositionally maps a regularexpression E to the corresponding simple type:

T [[0]] = ∅ empty typeT [[1]] = {()} unit typeT [[a]] = {a} singleton type

T [[E | F ]] = T [[E ]] + T [[F ]] sum typeL[[E F ]] = T [[E ]]× T [[F ]] product typeT [[E ∗]] = {[v1, . . . , vn] | vi ∈ T [[E ]]} list type

12

Page 13: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Flattening

Definition

The flattening function flat(.) : Val(A)→ Seq(A) is defined asfollows:

flat(()) = ε flat(a) = aflat(inl v) = flat(v) flat(inr w) = flat(w)

flat((v ,w)) = flat(v) flat(w)

flat([v1, . . . , vn]) = flat(v1) . . . flat(vn)

Example

flat([inl ((a, b), inr d), inr (a, (b, c))]) = abdabc

flat([inl ((a, b), inr d), inl ((a, b), inl c)]) = abdabc

13

Page 14: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expressions as types

Informally:

string s with syntax tree p according to regular expression E∼=

string flat(v) of value v element of simple type E

Theorem

L[[E ]] = {flat(v) | v ∈ T [[E ]]}

14

Page 15: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Membership testing versus parsing

Example

E = ((ab)(c|d)|(abc))* Ed = (ab(c|d))*

Ed is unambiguous: If v ,w ∈ T [[Ed ]] and flat(v) = flat(w)then v = w . (Each string in Ed has exactly one syntax tree.)

E is ambiguous. (Recall p1 and p2.)

E and Ed are equivalent: L[[E ]] = L[[Ed ]]

Ed “represents” the minimal deterministic finite automatonfor E .

Matching (membership testing): Easy—use Ed .

But: How to parse according to E using Ed?

15

Page 16: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Bit coding

General idea:

Have nondeterministic machine/algorithm M with no input,generating all elements of a set

Use sequence of choices as representation of output (moduloM)

For regular languages:

Record binary choices for expanding a regular expression Einto a particular string s.

The sequence of choices (as bits) to drive machine toparticular output s as the bit coding of s under E .

16

Page 17: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Bit coding: Example

Example

Recall syntax trees p1, p2 for abdabc underE = ((a× b)× (c + d) + a× (b × c))∗.

p1 = [inl ((a, b), inr d), inr (a, (b, c))]

p2 = [inl ((a, b), inr d), inl ((a, b), inl c)]

We can code them by storing only their inl , inr occurrences:

code(p1) = 011

code(p2) = 0100

17

Page 18: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Bit decoding

There is a linear-time polytypic function decode that canreconstitute the syntax trees.

Theorem

decodeE (codeE (v)) = v for all v ∈ T [[E ]].

Example

decodeE (011) = [inl ((a, b), inr d), inr (a, (b, c))]

decodeE (0100) = [inl ((a, b), inr d), inl ((a, b), inl c)]

18

Page 19: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Why bit coding?

Bit coding of string s under E

represents a syntax tree of s

takes at most as much space as |s| and often a lot less(depending on E )

can be combined with statistical compression for textcompression

19

Page 20: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Bit coded regular expression parsing

Problem:

Input: string s and regular expression E .Output: (some) parse tree p such that flat(p) = s.

Goal: Output bit coding codeE (p) instead.

Dual advantage:

Less space used for output.Output faster to compute.

How to do that? Mark the “turns” in Thompson NFA (theyyield the bit coding)

20

Page 21: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

DFASIM algorithm: Outline

1 RE to NFA: Build Thompson-style NFA with suitable outputbits

2 NFA to DFA: Perform extended DFA construction (only forstates required by input string), with (multiple) bit sequenceannotations on edges

3 Traverse accepting path from right to left to construct bitcoding by concatenating bit sequences.

21

Page 22: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Thompson-style NFA generation with output bits

E NFA Extended NFA

00

1

0

1

10 0

a0 1

a0 1

a/

E F0 1

E2

F0 1

E2

F

E | F

0

2

1

4F

3E

5 0

2/1

1

/0

4F

3E

5/

/

E ∗0

3

1

2

E0

3

/1

1/0

2

E

/

22

Page 23: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark examples

1: \w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*

([,;]\s*\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)*

2: $?(\d{1,3},?(\d{3},?)*\d{3}(\.\d{0,2})?|\d{1,3}(\.\d{0,2})?|\.\d{1,2}?)

4: [A-Za-z0-9](([ \.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)

(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z][A-Za-z]+)

5: (\w|-)+@((\w|-)+\.)+(\w|-)+

6: [+-]?([0-9]*\.?[0-9]+|[0-9]+\.?[0-9]*)([eE][+-]?[0-9]+)?

7: ((\w|\d|\-|\.)+)@{1}(((\w|\d|\-){1,67})|((\w|\d|\-)+\.(\w|\d|\-){1,67}))

\.((([a-z]|[A-Z]|\d){2,4})(\.([a-z]|[A-Z]|\d){2})?)

8: (([A-Za-z0-9]+ +)|([A-Za-z0-9]+\-+)|([A-Za-z0-9]+\.+)|([A-Za-z0-9]+\++))*

[A-Za-z0-9]+@((\w+\-+)|(\w+\.))*\w{1,63}\.[a-zA-Z]{2,6}

9: (([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+

([;.](([a-zA-Z0-9 \-\.]+)@([a-zA-Z0-9 \-\.]+)\.([a-zA-Z]{2,5}){1,25})+)*

10: ((\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)\s*[,]{0,1}\s*)+

From Veanes, de Halleaux, Tillman (2010)

23

Page 24: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiments (without #3)

0

10

20

30

40

50

60

70

80

90

0 2000 4000 6000 8000 10000

tim

e

n

Example #1

FrCa (s)DFA (s)

Precompiled DFA (ms)DFASIM (ms)

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000

tim

e

n

Example #2

Backtracking (ms)FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 2000 4000 6000 8000 10000

tim

e

n

Example #4

FrCa (ms)DFA (s)

Precompiled DFA (ms)DFASIM (ms)

0

1000

2000

3000

4000

5000

6000

7000

8000

0 2000 4000 6000 8000 10000

tim

e

n

Example #5

FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

0

20

40

60

80

100

120

140

160

180

0 2000 4000 6000 8000 10000

tim

e

n

Example #6

Backtracking (s)FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 2000 4000 6000 8000 10000

tim

e

n

Example #7

FrCa (ms)DFASIM (ms)

0

500

1000

1500

2000

2500

3000

3500

0 2000 4000 6000 8000 10000

tim

e

n

Example #8

FrCa (ms)DFASIM (ms)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 2000 4000 6000 8000 10000

tim

e

n

Example #9

FrCa (ms)DFASIM (ms)

0

2000

4000

6000

8000

10000

12000

0 2000 4000 6000 8000 10000

tim

e

n

Example #10

FrCa (ms)DFASIM (ms)

24

Page 25: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Regular expression algorithms compared

FrCa: Based on Frisch, Cardelli (2004), right-to-left firstphase, left-to-right second phase.

DFASIM: As above.

DFA: As DFASIM, but staged. Extended DFA for completeextended Thomson-NFA generated, before application toinput.

Precompiled DFA: As DFA, but extended DFA specialized (inC++) and compiled.

Backtracking: PCRE-style backtracking parser.

All algorithms:

generate bit codes;

coded in C++

25

Page 26: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #1

0

10

20

30

40

50

60

70

80

90

0 2000 4000 6000 8000 10000

tim

e

n

Example #1

FrCa (s)DFA (s)

Precompiled DFA (ms)DFASIM (ms)

26

Page 27: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #2

0

100

200

300

400

500

600

700

0 2000 4000 6000 8000 10000

tim

e

n

Example #2

Backtracking (ms)FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

27

Page 28: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #4

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 2000 4000 6000 8000 10000

tim

e

n

Example #4

FrCa (ms)DFA (s)

Precompiled DFA (ms)DFASIM (ms)

28

Page 29: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #5

0

1000

2000

3000

4000

5000

6000

7000

8000

0 2000 4000 6000 8000 10000

tim

e

n

Example #5

FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

29

Page 30: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #6

0

20

40

60

80

100

120

140

160

180

0 2000 4000 6000 8000 10000

tim

e

n

Example #6

Backtracking (s)FrCa (ms)DFA (ms)

Precompiled DFA (ms)DFASIM (ms)

30

Page 31: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #7

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

0 2000 4000 6000 8000 10000

tim

e

n

Example #7

FrCa (ms)DFASIM (ms)

31

Page 32: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #8

0

500

1000

1500

2000

2500

3000

3500

0 2000 4000 6000 8000 10000

tim

e

n

Example #8

FrCa (ms)DFASIM (ms)

32

Page 33: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Benchmark experiment #9

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 2000 4000 6000 8000 10000

tim

e

n

Example #9

FrCa (ms)DFASIM (ms)

33

Page 34: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

References

Henglein, Nielsen, “Regular Expression Containment:Coinductive Axiomatization and ComputationalInterpretation”, POPL 2011

Nielsen, Henglein, “Bit-coded Regular Expression Parsing”,LATA 2011

34

Page 35: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Related workFrisch, Cardelli (2004): Regular types corresponding to regularexpressions, linear-time parsing for REs;Hosoya et al. (2000-): Regular expression types, properextension of regular types (!), axiomatization of treecontainmentAanderaa (1965), Salomaa (1966), Krob (1990), Pratt(1990), Kozen (1994, 2008), Grabmeyer (2005), Rutten et al.(2008): RE axiomatizations (extensional)Rutten et al. (1998-): Coalgebraic approach to systems,including finite automata, extensionalBrandt/Henglein (1998): Coinduction rule and computationalinterpretation for recursive typesCameron (1988), Jansson, Jeuring (1999): Bit coding forCFGs and algebraic typesCox (2010): RE2 regular expression library, TCL RE library(appear to be state of the Perl/POSIX-style “regex” libraries)

35

Page 36: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Questions?

36

Page 37: Regular expressions as types: Bit-coded regular expression ... · Regular expressions as types: Bit-coded regular expression parsing Fritz Henglein Department of Computer Science

Future work

Construction of minimal extended NFAs: All

Regular expression parsing with projection (throwing subtreesaway)

Regular expression parsing with catamorphic postprocessing(substituting subtrees)

Regular expression library as practical alternatives to PCRE,RE2 and Tcl, etc., with improved expressiveness, semanticsand performance.

37