fast submatch extraction using obdds liu yang 1, pratyusa manadhata 2, william horne 2, prasad rao...

Post on 14-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fast Submatch Extraction using OBDDs

Liu Yang1, Pratyusa Manadhata2, William Horne2,

Prasad Rao2, Vinod Ganapathy1

Rutgers University1

HP Laboratories2

Applications of Regular Expressions

Signatures

Network traffic

Alerts

NIDS

Network intrusion detection systems (NIDS) employ regular expressions to represent attack signatures.

Applications of Regular Expressions (cont.)

Connectors (rule set) SIEM

Web security compliance

Email security compliance

Security information and event management (SIEM) systems employ regular expressions to normalize event logs generated by hardware connectors and software systems.

Submatch Extraction

…username=(.*), hostname=(.*) …

Rule set

username=Bob, hostname=Foo

Submatch extraction

$1 = Bob, $2 = Foo

Signature Matching

• Non-deterministic finite automaton (NFAs)– Space efficient, time inefficient

• Deterministic finite automaton (DFAs)– Time efficient, states blow-up

• Recursive backtracking– Fast in general– Vulnerable to algorithmic complexity attacks

Motivation: Time/Space Tradeoff

Space

Time

IdealDFA (deterministic finite automaton)

NFA (non-deterministic finite automaton)

Backtracking

Our approach

Our Contributions

• A novel way of annotating capturing groups, tagged-NFAs

• Design of a novel technique on submatch extraction (called Submatch-OBDD)– Extending Thompson’s algorithm– Using Boolean functions to represent tagged-NFAs– Using ordered binary decision diagrams (OBDDs)

to improve time efficiency

• Evaluation and comparison with RE2 and PCRENote: RE2 is a hybrid approach, using a mix of DFA/NFA, while PCRE uses recursive backtracking.

Solution Overview

RegExps with capturing groups

Tagged-NFAs

Boolean Representations

OBDD representations

NFA Representation of RegExps

E = a*aa

Current state (x) Input symbol (i) Next state (y)

1 a 1

1 a 2

2 a 3

NFA of regexp “a*aa”

Transition table T(x,i,y)

Submatch Tagging: tagged NFAsE = (a*)aa

Current state (x) Input symbol (i) Next state (y) Output tags (t)

1 a 1 {t1}

1 a 2 {}

2 a 3 {}

Tagged NFA of “(a*)aa” with submatch tagging t1

Extended transition table T(x,i,y,t) of the tagged NFA

/ t1

Tag(E) = (a*)t aa1

Match TestRegExp=(a*)aa; Input: aaaa

1

2

3

a a a a

{1} {1,2} {1,2,3} {1,2,3} {1,2,3}

{t1} {t1} {t1} {t1}

accept

Frontier

Submatch Extraction

1

2

3

a a a a

{t1} {t1} {t1} {t1}

accept

{1} {1,2} {1,2,3} {1,2,3} {1,2,3}Frontier

Any path from an accept state to a start state generates a valid assignment of submatches.

$1=aa

Complexity of Tagged NFAs

)( lnO )( lnO

Match test: Submatch extraction: n – size of tagged NFAl – length of input string

Can we make the operations faster?

Submatch-OBDD

• Representing tagged NFAs using Boolean functions– Updating frontiers in one-step using a single

Boolean formula

• Using OBDDs to manipulate Boolean functions

Transitions as Boolean Functions

Current state (x) Input symbol (i) Next state (y) Output tag (t)

1 a 1 {t1}

1 a 2 {}

2 a 3 {}

T(x,i,y,t) = (1 Λ a Λ 1 Λ t1)V (1 Λ a Λ 2 Λ{})V (2 Λ a Λ 3 Λ{})

RegExp: (a*)aa

Match Test using Boolean Functions

{1} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})

{1,2} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λ t1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})

{1,2,3} Λ a Λ T(x,i,y,t) (1ΛaΛ 1 Λt1)V (1ΛaΛ 2 Λ{})V (2ΛaΛ 3 Λ{})

Input symbol

Start states

Transition table

Intermediate transitions

Next states

Current states

Accept

aaaa

aaaa

aaaa

Submatch Extraction using Boolean Functions

(1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

aΛ3 Λ

Accept state

The last input symbol

Intermediate transitions [4]

2ΛaΛ3Λ{}

Previous state of 3

aΛ2Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

1ΛaΛ2Λ{}

Rename previous state as current state and continue

No output submatch tag

No output submatch tag

Intermediate transitions [3]

Previous state of 2

Start from the last symbol, going backwards

aaaa

aaaa

Submatch Extraction using Boolean Functions

aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})V (2ΛaΛ3Λ{})

1ΛaΛ1Λ t1

Output submatch tag

aΛ1Λ (1ΛaΛ1Λt1)V (1ΛaΛ2Λ{})

1ΛaΛ1Λ t1

Output submatch tag

aaaa

t1 t1

$1=aa

Intermediate transitions [2]

Intermediate transitions [1]

Previous state of 1

Previous state of 1

aaaa

aaaa

More Formal: Match Test

)),,,(

)(

)(( ,,

tyixionTransFunct

xFrontier

ilInputSymboMap tixxy

Finding new frontiers after processing an input symbol:

Next frontiers =

Checking acceptance:

))()(( xFrontierxesAcceptStatSAT

More Formal: Submatch Extraction

)(

))((

)),,,(

)(

)((

,,

,,

neTransitioOneRreversOutputTag

neTransitioOneRreversMapatepreviousSt

tyixsitionsIntermTran

ilInputSymbo

yteCurrentStaPickOne

neTransitioOneRrevers

yix

tyiyx

Submatch extraction: the last consecutive sequence of characters that are assigned with ti

A back traversal approach: starting from the last input symbol.

Submatch-OBDD

• Representation of tagged NFAs, match test, and submatch extraction using OBDDs

• OBDD representations for– Transitions with submatch tags– Intermediate transitions– Submatch tags– Set of start states– Set of accept states– Set of frontiers– Input symbols

Implementation

RE2TNFA TNFA2OBDD PATTERNMATCHRegExps

Tagged NFAs OBDDs

Input strings / network traffic

Matched at reg#Submatches $1= …, $2 = …

No match

Toolchain in C++, interfacing with the CUDD*

*CUDD is a package for manipulation of Binary Decision Diagrams

Feasibility Study

• Data sets– Snort-2009

• RegExps: 115 regexps with capturing groups from HTTP rules• Traces

– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 311 bytes)

– Snort-2012• RegExps: 403 regexps with capturing groups from HTTP rules• Traces

– 1.2GB department network traffic (average packet size 126 bytes)– 1.3GB Twitter traffic (average packet size 1202 bytes)– 1MB synthetic trace (average string length 689 bytes)

– Firewall-504• RegExps: 504 patterns from a commercial firewall F• Trace: 87MB of firewall logs (average line size 87 bytes)

Experimental Setup

• Platform: Intel Core2 Duo E7500, Linux-2.6.3, 2GB RAM

• Two configurations on pattern matching– Conf. S

• patterns compiled individually• Compiled pattern matched sequentially against

input traces

– Conf.C• patterns combined with UNION and compiled• combined pattern matched against input traces

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2009 data set

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Snort-2012 data set

Performance

Execution time (cycles/byte) and memory consumption (MB) of RE2, PCRE, and Submatch-OBDD for the Firewall-504 data set

Related Work

• NFA-OBDD [Yang et al., RAID’10, Chasaki and Wolf, ANCS’10]

• RE2 [Cox, code.google.com/p/re2]• PCRE [www.pcre.org]• TNFA [Laurikari et al., SPIRE’00]• MDFA [Yu et al., ANCS’06]• Hybrid FA [Becchi and Crowley, CoNEXT’07]• XFA [Smith et al., Oakland’08]• More – see paper for details

Conclusion

• A novel way of annotating capturing groups

• Submatch-OBDD: a novel technique on submatch extraction using OBDDs

• Feasibility study– Submatch-OBDD achieves ideal performance

when patterns are combined– Faster than RE2 and PCRE when patterns

are combined

top related