hyperscan: a fast multi-pattern regex matcher for modern ... · hyperscan: a fast multi-pattern...

19
This paper is included in the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). February 26–28, 2019 • Boston, MA, USA ISBN 978-1-931971-49-2 Open access to the Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19) is sponsored by Hyperscan: A Fast Multi-pattern Regex Matcher for Modern CPUs Xiang Wang, Yang Hong, and Harry Chang, Intel; KyoungSoo Park, KAIST; Geoff Langdale, branchfree.org; Jiayu Hu and Heqing Zhu, Intel https://www.usenix.org/conference/nsdi19/presentation/wang-xiang

Upload: dangkien

Post on 08-Apr-2019

236 views

Category:

Documents


2 download

TRANSCRIPT

This paper is included in the Proceedings of the 16th USENIX Symposium on Networked Systems

Design and Implementation (NSDI rsquo19)February 26ndash28 2019 bull Boston MA USA

ISBN 978-1-931971-49-2

Open access to the Proceedings of the 16th USENIX Symposium on Networked Systems

Design and Implementation (NSDI rsquo19) is sponsored by

Hyperscan A Fast Multi-pattern Regex Matcher for Modern CPUs

Xiang Wang Yang Hong and Harry Chang Intel KyoungSoo Park KAIST Geoff Langdale branchfreeorg Jiayu Hu and Heqing Zhu Intel

httpswwwusenixorgconferencensdi19presentationwang-xiang

Hyperscan A Fast Multi-pattern Regex Matcher for Modern CPUs

Xiang Wang Yang Hong Harry Chang KyoungSoo Parklowast

Geoff Langdale+ Jiayu Hu and Heqing Zhu

Intel Corporation lowastKAIST +branchfreeorg

Abstract

Regular expression matching serves as a key functionalityof modern network security applications Unfortunatelyit often becomes the performance bottleneck as it involvescompute-intensive scan of every byte of packet payloadWith trends towards increasing network bandwidth anda large ruleset of complex patterns the performance re-quirement gets ever more demanding

In this paper we present Hyperscan a high perfor-mance regular expression matcher for commodity servermachines Hyperscan employs two core techniques forefficient pattern matching First it exploits graph de-composition that translates regular expression matchinginto a series of string and finite automata matching Un-like existing solutions string matching becomes a part ofregular expression matching eliminating duplicate opera-tions Decomposed regular expression components alsoincrease the chance of fast DFA matching as they tendto be smaller than the original pattern Second Hyper-scan accelerates both string and finite automata matchingusing SIMD operations which brings substantial through-put improvement Our evaluation shows that Hyperscanimproves the performance of Snort by a factor of 87 fora real traffic trace

1 IntroductionDeep packet inspection (DPI) provides the fundamentalfunctionality for many middlebox applications that dealwith L7 protocols such as intrusion detection systems(IDS) [9 10 28] application identification systems [4]and web application firewalls (WAFs) [3] Todayrsquos DPIemploys regular expression (regex) as a standard toolfor pattern description as it flexibly represents variousattack signatures in a concise form Not surprisinglynumerous research works [16 18 32 38 39 41 42]have proposed efficient regex matching as its performanceoften dominates that of an entire DPI application

Despite continued efforts the performance of regexmatching on a commodity server still remains imprac-tical to directly serve todayrsquos large network bandwidthInstead the de-facto best practice of high-performanceDPI generally employs multi-string pattern matching asa pre-condition for expensive regex matching This hy-brid approach (or prefiltering) is attractive as multi-stringmatching is known to outperform multi-regex matchingby two orders of magnitude [21] and most input trafficis innocent making it more efficient to defer a rigorouscheck For example popular IDSes like Snort [9] andSuricata [10] specify a string pattern per each regex forprefiltering and launch the corresponding regex matchingonly if the string is found in the input stream

However the current prefilter-based matching has anumber of limitations First string keywords are oftendefined manually by humans 1 Manual choice does notscale as the ruleset expands over time and improper key-words would waste CPU cycles on redundant regex match-ing Second string matching and regex matching are ex-ecuted as two separate tasks with the former leveragedonly as a trigger for the latter This results in duplicatematching of the string keywords when the correspondingregex matching is executed Third current regex match-ing typically translates an entire regex into a single finiteautomaton (FA) If the number of deterministic finite au-tomaton (DFA) states becomes too large one must resortto a slower non-deterministic finite automaton (NFA) formatching of the whole regex

In this paper we present Hyperscan a high perfor-mance regex matching system that exploits regex decom-position as the first principle Regex decomposition splitsa regex pattern into a series of disjoint string and FA com-ponents 2 This translates regex matching into a sequence

1The content option in Snort and Suricata are determined by humanswith domain knowledge

2We refer to a subregex that contains regex meta-characters or quan-tifiers which has to be translated into either a DFA or an NFA for

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 631

of decomposed subregex matching whose execution andmatching order is controlled by fast string matching Thisdesign brings a number of benefits First our regex de-composition identifies string components automaticallyby performing rigorous structural analyses on the NFAgraph of a regex Our algorithm ensures that the extractedstrings are pre-requisite for the rest of regex matchingSecond string matching is run as a part of regex match-ing rather than being employed only as a trigger Unlikethe prefilter-based design Hyperscan keeps track of thestate of string matching throughout regex matching andavoids any redundant operations Third FA componentmatching is executed only when all relevant string and FAcomponents are matched This eliminates unnecessary FAcomponent matching which allows efficient CPU utiliza-tion Finally most decomposed FA components tend to besmall so they are more likely to be able to be convertedto a DFA and benefit from fast DFA matching

Beyond the benefits of regex decomposition Hyper-scan also brings a significant performance boost withsingle-instruction-multiple-data (SIMD)-accelerated pat-tern matching algorithms For string matching we extendthe shift-or algorithm [13] to support efficient multi-stringmatching with bit-level SIMD operations For FA match-ing we represent a state with a bit position while weimplement state transitions and successor state-set calcu-lation with SIMD instructions on a large bitmap We findthat our SIMD-accelerated string matching outperformsstate-of-the-art multi-string matching by a factor of 13 to25 We also find that our SIMD-accelerated regex match-ing achieves 248x to 401x performance improvementover PCRE [6] widely adopted by DPI middleboxes suchas Snort and Suricata

In summary we make the following contributions

bull We present a novel regex matching strategy that ex-ploits regex decomposition Regex decomposition per-forms rigorous graph analysis algorithms that extractkey strings of a regex for efficient matching and drivesthe order of pattern matching by fast string matchingThis drastically improves the performance

bull We develop SIMD-accelerated pattern matching algo-rithms for both string matching and FA matching toleverage CPUrsquos compute capability on data parallelism

bull Our evaluation shows Hyperscan greatly helps improvethe performance of real-world DPI applications Itimproves the performance of Snort by 87x for a realtraffic trace

bull We share our experience with developing Hyperscanand present lessons learned through commercialization

matching as an FA component

1a

2b

3c 7 8

g9h

5e

6f

04d

10i

Figure 1 Glushkov NFA for (abc|def)ghi

2 Background and Motivation

DPI is a common functionality in many security middle-boxes and its performance has been mainly driven bythat of regex matching [19 41] There has been a largebody of research that improves the performance of regexmatching Due to space constraint we briefly review onlya few categorizing them by their approachString matching is a subset of regex matching which re-quires specialized algorithms [12 24 29] to achieve highperformance The most popular one is the Aho-Corasick(AC) algorithm [12] that uses a variant of DFA for fastmulti-string matching It runs in O(n) time complexitywhere n is the number of input bytes Unfortunately ACsuffers from frequent cache misses due to large memoryfootprint and random memory access pattern which sig-nificantly impairs the performance In addition the modelof processing one byte at a time creates a sequential datadependency that stalls instruction pipelines of modernprocessors DFC [21] employs a set of small bitmap fil-ters that quickly pass out innocent traffic by checking thefirst few bytes of string patterns against the input streamEach matched input moves onto the verification stagefor full pattern comparison DFC substantially reducesmemory accesses and cache misses by using small andcache-friendly data structures which outperforms AC by2 to 36 times The string matcher of Hyperscan takes thetwo-stage matching similar to DFC but its bucket-basedshift-or algorithm benefits from SIMD instructions whichfurther improves the performance beyond that of DFCAn NFA implements a space-efficient state machine evenfor complex regexes Despite its small memory footprintthe execution is typically slow as each input charactertriggers O(m) memory lookups (m = of current states)For this reason a DFA is preferred to an NFA whenevera regex can be translated into the former One placewhere NFA might be preferred is a logic-based design thatmaps automata to hardware accelerators such as FPGA[14 22 23 34 35 40] An FPGA-based design canexploit parallelism by running multiple finite automatasimultaneously and does not suffer from sequential statetransition table lookups On the down side it is limitedto a small ruleset due to its hardware constraints Also it

632 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

suffers from the DMA overhead of moving data betweenthe CPU and the FPGA device This overhead can imposeprohibitive latency especially when input data is not large(as would be the case for scanning of small packets)

We use Glushkov NFA [27] for Hyperscan which iswidely used due to its two useful properties First itdoes not have epsilon transitions which simplifies statetransition implementation Second all transitions intoa given state are triggered by the same input symbolFigure 1 shows an example of a Glushkov NFA Eachcircle represents a state whose id is shown as a numberand each character represents the input symbol by whichany previous state transitions into this state For exampleone can transition into state 8 from state 3 6 or 7 onlyfor an input symbol lsquogrsquo The second property impliesthat the total number of states is bounded by the numberof input symbols and a few special states ndash a start statestates with lsquorsquo etcA DFA achieves high performance as it runs in O(1)per each input character Its main disadvantage how-ever is a large memory footprint and a potential ofstate explosion at transforming an NFA to a DFA Thusmost works on DFA focus on memory footprint reduc-tion [15 17 18 20 26 31 32 33] D2FA [32] com-presses the memory space by sharing multiple transitionsof states with a similar transition table and by establishinga default transition between them A-DFA [18] presentsuseful features such as alphabet reduction that classifiesalphabets into smaller groups indirect addressing thatreduces memory bandwidth by translating unique stateidentifiers to memory address and multi-stride structurethat processes multiple characters at a timeAn extended FA is a proposal that restructures the state-of-the-art FA to address state explosion XFA [38 39]associates update functions with states and transitionsby having a scratch memory that compresses the spaceHFA [16] presents a hybrid-FA that achieves compara-ble space to that of an NFA by making head DFAs andtrailing NFA or DFAs The theory behind it is to discoverboundary states so that one can conduct partial conversionof an NFA to a DFA to avoid exponential state explosionfrom a full conversionPrefilter-based approaches are the most popular way toscale performance of regex matching in practice BothSnort and Suricata extract string keywords from regexrules and perform unified multi-string matching withthe Aho-Corasick algorithm Expensive regex match-ing is only needed if AC detects literal strings in theinput SplitScreen [36] applies a similar approach to Cla-mAV [30] a widely-used anti-malware application andachieves a 2x speedup compared to original ClamAV

SIMD Register XX2 X1 X0X3

Y2 Y1 Y0Y3 SIMD Register Y

X2 OP Y2 X1 OP Y1 X0 OP Y0X3 OP Y3

OP

SIMD Register Z

OP OP OP

Figure 2 Typical two-operand SIMD operation

A SIMD instruction executes the same operation on mul-tiple data in parallel As shown in Figure 2 a SIMDoperation is performed on multiple lanes of two SIMDregisters independently and the results are stored in thethird register Modern CPU supports a number of SIMDinstructions that can work on specialized vector registers(SSE AVX etc) The latest AVX512 instructions supportup to 512-bit operations simultaneously

Despite its great potential few research works haveexploited SIMD instructions for regex matching Sitaridiet al propose a SIMD-based regex matching design [37]for database which uses a gather instruction to traverseDFA for multiple inputs simultaneously However itcannot be applied to our case as regex matching for DPIis typically performed on a single input streamSummary and our approach Most prior works onregex matching attempt to build a special FA that per-forms as well as a DFA while its memory footprint is assmall as an NFA However one common problem in allthese works is that FA restructuring inevitably imposes ex-tra performance overhead compared to the original DFAFor example XFA takes multiple tens to hundreds ofCPU cycles per input byte which is slower than a nor-mal DFA by one or two orders of magnitude In contrastthe prefilter-based approach looks attractive as it benefitsfrom multi-string matching most time which is fasterthan multi-regex matching by a few orders of magnitudeHowever it is still suboptimal as it must perform duplicatestring matching during regex matching and wrong choiceof string patterns would trigger redundant regex matching(as shown in Section 62) To avoid the inefficiency wetake a fresh approach that divides a regex pattern into mul-tiple components and leverages fast string matching tocoordinate the order of component matching This wouldminimize the waste of CPU cycles on redundant match-ing and thus improves the performance In addition wedevelop our own multi-string matching and FA matchingalgorithms carefully tailored to exploit SIMD operations

3 Regular Expression DecompositionIn this section we present the concept of regex decompo-sition and explain how Hyperscan matches a sequence ofregex components against the input Then we introduce

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 633

graph-based decomposition whose graph analysis tech-niques reliably identify the strings in regex patterns mostdesirable for string matching

31 Matching with Regex DecompositionThe key idea of Hyperscan is to decompose each regexpattern into a disjoint set of string and subregex (or FA)components and to match each component until it findsa complete match A string component consists of astream of literals (or input symbols) Subregex compo-nents are the remaining parts of a regex after all stringcomponents are removed They may include one or moremeta-characters or quantifiers in regex syntax (like lsquo ˆ rsquolsquo$rsquo lsquorsquo lsquorsquo etc) that need to be translated into an FA formatching Thus we refer to it as an FA componentLinear regex We start with a simple regex where eachcomponent is concatenated without alternation We call ita linear regex Formally a linear regex pattern that con-tains at least one string can be represented as the followingproduction rules

1 regex rarr left str FA2 left rarr left str FA | FA

where str and FA are both indivisible components andFA can be empty A linear regex without any string is im-plemented as a single DFA or NFA In practice howeverwe find that 87 to 94 of the regex rules in IDSes haveat least one extractable string so a majority of real-worldregexes would benefit from decomposition The produc-tion rules imply that if we find the rightmost string in alinear regex pattern we can recursively apply the samealgorithm to decompose the rest of the pattern One com-plication lies in a subregex with a repetition operator suchas (R) (R)lowast (R)+ and (R)mn where R is arbitrar-ily complex Hyperscan treats (R) and (R)lowast as a singleFA since R is optional while it converts (R)+ = (R)(R)lowastand (R)mn= (R)(R)(R)0nminusm ((R) appears mtimes)) Then it decomposes their prefixes and treats thesuffix as an FA

In general a decomposable linear regex can be ex-pressed as FAn strn FAnminus1 strnminus1 middot middot middot str2 FA1 str1 FA0For any successful match of the original regex all stringsmust be matched in the same order as they appear Basedon the observation Hyperscan applies the following threerules for regex matching1 String matching is the first step It scans the entire

input to find all strs Each successful match of str maytrigger the execution of its neighbor FA matching

2 Each FA has its own switch for execution It is off bydefault except for the leftmost FA components

3 For a generalized form like left FA str right whereleft or right is an arbitrary set of decomposed com-

ponents including an empty character Only if all com-ponents of left are matched successfully the switchof FA is turned on Only if str is matched successfullyand the FA switch is on FA matching is executed Fi-nally only if FA is matched successfully the leftmostFA of right is turned onLetrsquos take one example regex foo[ˆX]barY+

and consider two input cases The regex patternis decomposed into FA2 str2 FA1 str1 FA0 whereFA2= str2=fooFA1=[ˆX] str1=bar FA0=Y+bull Input=XfooZbarY This is overall a successful

match First the string matcher finds str2 (foo) andtriggers matching of FA2 () against X since theleftmost FA switch is always on Then the switch ofFA1 ([ˆX]) is turned on After that the matcher findsstr1 (bar) which triggers matching of FA1 againstZ and its success turns on the the switch of FA0(Y+) Since FA0 is the rightmost FA it is executedagainst the remaining input Y

bull Input=XfoZbarY This is overall a failed matchFirst the string matcher finds str1 (bar) and sees if itcan trigger matching of FA1 ([ˆX]) Then it figuresout that the switch of FA1 is off since str2 (foo) wasnot found and thus none of FA2 and str2 was a success-ful match So the matching of regex FA1 terminatesensuring no waste of CPU resourcesOur implementation tracks of the input offsets of

matched strings and the state of matching individual com-ponents which allows invoking appropriate FA matchingwith a correct byte range of the inputRegex with alternation If a regex includes an alterna-tion like (A|B) we expand the alternation into two regexesonly if both A and B are decomposed into str and FAcomponents (decomposable) If not (A|B) is treated asa single FA In case A or B itself has an alternation weneed to recursively apply the same rule until there is no al-ternation in their subregexes Then each expanded regexwould become a linear regex which would benefit fromthe same rules for decomposition and matching as before

Pattern matching with regex decomposition presentsits main benefit ndash it minimizes the waste of CPU cycleson unnecessary FA matching because FA matching isexecuted only when it is needed Also it increases thechance of fast DFA matching as each decomposed FAis smaller so it is more likely to be converted into aDFA In contrast the prefiltering approach has to executematching of the entire regex even when it is not needed(eg matching bar in the example above would triggerregex matching even if foo is not found) and regexmatching must re-match the string already found in string

634 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

Hyperscan A Fast Multi-pattern Regex Matcher for Modern CPUs

Xiang Wang Yang Hong Harry Chang KyoungSoo Parklowast

Geoff Langdale+ Jiayu Hu and Heqing Zhu

Intel Corporation lowastKAIST +branchfreeorg

Abstract

Regular expression matching serves as a key functionalityof modern network security applications Unfortunatelyit often becomes the performance bottleneck as it involvescompute-intensive scan of every byte of packet payloadWith trends towards increasing network bandwidth anda large ruleset of complex patterns the performance re-quirement gets ever more demanding

In this paper we present Hyperscan a high perfor-mance regular expression matcher for commodity servermachines Hyperscan employs two core techniques forefficient pattern matching First it exploits graph de-composition that translates regular expression matchinginto a series of string and finite automata matching Un-like existing solutions string matching becomes a part ofregular expression matching eliminating duplicate opera-tions Decomposed regular expression components alsoincrease the chance of fast DFA matching as they tendto be smaller than the original pattern Second Hyper-scan accelerates both string and finite automata matchingusing SIMD operations which brings substantial through-put improvement Our evaluation shows that Hyperscanimproves the performance of Snort by a factor of 87 fora real traffic trace

1 IntroductionDeep packet inspection (DPI) provides the fundamentalfunctionality for many middlebox applications that dealwith L7 protocols such as intrusion detection systems(IDS) [9 10 28] application identification systems [4]and web application firewalls (WAFs) [3] Todayrsquos DPIemploys regular expression (regex) as a standard toolfor pattern description as it flexibly represents variousattack signatures in a concise form Not surprisinglynumerous research works [16 18 32 38 39 41 42]have proposed efficient regex matching as its performanceoften dominates that of an entire DPI application

Despite continued efforts the performance of regexmatching on a commodity server still remains imprac-tical to directly serve todayrsquos large network bandwidthInstead the de-facto best practice of high-performanceDPI generally employs multi-string pattern matching asa pre-condition for expensive regex matching This hy-brid approach (or prefiltering) is attractive as multi-stringmatching is known to outperform multi-regex matchingby two orders of magnitude [21] and most input trafficis innocent making it more efficient to defer a rigorouscheck For example popular IDSes like Snort [9] andSuricata [10] specify a string pattern per each regex forprefiltering and launch the corresponding regex matchingonly if the string is found in the input stream

However the current prefilter-based matching has anumber of limitations First string keywords are oftendefined manually by humans 1 Manual choice does notscale as the ruleset expands over time and improper key-words would waste CPU cycles on redundant regex match-ing Second string matching and regex matching are ex-ecuted as two separate tasks with the former leveragedonly as a trigger for the latter This results in duplicatematching of the string keywords when the correspondingregex matching is executed Third current regex match-ing typically translates an entire regex into a single finiteautomaton (FA) If the number of deterministic finite au-tomaton (DFA) states becomes too large one must resortto a slower non-deterministic finite automaton (NFA) formatching of the whole regex

In this paper we present Hyperscan a high perfor-mance regex matching system that exploits regex decom-position as the first principle Regex decomposition splitsa regex pattern into a series of disjoint string and FA com-ponents 2 This translates regex matching into a sequence

1The content option in Snort and Suricata are determined by humanswith domain knowledge

2We refer to a subregex that contains regex meta-characters or quan-tifiers which has to be translated into either a DFA or an NFA for

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 631

of decomposed subregex matching whose execution andmatching order is controlled by fast string matching Thisdesign brings a number of benefits First our regex de-composition identifies string components automaticallyby performing rigorous structural analyses on the NFAgraph of a regex Our algorithm ensures that the extractedstrings are pre-requisite for the rest of regex matchingSecond string matching is run as a part of regex match-ing rather than being employed only as a trigger Unlikethe prefilter-based design Hyperscan keeps track of thestate of string matching throughout regex matching andavoids any redundant operations Third FA componentmatching is executed only when all relevant string and FAcomponents are matched This eliminates unnecessary FAcomponent matching which allows efficient CPU utiliza-tion Finally most decomposed FA components tend to besmall so they are more likely to be able to be convertedto a DFA and benefit from fast DFA matching

Beyond the benefits of regex decomposition Hyper-scan also brings a significant performance boost withsingle-instruction-multiple-data (SIMD)-accelerated pat-tern matching algorithms For string matching we extendthe shift-or algorithm [13] to support efficient multi-stringmatching with bit-level SIMD operations For FA match-ing we represent a state with a bit position while weimplement state transitions and successor state-set calcu-lation with SIMD instructions on a large bitmap We findthat our SIMD-accelerated string matching outperformsstate-of-the-art multi-string matching by a factor of 13 to25 We also find that our SIMD-accelerated regex match-ing achieves 248x to 401x performance improvementover PCRE [6] widely adopted by DPI middleboxes suchas Snort and Suricata

In summary we make the following contributions

bull We present a novel regex matching strategy that ex-ploits regex decomposition Regex decomposition per-forms rigorous graph analysis algorithms that extractkey strings of a regex for efficient matching and drivesthe order of pattern matching by fast string matchingThis drastically improves the performance

bull We develop SIMD-accelerated pattern matching algo-rithms for both string matching and FA matching toleverage CPUrsquos compute capability on data parallelism

bull Our evaluation shows Hyperscan greatly helps improvethe performance of real-world DPI applications Itimproves the performance of Snort by 87x for a realtraffic trace

bull We share our experience with developing Hyperscanand present lessons learned through commercialization

matching as an FA component

1a

2b

3c 7 8

g9h

5e

6f

04d

10i

Figure 1 Glushkov NFA for (abc|def)ghi

2 Background and Motivation

DPI is a common functionality in many security middle-boxes and its performance has been mainly driven bythat of regex matching [19 41] There has been a largebody of research that improves the performance of regexmatching Due to space constraint we briefly review onlya few categorizing them by their approachString matching is a subset of regex matching which re-quires specialized algorithms [12 24 29] to achieve highperformance The most popular one is the Aho-Corasick(AC) algorithm [12] that uses a variant of DFA for fastmulti-string matching It runs in O(n) time complexitywhere n is the number of input bytes Unfortunately ACsuffers from frequent cache misses due to large memoryfootprint and random memory access pattern which sig-nificantly impairs the performance In addition the modelof processing one byte at a time creates a sequential datadependency that stalls instruction pipelines of modernprocessors DFC [21] employs a set of small bitmap fil-ters that quickly pass out innocent traffic by checking thefirst few bytes of string patterns against the input streamEach matched input moves onto the verification stagefor full pattern comparison DFC substantially reducesmemory accesses and cache misses by using small andcache-friendly data structures which outperforms AC by2 to 36 times The string matcher of Hyperscan takes thetwo-stage matching similar to DFC but its bucket-basedshift-or algorithm benefits from SIMD instructions whichfurther improves the performance beyond that of DFCAn NFA implements a space-efficient state machine evenfor complex regexes Despite its small memory footprintthe execution is typically slow as each input charactertriggers O(m) memory lookups (m = of current states)For this reason a DFA is preferred to an NFA whenevera regex can be translated into the former One placewhere NFA might be preferred is a logic-based design thatmaps automata to hardware accelerators such as FPGA[14 22 23 34 35 40] An FPGA-based design canexploit parallelism by running multiple finite automatasimultaneously and does not suffer from sequential statetransition table lookups On the down side it is limitedto a small ruleset due to its hardware constraints Also it

632 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

suffers from the DMA overhead of moving data betweenthe CPU and the FPGA device This overhead can imposeprohibitive latency especially when input data is not large(as would be the case for scanning of small packets)

We use Glushkov NFA [27] for Hyperscan which iswidely used due to its two useful properties First itdoes not have epsilon transitions which simplifies statetransition implementation Second all transitions intoa given state are triggered by the same input symbolFigure 1 shows an example of a Glushkov NFA Eachcircle represents a state whose id is shown as a numberand each character represents the input symbol by whichany previous state transitions into this state For exampleone can transition into state 8 from state 3 6 or 7 onlyfor an input symbol lsquogrsquo The second property impliesthat the total number of states is bounded by the numberof input symbols and a few special states ndash a start statestates with lsquorsquo etcA DFA achieves high performance as it runs in O(1)per each input character Its main disadvantage how-ever is a large memory footprint and a potential ofstate explosion at transforming an NFA to a DFA Thusmost works on DFA focus on memory footprint reduc-tion [15 17 18 20 26 31 32 33] D2FA [32] com-presses the memory space by sharing multiple transitionsof states with a similar transition table and by establishinga default transition between them A-DFA [18] presentsuseful features such as alphabet reduction that classifiesalphabets into smaller groups indirect addressing thatreduces memory bandwidth by translating unique stateidentifiers to memory address and multi-stride structurethat processes multiple characters at a timeAn extended FA is a proposal that restructures the state-of-the-art FA to address state explosion XFA [38 39]associates update functions with states and transitionsby having a scratch memory that compresses the spaceHFA [16] presents a hybrid-FA that achieves compara-ble space to that of an NFA by making head DFAs andtrailing NFA or DFAs The theory behind it is to discoverboundary states so that one can conduct partial conversionof an NFA to a DFA to avoid exponential state explosionfrom a full conversionPrefilter-based approaches are the most popular way toscale performance of regex matching in practice BothSnort and Suricata extract string keywords from regexrules and perform unified multi-string matching withthe Aho-Corasick algorithm Expensive regex match-ing is only needed if AC detects literal strings in theinput SplitScreen [36] applies a similar approach to Cla-mAV [30] a widely-used anti-malware application andachieves a 2x speedup compared to original ClamAV

SIMD Register XX2 X1 X0X3

Y2 Y1 Y0Y3 SIMD Register Y

X2 OP Y2 X1 OP Y1 X0 OP Y0X3 OP Y3

OP

SIMD Register Z

OP OP OP

Figure 2 Typical two-operand SIMD operation

A SIMD instruction executes the same operation on mul-tiple data in parallel As shown in Figure 2 a SIMDoperation is performed on multiple lanes of two SIMDregisters independently and the results are stored in thethird register Modern CPU supports a number of SIMDinstructions that can work on specialized vector registers(SSE AVX etc) The latest AVX512 instructions supportup to 512-bit operations simultaneously

Despite its great potential few research works haveexploited SIMD instructions for regex matching Sitaridiet al propose a SIMD-based regex matching design [37]for database which uses a gather instruction to traverseDFA for multiple inputs simultaneously However itcannot be applied to our case as regex matching for DPIis typically performed on a single input streamSummary and our approach Most prior works onregex matching attempt to build a special FA that per-forms as well as a DFA while its memory footprint is assmall as an NFA However one common problem in allthese works is that FA restructuring inevitably imposes ex-tra performance overhead compared to the original DFAFor example XFA takes multiple tens to hundreds ofCPU cycles per input byte which is slower than a nor-mal DFA by one or two orders of magnitude In contrastthe prefilter-based approach looks attractive as it benefitsfrom multi-string matching most time which is fasterthan multi-regex matching by a few orders of magnitudeHowever it is still suboptimal as it must perform duplicatestring matching during regex matching and wrong choiceof string patterns would trigger redundant regex matching(as shown in Section 62) To avoid the inefficiency wetake a fresh approach that divides a regex pattern into mul-tiple components and leverages fast string matching tocoordinate the order of component matching This wouldminimize the waste of CPU cycles on redundant match-ing and thus improves the performance In addition wedevelop our own multi-string matching and FA matchingalgorithms carefully tailored to exploit SIMD operations

3 Regular Expression DecompositionIn this section we present the concept of regex decompo-sition and explain how Hyperscan matches a sequence ofregex components against the input Then we introduce

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 633

graph-based decomposition whose graph analysis tech-niques reliably identify the strings in regex patterns mostdesirable for string matching

31 Matching with Regex DecompositionThe key idea of Hyperscan is to decompose each regexpattern into a disjoint set of string and subregex (or FA)components and to match each component until it findsa complete match A string component consists of astream of literals (or input symbols) Subregex compo-nents are the remaining parts of a regex after all stringcomponents are removed They may include one or moremeta-characters or quantifiers in regex syntax (like lsquo ˆ rsquolsquo$rsquo lsquorsquo lsquorsquo etc) that need to be translated into an FA formatching Thus we refer to it as an FA componentLinear regex We start with a simple regex where eachcomponent is concatenated without alternation We call ita linear regex Formally a linear regex pattern that con-tains at least one string can be represented as the followingproduction rules

1 regex rarr left str FA2 left rarr left str FA | FA

where str and FA are both indivisible components andFA can be empty A linear regex without any string is im-plemented as a single DFA or NFA In practice howeverwe find that 87 to 94 of the regex rules in IDSes haveat least one extractable string so a majority of real-worldregexes would benefit from decomposition The produc-tion rules imply that if we find the rightmost string in alinear regex pattern we can recursively apply the samealgorithm to decompose the rest of the pattern One com-plication lies in a subregex with a repetition operator suchas (R) (R)lowast (R)+ and (R)mn where R is arbitrar-ily complex Hyperscan treats (R) and (R)lowast as a singleFA since R is optional while it converts (R)+ = (R)(R)lowastand (R)mn= (R)(R)(R)0nminusm ((R) appears mtimes)) Then it decomposes their prefixes and treats thesuffix as an FA

In general a decomposable linear regex can be ex-pressed as FAn strn FAnminus1 strnminus1 middot middot middot str2 FA1 str1 FA0For any successful match of the original regex all stringsmust be matched in the same order as they appear Basedon the observation Hyperscan applies the following threerules for regex matching1 String matching is the first step It scans the entire

input to find all strs Each successful match of str maytrigger the execution of its neighbor FA matching

2 Each FA has its own switch for execution It is off bydefault except for the leftmost FA components

3 For a generalized form like left FA str right whereleft or right is an arbitrary set of decomposed com-

ponents including an empty character Only if all com-ponents of left are matched successfully the switchof FA is turned on Only if str is matched successfullyand the FA switch is on FA matching is executed Fi-nally only if FA is matched successfully the leftmostFA of right is turned onLetrsquos take one example regex foo[ˆX]barY+

and consider two input cases The regex patternis decomposed into FA2 str2 FA1 str1 FA0 whereFA2= str2=fooFA1=[ˆX] str1=bar FA0=Y+bull Input=XfooZbarY This is overall a successful

match First the string matcher finds str2 (foo) andtriggers matching of FA2 () against X since theleftmost FA switch is always on Then the switch ofFA1 ([ˆX]) is turned on After that the matcher findsstr1 (bar) which triggers matching of FA1 againstZ and its success turns on the the switch of FA0(Y+) Since FA0 is the rightmost FA it is executedagainst the remaining input Y

bull Input=XfoZbarY This is overall a failed matchFirst the string matcher finds str1 (bar) and sees if itcan trigger matching of FA1 ([ˆX]) Then it figuresout that the switch of FA1 is off since str2 (foo) wasnot found and thus none of FA2 and str2 was a success-ful match So the matching of regex FA1 terminatesensuring no waste of CPU resourcesOur implementation tracks of the input offsets of

matched strings and the state of matching individual com-ponents which allows invoking appropriate FA matchingwith a correct byte range of the inputRegex with alternation If a regex includes an alterna-tion like (A|B) we expand the alternation into two regexesonly if both A and B are decomposed into str and FAcomponents (decomposable) If not (A|B) is treated asa single FA In case A or B itself has an alternation weneed to recursively apply the same rule until there is no al-ternation in their subregexes Then each expanded regexwould become a linear regex which would benefit fromthe same rules for decomposition and matching as before

Pattern matching with regex decomposition presentsits main benefit ndash it minimizes the waste of CPU cycleson unnecessary FA matching because FA matching isexecuted only when it is needed Also it increases thechance of fast DFA matching as each decomposed FAis smaller so it is more likely to be converted into aDFA In contrast the prefiltering approach has to executematching of the entire regex even when it is not needed(eg matching bar in the example above would triggerregex matching even if foo is not found) and regexmatching must re-match the string already found in string

634 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

of decomposed subregex matching whose execution andmatching order is controlled by fast string matching Thisdesign brings a number of benefits First our regex de-composition identifies string components automaticallyby performing rigorous structural analyses on the NFAgraph of a regex Our algorithm ensures that the extractedstrings are pre-requisite for the rest of regex matchingSecond string matching is run as a part of regex match-ing rather than being employed only as a trigger Unlikethe prefilter-based design Hyperscan keeps track of thestate of string matching throughout regex matching andavoids any redundant operations Third FA componentmatching is executed only when all relevant string and FAcomponents are matched This eliminates unnecessary FAcomponent matching which allows efficient CPU utiliza-tion Finally most decomposed FA components tend to besmall so they are more likely to be able to be convertedto a DFA and benefit from fast DFA matching

Beyond the benefits of regex decomposition Hyper-scan also brings a significant performance boost withsingle-instruction-multiple-data (SIMD)-accelerated pat-tern matching algorithms For string matching we extendthe shift-or algorithm [13] to support efficient multi-stringmatching with bit-level SIMD operations For FA match-ing we represent a state with a bit position while weimplement state transitions and successor state-set calcu-lation with SIMD instructions on a large bitmap We findthat our SIMD-accelerated string matching outperformsstate-of-the-art multi-string matching by a factor of 13 to25 We also find that our SIMD-accelerated regex match-ing achieves 248x to 401x performance improvementover PCRE [6] widely adopted by DPI middleboxes suchas Snort and Suricata

In summary we make the following contributions

bull We present a novel regex matching strategy that ex-ploits regex decomposition Regex decomposition per-forms rigorous graph analysis algorithms that extractkey strings of a regex for efficient matching and drivesthe order of pattern matching by fast string matchingThis drastically improves the performance

bull We develop SIMD-accelerated pattern matching algo-rithms for both string matching and FA matching toleverage CPUrsquos compute capability on data parallelism

bull Our evaluation shows Hyperscan greatly helps improvethe performance of real-world DPI applications Itimproves the performance of Snort by 87x for a realtraffic trace

bull We share our experience with developing Hyperscanand present lessons learned through commercialization

matching as an FA component

1a

2b

3c 7 8

g9h

5e

6f

04d

10i

Figure 1 Glushkov NFA for (abc|def)ghi

2 Background and Motivation

DPI is a common functionality in many security middle-boxes and its performance has been mainly driven bythat of regex matching [19 41] There has been a largebody of research that improves the performance of regexmatching Due to space constraint we briefly review onlya few categorizing them by their approachString matching is a subset of regex matching which re-quires specialized algorithms [12 24 29] to achieve highperformance The most popular one is the Aho-Corasick(AC) algorithm [12] that uses a variant of DFA for fastmulti-string matching It runs in O(n) time complexitywhere n is the number of input bytes Unfortunately ACsuffers from frequent cache misses due to large memoryfootprint and random memory access pattern which sig-nificantly impairs the performance In addition the modelof processing one byte at a time creates a sequential datadependency that stalls instruction pipelines of modernprocessors DFC [21] employs a set of small bitmap fil-ters that quickly pass out innocent traffic by checking thefirst few bytes of string patterns against the input streamEach matched input moves onto the verification stagefor full pattern comparison DFC substantially reducesmemory accesses and cache misses by using small andcache-friendly data structures which outperforms AC by2 to 36 times The string matcher of Hyperscan takes thetwo-stage matching similar to DFC but its bucket-basedshift-or algorithm benefits from SIMD instructions whichfurther improves the performance beyond that of DFCAn NFA implements a space-efficient state machine evenfor complex regexes Despite its small memory footprintthe execution is typically slow as each input charactertriggers O(m) memory lookups (m = of current states)For this reason a DFA is preferred to an NFA whenevera regex can be translated into the former One placewhere NFA might be preferred is a logic-based design thatmaps automata to hardware accelerators such as FPGA[14 22 23 34 35 40] An FPGA-based design canexploit parallelism by running multiple finite automatasimultaneously and does not suffer from sequential statetransition table lookups On the down side it is limitedto a small ruleset due to its hardware constraints Also it

632 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

suffers from the DMA overhead of moving data betweenthe CPU and the FPGA device This overhead can imposeprohibitive latency especially when input data is not large(as would be the case for scanning of small packets)

We use Glushkov NFA [27] for Hyperscan which iswidely used due to its two useful properties First itdoes not have epsilon transitions which simplifies statetransition implementation Second all transitions intoa given state are triggered by the same input symbolFigure 1 shows an example of a Glushkov NFA Eachcircle represents a state whose id is shown as a numberand each character represents the input symbol by whichany previous state transitions into this state For exampleone can transition into state 8 from state 3 6 or 7 onlyfor an input symbol lsquogrsquo The second property impliesthat the total number of states is bounded by the numberof input symbols and a few special states ndash a start statestates with lsquorsquo etcA DFA achieves high performance as it runs in O(1)per each input character Its main disadvantage how-ever is a large memory footprint and a potential ofstate explosion at transforming an NFA to a DFA Thusmost works on DFA focus on memory footprint reduc-tion [15 17 18 20 26 31 32 33] D2FA [32] com-presses the memory space by sharing multiple transitionsof states with a similar transition table and by establishinga default transition between them A-DFA [18] presentsuseful features such as alphabet reduction that classifiesalphabets into smaller groups indirect addressing thatreduces memory bandwidth by translating unique stateidentifiers to memory address and multi-stride structurethat processes multiple characters at a timeAn extended FA is a proposal that restructures the state-of-the-art FA to address state explosion XFA [38 39]associates update functions with states and transitionsby having a scratch memory that compresses the spaceHFA [16] presents a hybrid-FA that achieves compara-ble space to that of an NFA by making head DFAs andtrailing NFA or DFAs The theory behind it is to discoverboundary states so that one can conduct partial conversionof an NFA to a DFA to avoid exponential state explosionfrom a full conversionPrefilter-based approaches are the most popular way toscale performance of regex matching in practice BothSnort and Suricata extract string keywords from regexrules and perform unified multi-string matching withthe Aho-Corasick algorithm Expensive regex match-ing is only needed if AC detects literal strings in theinput SplitScreen [36] applies a similar approach to Cla-mAV [30] a widely-used anti-malware application andachieves a 2x speedup compared to original ClamAV

SIMD Register XX2 X1 X0X3

Y2 Y1 Y0Y3 SIMD Register Y

X2 OP Y2 X1 OP Y1 X0 OP Y0X3 OP Y3

OP

SIMD Register Z

OP OP OP

Figure 2 Typical two-operand SIMD operation

A SIMD instruction executes the same operation on mul-tiple data in parallel As shown in Figure 2 a SIMDoperation is performed on multiple lanes of two SIMDregisters independently and the results are stored in thethird register Modern CPU supports a number of SIMDinstructions that can work on specialized vector registers(SSE AVX etc) The latest AVX512 instructions supportup to 512-bit operations simultaneously

Despite its great potential few research works haveexploited SIMD instructions for regex matching Sitaridiet al propose a SIMD-based regex matching design [37]for database which uses a gather instruction to traverseDFA for multiple inputs simultaneously However itcannot be applied to our case as regex matching for DPIis typically performed on a single input streamSummary and our approach Most prior works onregex matching attempt to build a special FA that per-forms as well as a DFA while its memory footprint is assmall as an NFA However one common problem in allthese works is that FA restructuring inevitably imposes ex-tra performance overhead compared to the original DFAFor example XFA takes multiple tens to hundreds ofCPU cycles per input byte which is slower than a nor-mal DFA by one or two orders of magnitude In contrastthe prefilter-based approach looks attractive as it benefitsfrom multi-string matching most time which is fasterthan multi-regex matching by a few orders of magnitudeHowever it is still suboptimal as it must perform duplicatestring matching during regex matching and wrong choiceof string patterns would trigger redundant regex matching(as shown in Section 62) To avoid the inefficiency wetake a fresh approach that divides a regex pattern into mul-tiple components and leverages fast string matching tocoordinate the order of component matching This wouldminimize the waste of CPU cycles on redundant match-ing and thus improves the performance In addition wedevelop our own multi-string matching and FA matchingalgorithms carefully tailored to exploit SIMD operations

3 Regular Expression DecompositionIn this section we present the concept of regex decompo-sition and explain how Hyperscan matches a sequence ofregex components against the input Then we introduce

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 633

graph-based decomposition whose graph analysis tech-niques reliably identify the strings in regex patterns mostdesirable for string matching

31 Matching with Regex DecompositionThe key idea of Hyperscan is to decompose each regexpattern into a disjoint set of string and subregex (or FA)components and to match each component until it findsa complete match A string component consists of astream of literals (or input symbols) Subregex compo-nents are the remaining parts of a regex after all stringcomponents are removed They may include one or moremeta-characters or quantifiers in regex syntax (like lsquo ˆ rsquolsquo$rsquo lsquorsquo lsquorsquo etc) that need to be translated into an FA formatching Thus we refer to it as an FA componentLinear regex We start with a simple regex where eachcomponent is concatenated without alternation We call ita linear regex Formally a linear regex pattern that con-tains at least one string can be represented as the followingproduction rules

1 regex rarr left str FA2 left rarr left str FA | FA

where str and FA are both indivisible components andFA can be empty A linear regex without any string is im-plemented as a single DFA or NFA In practice howeverwe find that 87 to 94 of the regex rules in IDSes haveat least one extractable string so a majority of real-worldregexes would benefit from decomposition The produc-tion rules imply that if we find the rightmost string in alinear regex pattern we can recursively apply the samealgorithm to decompose the rest of the pattern One com-plication lies in a subregex with a repetition operator suchas (R) (R)lowast (R)+ and (R)mn where R is arbitrar-ily complex Hyperscan treats (R) and (R)lowast as a singleFA since R is optional while it converts (R)+ = (R)(R)lowastand (R)mn= (R)(R)(R)0nminusm ((R) appears mtimes)) Then it decomposes their prefixes and treats thesuffix as an FA

In general a decomposable linear regex can be ex-pressed as FAn strn FAnminus1 strnminus1 middot middot middot str2 FA1 str1 FA0For any successful match of the original regex all stringsmust be matched in the same order as they appear Basedon the observation Hyperscan applies the following threerules for regex matching1 String matching is the first step It scans the entire

input to find all strs Each successful match of str maytrigger the execution of its neighbor FA matching

2 Each FA has its own switch for execution It is off bydefault except for the leftmost FA components

3 For a generalized form like left FA str right whereleft or right is an arbitrary set of decomposed com-

ponents including an empty character Only if all com-ponents of left are matched successfully the switchof FA is turned on Only if str is matched successfullyand the FA switch is on FA matching is executed Fi-nally only if FA is matched successfully the leftmostFA of right is turned onLetrsquos take one example regex foo[ˆX]barY+

and consider two input cases The regex patternis decomposed into FA2 str2 FA1 str1 FA0 whereFA2= str2=fooFA1=[ˆX] str1=bar FA0=Y+bull Input=XfooZbarY This is overall a successful

match First the string matcher finds str2 (foo) andtriggers matching of FA2 () against X since theleftmost FA switch is always on Then the switch ofFA1 ([ˆX]) is turned on After that the matcher findsstr1 (bar) which triggers matching of FA1 againstZ and its success turns on the the switch of FA0(Y+) Since FA0 is the rightmost FA it is executedagainst the remaining input Y

bull Input=XfoZbarY This is overall a failed matchFirst the string matcher finds str1 (bar) and sees if itcan trigger matching of FA1 ([ˆX]) Then it figuresout that the switch of FA1 is off since str2 (foo) wasnot found and thus none of FA2 and str2 was a success-ful match So the matching of regex FA1 terminatesensuring no waste of CPU resourcesOur implementation tracks of the input offsets of

matched strings and the state of matching individual com-ponents which allows invoking appropriate FA matchingwith a correct byte range of the inputRegex with alternation If a regex includes an alterna-tion like (A|B) we expand the alternation into two regexesonly if both A and B are decomposed into str and FAcomponents (decomposable) If not (A|B) is treated asa single FA In case A or B itself has an alternation weneed to recursively apply the same rule until there is no al-ternation in their subregexes Then each expanded regexwould become a linear regex which would benefit fromthe same rules for decomposition and matching as before

Pattern matching with regex decomposition presentsits main benefit ndash it minimizes the waste of CPU cycleson unnecessary FA matching because FA matching isexecuted only when it is needed Also it increases thechance of fast DFA matching as each decomposed FAis smaller so it is more likely to be converted into aDFA In contrast the prefiltering approach has to executematching of the entire regex even when it is not needed(eg matching bar in the example above would triggerregex matching even if foo is not found) and regexmatching must re-match the string already found in string

634 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

suffers from the DMA overhead of moving data betweenthe CPU and the FPGA device This overhead can imposeprohibitive latency especially when input data is not large(as would be the case for scanning of small packets)

We use Glushkov NFA [27] for Hyperscan which iswidely used due to its two useful properties First itdoes not have epsilon transitions which simplifies statetransition implementation Second all transitions intoa given state are triggered by the same input symbolFigure 1 shows an example of a Glushkov NFA Eachcircle represents a state whose id is shown as a numberand each character represents the input symbol by whichany previous state transitions into this state For exampleone can transition into state 8 from state 3 6 or 7 onlyfor an input symbol lsquogrsquo The second property impliesthat the total number of states is bounded by the numberof input symbols and a few special states ndash a start statestates with lsquorsquo etcA DFA achieves high performance as it runs in O(1)per each input character Its main disadvantage how-ever is a large memory footprint and a potential ofstate explosion at transforming an NFA to a DFA Thusmost works on DFA focus on memory footprint reduc-tion [15 17 18 20 26 31 32 33] D2FA [32] com-presses the memory space by sharing multiple transitionsof states with a similar transition table and by establishinga default transition between them A-DFA [18] presentsuseful features such as alphabet reduction that classifiesalphabets into smaller groups indirect addressing thatreduces memory bandwidth by translating unique stateidentifiers to memory address and multi-stride structurethat processes multiple characters at a timeAn extended FA is a proposal that restructures the state-of-the-art FA to address state explosion XFA [38 39]associates update functions with states and transitionsby having a scratch memory that compresses the spaceHFA [16] presents a hybrid-FA that achieves compara-ble space to that of an NFA by making head DFAs andtrailing NFA or DFAs The theory behind it is to discoverboundary states so that one can conduct partial conversionof an NFA to a DFA to avoid exponential state explosionfrom a full conversionPrefilter-based approaches are the most popular way toscale performance of regex matching in practice BothSnort and Suricata extract string keywords from regexrules and perform unified multi-string matching withthe Aho-Corasick algorithm Expensive regex match-ing is only needed if AC detects literal strings in theinput SplitScreen [36] applies a similar approach to Cla-mAV [30] a widely-used anti-malware application andachieves a 2x speedup compared to original ClamAV

SIMD Register XX2 X1 X0X3

Y2 Y1 Y0Y3 SIMD Register Y

X2 OP Y2 X1 OP Y1 X0 OP Y0X3 OP Y3

OP

SIMD Register Z

OP OP OP

Figure 2 Typical two-operand SIMD operation

A SIMD instruction executes the same operation on mul-tiple data in parallel As shown in Figure 2 a SIMDoperation is performed on multiple lanes of two SIMDregisters independently and the results are stored in thethird register Modern CPU supports a number of SIMDinstructions that can work on specialized vector registers(SSE AVX etc) The latest AVX512 instructions supportup to 512-bit operations simultaneously

Despite its great potential few research works haveexploited SIMD instructions for regex matching Sitaridiet al propose a SIMD-based regex matching design [37]for database which uses a gather instruction to traverseDFA for multiple inputs simultaneously However itcannot be applied to our case as regex matching for DPIis typically performed on a single input streamSummary and our approach Most prior works onregex matching attempt to build a special FA that per-forms as well as a DFA while its memory footprint is assmall as an NFA However one common problem in allthese works is that FA restructuring inevitably imposes ex-tra performance overhead compared to the original DFAFor example XFA takes multiple tens to hundreds ofCPU cycles per input byte which is slower than a nor-mal DFA by one or two orders of magnitude In contrastthe prefilter-based approach looks attractive as it benefitsfrom multi-string matching most time which is fasterthan multi-regex matching by a few orders of magnitudeHowever it is still suboptimal as it must perform duplicatestring matching during regex matching and wrong choiceof string patterns would trigger redundant regex matching(as shown in Section 62) To avoid the inefficiency wetake a fresh approach that divides a regex pattern into mul-tiple components and leverages fast string matching tocoordinate the order of component matching This wouldminimize the waste of CPU cycles on redundant match-ing and thus improves the performance In addition wedevelop our own multi-string matching and FA matchingalgorithms carefully tailored to exploit SIMD operations

3 Regular Expression DecompositionIn this section we present the concept of regex decompo-sition and explain how Hyperscan matches a sequence ofregex components against the input Then we introduce

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 633

graph-based decomposition whose graph analysis tech-niques reliably identify the strings in regex patterns mostdesirable for string matching

31 Matching with Regex DecompositionThe key idea of Hyperscan is to decompose each regexpattern into a disjoint set of string and subregex (or FA)components and to match each component until it findsa complete match A string component consists of astream of literals (or input symbols) Subregex compo-nents are the remaining parts of a regex after all stringcomponents are removed They may include one or moremeta-characters or quantifiers in regex syntax (like lsquo ˆ rsquolsquo$rsquo lsquorsquo lsquorsquo etc) that need to be translated into an FA formatching Thus we refer to it as an FA componentLinear regex We start with a simple regex where eachcomponent is concatenated without alternation We call ita linear regex Formally a linear regex pattern that con-tains at least one string can be represented as the followingproduction rules

1 regex rarr left str FA2 left rarr left str FA | FA

where str and FA are both indivisible components andFA can be empty A linear regex without any string is im-plemented as a single DFA or NFA In practice howeverwe find that 87 to 94 of the regex rules in IDSes haveat least one extractable string so a majority of real-worldregexes would benefit from decomposition The produc-tion rules imply that if we find the rightmost string in alinear regex pattern we can recursively apply the samealgorithm to decompose the rest of the pattern One com-plication lies in a subregex with a repetition operator suchas (R) (R)lowast (R)+ and (R)mn where R is arbitrar-ily complex Hyperscan treats (R) and (R)lowast as a singleFA since R is optional while it converts (R)+ = (R)(R)lowastand (R)mn= (R)(R)(R)0nminusm ((R) appears mtimes)) Then it decomposes their prefixes and treats thesuffix as an FA

In general a decomposable linear regex can be ex-pressed as FAn strn FAnminus1 strnminus1 middot middot middot str2 FA1 str1 FA0For any successful match of the original regex all stringsmust be matched in the same order as they appear Basedon the observation Hyperscan applies the following threerules for regex matching1 String matching is the first step It scans the entire

input to find all strs Each successful match of str maytrigger the execution of its neighbor FA matching

2 Each FA has its own switch for execution It is off bydefault except for the leftmost FA components

3 For a generalized form like left FA str right whereleft or right is an arbitrary set of decomposed com-

ponents including an empty character Only if all com-ponents of left are matched successfully the switchof FA is turned on Only if str is matched successfullyand the FA switch is on FA matching is executed Fi-nally only if FA is matched successfully the leftmostFA of right is turned onLetrsquos take one example regex foo[ˆX]barY+

and consider two input cases The regex patternis decomposed into FA2 str2 FA1 str1 FA0 whereFA2= str2=fooFA1=[ˆX] str1=bar FA0=Y+bull Input=XfooZbarY This is overall a successful

match First the string matcher finds str2 (foo) andtriggers matching of FA2 () against X since theleftmost FA switch is always on Then the switch ofFA1 ([ˆX]) is turned on After that the matcher findsstr1 (bar) which triggers matching of FA1 againstZ and its success turns on the the switch of FA0(Y+) Since FA0 is the rightmost FA it is executedagainst the remaining input Y

bull Input=XfoZbarY This is overall a failed matchFirst the string matcher finds str1 (bar) and sees if itcan trigger matching of FA1 ([ˆX]) Then it figuresout that the switch of FA1 is off since str2 (foo) wasnot found and thus none of FA2 and str2 was a success-ful match So the matching of regex FA1 terminatesensuring no waste of CPU resourcesOur implementation tracks of the input offsets of

matched strings and the state of matching individual com-ponents which allows invoking appropriate FA matchingwith a correct byte range of the inputRegex with alternation If a regex includes an alterna-tion like (A|B) we expand the alternation into two regexesonly if both A and B are decomposed into str and FAcomponents (decomposable) If not (A|B) is treated asa single FA In case A or B itself has an alternation weneed to recursively apply the same rule until there is no al-ternation in their subregexes Then each expanded regexwould become a linear regex which would benefit fromthe same rules for decomposition and matching as before

Pattern matching with regex decomposition presentsits main benefit ndash it minimizes the waste of CPU cycleson unnecessary FA matching because FA matching isexecuted only when it is needed Also it increases thechance of fast DFA matching as each decomposed FAis smaller so it is more likely to be converted into aDFA In contrast the prefiltering approach has to executematching of the entire regex even when it is not needed(eg matching bar in the example above would triggerregex matching even if foo is not found) and regexmatching must re-match the string already found in string

634 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

graph-based decomposition whose graph analysis tech-niques reliably identify the strings in regex patterns mostdesirable for string matching

31 Matching with Regex DecompositionThe key idea of Hyperscan is to decompose each regexpattern into a disjoint set of string and subregex (or FA)components and to match each component until it findsa complete match A string component consists of astream of literals (or input symbols) Subregex compo-nents are the remaining parts of a regex after all stringcomponents are removed They may include one or moremeta-characters or quantifiers in regex syntax (like lsquo ˆ rsquolsquo$rsquo lsquorsquo lsquorsquo etc) that need to be translated into an FA formatching Thus we refer to it as an FA componentLinear regex We start with a simple regex where eachcomponent is concatenated without alternation We call ita linear regex Formally a linear regex pattern that con-tains at least one string can be represented as the followingproduction rules

1 regex rarr left str FA2 left rarr left str FA | FA

where str and FA are both indivisible components andFA can be empty A linear regex without any string is im-plemented as a single DFA or NFA In practice howeverwe find that 87 to 94 of the regex rules in IDSes haveat least one extractable string so a majority of real-worldregexes would benefit from decomposition The produc-tion rules imply that if we find the rightmost string in alinear regex pattern we can recursively apply the samealgorithm to decompose the rest of the pattern One com-plication lies in a subregex with a repetition operator suchas (R) (R)lowast (R)+ and (R)mn where R is arbitrar-ily complex Hyperscan treats (R) and (R)lowast as a singleFA since R is optional while it converts (R)+ = (R)(R)lowastand (R)mn= (R)(R)(R)0nminusm ((R) appears mtimes)) Then it decomposes their prefixes and treats thesuffix as an FA

In general a decomposable linear regex can be ex-pressed as FAn strn FAnminus1 strnminus1 middot middot middot str2 FA1 str1 FA0For any successful match of the original regex all stringsmust be matched in the same order as they appear Basedon the observation Hyperscan applies the following threerules for regex matching1 String matching is the first step It scans the entire

input to find all strs Each successful match of str maytrigger the execution of its neighbor FA matching

2 Each FA has its own switch for execution It is off bydefault except for the leftmost FA components

3 For a generalized form like left FA str right whereleft or right is an arbitrary set of decomposed com-

ponents including an empty character Only if all com-ponents of left are matched successfully the switchof FA is turned on Only if str is matched successfullyand the FA switch is on FA matching is executed Fi-nally only if FA is matched successfully the leftmostFA of right is turned onLetrsquos take one example regex foo[ˆX]barY+

and consider two input cases The regex patternis decomposed into FA2 str2 FA1 str1 FA0 whereFA2= str2=fooFA1=[ˆX] str1=bar FA0=Y+bull Input=XfooZbarY This is overall a successful

match First the string matcher finds str2 (foo) andtriggers matching of FA2 () against X since theleftmost FA switch is always on Then the switch ofFA1 ([ˆX]) is turned on After that the matcher findsstr1 (bar) which triggers matching of FA1 againstZ and its success turns on the the switch of FA0(Y+) Since FA0 is the rightmost FA it is executedagainst the remaining input Y

bull Input=XfoZbarY This is overall a failed matchFirst the string matcher finds str1 (bar) and sees if itcan trigger matching of FA1 ([ˆX]) Then it figuresout that the switch of FA1 is off since str2 (foo) wasnot found and thus none of FA2 and str2 was a success-ful match So the matching of regex FA1 terminatesensuring no waste of CPU resourcesOur implementation tracks of the input offsets of

matched strings and the state of matching individual com-ponents which allows invoking appropriate FA matchingwith a correct byte range of the inputRegex with alternation If a regex includes an alterna-tion like (A|B) we expand the alternation into two regexesonly if both A and B are decomposed into str and FAcomponents (decomposable) If not (A|B) is treated asa single FA In case A or B itself has an alternation weneed to recursively apply the same rule until there is no al-ternation in their subregexes Then each expanded regexwould become a linear regex which would benefit fromthe same rules for decomposition and matching as before

Pattern matching with regex decomposition presentsits main benefit ndash it minimizes the waste of CPU cycleson unnecessary FA matching because FA matching isexecuted only when it is needed Also it increases thechance of fast DFA matching as each decomposed FAis smaller so it is more likely to be converted into aDFA In contrast the prefiltering approach has to executematching of the entire regex even when it is not needed(eg matching bar in the example above would triggerregex matching even if foo is not found) and regexmatching must re-match the string already found in string

634 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

matching Furthermore conversion of a whole regex intoa single FA is not only more complex but often ends upwith a slower NFA to avoid state explosion In terms ofcorrectness pattern matching with regex decompositionproduces the same result as the original regex but weleave its formal proof as our future work

32 Rationale and GuidelinesIn practice performing regex decomposition on its textualform is often tricky as some string segments might lookhidden behind special regex syntax We provide severalsuch examples belowbull Character class (or character-set) b[il1]l s010

includes a character class that can be expanded to threestrings (bil bll and b1l) while naiumlve textual ex-traction might find only lsquobrsquo and lsquolrsquo

bull Alternation The alternation sequence in( lowast x2d(h|H)(t|T )(t|T )(p|P)) makes it harder todiscover http sequences from textual extraction

bull Bounded repeats From the perspective of text thestrings with a minimum length of 32 are hidden frombounded repeats in [x40 x90]32

To reliably find these strings we perform regex decom-position on the Glushkov NFA graph [27] which wouldbenefit from structural graph analyses We describe use-ful guidelines for finding the strings effective for regexmatching1 Find the string that would divide the graph into twosubgraphs with the start state in one subgraph and theset of accept states in the other Matching such a stringis a necessary condition for any successful match on theentire regex If the start and accept states happen to bein the same subgraph the corresponding FA will alwayshave to run regardless of a string match2 Avoid short strings Short strings are prone to matchmore frequently and are likely to trigger expensive FAmatching that fails3

3 Expand small character-sets 4 to multiple strings tofacilitate decomposition This would not only increasethe chance of successful decomposition but also lead to alonger string if a character-set intercepts a string sequence(ie document[x22 x27]ob ject)4 Avoid having too many strings Having too manystrings for matching would overload the matcher anddegrade the entire performance So it is important to finda small set of good strings effective for regex matching

3Our current limit is 2 to 3 characters4Our current implementation treats a character-set that expands to

11 or smaller strings as a small character-set

1[^a]

2 4[^a]

6a

7b

8c

3[^a] 5

d10f

9[^e]

11c

0

Figure 3 Dominant path analysis

33 Graph-based String ExtractionWe develop three graph analysis techniques that discoverstrings in the critical path for matching We describethe key idea of each algorithm below and provide moredetailed algorithms in an appendixDominant path analysis A vertex u is called a domina-tor of vertex v if every path from the start state to vertexv must go through vertex u A dominant path of vertex vis defined as a set of vertices W in a graph where eachvertex in W is a dominator of v and the vertices form atrace of a single path Dominant path analysis finds thelongest common string that exists in all dominant paths ofany accept state For example Figure 3 shows the stringon the dominant path of the accept state (vertex 11)

The string selected by the analysis is highly desirablefor matching as it clearly divides start and accept statesinto two separate subgraphs satisfying the first guide-line The algorithm calculates the dominant path per eachaccept state and finds the longest common prefix of alldominant paths Then it extracts the string on the chosenpath If a vertex on the path is a small character-set weexpand it and obtain multiple stringsDominant region analysis If the dominant path analy-sis fails to extract a string we perform dominant regionanalysis It finds a region of vertices that partition the startstate into one graph and all accept states into the otherMore formally a dominant region is defined as a subsetof vertices in a graph such that (a) the set of all edges thatenter and exit the region constitute a cut-set of the graph(b) for every in-edge (u v) to the region there exist edges(u w) for all w in w w is in the region and w has anin-edge where (u v) refers to an edge from vertex u tov in the graph and (c) for every out-edge (u v) from theregion there exist edges (w v) for all w in w w in is inthe region and w has an out-edge

If a discovered region consists of only string or smallcharacter-set vertices we transform the region into a set ofstrings Since these strings connect two disjoint subgraphsof the original graph any match of the whole regex mustmatch one of these strings Figure 4 shows one example ofa dominant region with 9 vertices Vertices 5 6 and 7 arethe entry points with the same predecessors and vertices

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 635

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

1[^a]

2 3[^a]

6b

9a

12r

3[^a]

4d

14[^e]

15c0

7a

10b

13c

5f

8o

11o

Figure 4 Dominant region analysis

2[^a]

3 6[^a]

10f

13g 16h

5[^a] 8

19[^c] 220

11a

14b

17c

9f

12o

15o

7e

420[^m]

18[^e] 21

c1[^a]

23g

Figure 5 Network flow analysis

11 12 and 13 are the exits with the same successor Wecan extract strings f oo bar and abc as a result ofdominant region analysis

The algorithm for dominant region analysis first createsa directed acyclic graph (DAG) from the origin graph toavoid any interference from back edges Then it performstopological sort on the DAG and iterates each vertexto see if it is added to the current candidate region itsboundary edges form a valid cut-set We repeat this todiscover all regions in the graph Since we only analyzethe DAG the back edges of the original graph might affectthe correctness Thus for each back edge if its sourceand target vertices are in different regions we merge them(and all intervening regions) into a single region Finallywe extract the strings from the dominant regionNetwork flow analysis Since dominant path and domi-nant region analyses depend on a special graph structurethey may not be always successful Thus we run networkflow analysis for generic graphs For each edge the analy-sis finds a string (or multiple strings) that ends at the edgeThen the edge is assigned a score inversely proportionalto the length of the string(s) ending there The longer thestring is the smaller the score gets With a score per edgethe analysis runs ldquomax-flow min-cutrdquo algorithm [25] tofind a minimum cut-set that splits the graph into two thatseparate the start state from all accept states Then theldquomax-flow min-cutrdquo algorithm discovers a cut-set of edgesthat generate the longest strings from this graph

Figure 5 shows a result of network flow analysis ex-tracting a string set of f oo e f gh and abc thatwould divide the whole graph into two partsEffectiveness of graph analysis Our graph analysis ef-fectively produces good strings for most of real-worldrules Table 1 shows that 972 to 992 of decompos-able real-world regex rules benefit from dominant path

Ruleset Total Decomp D-Path D-Reg N-flowS-V 1663 1563 1551 32 16S-E 7564 6756 6575 100 203Suri 7430 6501 6318 94 201

Table 1 Effectiveness of graph analysis on real-world rulesetsS-V Snort Talos (May 2015) S-E Snort ET-Open 290 SuriSuricata rulesets 404 D-Path D-Reg and N-flow refer todominant path dominant region and network flow analysisrespectively Decomp is the total number of decomposable rulesNote that one regex could benefit from multiple graph analysesso the sum of graph analyses is larger than the Decomp fields

Extended Shift-or

Matching

VerificationInput

StreamExact

Matching

Candidate Matching

Input

String Pattern

hellip

Hashing

Figure 6 Two-stage matching with FDR

analysis while remaining patterns exploit dominant re-gion and network flow analysis These strings prove tobe highly beneficial for reducing the number of regexmatching invocations as shown in Section 62

4 SIMD-accelerated Pattern MatchingIn this section we present the design of multi-string andFA matching algorithms that leverage SIMD operationsof modern CPU

41 Multi-string Pattern MatchingWe introduce an efficient multi-string matcher calledFDR 5 The key idea of FDR is to quickly filter out inno-cent traffic by fast input scanning As shown in Figure 6FDR performs extended shift-or matching [13] to find can-didate input strings that are likely to match some stringpattern Then it verifies them to confirm an exact matchShift-or matching We first provide a brief backgroundof shift-or matching that serves as the base algorithm ofFDR The shift-or algorithm finds all occurrences of astring pattern in the input bytestream by performing bit-wise shift and or operations as shown in Figure 7 Ituses two data structures ndash a shift-or mask for each charac-ter c in the symbol set (sh-mask(lsquocrsquo)) and a state mask(st-mask) for matching operation sh-mask(lsquocrsquo) zeros allbits whose bit position corresponds to the byte position ofc in the string pattern while all other bits are set to 1 Thebit position in a sh-mask is counted from the rightmostbit while the byte position in a pattern is counted from theleftmost byte For example for a string pattern aphp

5It is named after the 32nd President of the US

636 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

sh-mask(lsquohrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquoprsquo) =

aphp

m4 = (m3 ltlt 1) | sh-mask(lsquoprsquo)

m3 = (m2 ltlt 1) | sh-mask(lsquohrsquo)

m2 = (m1 ltlt 1) | sh-mask(lsquoprsquo)

m1 = (st-state ltlt 1)| sh-mask(lsquoarsquo)

lowhigh

Shift-o

r masks

string pattern

aphphellip

Input

mat

chin

g w

ith

inp

ut

st-mask = 11111111

Match

11111110

11111011

11110101

11111110

11111101

11111011

11110111

Figure 7 Classical shift-or matching

ab sh-mask(lsquobrsquo) =

sh-mask(lsquoarsquo) =

sh-mask(lsquocrsquo) =

lowhigh

Bucket 0

cd

sh-mask(lsquodrsquo) =

Padding Bytes

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

hellip 11111110 11111110 11111111

hellip 11111110 11111111 11111110

Figure 8 Example shift-or masks with two patterns at bucket 0No other buckets contain lsquoarsquo lsquobrsquo lsquocrsquo lsquodrsquo in their patterns

sh-mask(lsquoprsquo) = 11110101 as lsquoprsquo appears at the secondand the fourth position in the pattern If a character isunused in the pattern all bits of its sh-mask are set to 1The algorithm keeps a st-mask whose size is equal to thelength of a sh-mask Initially all bits of the st-mask areset to 1 The algorithm updates the st-mask for each inputcharacter lsquoxrsquo as st-mask = ((st-mask ≪ 1) | sh-mask(lsquoxrsquo))For each matching input character 0 is propagated to theleft by one bit If the zero bit position becomes the lengthof the pattern it indicates that the pattern string is found

The original shift-or algorithm runs fast with high spaceefficiency but it leaves two limitations First it supportsonly a single string pattern Second although it consistsof bit-level operations the implementation cannot benefitfrom SIMD instructions except for very long patterns Wetackle these problems as belowMulti-string shift-or matching To support multi-stringpatterns we update our data structures First we dividethe set of string patterns into n distinct buckets where eachbucket has an id from 0 to n-1 For now assume that eachstring pattern belongs to one of n buckets as we will dis-cuss how we divide the patterns later (lsquopattern groupingrsquo)Second we increase the size of sh-mask and st-mask byn times so that a group of n bits in sh-mask(lsquoxrsquo) record allthe buckets that have lsquoxrsquo in some of their patterns Moreprecisely the k-th n bits of sh-mask(lsquoxrsquo) encode the idsof all buckets that hold at least one pattern which has lsquoxrsquoat the k-th byte position One difference from the originalalgorithm is that the byte position in a pattern is countedfrom the rightmost byte This enables parallel executionof multiple CPU instructions per cycle as explained later(lsquoSIMD accelerationrsquo) For efficient implementation we

hp

aphphellip

sh-mask(lsquoarsquo)

OR

aphp

sh-mask(lsquoprsquo) ltlt 24

sh-mask(lsquohrsquo) ltlt 16

sh-mask(lsquoprsquo) ltlt 8

sh-mask(lsquoarsquo)

st-mask

lowhigh

Bucket 4

Bucket 0

Input

hellip 11101110 11111110 11111111 11111111

hellip 11111110 11111110 11101110 11111111

hellip 11111110 11101110 11111111 11101110

hellip 00000000 00000000 00000000 11111111

hellip 11101110 11111110 11111111 11111111

hellip 11101110 11111111 11101110 00000000

hellip 11101110 11111111 00000000 00000000

hellip 11101110 00000000 00000000 00000000

sh-mask(lsquohrsquo)

sh-mask(lsquoprsquo)

mat

chin

g w

ith

inp

ut

Match (bucket = 0 position = 3)Match (bucket = 4 position = 3)

11101110

Figure 9 FDRrsquos extended shift-or matching

set n to 8 so that the byte position in a sh-mask matchesthe same position in a pattern This implies that the lengthof a sh-mask should be no smaller than the longest pattern

Figure 8 shows an example Bucket 0 has two stringpatterns ab and cd Since lsquoarsquo appears at the secondbyte of only ab in bucket 0 sh-mask(lsquoarsquo) zeros only thefirst bit (= bucket 0) of the second byte The i-th bit withineach byte of a sh-mask indicates the bucket id i-1 If thebyte position of a sh-mask exceeds the longest patternof a certain bucket (called lsquopadding bytersquo) we encodethe bucket id in the padding byte This ensures matchingcorrectness by carrying a match at a lower input bytealong in the shift process Examples of this are shownin the padding bytes in Figure 8 and in the sh-masks inFigure 9 that zero the first bit in the third and fourth bytes

The pattern matching process is similar to the originalalgorithm except that sh-masks are shifted left insteadof the st-mask The st-mask is initially 0 except for thebyte positions smaller than the shortest pattern Thisavoids a false-positive match at a position smaller thanthe shortest pattern Now we proceed with input char-acters The matcher keeps k the number of charactersprocessed so far modulo n For an input character lsquoxrsquost-mask |= (sh-mask(lsquoxrsquo) ≪ (k bytes)) The matcher re-peats this for n input characters and checks if the st-maskhas any zero bits Zero bits represent a possible match atthe corresponding bucket For example Figure 9 showsthat bucket 0 and 4 have a potential match at input byteposition 3 The verification stage illustrated later checkswhether they are a real match or a false positivePattern grouping The strategy for grouping patternsinto each bucket affects matching performance A good

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 637

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

strategy would distribute the patterns well such that mostinnocent traffic would pass with a low false positive rateTowards the goal we design our algorithm based on twoguidelines First we group the patterns of a similar lengthinto the same bucket This is to minimize the informationloss of longer patterns as the input characters match onlyup to the length of the shortest pattern in a bucket formatching correctness Second we avoid grouping toomany short patterns into one bucket In general shorterpatterns are more likely to increase false positives Tomeet these requirements we sort the patterns in the as-cending order of their length and assign an id of 0 to(s-1) to each pattern by the sorted order Then we runthe following algorithm that calculates the minimum costof grouping the patterns into n buckets using dynamicprogramming The algorithm is summarized by the twoequations below

1 t[i][ j] =sminus1min

k=i+1(costik + t[k+1][ jminus1]) where s is the

number of patterns and t[i][j] stores the minimum costof grouping the patterns i to (s-1) into (j+1) buckets

2 costik = (kminus i+1)αlengthβ

i where costik is the costof grouping patterns i to k into one bucket lengthi isfor pattern i α and β are constant parameters

t[i][j] is calculated as the minimum of the sum of the costof grouping patterns i to k into one bucket (costik) and theminimum cost of grouping remaining patterns (k+1) to(s-1) into j buckets (t[k+1][j-1]) costik gets smaller as thebucket has a longer pattern which allows more patterns inthe bucket It gets larger as the bucket has a shorter patternlimiting the number of such patterns Our implementationcurrently uses α = 105 and β = 3 towards this goaland computes t[0][7] to divide all string patterns into 8buckets and records the bucket id per each pattern in theprocess In practice we find that the algorithm workswell automatically reaching the sweetspot that minimizesthe total costSuper Characters One problem with the bucket-basedmatching is that it produces false positives with patternsin the same bucket For example if a bucket has ab andcd the algorithm not only matches the correct patternsbut also matches false positives ad and cb To suppressthem we use an m-bit (mgt8) super character (insteadof an 8-bit ASCII character) to build and index the sh-masks An m-bit super character consists of a normal(8-bit) character in the lower 8 bits and low-order (m-8) bits of the next character in the upper bits If it isthe last character of a pattern (or in the input) we usea null character (0) as the next character The key ideais to reflect some information of the next character in apattern into building the sh-mask for the current character

Only if the same two characters appear in the input 6 wedeclare a match at that input byte position This wouldsignificantly reduce false positives at the cost of a slightlylarge memory for sh-masks

In practice m should be between 9 and 15 Letrsquos saym = 12 bits For a pattern ab we see two 12-bit supercharacters α = ((low-order 4 bits of lsquobrsquo ≪ 8) | lsquoarsquo) andβ = lsquobrsquo Then we build sh-masks for α and β respec-tively When the input arrives we construct a 12-bit supercharacter based on the current input byte offset and useit as an index to fetch its sh-mask We advance the inputone byte at a time as before For example if the input islsquoadrsquo it first constructs γ = ((low-order 4 bits of lsquodrsquo ≪ 8) |lsquoarsquo) fetches sh-mask(lsquoγrsquo) and performs shift and oroperations as before Then it advances to the next byteand constructs δ = lsquodrsquo So the input lsquoadrsquo will not matcheven if a bucket contains ab and cdSIMD acceleration Our implementation of FDR heav-ily exploits SIMD operations and instruction-level paral-lelism First it uses 128-bit sh-masks so that it employs128-bit SIMD instructions (eg pslldq for left shiftand por for or in the Intel x86-64 instruction set) to up-date the masks As shift and or are the most frequentoperations in FDR it enjoys a substantial performanceboost with the SIMD instructions Second it exploitsparallel execution with multiple CPU execution ports Inthe original shift-or matching the execution of shiftand or operations is completely serialized as they havedependency on the previous result This under-utilizesmodern CPU even if it can issue multiple instructions perCPU cycle In contrast FDR exploits instruction-levelparallelism by pre-shifting the sh-masks with multiple in-put characters in parallel Note that this is made possibleas we count the byte position differently from the origi-nal version The parallel execution effectively increasesinstructions per cycle (IPC) and significantly improvesthe performance To accommodate the parallel shiftingwe limit the length of a pattern to 8 bytes and extractthe lower 8 bytes from any pattern longer than 8 bytesBecause it requires minimum 8-byte masks and up to 7bytes of shifting a 128-bit mask would not lose high bitinformation during left shift In actual matching FDRhandles 8 bytes of input at a time To guarantee a contigu-ous matching across 8-byte input boundaries the st-maskof the previous iteration is shifted right by 8 bytes for thenext iterationVerification As our shift-or matching can still generatefalse positives we need to verify if a candidate match is

6Of course there is still a small chance of a false positive as we usepartial bits of the next character but the probability becomes fairly smallas the pattern length grows

638 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

Algorithm 1 FDR Multi-string Matcher1 function MATCH2 n = number o f bits o f a super character3 R = startMask4 for each 8ndashbyte V isin input do5 for i isin 07 do6 index =V [ilowast8ilowast8+nminus1]7 M[i] = shi f tOrMask[index]8 S[i] = LSHIFT (M[i] i)9 end for

10 for i isin 07 do11 R = OR128(RS[i])12 end for13 for zero bit b in low 8 bytes o f R do14 j = the byte position containing bit b15 l = length o f string in bucket b16 h = HASH(V [ jminus l +1 j])17 if h is valid then18 Per f orm exact matching f or each19 string in hash bucket[h]20 end if21 end for22 R = RSHIFT (R8))23 end for24 end function

an exact match This phase consists of hashing and exactstring comparison To minimize hash collisions we builda separate hash table for each bucket Then we leveragethe byte position of a match and compare the input witheach string in the hash bucket to confirm a match Inpractice we find hashing filters out a large portion offalse positives

42 Finite Automata Pattern MatchingSuccessful string matching often triggers FA componentmatching which is essentially the same as general regexmatching Our strategy is to use a DFA whenever is possi-ble but if the number of DFA states exceeds a threshold 7we fall back to NFA-based matching As the state-of-the-art DFA already delivers high performance we introducefast NFA-based matching with SIMD operations here

In NFA-based matching there can be multiple currentstates (current set) that are active at any time A state tran-sition with an input character is performed on every activestate in the current set in parallel which produces a set ofsuccessor states (successor set) A match is successful ifany state reaches one of the accept states

We develop bit-based NFA where each bit represents astate We choose the bit-based representation as it outper-forms traditional NFA representations that use byte arraysto store transitions and look up a transition table for eachcurrent state in a serialized manner Also bit-based NFAleverages SIMD instructions to perform vectorized bit

7We use 16384 states as the threshold

operations to further accelerate the performance Ourscheme assigns an id to each state (ie each vertex inan n-node NFA graph) from 0 to n-1 by the topologicalorder and maintains a current set as a bit mask (calledcurrent-set mask) that sets a bit to 1 if its position matchesa current state id We define a span of a state transition asthe id difference between the two states of the transitionSince state ids are sequentially assigned by the topologi-cal order the span of a state transition is typically smallWe exploit this fact to compactly represent the transitionsbelow

The bit-based NFA implements a state transition withan input character lsquocrsquo in three steps First it calculatesthe successor set that can be transitioned to from anystate in the current set with any input character Secondit computes the set of all states that can be transitionedto from any state with lsquocrsquo (called reachable states by lsquocrsquo)Third it computes the intersection of the two sets Thisproduces a correct successor set as a Glushkov NFA graphguarantees that one can enter a specific state only withthe same character or with the same character-set

The challenge is to efficiently calculate the successorset One can pre-calculate a successor set for every combi-nation of current states and look up the successor set forthe current-set mask While this is fast it requires storing2n successor sets which becomes intractable except fora small n An alternative is to keep a successor-set maskper each individual state and to combine the successorset of every state in the current set This is more space-efficient but it costs up to n memory lookups and (nminus1)or operations We implement the latter but optimize itby minimizing the number of successor-set masks whichwould save memory lookups To achieve this we keep aset of shift-k masks shared by all relevant states A shift-kmask records all states with a forward transition of spank where a forward transition moves from a smaller stateid to a larger one and a backward transition does the op-posite Figure 10 shows some examples of shift-k masksShift-1 mask sets every bit to 1 except for bit 7 since allstates except state 7 has a forward transition of span 1

We divide each state transition into two types ndash typicalor exceptional A typical transition is the one whose spanis smaller or equal to a pre-defined shift limit Given ashift limit we build shift-k masks for every k (0leklelimit)at initialization These masks allow us to efficiently com-pute the successor set from the current set following typ-ical transitions If the current-set mask is S then ((Samp shift-k mask) ≪ k) would represent all possible suc-cessor states with transitions of span k from S If wecombine successor sets for all k we obtain the successorset reached by all typical transitions

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 639

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

-31A

4D

5A

6F

7F

0 3C

2B

-1 -135

3

Exception mask Exceptional transitions

Shift-0 mask

Shift-1 mask

Typical transitionshigh low

1 0 0 0 0 0 1

Shift-2 mask0 0 0 0 0 0 0

10101000

0

1 1 1 1 1 1 10

0

0 0 1 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 0 1 0 1 0

Succ mask for state 0

Succ mask for state 2

Succ mask for state 4

high low

Figure 10 NFA representation for (AB|CD)lowastAFFlowast

We call all other transitions exceptional These includeforward transitions whose span exceeds the limit and anybackward transitions 8 Any state that has at least oneexceptional transition keeps its own successor mask Thesuccessor mask records all states reached by exceptionaltransitions of its owner state All exceptional states aremaintained in an exception mask

As you can see the choice of the shift limit affects theperformance If it is too large we would have too manyshift-k masks representing rare transitions and if it is toosmall we would have to handle many exceptional statesOur current implementation uses 7 after performance tun-ing with real-world regex patterns

Figure 10 shows an NFA graph for (AB|CD)lowastAFFlowastWe set the shift limit to 2 and mark exceptional edges withthe difference of ids State 0 2 and 4 are highlighted asthey have exceptional out-edge transitions The exceptionmask holds all exceptional states and each state points toits own successor mask For example successor mask forstate 2 sets bits 1 and 5 as its exceptional transitions pointto states 1 and 5

Algorithm 2 shows our bit-based NFA matching Itcombines the successor masks possibly reached by typi-cal transitions (SUCC_TY P) and exceptional transitions(SUCC_EX) Then it fetches the reachable state set withthe current input character c (reach[c]) and perform a bit-wise and operation with the combined successor mask(SUCC) The result is the final successor set and we re-port a match if the successor set includes any accept stateOtherwise it proceeds with the next input character Foreach character it runs in O(l + e) where l is the shift limitand e is the number of exception states Our implemen-

8In our implementation forward transitions that cross the 64-bitboundary of the state id space (eg from an id smaller than 64 to anid larger than 64) are also treated as exceptional This is related to aspecific SIMD instruction that we use so we omit the detail here

Algorithm 2 Bit-based NFA Matching1 SH_MSKS[i] shift-i masks for typical transitions2 SUCC_MSKS[i] successor mask for state i3 EX_MSK exception mask4 reach[k] reachable state set for character k5 function RUNNFA(S current active state set)6 SUCC_TY P = 0 SUCC_EX = 07 for c isin input do8 if any state is active in S then9 for i = 0 to shi f tlimit do

10 R0 = AND(SSH_MSKS[i])11 R1 = LSHIFT (R0 i)12 SUCC_TY P = OR(SUCC_TY PR1)13 end for14 S_EX = AND(SEX_MSK)15 for active state s in S_EX do16 SUCC_EX =17 OR(SUCC_EX SUCC_MSKS[s])18 end for19 SUCC = OR(SUCC_TY PSUCC_EX)20 S = AND(SUCCreach[c])21 Report accept states in S22 end if23 end for24 end function

Total Size 1 GBytesNumber of Packets 818682Number of TCP Packets 818520Percent of TCP Bytes 972Percent of HTTP Bytes 929Average Packet Size 1265 Bytes

Table 2 HTTP traffic trace from a cloud service provider

tation uses a 128-bit mask (and extends it up to a 512-bitmask with four variables if needed) and employs 128-bitSIMD instructions for fast bitwise operations In practicewe find that 512 states are enough for representing theNFA graph of most regexes

5 ImplementationHyperscan consists of compile and run time functionsCompile-time functions include regex-to-NFA graph con-version graph decomposition and matching componentsgeneration The run time executes regex matching on theinput stream While we cover the core functions in Sec-tion 3 and 4 Hyperscan has a number of other subsystemsand optimizationsbull Small string-set (lt80) matching This subsystem imple-

ments a shift-or algorithm using the SSSE3 PSHUFBinstruction applied over a small number (2-8) of 4-bitregions in the suffix of each string

bull NFA and DFA cyclic state acceleration Where a state(in the case of the DFA) or a set of states (in the caseof the NFA) can be shown to recur until some input isseen we consider these cyclic states In case where

640 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

of regexes Prefilter Hyperscan Reduction500 2971652 645326 46x1000 93595304 714582 1310x1500 110122972 791017 1392x2000 139804519 780665 1791x2500 156332187 857100 1824x

Table 3 Regex invocations of Snortrsquos ET-Open ruleset

current states in NFA or DFA are all cyclic states witha large reachable symbol set there is a high probabilityof staying at current state(s) for many input charactersWe have SIMD acceleration for searching the first ex-ceptional input sequence (1-2 bytes) that leads to oneor more transitions out of current states or switches offone of the current states

bull Small-size DFA matching We design a SIMD-basedalgorithm for a small-size DFA (lt 16 states) that outper-forms the state-of-the-art DFA by utilizing the shuffleinstruction for fast state transitions

bull Anchored pattern matching When an anchored patternconsists of comparatively short acyclic sequences (ieno loops) the automata corresponding to them are bothsimple and short-lived They are thus cheap to scanand scale well We run DFA-based subsystems special-ized to attempt to discover when anchored patterns arematched or rejected

bull Suppression of futile FA matching We design a fastlookaround approach that peeks at inputs that are nearthe triggers for an FA before running it This oftenallows us to discover that the FA either does not needto run at all or will have reached a dormant state beforethe triggers arrive These checks are implemented ascomparatively simple SIMD checks and can be done inparallel over a range of input characters and characterclasses For example in the regex fragment R d s45 f oo where R is a complex regex we can firstdetect that the digit and space character classes havematched with SIMD checks and if not avoid or deferrunning a potentially expensive FA associated with R

6 EvaluationIn this section we evaluate the performance of Hyper-scan to answer the following questions (1) Does regexdecomposition extract better strings than those by manualchoice (2) How well do multi-string matching and regexmatching perform in comparison with the existing state-of-the-art (3) How much performance improvement doesHyperscan bring to a real DPI application

61 Experiment SetupWe use a server machine with Intel Xeon Platinum 8180CPU 250GHz and 48 GB of memory and compile the

of regexes Prefilter Hyperscan Reduction700 699622 18164 385x850 7516464 19214 3912x1000 17063344 32533 5245x1150 17737814 34075 5206x1300 25143574 36040 6977x

Table 4 Regex invocations for Snortrsquos Talos ruleset

code with GCC 54 To separate the impact by networkIO we evaluate the performance with a single CPU coreby feeding the packets from memory We test with packetsof random content as well as a real-world Web traffic traceobtained at a cloud service provider as shown in Table 2For all evaluation we use the latest version of Hyperscan(v50) [2]

62 Effectiveness of Regex DecompositionThe primary goal of regex decomposition is to mini-mize unnecessary FA matching by extracting goodstrings from a set of regexes with rigorous graph anal-yses To evaluate this point we compare the numberof regex matching invocations triggered by a prefilter-based DPI and by Hyperscan We extract the contentoptions and their associated regex from Snort rulesetsand count the number of regex matching invocations fora prefilter-based DPI And then we measure the samenumber for Hyperscan where Hyperscan automaticallyextracts the strings from regexes rather than using thekeywords from the content option For the ruleset we useET-Open 290 [8] and Talos 29111 [11] against the realtraffic trace and confirm the correctness ndash both versionsproduce the identical output for the traffic

Tables 3 and 4 show that Hyperscan dramatically re-duces the number of regex matching invocations by overtwo orders of magnitude As the number of regex rulesincreases the reduction by Hyperscan grows affirmingthat regex decomposition is the key contributor to effi-cient pattern matching Close examination reveals thatthere are many single-character strings in the content op-tion of the Snort rulesets which invokes redundant regexmatching too frequently In practice other rule options inSnort may mitigate the excessive regex invocations butfrequent string matching alone poses a severe overheadIn contrast Hyperscan completely avoids this problem bytriggering regex matching only if it is necessary

63 MicrobenchmarksWe evaluate the performance of FDR our multi-stringmatcher as well as that of regex matching of HyperscanMulti-string pattern matching We compare the perfor-mance of FDR with that of DFC and AC We measure theperformance over different numbers of string patterns ex-

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 641

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

Ruleset PCRE PCRE2 RE2-s Hyperscan-s RE2-m Hyperscan-mTalos 6942 394 1777 173 29 215ET-Open 12800 913 4696 516 1116 133

Table 5 Performance comparison with PCRE PCRE2 RE2 and Hyperscan for Snort Talos (1300 regexes) and Suricata (2800regexes) rulesets with the real Web traffic trace Numbers are in seconds

32

13 12 11

0

1

2

3

4

0

3

6

9

12

15

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

23

15 1412

0

1

2

3

0

3

6

9

12

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 11 String matching performance with random packets

2521

17 15

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 26k

Thro

ugh

pu

t (G

bp

s)

Number of String Patterns

FDR DFC

(a) ET-Open ruleset

22

1614 13

0

1

2

3

0

2

4

6

8

10

1k 5k 10k 20k

Imp

rove

men

t o

ver

DFC

Number of String Patterns

AC Improvement

(b) Talos rulesetFigure 12 String matching performance with a real traffic trace

tracted from Snort ET-Open and Talos rulesets Figure 11and Figure 12 show that FDR outperforms the state-of-the-art matcher DFC by 11x to 32x (and AC by 42x to88x) for packets of random content and by 13x to 25x(and AC by 32x to 82x) for the real traffic trace We alsoevaluate it with the Suricata ruleset but we omit the resulthere since it exhibits the similar performance trend Whenthe number of string patterns is small Hyperscan benefitsfrom small CPU cache footprint and SIMD accelerationbut when the number of patterns grows the performancebecomes compatible with DFC due to increased cacheusage but it is still much better than ACRegex matching We now evaluate the performance ofregex matching of Hyperscan We compare the perfor-mance with PCRE (v841) [6] as it is most widely usedin network DPI applications and PCRE2 (v1032) [7] amore recent fork from PCRE with a new API We en-able the JIT option for both PCRE and PCRE2 Wealso compare with RE2 (v2018-09-01) [1] a fast smallmemory-footprint regex matcher developed by GoogleBoth Hyperscan and RE2 support multi-regex matchingin parallel while PCRE matches one regex at a time Forfair comparison with PCRE and PCRE2 we measure thetotal time for matching all regexes in a serial manner (eg

match against one regex at a time) which would requirepassing the entire input for each regex For Hyperscan andRE2 we measure two numbers ndash one for matching oneregex at a time (RE2-s Hyperscan-s) and the other formatching all regexes in parallel (RE2-m Hyperscan-m)For testing we use 1300 regexes from the Snort Talosruleset and 2800 regexes from the Suricata ruleset

Table 5 shows the result For the Snort Talos rulesetHyperscan-s outperforms PCRE RE2-s and PCRE2 by401x 103x and 23x respectively Hyperscan-m is135x faster than RE2-m while it reveals 1833x perfor-mance improvement over PCRE2 For the Suricata rule-set Hyperscan-s shows 248x 91x and 18x speedupsover PCRE RE2-s and PCRE2 Hyperscan-m outper-forms RE2-m and PCRE2 by 84x and 69x respectivelyWe do not report DFA-based PCRE performance as wefind it much slower than the default operation with NFA-based matching [5] RE2-s uses a DFA for regex matching(and fails on a large regex that requires an NFA) but itneeds to translate the whole regex into a single DFA Incontrast Hyperscan splits a regex into strings and FAsand benefits from fast string matching as well as smallerDFA matching of the FA components which explains theperformance boost

64 Real-world DPI ApplicationWe now evaluate how much performance improvementHyperscan brings to a popular IDS like Snort Wecompare the performance of stock Snort (ST-Snort) andHyperscan-ported Snort (HS-Snort) that performs patternmatching with Hyperscan both with a single CPU coreST-Snort employs AC and PCRE for multi-string match-ing and regex matching respectively HS-Snort keeps thebasic design of Snort but it replaces AC and PCRE withthe multi-string and single-regex matchers of HyperscanIt also replaces the Boyer-Moore algorithm in Snort witha fast single-literal matcher of Hyperscan With the SnortTalos ruleset ST-Snort achieves 113 Mbps on our realWeb traffic In contrast HS-Snort produces 986 Mbps onthe same traffic a factor of 873 performance improve-ment We find that the main contributor for performanceimprovement is the highly-efficient multi-string matcherof Hyperscan as shown in Figure 11 In practice weexpect a much higher performance improvement if werestructure Snort to use multi-regex matching in parallel

642 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

7 Evolution Experience and LessonsHyperscan has been developed since 2008 and was firstopen-sourced in 2013 Hyperscan has been successfullyadopted by over 40 commercial projects globally and it isin production use by tens of thousands of cloud servers indata centers In addition Hyperscan has been integratedinto 37 open-source projects and it supports various op-erating systems such as Linux FreeBSD and WindowsHyperscan APIs are initially developed in C but thereare public projects that provide bindings for other pro-gramming languages such as Java Python Go NodejsRuby Lua and Rust In this section we briefly shareour experience with developing Hyperscan and lay out itsfuture direction

71 Evolution of HyperscanHyperscan was developed at a start-up company Sen-sory Networks after a move away from hardware match-ing which was expensive in terms of material costs anddevelopment time We investigated GPU-based regexmatching but it imposed unacceptable latency and sys-tem complexity As CPU technology advances we settledat CPU-based regex matching which not only becamecost-effective with high performance but also made itsimple to be employed by applicationsVersion 10 The initial version was released in 2008with the intent of providing a simple block-mode regexpackage that could handle large numbers of regexes Likeother popular solutions at that time it used string-basedprefiltering to avoid expensive regex matching How-ever the initial version was algorithmically primitive andlacked streaming capability (eg pattern matching overstreamed data) Also it suffered from quadratic matchingtime as it had to re-scan the input from a matched literalfor each match

Version 10 did include a large-scale string matcher(a hash-based matcher called hlm akin to Rabin-Karp[29] with multiple hash tables for different length strings)as well as a bit-based implementation of Glushkov NFAsThe NFA implementation allowed support of a broadrange of regexes that would suffer from combinatorialexplosions if the DFA construction algorithm was usedThe Glushkov construction mapped well to Intel SIMDinstructions allowing NFA states to be held in one ormore SIMD registersVersion 20 The algorithmic issues and the absence ofa streaming capability led to major changes to version10 which became Hyperscan version 20 First it movedtowards a Glushkov NFA-based internal representation(the NFA Graph) that all transformations operated overdeparting from ad-hoc optimizations on the regex syntax

tree Second it supported lsquostreamingrsquo ndash scanning multipleblocks without retaining old data and with a fixed-at-pattern-compile-time amount of stream state Support forefficient streaming was especially desirable for networktraffic monitoring as patterns may spread over multiplepackets Third it scanned patterns that used one or morestrings detected by a string matching pre-pass followedby a series of NFA or DFA engines running over the inputonly when the required strings are found This approachavoids the high risk of potential quadratic behavior ofversion 10 with the tradeoff of potentially bypassingsome comparatively lesser optimizations if a regex couldbe quickly falsified at each string matching site

Unfortunately version 20 still had a number of limita-tions First we observed the adverse performance impactof prefiltering Prefiltering did not reduce the size of theNFA or DFA engines even if a string factor completelyseparated a pattern into two smaller ones This exac-erbated the problem of a large regex that often neededto be converted into an NFA As the system had a hardlimit of 512 NFA states (dictated by the practicalities of adata-parallel SIMD implementation of the Glushkov NFAmore than 512 states resulted in extremely slow code) itoften did not accommodate user-provided regexes whenthey were too large Further if prefiltering failed (iewhen the string factors were all present) it ended up con-suming more CPU cycles than naiumlvely performing theNFA engines over all the input

Another serious limitation was that matches emergedfrom the system in an undefined order Since the NFAswere run after string matching had finished the matchesfrom these NFAs would emerge based on the order ofwhich NFAs were run first and no rigorous order was de-fined for when these matches would appear Further con-fusing matters the string matcher was capable of produc-ing matches of plain strings ahead of the NFA executionsIn fact due to potential optimizations where NFA graphsmight be split (for example splitting into connected com-ponents to allow an unimplementably large NFA to be runas smaller components) it was even possible to receiveduplicate matches for a given pattern at a given locationAfter an NFA is split into connected components andrun in separate engines no mechanism existed to detectwhether these different components (which would be run-ning at different times) might be sometimes producingmatches with the same match id and location

For example a regex workload consisting of patterns f oo abc[xyz] and abc[xy] lowast de f |abcz lowast de[ f g]might first produce matches for the simple literal f oothen provide all the matches for the components of thepattern abc[xyz] then provide matches for the two parts

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 643

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

of the alternation abc[xy]lowastde f and abczlowastde[ f g] with-out removing duplicate matches for inputs that happenedto match both parts of the pattern on the same offset (egthe input abcxxxxzde f )Versions 21 and 30 (The version 21 release series ofHyperscan saw considerable development and in retro-spect should have merited a full version number incre-ment) The limitation of prefiltering spurred the develop-ment of an alternate matching subsystem called lsquoRosersquolsquoRosersquo allowed both ordered matching duplicate matchavoidance and pattern decomposition This subsystemwas maintained in parallel to the prefiltering system inher-ited from the original 20 design Whenever it was possi-ble to decompose patterns the patterns were matched withthe lsquoRosersquo subsystem which initially was not capable ofhandling all regular expressions

FDR was developed during in the version 21 releaseseries it replaced the hash-based literal matcher (hlm)with considerably performance improvements and reduc-tion in memory footprint

Eventually by version 30 the old prefiltering systemwas entirely removed as the Rose path was made fullygeneral Version 30 also marked an organizational changein that Intel Corporation had acquired Sensory NetworksVersion 40 Version 40 was released in October 2013under an open-source BSD license to further increasethe usage of Hyperscan by removing barriers of costand allowing customization Many elements of Hyper-scanrsquos design continued to evolve For example the initialRose subsystem had a host of special-purpose matchingmechanisms that identify the strings separated by variouscommon regex constructs such as lowast or X+ for some char-acter class X For example it is frequently the case thatstrings in regexes might be separated by the construct(ie f oo lowast bars) This is usually implementable byrequiring only that f oo is seen before bar (usuallybut not always consider the expression f oolowastoars )The original version of Rose had many special-purposeengines to handle these type of subcases during the evo-lution of the system this special-purpose code was almostentirely replaced with generalized NFA and DFA mecha-nisms amenable to analysis and optimization and wereneeded for the general case of regex matching in any caseVersion 50 Version 50 which is the latest version asof writing this paper mainly focused on enhancing theusability of Hyperscan Two key added features are sup-port for logical combinations of patterns and Chimeraa hybrid regex engine of Hyperscan and PCRE [6] Asthe detection technology of malicious network traffic ma-tures it often requires evaluating a logical combination ofa group of patterns beyond matching a single pattern To

support this the system now allows user-defined ANDOR NOT along their patterns For example an AND op-eration between patterns foobar and teakettle requiredthat both patterns are matched for input before reportinga match Version 50 added Chimera a hybrid matcher ofHyperscan and PCRE brings the benefit of both worlds ndashsupport for full PCRE syntax while enjoying the high per-formance of Hyperscan Lack of support of Hyperscan forfull PCRE syntax (such as capturing and back-references)made it difficult to completely replace PCRE in adoptedsolutions Chimera employs Hyperscan as a fast filter forinput and triggers the PCRE library functions to confirma real match only when Hyperscan reports a match on apattern that has features not supported by Hyperscan

72 Lessons LearnedWe summarize a few lessons that we learned in the courseof development and commercialization of HyperscanRelease quickly and iterate even with partial featuresupport The difficulty of generalized regex matchingoften led Hyperscan to focus on the delivery of somecapability in partial form From a theoretical standpointit is unsatisfactory that Hyperscan still cannot supportall regexes (even from the subset of rsquotruersquo regexes) andthat the support of regexes with the rsquostart of matchrsquo flagturned on is even smaller However customers can stillfind the system practically useful despite these limitationsDespite the limited pattern support of version 20 andthe problems of ordering and duplicated matches therewas immediate commercial use of the product even inthat early form (use which made subsequent developmentpossible) This lends support to the idea of releasing aMinimal Viable Product early rather than developing aproduct with a long list of features that customers may ormay not wantEvolve new features over several versions at the ex-pense of maintaining multiple code paths Academicsystems are usually built for elegance and to illustratea particular methodology However a commercial sys-tem must stay viable as a product while a new subsystemis built For example Hyperscan maintained both thersquoprefilterrsquo arrangement and the new rsquoRose-basedrsquo decom-position arrangement in the same code base resulting inconsiderable extra complexity However the benefits ofhaving the new rsquoorderedrsquo semantics (with additional pow-ers of support for large patterns due to decomposition)outweighed the complexity cost It was almost impossiblethat a small start-up could have managed to transitionfrom one system to the other in a span of a single releaseor to have simply not meaningfully updated the project foran extended time while working on a substantial update

644 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

Commercial products may need to emphasize less in-teresting subcases of a task or unusual corner casesThere was also considerable commercial pressure to bethe best option at some comparatively degenerate sub-set of regex matching or some relatively rsquohard casersquoCustomers often wanted Hyperscan to function as astring matcher - sometimes even a single-string-at-a-timematcher Other customers wanted high performance de-spite very high regex match rates (for example more than1 match per character) Such demands often force specialoptimizations that lack deep lsquoalgorithmic interestrsquo but arenecessary for commercial successBe cautious of cross-cutting complexity resultingfrom customer API requests One illuminating expe-rience in the delivery of a commercially viable regexmatcher was that customer feature requests for newrsquomodesrsquo or unusual calls at the API level resulted in cross-cutting complexity that made the code base considerablymore complicated (due to a combinatorial explosion of in-teractions between features) while rarely being reused byother customers Features added in the 20 or 30 releaseseries over time were not carried forward to the 40 serieswe found that frequently such features were only used bya single customer (despite being made available to all)

Examples of two such features were precise alive(the ability to tell at any given stream write boundarywhether a pattern might still be able to match) and anad-hoc stream state compression scheme that allowedsome stream states to be discarded if no NFA engineshad started running These features were complicated andsuppressed potential optimizations as well as interactingpoorly with other parts of the system

73 Future DirectionsHyperscan is performance-oriented future developmentin Hyperscan will still focus on delivering the best possi-ble performance especially on upcoming Intel Architec-ture cores featuring new instruction set extensions suchas Vector Byte Manipulation Instructions (VBMI) Im-provement of scanning performance as well as reductionof overheads such as the size of bytecodes size of streamstates and time to compile the pattern matching bytecodeare obvious next steps

Beyond this adding richer functionality including sup-port for currently unsupported constructs such as gener-alized lookaround asserts and possible some level ofsupport for back-references would aid some users Thereis a considerable amount of usage of the lsquocapturingrsquo func-tionality of regexes which Hyperscan does not support atall (an experimental subbranch of Hyperscan not widelyreleased supported capturing functionality for a limited

set of expressions) Hyperscan could be extended to haveenriched semantics to support capturing which would al-low portions of the regexes to lsquocapturersquo parts of the inputthat matched particular parts of the regular expression

8 ConclusionIn this paper we have presented Hyperscan a high-performance regex matching system that suggests a newstrategy for efficient pattern matching We have shownthat the existing prefilter-based DPI suffers from fre-quent executions of unnecessary regex matching Eventhough Hyperscan started with the similar approach ithas evolved to address the limitation over time with novelregex decomposition based on rigorous graph analysesIts performance advantage is further boosted by efficientmulti-string matching and bit-based NFA implementationthat effectively harnesses the capacity of modern CPUHyperscan is open sourced for wider use and it is gen-erally recognized as the state-of-the-art regex matcheradopted by many commercial systems around the world

Acknowledgment

We appreciate valuable feedback by anonymous reviewersof USENIX NSDIrsquo19 as well as our shepherd Vyas SekarWe acknowledge the code contributions over the yearsby the Sensory Networks team Matt Barr Alex Coyteand Justin Viiret This work is in part supported by theICT Research and Development Program of MSIPIITPSouth Korea under projects [2018-0-00693 Developmentof an ultra-low latency user-level transfer protocol] and[2016-0-00563 Research on Adaptive Machine Learn-ing Technology Development for Intelligent AutonomousDigital Companion]

References

[1] Google RE2 httpsgithubcomgooglere2

[2] Hyperscan GitHub Repository httpsgithubcomintelhyperscan

[3] ModSecurity httpswwwmodsecurityorg

[4] nDPI httpswwwntoporgproductsdeep-packet-inspectionndpi

[5] PCRE Manual Pages httpspcreorgpcretxt

[6] PCRE Perl Compatible Regular Expressionshttpswwwpcreorg

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 645

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

[7] Perl-compatible Regular Expressions (revised APIPCRE2) httpswwwpcreorgcurrentdochtmlindexhtml

[8] Snort Emerging Threats Rules 290httpsrulesemergingthreatsnetopensnort-290rules

[9] Snort Intrusion Detection System httpssnortorg

[10] Suricata Open Source IDS httpsuricata-idsorg

[11] Talos Ruleset httpswwwsnortorgtalos[12] Aho Alfred V and Corasick Margaret J Efficient

String Matching An Aid to Bibliographic SearchCommunications of the ACM 18(6)333ndash340 June1975

[13] Ricardo A Baeza-Yates and Gaston H Gonnet Anew approach to text searching Communications ofthe ACM (CACM) 35(10)74ndash82 1992

[14] Zachary K Baker and Viktor K Prasanna Time andArea Efficient Pattern Matching on FPGAs In Pro-ceedings of the ACMSIGDA International Sympo-sium on Field-Programmable Gate Arrays (FPGA)2004

[15] M Becchi and S Cadambi Memory-efficient reg-ular expression search using state merging In Pro-ceedings of the IEEE International Conference onComputer Communications (INFOCOM) 2007

[16] M Becchi and P Crowley A Hybrid Finite Au-tomaton for Practical Deep Packet Inspection InProceedings of the ACM International Conferenceon Emerging Networking Experiments and Technolo-gies (CoNEXT) 2007

[17] M Becchi and P Crowley An improved algorithmto accelerate regular expression evaluation In Pro-ceedings of the ACMIEEE Symposium on Architec-tures for Networking and Communications Systems(ANCS) 2007

[18] M Becchi and P Crowley A-DFA A Time- andSpace-Efficient DFA Compression Algorithm forFast Regular Expression Evaluation ACM Trans-actions on Architecture and Code Optimization(TACO) 10(1) April 2013

[19] A Bremler-Barr Y Harchol D Hay and Y Ko-ral Deep Packet Inspection as a Service In Pro-ceedings of the ACM International Conference onEmerging Networking Experiments and Technolo-gies (CoNEXT) 2014

[20] Taylor-D E Brodie B and R K Cytron A scalablearchitecture for high-throughput regular expressionpattern matching In Proceedings of the Interna-

tional Symposium on Computer Architecture (ISCA)2006

[21] B Choi J Chae M Jamshed K Park and D HanDFC Accelerating string pattern matching for net-work applications In Proceedings of USENIX Sym-posium on Networked Systems Design and Imple-mentation (NSDI) 2016

[22] Chris Clark Wenke Lee David Schimmel DidierContis Mohamed Koneacute and Ashley Thomas AHardware Platform for Network Intrusion Detectionand Prevention In Proceedings of the Workshop onNetwork Processors Applications (NP3) 2004

[23] Christopher R Clark and David E Schimmel Ef-ficient Reconfigurable Logic Circuits for Match-ing Complex Network Intrusion Detection PatternsIn Proceedings of the International Conference onField-Programmable Logic and Applications (FPL)2003

[24] Beate Commentz-Walter A string matching algo-rithm fast on the average In Proceedings of theColloquium on Automata Languages and Program-ming 1979

[25] Jack Edmonds and Richard M Karp Theoreticalimprovements in algorithmic efficiency for networkflow problems Journal of the ACM 19(2)248ndash2641972

[26] Giordano-S Procissi G Vitucci F Antichi G FicaraD and A Di Petro An improved DFA for fast reg-ular expression matching ACM SIGCOMM Com-puter Communication Review (CCR) 38(5) 2008

[27] VM Glushkov The abstract theory of automataRussian Mathematical Surveys 16(5)1ndash53 1961

[28] Muhammad Asim Jamshed Jihyung Lee Sang-woo Moon Insu Yun Deokjin Kim Sungryoul LeeYung Yi and KyoungSoo Park Kargus a highly-scalable software-based intrusion detection systemIn Proceedings of the ACM Conference on Com-puter and Communications Security (CCS) pages317ndash328 2012

[29] Richard M Karp and Michael O Rabin Efficientrandomized pattern-matching algorithms IBM Jour-nal of Research and Development 31(2)249ndash2601987

[30] T Kojm ClamAV httpwwwclamavnet[31] Smith-R Kong S and C Estan Efficient signature

matching with multiple alphabet compression tablesIn Proceedings of the International ICST Confer-ence on Security and Privacy in CommunicationNetworks (SecureComm) 2008

[32] S Kumar S Dharmapurikar F Yu P Crowley and

646 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

J Turner Algorithms to accelerate multiple regularexpressions matching for deep packet inspectionIn Proceedings of the ACM SIGCOMM on Datacommunication (SIGCOMM) 2006

[33] Turner-J Kumar S and J Williams Advanced algo-rithms for fast and scalable deep packet inspectionIn Proceedings of the ACMIEEE Symposium onArchitecture for Networking and CommunicationsSystems (ANCS) 2006

[34] Janghaeng Lee Sung Ho Hwang Neungsoo ParkSeong-Won Lee Sunglk Jun and Young Soo KimA High Performance NIDS using FPGA-based Reg-ular Expression Matching In Proceedings of theACM symposium on Applied computing 2007

[35] Abhishek Mitra Walid Najjar and Laxmi BhuyanCompiling PCRE to FPGA for accelerating SnortIDS In Proceedings of the ACMIEEE Symposiumon Architectures for Networking and Communica-tions Systems (ANCS) 2007

[36] J Jang J Truelove D G Andersen S K ChaI Moraru and D Brumley SplitScreen Enablingefficient distributed malware detection In Proceed-ings of the USENIX Symposium on Networked Sys-tems Design and Implementation (NSDI) 2010

[37] E Sitaridi O Polychroniou and K A Ross SIMD-accelerated regular expression matching In Pro-ceedings of the Workshop on Data Management onNew Hardware (DaMoN) 2016

[38] R Smith C Estan and S Jha XFA Faster Sig-nature Matching with Extended Automata In Pro-ceedings of the IEEE Symposium on Security andPrivacy 2008

[39] R Smith C Estan S Jha and S Kong Deflatingthe big bang fast and scalable deep packet inspec-tion with extended finite automata In Proceedingsof the ACM SIGCOMM on Data communication(SIGCOMM) 2008

[40] Y E Yang W Jiang and V K Prasanna Compactarchitecture for high-throughput regular expressionmatching on fpga In Proceedings of the ACMIEEESymposium on Architectures for Networking andCommunications Systems (ANCS) 2008

[41] F Yu Z Chen Y Diao TV Lakshman and R HKats Fast and memory-efficient regular expressionmatching for deep packet inspection In Proceedingsof the ACMIEEE Symposium on Architectures forNetworking and Communications Systems (ANCS)2014

[42] Xiaodong Yu Bill Lin and Michela BecchiRevisiting state blow-up Automatically buildingaugmented-fa while preserving functional equiva-

lence IEEE Journal on Selected Areas in Commu-nications 32(10) October 2014

Appendix

Algorithm 3 Dominant Path Analysis

Require Graph G=(EV)1 function DOMINANTPATHANALYSIS(G)2 d path = 3 for v isin accepts do4 calculate dominant path p[v] for v5 if d path = then6 d path = p[v]7 else8 d path = common_pre f ix(d path p[v])9 if d path = then

10 return null_string11 end if12 end if13 strings = expand_and_extract(d path)14 end for15 return strings16 end function

The dominant path analysis algorithm finds the dom-inant path (p[v]) for every accept state (v) and find thecommon path of all dominant paths The function ex-pand_and_extract() expands small character-sets in thepath and extracts the string on the path

Algorithm 4 Dominant Region Analysis

Require Graph G=(EV)1 function DOMINANTREGIONANALYSIS(G)2 acyclic_g = build_acyclic(G)3 Gt = build_topology_order(acyclic_g)4 candidate = q05 it = begin(Gt)6 while it = end(Gt) do7 if isValidCut(candidate) then8 setRegion(candidate)9 initializeCandidate(candidate)

10 else11 addToCandidate(it)12 it = it +113 end if14 end while15 setRegion(candidate)16 Merge regions connected with back edge17 strings = expand_and_extract(regions)18 return strings19 end function

The dominant region analysis builds an acyclic graphand sorts the vertices by the topological order Then itadds each vertex of the graph into a candidate vertex set

USENIX Association 16th USENIX Symposium on Networked Systems Design and Implementation 647

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion

and sees if the candidate vertex set forms a valid cut Ifso it creates a region It continues to create regions byiterating all vertices Finally it merges the regions byback edges and extracts strings from the merged region

Algorithm 5 Network Flow Analysis

Require Graph G=(EV)1 function NETWORKFLOWANALYSIS(G)2 for edge isin E do3 strings = f ind_strings(edge)4 scoreEdge(edgestrings)5 end for6 cuts = MinCut(G)7 strings = extract and expand strings f rom cuts8 return strings9 end function

The network flow analysis assigns a score to every edgeand runs the max-flow min-cut algorithm An edge isassigned a score inverse proportional to the length of astring that ends at the edge So the longer the string isthe smaller the score gets Then the max-flow min-cutalgorithm finds a cut whose edge has the longest strings

648 16th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

  • Introduction
  • Background and Motivation
  • Regular Expression Decomposition
    • Matching with Regex Decomposition
    • Rationale and Guidelines
    • Graph-based String Extraction
      • SIMD-accelerated Pattern Matching
        • Multi-string Pattern Matching
        • Finite Automata Pattern Matching
          • Implementation
          • Evaluation
            • Experiment Setup
            • Effectiveness of Regex Decomposition
            • Microbenchmarks
            • Real-world DPI Application
              • Evolution Experience and Lessons
                • Evolution of Hyperscan
                • Lessons Learned
                • Future Directions
                  • Conclusion