a fast regular expression indexing engine junghoo “john” cho (ucla) sridhar rajagopalan (ibm)

A Fast Regular Expression Indexing

Engine

Junghoo “John” Cho (UCLA)

Sridhar Rajagopalan (IBM)

Junghoo "John" Cho (UCLA Computer Science) 2

Problem

How can we match a regular expression fast? Large text-corpus Several days to match a simple regular

expression! Our solution

Use an index!


Motivation

Advanced search interface What is the middle name of Thomas Edison?

State-of-the-art: Keyword-based Thomas Edison

Regular expression Thomas [a-z]+ Edison

Data extraction [Brin 98]


Outline

Index key selection Useful gram Algorithm for key selection Other issues Experiments


Motivating example

All mp3 URLs on the Web:<a href=(“|’)?.*\.mp3(“|’)?>Every matching string contains mp3.

Questions: Should we index “mp3”? Should we index “<a href=”?


What index entires?

Solution 1: Inverted index (English words) Cannot handle many regular expressions

Solution 2: k-grams for k = 1, 2, …, 10 Index too large (10 times as large!)

Our solution: multigram


Main idea

“mp3” is helpful. Not many pages have it.

“<a href=” is not. All pages have it.

We index only “useful” grams.


Gram selectivity

Sel(x): selectivity of gram xSel(x) = M(x)/NM(x): number of pages containing gram

xN: total number of pages

C-useful gram: All grams with Sel(x) < C C: system parameter

random access vs. sequential access time We index only “C-useful” grams


Minimal useful gram

“Unix is great” If “Unix” is useful “Unix i”, “Unix is”, “Unix is g”, … are all useful.

“Unix” is the minimal useful gram. We index only the minimal useful gram.


Advantages

Versatile We can look up “Unix” for all grams like “Unix i”, “Unix is g”, etc.

Easy to find Reduction to “A priori” algorithm

Index size guarantee


Algorithm

Main idea: If “abcde” is minimal useful gram, then “abcd” is

not useful. If “abcd” is not useful, then “a”, “ab”, “abc” is not

useful. Minimal useful gram identification is

equivalent to useless gram identification.


A priori algorithm

Useless gram identification Find all sequences of characters that occur in

more than k pages A priori algorithm

Find all sets of items that occur in more than k baskets

Less than 4 scans of the corpus to find all minimal useful grams.


Prefix free set

A set of grams X is prefix free ifno x X is a prefix of any other x’ Xe.g.) X = {ab, ac, abc} is not prefix free.

A set of minimal useful grams is a prefix free set.


Size of a prefix free set

Let X be a set of grams extracted from corpus D and is prefix free. Then

|X| |D||X|: number of grams in X|D|: number of characters in D

The size of an index with minimal useful grams does not exceed the size of the corpus!


Shortest suffix gram

<a href=“k If =“k is useful, then <a href=“k, a href=“k, href=“k,etc are all useful.

=“k: shortest suffix gram We index only the shortest suffix gram.

Pre-suf shell


Other issues

Given a regular expression how to find an index entry to look up?

Optimization?


Experiments

Half million Web documents Comparison

Raw scanning Multigram index Complete: k-grams for k = 1,2, …, 10

Benchmark queries No standard Collected from IBM Almaden researchers


Example queries (simplified)

MP3 URLs: <a href=.*\.mp3> Invalid HTML: <[^>]*< Phone numbers:

(\d\d\d) \d\d\d-\d\d\d\d PowerPC chip number:

(xpc|mpc)[0-9]+[0-9a-z]+ Middle name of Clinton:

William [a-z]+ Clinton


Evaluation metrics

Index construction time Index size Matching time

Overall throughput Response time for first 10 matches


Construction time & Index size

Complete Multigram

Construction Time 63 hours 6 hours

No of Keys 103,151,302 64,656

No of Postings 18,193,048,399 820,396,717

An order of magnitude reduction in index size


Matching time

On average, Complete is faster than Multigram only by 33%

Query Scanning Complete Multigram

mp3 573 sec 11 sec 15 sec

PowerPC 548 sec 1 sec 2 sec

phone 540 sec 540 sec 540 sec


Result size & Improvement

100%

1000%

10000%

100000%

1 10 100 1000 10000 100000 1000000

Result size

Improvement from Scanning


Related work

Suffix tree Beaza-Yates et al., JACM,1998

Main-memory based

Disk-based string index Cooper et al., VLDB, 2001

Good for exact string matching

Inverted index English words


Conclusion

Fast matching of regular expressions Multigram index

Small size Significant improvement in matching time

Future work Optimization?

a fast regular expression indexing engine junghoo “john” cho (ucla) sridhar rajagopalan (ibm)

Documents

simple regular expression

mp3 urls

z edison data extraction

issues experiments

matching string

large textcorpus