© 2008 ibm corporation sonoma state university computer science colloquium 03/06/2008 declarative...

53
Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation Declarative Information Extraction The Avatar Group IBM Almaden Research Center Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu

Upload: randolf-merritt

Post on 18-Jan-2018

230 views

Category:

Documents


0 download

DESCRIPTION

© 2008 IBM Corporation 3 Where is the party? Hi guys, We are planning a salsa party tonight starting at 10:00pm for our class at Miami Beach Club, 175 San Pedro Square San Jose, CA Whoever who is interested, please let me know so we can organize some car-pooling. -Juan PS: you can call me at if needed. salsa address 0 found salsa 100 s found address 0 found The address of the party! But the itself does not contain the word “address”!

TRANSCRIPT

Page 1: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation

Declarative Information Extraction

The Avatar Group IBM Almaden Research Center

Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, and Huaiyu Zhu

Page 2: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation2

MotivationWhere is the party?

Hmmm…I don’t know. Let me

check my email.

John and Jane are going to a salsa party tonight! But …

Page 3: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation3

Where is the party?

Hi guys,

We are planning a salsa party tonight starting at

10:00pm for our class at Miami Beach Club,

175 San Pedro Square

San Jose, CA 95109

Whoever who is interested, please let me know

so we can organize some car-pooling.

-Juan

PS: you can call me at 408.123.4567 if needed.

salsa address 0 email found

salsa 100 emails found

address 0 email found

The address of the party!

But the email itself does not contain the word “address”!

Page 4: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation4

Information Extraction Distill structured data from unstructured and semi-structured text

– E.g. extracting phone numbers from emails, extracting person names from the web

Hi guys,

We are planning a salsa party tonight starting at 10:00pm for our salsa class at Miami Beach Club,

175 San Pedro Square San Jose, CA 95109

Whoever who is interested, please let me know so we can organize some car-pooling.

-Juan

PS: you can call me at 408.123.4567 if needed.

Event Address salsa party 175 San Pedro Square ...... ...

Select Address From EVENTS Where event = ‘salsa party’

175 San Pedro Square …

Exploit the extracted data in your applications– E.g. for search, for advertisement

Page 5: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation5

Revisit: Where is the Party?

salsa address

San Jose, CA 95109

Lotus Notes 8.01 Live Text

Page 6: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation6

Other Commercial Applications

Page 7: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation7

And many others

Literature Citations/ Research Communities– DBLife– Google Scholar

Terminology Extraction Document Summarization Life Science

– Eg. Gene Sequence Extraction, Protein Interaction Extraction

… …

As the amount of data in text explodes,

information extraction is becoming

increasing important!

Page 8: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation8

Basic Terminology

Annotator

Annotator

Annotator

annotations

annotations

annotations

documents Data Repository

Higher Level ApplicationsPrograms used to extract

structured data

Structured data extracted by annotators

Page 9: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation9

Background: Avatar

Working on information extraction (IE) since 2003 Main goals:

– Extract structured information from text– Build a system that can scale IE to real enterprise apps – Build new enterprise applications that leverage IE

Page 10: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation10

Large number of annotators

System T(algebraic information

extraction system)2007

2004

2005

2006

Evolution of the Avatar IE System

Performance, Expressivity

Custom Code

Diverse data sets, Complex extraction tasks

RAP(CPSL-style cascading

grammar system)

Evolutionary Triggers

RAP++(RAP + Extensions outside the

scope of grammars)

2008

Page 11: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation

The Custom Code Era

Page 12: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation12

Extracting Information with Custom Code

“It’s just pattern matching” – Use scripts and regular expressions

Then reality sets in…– Dozens of rules, even for simple concepts– Many special cases– Convoluted logic– Painfully slow code

Page 13: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation

The Age of Cascading Grammars

Page 14: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation14

Historical Perspective MUC (Message Understanding Conference) – 1987 to 1997

– Competition-style conferences organized by DARPA– Shared data sets and performance metrics

• News articles, Radio transcripts, Military telegraphic messages

Classical IE Tasks– Entity and Relationship/Link extraction– Event detection, sentiment mining etc.– Entity resolution/matching

Several IE systems were built– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS

[Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]

Page 15: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation15

Cascading Finite-state Grammars

Most IE systems share a common formalism– Input text viewed as a sequence of tokens– Rules expressed as regular expression patterns

over the lexical features of these tokens Several levels of processing Cascading

Grammars

CPSL– A standard language for specifying cascading grammars– Created in 1998

Several known implementations– TextPro: reference implementation of CPSL by Doug Appelt– JAPE (Java Annotation Pattern Engine)

• Part of the GATE NLP framework• Under active consideration for commercial use by several companies

Page 16: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation16

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Cascading Grammars By Example

Name Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0 (Tokenize)

Level 2

Level 1

Page 17: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation17

Experiences with Cascading Grammars

Benefits– Big step forward from custom code– Can express many simple concepts

Drawbacks– Expressiveness

• Dealing with overlap• Building complex structures

– Performance

Page 18: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation18

Sequencing Overlapping Input Annotations

ProperNoun Instrument

John Pipe plays the guitar

Instrument

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

<[A-Z]\w+(\s[A-Z]\w+)?> <d1|d2|…dn>

Example rule from the Band Review

ProperNoun

Marco Doe on the Hammond organ

Instrument

ProperNoun

Page 19: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation19

Sequencing Overlapping Input Annotations Possible options

– Pre-specified disambiguation rules (e.g., pick earlier annotation)– Supply tie-breaking rules for every possible overlap scenario– Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)

ProperNoun

Marco Doe on the Hammond organ

Instrument

ProperNoun

Instrument

John Pipe plays the guitar

ProperNoun

Instrument

Marco Doe on the Hammond organ ProperNoun Token Token Instrument

Which of the two should we pick?

John Pipe plays the guitarProperNoun Token Token Instrument

John Pipe plays the guitarToken Instrument Token Token Instrument

Marco Doe on the Hammond organ ProperNoun Token Token PoperNoun token

Prefer ProperNoun over InstrumentOver 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%.

There is no magic!

Page 20: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation20

Complex Structures Example: Signature Annotator

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL}End with one of these.

Start with Person

Within 50 tokens

Page 21: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation21

Complex Structures: Existing Solutions

Approximate using regular expressions Example: Signature

– Rule: (Person Token{,25} Phone (Token{,25} Contact)+) | (Person (Token{,25} Contact)+ Token{,25} Phone

(Token{,25} Contact)*)– Problems:

• Need to enumerate all possible orders of sub-annotations– What if you want at least one phone and one email?

• Does not restrict total token count

Page 22: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation22

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, John Smith at <Phone> amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin l enina i facilisis, <Name> at 555-1212 arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed John Smith at 555-1212 hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Performance

Name Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ NameToken[~ “[1-9]\d{2}-\d{4}”] Phone

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis arcu augue rutrum velit, sed <Name> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est

Level 0

Level 2

Level 1

Each level in a cascading grammar looks at each character in each document

Page 23: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation

Dawn of Declarative Information Extraction

Page 24: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation24

System-T ArchitectureAQL Language

Optimizer

OperatorRuntime

Specify annotator semantics declaratively

Choose an efficient execution plan that implements semantics

Annotation Algebra

Page 25: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation25

Declarative Information Extraction: AQL

SQL-like language for defining annotators Declarative

– Define basic patterns and the relationships between them

– Let the system worry about the order of operations

Page 26: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation26

AQL Example

select CombineSpans(name.match, instrument.match) as annotfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

Page 27: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation27

Annotation Algebra

Each Operator in the algebra…– …operates on one or more tuples of annotations – …produces tuples of annotations

“Document at a time” execution model– Algebra expression is defined over

• the current document d • annotations defined over d

Algebra expression is evaluated over each document in the corpus individually

Page 28: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation28

Basic Single-Argument Operator

Annotation 1

Operator

Output Tuple 1

Parameters

DocumentInput Tuple

Document

Annotation 2Output Tuple 2 Document

Page 29: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation29

Comparison with Cascading Grammars

Apply Name Rule

Apply Phone Rule

Apply PersonPhone

…John Smith at 555-1212…

…<Name> at <Phone>…

…<PersonPhone>…

…John Smith at 555-1212…

555-1212

John Smith at 555-1212

Grammar

Dictionary Regex

Join

Algebra

Block

JohnSmith

John Smith

Fewer passes over the documents

Page 30: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation30

Revisit Problem of Sequencing Annotations

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

InstrumentProperNoun

Page 31: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation31

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)

ProperNoun Instrument

(followed within 30 characters)

DictionaryRegular

expression

Join

Page 32: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation32

DictionaryRegex

Join

John PipedocMarco Beneventodoc

Hammonddoc

docdoc

Pipeguitar

doc Hammond organ

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

InstrumentProperNoun

John PipedocMarco Beneventodoc

guitarHammond organ

ProperNoun Instrument

ProperNoun <0-30 chars> Instrument

Page 33: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation33

How is aggregation handled

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL}End with one of these.

Start with Person

Within 50 tokens

Page 34: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation34

Back to signature

Org Phone URL

Person

Organization PhoneURL

Block

Union

Organization

PhoneURLPerson

Join

Organization PhoneURL

PersonSignature

Cleaner and potentially faster

Page 35: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation35

Performance

Performance issues with grammars– Complete pass through tokens for each rule– Many of these passes are wasted work

Dominant approach: Make each pass go faster– Doesn’t solve root problem!

Algebraic approach: Build a query optimizer!

Page 36: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation36

Optimizations

Query optimization is a familiar topic in databases What’s different in text?

– Operations over sequences and texts– Document boundaries– Costs concentrated in extraction operators (dictionary,

regular expression) Can leverage these characteristics

– Text-specific optimizations– Significant performance improvements

Page 37: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation37

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipe played the guitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve

Optimization Example

Regex match Dictionary match

0-30 characters

<ProperNoun> <within 30 characters> <Instrument>

Page 38: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation38

<ProperNoun> <Instrument>

(Followed within 30 characters)

<ProperNoun>

Find <Instrument> within 30 characters

<Instrument>

Find <ProperNoun> within 30 characters

Consider text to the rightConsider text to the left

Plan B Plan C

Plan A

Join

Classic Query Optimization

Page 39: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation39

Example of Text-Specific Optimization:

Conditional Evaluation (CE)– Leverage document-at-a-

time processing– Don’t evaluate the inner

operand of a join if the outer has no results

– Costing plans is challenging

…John Smith at 555-1212…

John Smith 555-1212

John Smith at 555-1212

Dictionary Regex

CEJoin

Don’t evaluate this Regex when there are no dictionary

matches.

Page 40: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation40

Experimental Results (Band Review Annotator)

Annotator Running Time

0

5000

10000

15000

20000

25000

30000

GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)

Run

ning

Tim

e (s

ec)

Classical query

optimization

Text-specific optimizations

Page 41: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation41

IOPES: Extracting Relationships and Composite Entities

IOPES = IBM Omnifind Personal Email Search Extract entities such as email address, url Associations such as name ↔ phone number Complex entities like conference schedules, directions, signature

blocks

Page 42: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation42

Thank you!

For more information…– Try out IOPES

• http://www.alphaworks.ibm.com/tech/emailsearch– Avatar Project home page

• http://almaden.ibm.com/cs/projects/avatar/– Contact me

[email protected]

Page 43: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

Sonoma State University Computer Science Colloquium 03/06/2008 © 2008 IBM Corporation

Backup Slides

Page 44: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation44

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In augue mi, scelerisque non, dictum non, vestibulum congue, erat. Donec non felis. Maecenas urna nunc, pulvinar et, fringilla a, porta at, diam. In iaculis dignissim erat. Quisque pharetra. Suspendisse cursus viverra urna. Aliquam erat volutpat. Donec quis sapien et metus molestie eleifend. Maecenas sit amet metus eleifend nibh semper fringilla. Pellentesque habitant morbi tristique senectus et netus et malesuada

Block Operator ()

Input InputInput

Input

Constraint on distance between inputs

Constraint on number of inputs

Blo

ck

Page 45: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation45

Conditional Evaluation (CE)

Leverage document-at-a-time processing

Don’t evaluate the inner operand of a join if the outer has no results

Costing plans is challenging

…John Smith at 555-1212…

John Smith 555-1212

John Smith at 555-1212

Dictionary Regex

CEJoin

Don’t evaluate this Regex when there are no dictionary

matches.

Page 46: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation46

Restricted Span Evaluation

Leverage the sequential nature of text

Only evaluate the inner on the relevant portions of the document

Limited applicability (compared with CE)

– Only certain operands and predicates

…John Smith at 555-1212…

John Smith555-1212

John Smith at 555-1212

DictionaryRegex

RSEJoin

Only look for dictionary matches in the vicinity of a

phone number.

Page 47: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation47

Implementing Restricted Span Evaluation (RSE)

RSE join operator RSE extraction operator Pass join bindings down to

the inner of a join Requires special physical

operators at edges of plan

s1

R1

p(s1,s2)Dict(D,s2)

RSEDict

s1 binding

s2’s that satisfyp(binding, s2) RSE

DictionaryOperator

Dp

Page 48: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation48

RSE Dictionary Operator

RSE version of an operator must produce the exact same answer

– Ongoing work: RSE Regular Expression operator

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin tincidunt eleifend quam. Aliquam ut pede ut enim dapibus venenatis.

To find dictionary matches that end in this range…

…need to examine this range.

Length of longest dictionary entry

Page 49: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation49

Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)

Regular Expressions and

Custom Code

Cascading Grammars

CPSL, AFST UIMA, GATE

Workflows

System T DBLifeIn the context of Project Cimple.

Search for “cimple wisc”

Page 50: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation50

Delving deeper into System T versus DBLife

Restricted Span

Evaluation

Shared Dictionary Matching

Conditional Evaluation

Pushing Down Text

Properties

Scoping Extractions

Pattern MatchingSy

stem

TD

BLife

Page 51: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation51

Cascading Grammar Reality Set of simple grammar rules for person name recognition

PersonDict PersonDict Person

Salutation CapsWord CapsWord Person

CapsWord CapsWord Token[~“,”]? Qualification Person

Level 1: Rules that look for patterns in each token to produce corresponding annotations

Tokenize(Document Text) Sequence of <Token>

Token[~ “Mr. | Mrs. | Dr. | …”] Salutation

Token[~ “Ph.D | MBA | …”] Qualification

Token[~ “[A-Z][a-z]*”] CapsWord

Token[~ “Michael | Richard | Smith| …”] PersonDict

Richard Smith

Dr. Laura Haas

Laura Haas, Ph.D

Pre-processing step: Tokenization of the document text

Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons

Page 52: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation52

IOPES: Extracting Relationships and Composite Entities

IOPES = IBM Omnifind Personal Email Search Entities like addresses, person names Relationships like name ↔ phone number Complex entities like conference schedules, directions, signature

blocks

Page 53: © 2008 IBM Corporation Sonoma State University Computer Science Colloquium 03/06/2008 Declarative Information Extraction The Avatar Group IBM Almaden Research

© 2008 IBM Corporation53

Extracting Entities in Notes 8.01 Live Text

Leverages Information Extraction Techniques

Names, addresses, phone numbers…

Ships with Lotus Notes 8.01