effective phrase prediction

19
Effective Phrase Prediction Effective Phrase Prediction VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung

Upload: yukio

Post on 13-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

Effective Phrase Prediction. VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung. Motivation. Pervasiveness of Autocompletion Typical autocompletion is still at word level Phrase Prediction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Effective Phrase Prediction

Effective Phrase PredictionEffective Phrase Prediction

VLDB 2007 : Text Databases

Presented By Arnab Nandi, H. V. Jagadish

University of Michigan

2008-03-07

Summerized By Jaeseok Myung

Page 2: Effective Phrase Prediction

Copyright 2008 by CEBT

MotivationMotivation

Pervasiveness of Autocompletion

Typical autocompletion is still at word level

Phrase Prediction

Words provide much more information to exploit for prediction

– Context, Phrase Structures

Most text is predictable and repetitive in many applications

– Email Composition

Prob(“Thank you very much” | “Thank”) ~= 1

Center for E-Business Technology IDS Lab. Seminar – 2/13

Page 3: Effective Phrase Prediction

Copyright 2008 by CEBT

ChallengesChallenges

Number of phrases is large

n(vocabulary) >> n(alphabet)

n(phrases) = O(vocabulary phrase length)

=> FussyTree structure

Length of phrase is unknown

“word” has a well-defined boundary

=> Significance

How to evaluate a suggestion mechanism?

=> Total Profit Metric (TPM)

Center for E-Business Technology IDS Lab. Seminar – 3/13

Page 4: Effective Phrase Prediction

Copyright 2008 by CEBT

Problem DefinitionProblem Definition

R = query(p)

Need data structure that can

Store completions efficiently

Support fast querying

Center for E-Business Technology

Page 5: Effective Phrase Prediction

Copyright 2008 by CEBT

An n-gram Data ModelAn n-gram Data Model

R = query(p) : r ∈ R, prob (p, r) is maximized

mth order Markov model

m: # of previous states that we are using to predict the next state

n-gram model is equivalent to an (n-1)th order Markov model

Center for E-Business Technology

18.0

6.03.00.1

)|()|()(),,( 23121

iXpXPtXiXPtXPpitP

w7,1

w7,2

w7,3

1020

30

frequency for rank

Prefix length p = 5

Page 6: Effective Phrase Prediction

Copyright 2008 by CEBT

Fundamental Data StructuresFundamental Data Structures

Basic data structure to “completion” problems

TRIE or Suffix Tree

Phrase Version

Every node = word

Center for E-Business Technology

<TRIE> <Suffix Tree>

Page 7: Effective Phrase Prediction

Copyright 2008 by CEBT

Pruned Count Suffix Tree(PCST)Pruned Count Suffix Tree(PCST)

Construct a frequency based phrase Tree

Prune all nodes with frequency < threshold τ

Problems

PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data

Center for E-Business Technology

[16] Estimating alphanumeric selectivity in the presence of wildcards

Page 8: Effective Phrase Prediction

Copyright 2008 by CEBT

FussyTree ConstructionFussyTree Construction

Filter out infrequent phrases even before adding to the tree

Center for E-Business Technology

N = 2, τ = 2

training sentence size

threshold

Ignoredphrases

Tokenizingwindow size = 4

the size of the largest frequent phrase

(please, call, me, asap)

(call, me, asap, -end-)

Page 9: Effective Phrase Prediction

Copyright 2008 by CEBT

SignificanceSignificance

A node in the FussyTree is “significant” if it marks a phrase boundary

Center for E-Business Technology

Example : “please call”

Z and Y are considered tuning parametersAssume, z=2, y=3

“please call”(3) > “please”(0) * “call”(1)

“please call”(3) > ½ * “please”(0)

“please call”(3) > 3 * “please call me”(1)…

Page 10: Effective Phrase Prediction

Copyright 2008 by CEBT

Significance – cont.Significance – cont.

Center for E-Business Technology

END

All leaves are significant

due to END node (frequency = 0)

Some internal nodes are significant too

Intuitively, suggestions ending on significant nodes will be better

No need to store counts

Page 11: Effective Phrase Prediction

Copyright 2008 by CEBT

Online Significance MarkingOnline Significance Marking

(Offline) Significance requires an additional pass

Compare against tree generated by FussyTree with offline significance

Center for E-Business Technology

A

B

C

D

E

Add “ABCXY”

A

B

C

D

E

X

Y

The branch point is considered for promotion

The immediate descendant significant nodes are considered for demotion

Page 12: Effective Phrase Prediction

Copyright 2008 by CEBT

Evaluation MetricsEvaluation Metrics

Precision & Recall

Refer to the quality of the suggestions themselves

For ranked results :

Center for E-Business Technology

Page 13: Effective Phrase Prediction

Copyright 2008 by CEBT

Total Profit Metric (TPM)Total Profit Metric (TPM)

Total Profit Metric

TPM measures the effectiveness of suggestion mechanism

Counting number of keystrokes saved by suggestions

d is the distraction parameter

TPM(0) corresponds to a user who does not mind distraction at all

TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke

The distraction value would be closer to 0 than 1

Center for E-Business Technology

Page 14: Effective Phrase Prediction

Copyright 2008 by CEBT

Total Profit Metric – An ExampleTotal Profit Metric – An Example

Center for E-Business Technology

Page 15: Effective Phrase Prediction

Copyright 2008 by CEBT

ExperimentsExperiments

Multiple Corpora

Enron Small : 1 user’s “sent” (366 emails, 250KB)

Enron Large : multiple users (20,842 emails, 16MB)

Wikipedia (40,000 documents, 53MB)

Data Structures

(1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance

Parameters

Significance : z(comparability) = 2, y(uniqueness) = 2

Training Sentence Size N = 8

Prefix Size P = 2

Center for E-Business Technology

Page 16: Effective Phrase Prediction

Copyright 2008 by CEBT

Prediction QualityPrediction Quality

Center for E-Business Technology

Page 17: Effective Phrase Prediction

Copyright 2008 by CEBT

Tuning Parameters (1)Tuning Parameters (1)

Center for E-Business Technology

Page 18: Effective Phrase Prediction

Copyright 2008 by CEBT

Tuning Parameters (2)Tuning Parameters (2)

Center for E-Business Technology

Page 19: Effective Phrase Prediction

Copyright 2008 by CEBT

ConclusionConclusion

Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion

A technique to accomplish this based on “significance”

New evaluation metrics for ranked autocompletion

Possible Extensions

Part of Speech Reranking

Semantic Reranking

– Using Wordnet

Query Completion for structured data

– XML, ..

Center for E-Business Technology