effective phrase prediction

Effective Phrase PredictionEffective Phrase Prediction

VLDB 2007 : Text Databases

Presented By Arnab Nandi, H. V. Jagadish

University of Michigan

2008-03-07

Summerized By Jaeseok Myung

Copyright 2008 by CEBT

MotivationMotivation

Pervasiveness of Autocompletion

Typical autocompletion is still at word level

Phrase Prediction

Words provide much more information to exploit for prediction

– Context, Phrase Structures

Most text is predictable and repetitive in many applications

– Email Composition

Prob(“Thank you very much” | “Thank”) ~= 1

Center for E-Business Technology IDS Lab. Seminar – 2/13


ChallengesChallenges

Number of phrases is large

n(vocabulary) >> n(alphabet)

n(phrases) = O(vocabulary phrase length)

=> FussyTree structure

Length of phrase is unknown

“word” has a well-defined boundary

=> Significance

How to evaluate a suggestion mechanism?

=> Total Profit Metric (TPM)

Center for E-Business Technology IDS Lab. Seminar – 3/13


Problem DefinitionProblem Definition

R = query(p)

Need data structure that can

Store completions efficiently

Support fast querying

Center for E-Business Technology


An n-gram Data ModelAn n-gram Data Model

R = query(p) : r ∈ R, prob (p, r) is maximized

mth order Markov model

m: # of previous states that we are using to predict the next state

n-gram model is equivalent to an (n-1)th order Markov model


18.0

6.03.00.1

)|()|()(),,( 23121

iXpXPtXiXPtXPpitP

w7,1

w7,2

w7,3

1020

30

frequency for rank

Prefix length p = 5


Fundamental Data StructuresFundamental Data Structures

Basic data structure to “completion” problems

TRIE or Suffix Tree

Phrase Version

Every node = word


<TRIE> <Suffix Tree>


Pruned Count Suffix Tree(PCST)Pruned Count Suffix Tree(PCST)

Construct a frequency based phrase Tree

Prune all nodes with frequency < threshold τ

Problems

PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data


[16] Estimating alphanumeric selectivity in the presence of wildcards


FussyTree ConstructionFussyTree Construction

Filter out infrequent phrases even before adding to the tree


N = 2, τ = 2

training sentence size

threshold

Ignoredphrases

Tokenizingwindow size = 4

the size of the largest frequent phrase

(please, call, me, asap)

(call, me, asap, -end-)


SignificanceSignificance

A node in the FussyTree is “significant” if it marks a phrase boundary


Example : “please call”

Z and Y are considered tuning parametersAssume, z=2, y=3

“please call”(3) > “please”(0) * “call”(1)

“please call”(3) > ½ * “please”(0)

“please call”(3) > 3 * “please call me”(1)…


Significance – cont.Significance – cont.


END

All leaves are significant

due to END node (frequency = 0)

Some internal nodes are significant too

Intuitively, suggestions ending on significant nodes will be better

No need to store counts


Online Significance MarkingOnline Significance Marking

(Offline) Significance requires an additional pass

Compare against tree generated by FussyTree with offline significance


A

B

C

D

E

Add “ABCXY”

A

B

C

D

E

X

Y

The branch point is considered for promotion

The immediate descendant significant nodes are considered for demotion


Evaluation MetricsEvaluation Metrics

Precision & Recall

Refer to the quality of the suggestions themselves

For ranked results :



Total Profit Metric (TPM)Total Profit Metric (TPM)

Total Profit Metric

TPM measures the effectiveness of suggestion mechanism

Counting number of keystrokes saved by suggestions

d is the distraction parameter

TPM(0) corresponds to a user who does not mind distraction at all

TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke

The distraction value would be closer to 0 than 1



Total Profit Metric – An ExampleTotal Profit Metric – An Example



ExperimentsExperiments

Multiple Corpora

Enron Small : 1 user’s “sent” (366 emails, 250KB)

Enron Large : multiple users (20,842 emails, 16MB)

Wikipedia (40,000 documents, 53MB)

Data Structures

(1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance

Parameters

Significance : z(comparability) = 2, y(uniqueness) = 2

Training Sentence Size N = 8

Prefix Size P = 2



Prediction QualityPrediction Quality



Tuning Parameters (1)Tuning Parameters (1)



Tuning Parameters (2)Tuning Parameters (2)



ConclusionConclusion

Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion

A technique to accomplish this based on “significance”

New evaluation metrics for ranked autocompletion

Possible Extensions

Part of Speech Reranking

Semantic Reranking

– Using Wordnet

Query Completion for structured data

– XML, ..


effective phrase prediction

Documents

ebusiness technologycopyright

ebusiness technologycenter

ebusiness technologyids

ebusiness technology16

ebusiness technologyw7

ebusiness technologyexample

ebusiness technologyn

phrase treeprune