effective phrase prediction
DESCRIPTION
Effective Phrase Prediction. VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung. Motivation. Pervasiveness of Autocompletion Typical autocompletion is still at word level Phrase Prediction - PowerPoint PPT PresentationTRANSCRIPT
Effective Phrase PredictionEffective Phrase Prediction
VLDB 2007 : Text Databases
Presented By Arnab Nandi, H. V. Jagadish
University of Michigan
2008-03-07
Summerized By Jaeseok Myung
Copyright 2008 by CEBT
MotivationMotivation
Pervasiveness of Autocompletion
Typical autocompletion is still at word level
Phrase Prediction
Words provide much more information to exploit for prediction
– Context, Phrase Structures
Most text is predictable and repetitive in many applications
– Email Composition
Prob(“Thank you very much” | “Thank”) ~= 1
Center for E-Business Technology IDS Lab. Seminar – 2/13
Copyright 2008 by CEBT
ChallengesChallenges
Number of phrases is large
n(vocabulary) >> n(alphabet)
n(phrases) = O(vocabulary phrase length)
=> FussyTree structure
Length of phrase is unknown
“word” has a well-defined boundary
=> Significance
How to evaluate a suggestion mechanism?
=> Total Profit Metric (TPM)
Center for E-Business Technology IDS Lab. Seminar – 3/13
Copyright 2008 by CEBT
Problem DefinitionProblem Definition
R = query(p)
Need data structure that can
Store completions efficiently
Support fast querying
Center for E-Business Technology
Copyright 2008 by CEBT
An n-gram Data ModelAn n-gram Data Model
R = query(p) : r ∈ R, prob (p, r) is maximized
mth order Markov model
m: # of previous states that we are using to predict the next state
n-gram model is equivalent to an (n-1)th order Markov model
Center for E-Business Technology
18.0
6.03.00.1
)|()|()(),,( 23121
iXpXPtXiXPtXPpitP
w7,1
w7,2
w7,3
1020
30
frequency for rank
Prefix length p = 5
Copyright 2008 by CEBT
Fundamental Data StructuresFundamental Data Structures
Basic data structure to “completion” problems
TRIE or Suffix Tree
Phrase Version
Every node = word
Center for E-Business Technology
<TRIE> <Suffix Tree>
Copyright 2008 by CEBT
Pruned Count Suffix Tree(PCST)Pruned Count Suffix Tree(PCST)
Construct a frequency based phrase Tree
Prune all nodes with frequency < threshold τ
Problems
PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data
Center for E-Business Technology
[16] Estimating alphanumeric selectivity in the presence of wildcards
Copyright 2008 by CEBT
FussyTree ConstructionFussyTree Construction
Filter out infrequent phrases even before adding to the tree
Center for E-Business Technology
N = 2, τ = 2
training sentence size
threshold
Ignoredphrases
Tokenizingwindow size = 4
the size of the largest frequent phrase
(please, call, me, asap)
(call, me, asap, -end-)
Copyright 2008 by CEBT
SignificanceSignificance
A node in the FussyTree is “significant” if it marks a phrase boundary
Center for E-Business Technology
Example : “please call”
Z and Y are considered tuning parametersAssume, z=2, y=3
“please call”(3) > “please”(0) * “call”(1)
“please call”(3) > ½ * “please”(0)
“please call”(3) > 3 * “please call me”(1)…
Copyright 2008 by CEBT
Significance – cont.Significance – cont.
Center for E-Business Technology
END
All leaves are significant
due to END node (frequency = 0)
Some internal nodes are significant too
Intuitively, suggestions ending on significant nodes will be better
No need to store counts
Copyright 2008 by CEBT
Online Significance MarkingOnline Significance Marking
(Offline) Significance requires an additional pass
Compare against tree generated by FussyTree with offline significance
Center for E-Business Technology
A
B
C
D
E
Add “ABCXY”
A
B
C
D
E
X
Y
The branch point is considered for promotion
The immediate descendant significant nodes are considered for demotion
Copyright 2008 by CEBT
Evaluation MetricsEvaluation Metrics
Precision & Recall
Refer to the quality of the suggestions themselves
For ranked results :
Center for E-Business Technology
Copyright 2008 by CEBT
Total Profit Metric (TPM)Total Profit Metric (TPM)
Total Profit Metric
TPM measures the effectiveness of suggestion mechanism
Counting number of keystrokes saved by suggestions
d is the distraction parameter
TPM(0) corresponds to a user who does not mind distraction at all
TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke
The distraction value would be closer to 0 than 1
Center for E-Business Technology
Copyright 2008 by CEBT
Total Profit Metric – An ExampleTotal Profit Metric – An Example
Center for E-Business Technology
Copyright 2008 by CEBT
ExperimentsExperiments
Multiple Corpora
Enron Small : 1 user’s “sent” (366 emails, 250KB)
Enron Large : multiple users (20,842 emails, 16MB)
Wikipedia (40,000 documents, 53MB)
Data Structures
(1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance
Parameters
Significance : z(comparability) = 2, y(uniqueness) = 2
Training Sentence Size N = 8
Prefix Size P = 2
Center for E-Business Technology
Copyright 2008 by CEBT
Prediction QualityPrediction Quality
Center for E-Business Technology
Copyright 2008 by CEBT
Tuning Parameters (1)Tuning Parameters (1)
Center for E-Business Technology
Copyright 2008 by CEBT
Tuning Parameters (2)Tuning Parameters (2)
Center for E-Business Technology
Copyright 2008 by CEBT
ConclusionConclusion
Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion
A technique to accomplish this based on “significance”
New evaluation metrics for ranked autocompletion
Possible Extensions
Part of Speech Reranking
Semantic Reranking
– Using Wordnet
Query Completion for structured data
– XML, ..
Center for E-Business Technology