Download - UMass Amherst at TDT 2003
UMass Amherst at TDT 2003
James Allan, Alvaro Bolivar, Margie Connell, Steve Cronen-Townsend, Ao Feng, FangFang Feng, Leah Larkey, Giridhar Kumaran, Victor Lavrenko, Ramesh Nallapati, and Hema RaghavanCenter for Intelligent Information RetrievalDepartment of Computer ScienceUniversity of Massachusetts Amherst
What we did Tasks
Story Link Detection Topic Tracking New Event Detection Cluster Detection
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models
ROI motivation
Analyzed vector space similarity measures Failed to distinguish between similar topics e.g. two “health care” stories from different topics
different locations and individuals similarity dominated by “health care” terms
drugs, cost, coverage, plan, prescription Possible solution: first categorize stories
different category different topics (mostly true) use within-category statistics
“health care” may be less confusing Rules of Interpretation provide natural categories
ROI intuition
•Each document in the corpus is classified into one of the ROI categories•Stories in different ROIs are less likely to be in same topic.•If two stories belong to different ROIs, we should trust their similarities less
ROI tagged corpus
simnew(s1,s2)=simold(s1,s2)
simnew(s1,s2)<simold(s1,s2)Sn
Sn
ROI classifiers Naïve Bayes BoosTexter [Schapire and Singer, 2000 ]
Decision tree classifier Generates and combines simple rules Features are terms with tf as weights
Used most likely single class Explored distribution of all classes Unable to do so successfully
Training Data for Classification Experiments: train on TDT-2,test on TDT-3
Submissions: train on TDT-2 plus TDT-3 Training data prepared the same way
Stories in each topic tagged with topic’s ROI Remove duplicate stories (in topics with the same ROI) Remove all stories with more than one ROI
Worst case: a single story relevant to…Chinese Labor Activists with ROI Legal/Criminal CasesBlair Visits China in October with ROI Political/Diplomatic Mtgs.China will not allow Opposition Parties with ROI Miscellaneous
Experiments with removing named entities for training
Naïve Bayes vs. BoosTexter Similar classification accuracy
Overall accuracy is the same Errors are substantially different
Our training results (TDT-3) BoosTexter beat Naïve Bayes for SLD and NED
BoosTexter used in most tasks for submission Evaluation results:
In Link Detection, using Naïve Bayes more useful
ROI classes in link detection Given story pair and their estimated ROIs If estimated ROIs are same, leave score alone If they are different, reduce score
Reduced to 1/3 of original value based on training runs Used four different ROI classifiers
ROI-BT,ne: BoosTexter with named entities ROI-BT, no-ne: BoosTexter without named entities ROI-NB, ne: Naïve Bayes with name entities ROI-NB, no-ne: Naïve Bayes without name entities
Training effectiveness (TDT-3)
Story Link Detection Minimum normalized cost
Various types of databases
1Dcos 4Dcos UDcos
original 0.3536 0.2556 0.3254
ROI-BT,ne 0.2959 0.2360 0.2748
ROI-BT,no ne 0.4600 0.3670 0.4246
ROI-NB,ne 0.3724 0.3047 0.3380
ROI-NB,no ne 0.4072 0.3269 0.3718
Evaluation results Story link detection
Various types of databases
1Dcos 4Dcos UDcos
original 0.2472 0.1983 0.2439
ROI-BT,ne 0.3090 0.2587 0.2938
ROI-BT,no ne 0.3220 0.2649 0.3020
ROI-NB,ne 0.2867 0.2407 0.2697
ROI-NB,no ne 0.2937 0.2463 0.2738
ROI for tracking Compare story to centroid of topic
Built from training stories If ROI does not match, drop score based on how
bad mismatch is Used ROI-BT,ne classifier only
model story
scoreold
Scorenew
1/3
Training for tracking
Various types of databases
1Dcos 4Dcos ADcos UDcosNt=1 orig 0.1890 0.1819 0.1390 0.1819
ROI-BT,ne 0.1659 0.1489 0.1280 0.1541
Nt=4 orig 0.1427 0.1294 0.1076 0.1321
ROI-BT,ne 0.1639 0.1314 0.1078 0.1494
Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
Evaluation results
Various types of databases
1Dcos 4Dcos ADcos UDcosNt=1 orig 0.1968 0.2149 0.2270 0.2604
ROI-BT,ne 0.3965 0.3807 0.3572 0.5002
Nt=4 orig 0.1716 0.1610 0.1463 0.1988
ROI-BT,ne 0.2996 0.2682 0.2525 0.3677
Topic tracking on TDT-3 Minimum normalized cost ROI BoosTexter with named entities only
ROI-based vocabulary pruning
New Event Detection only Create “stop list” for each ROI
300 most frequent terms in stories within ROI Obtained from TDT-2 corpus
When story is classified into an ROI… Remove those terms from the story’s vector
ROI determined from BoosTexter classifier
New Event Detection approach Cosine Similarity measure
ROI-based vocabulary pruning Score normalization Incremental IDF Remove short documents
Preprocessing Train BoosTexter on TDT-2 &TDT-3 Include named entities while training
NED ResultsTDT 3 TDT 4
ROI Conclusions Both uses of ROI helped in training
Score reduction for ROI mismatch Tracking and link detection
Vocabulary pruning for new event detection Score reduction failed in evaluation
Name entities important in ROI classifier TDT-4 has different set of entities (time gap)
Possible overfitting to TDT-3? Preliminary work applying to detection
Unsuccessful to date
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models
Comparing multilingual stories
Baseline All stories converted to English Using provided machine translations
New approaches Dictionary translation of Arabic stories Native language comparisons Adaptation in tracking
Dictionary Translation of Arabic
Probabilistic translation model Each Arabic word has multiple English
translations Obtain P(e|a) from UN Arabic-English parallel
corpus Forms a pseudo-story in English representing
Arabic Story Can get large due to multiple translations per
word Keep English words whose summed
probabilities are the greatest
Language specific comparisons
Language representations: Arabic CP1256 encoding and light stemming English stopped and stemmed with kstem Chinese segmented if necessary and overlapping
bigrams Linking Task:
If stories in same language, use that language All other comparisons done using all stories
translated into English
Adaptation in tracking Adaptation
Stories added to topic when high similarity score
Establish topic representation in each language as soon as added story in that language appears
Similarity of Arabic story compared to Arabic topic representation, etc.
Cross-Lingual Link Detection Results
Translation Condition
Minimum Cost Cost
TDT-3 TDT-4 TDT-4
1DcostIDF 0.3536 0.2472 0.2523
UDcosIDF 0.3254 (-8 %) 0.2439 (-1%) 0.2597
4DcosIDF 0.2556 (-28%) 0.1983 (-20%) 0.2000
Translation Conditions: 1DcosIDF: baseline, all
stories in English using provided translations.
UDcosIDF: all stories in English but using dictionary translation of Arabic.
4DcosIDF: comparing a pair of stories in native language if both stories within the same language, otherwise comparing them in English using the dictionary translation of Arabic
Cross-Lingual Topic Tracking Results (required condition: Nt=1,bnman)
Translation Condition
Minimum Cost Cost
TDT-3 TDT-4 TDT-4
1DcostIDF 0.1890 0.1968 0.1964
UDcosIDF 0.1853 (-2 %) 0.2024 (+3%) 0.2604
4DcosIDF 0.1819 (-4%) 0.2036 (+3%) 0.2149
ADcosIDF 0.1390 (-26%) 0.2007 (+2%) 0.2270
Translation Conditions: 1DcosIDF: baseline. UDcosIDF: dictionary
translation of Arabic. 4DcosIDF: comparing a pair
of stories in native language. ADcosIDF: baseline plus
adaptation, add a story to the centroid vector if its similarity score > adapting threshold, the vector limited top 100 terms, at maximum 100 stories could be added to the centroid.
Cross-Lingual Topic Tracking Results (alternate condition: Nt=4,bnasr)
Translation Conditions:1DcosIDF: baseline.UDcosIDF: dictionary translation of Arabic.4DcosIDF: comparing a pair of stories in native language.ADcosIDF: baseline plus adaptation.
Translation Condition
Minimum Cost Cost
TDT-3 TDT-4 TDT-4
1DcostIDF 0.1427 0.1676 0.1716
UDcosIDF 0.1321 (-7 %) 0.1594 (-5 %) 0.1988
4DcosIDF 0.1294 (-9 %) 0.1501 (-10%) 0.1610
ADcosIDF 0.1076 (-25%) 0.1443 (-14%) 0.1463
Outline Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models
Relevance Models for SLD Relevance Model (RM): “model of stories relevant to a query” Algorithm:
Given stories A,B1. compute “queries” QA and QB
2. estimate relevance models P(w|QA) and P(w|QB)
3. compute divergence between relevance models
M
QMPMwPQwP )|()|()|(
w
w
w
wwwchance tfn
cfNtfcf
nN
NncftfP ),,|(
TDT-3 TDT-4
Cosine / tf.idf .2551 .1983
Relevance Model .1938 .1881
Rel. Model +ROI .1862 .1863
Results: Story Link Detection
Relevance Models for Tracking
1. Initialize:• set P(M|Q) = 1/Nt if M is a training doc• compute relevance model as before
2. For each incoming story D:• score = divergence between P(w|D) and RM• if (score > threshold)
add D to the training, recompute RM• allow no more than k adaptations
TDT-3 TDT-4
Cosine / tf.idf .1888 .1964
Language Model .1481 .2122
Adaptive tf.idf .1390 .2007
Relevance Model .0953 .1784
Results: Topic Tracking
Conclusions Rule of Interpretation (ROI) classification ROI-based vocabulary reduction Cross-language techniques
Dictionary translation of Arabic stories Native language comparisons Adaptive tracking
Relevance models