alias detection in link data sets master’s thesis paul hsiung
Post on 20-Dec-2015
226 views
TRANSCRIPT
![Page 1: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/1.jpg)
Alias Detection in Alias Detection in Link Data SetsLink Data Sets
Master’s Thesis
Paul Hsiung
![Page 2: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/2.jpg)
Alias DefinitionAlias Definition
Alias of names– Dubya = G.W. Bush– Usama = Osama– G.W.Bush = the President
Osama bin Laden = the Emir, the PrinceMisspelled words
– Unintentional (typos)– Intentional : mortgage = m0rtg@ge (Spam)
![Page 3: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/3.jpg)
In What Context Do Aliases In What Context Do Aliases Occur?Occur?
Newspaper articlesWebPagesSpam emailsAny collections of text
![Page 4: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/4.jpg)
Link Data SetLink Data Set
A way to represent the contextCompose of set of names and links
– Names are extracted from the text– Names can refer to the same entity (“Dubya”
and “G.W.Bush”)– Links are collection of names and represent a
relationship between names
![Page 5: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/5.jpg)
ExampleExample
Wanted al-Qaeda terror network chief Osama binLaden and his top aide, Ayman al-Zawahri, haveMoved out of Pakistan and are believed to haveCrossed the mountainous border back intoAfghanistan (Osama bin Laden, Ayman al-Zawahri, al-Qaeda) (Pakistan, Osama bin Laden) (Afghanistan, Osama bin Laden)
![Page 6: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/6.jpg)
Graph RepresentationGraph Representation
Osama
al-Qaeda
Ayman
Pakistan
Afghanistan
![Page 7: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/7.jpg)
AdvantagesAdvantages
Link data set is easily understood by computers
Mimic the way intelligence communities gather data
![Page 8: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/8.jpg)
Alias DetectionAlias Detection
Given two names in a link data set, are they aliases (i.e. do they refer to the same entity?)
How to measure their alias-ness?Semi-supervised learning
![Page 9: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/9.jpg)
Orthographic MeasuresOrthographic Measures
String edit distance– Minimum number of insertions, deletions, and
substitutions required to transform one name into the other
– SED(Osama, Usama) = 2– SED(Osama, Bush) = 7– Intuitive measure
![Page 10: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/10.jpg)
Some Orthographic MeasuresSome Orthographic Measures
String edit distanceNormalized string edit distanceDiscretized string edit distance
![Page 11: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/11.jpg)
Semantic MeasuresSemantic Measures
But what about aliases such as the Prince and Osama?
Define friends of Osama as people who have occurred in same links with Osama
Through link data sets, number of occurrences of each friend can be collected
Intuition: friends of the Prince look like friends of Osama
Treat friends as probability vectors
![Page 12: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/12.jpg)
Example of FriendsExample of Friendsal-Qaeda
10
5
Islam
CNN2Osama
![Page 13: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/13.jpg)
Comparing Two Friends ListsComparing Two Friends Lists
Osama
al-Qaeda
Music
The Prince
10 2
5 50
Islam
CNN2 8
![Page 14: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/14.jpg)
Some Semantic MeasuresSome Semantic Measures
Dot Product: 10 * 2 + 2 * 8Normalized Dot ProductCommon Friends: 2 (CNN, AlQaeda)KL Distance:
![Page 15: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/15.jpg)
ClassifierClassifier
So we have a link data setWe have some measures of what aliases areWe can easily hand-pick some examples of
aliasesLet’s build a classifier!
![Page 16: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/16.jpg)
Classifier Training SetClassifier Training Set
Positive examples: hand-pick pairs of names in link data set that are known aliases
Negative examples: randomly pick pairs of names from the same link data set
Calculate measures for all the pairs and insert them as attributes into the training set
![Page 17: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/17.jpg)
Classifier Example:Classifier Example:
![Page 18: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/18.jpg)
Classifier : Cross-ValidationClassifier : Cross-Validation
Experimented with Decision Trees, k-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Logistic Regression
Logistic Regression performed the best
![Page 19: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/19.jpg)
PredictionPrediction
Given a query name in the link data set with known aliases
Pair query name with ALL other namesCalculate attributes for all pairsRun each pair through the classifier and
obtain a score (how likely are they to be aliases?)
![Page 20: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/20.jpg)
ExampleExample
![Page 21: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/21.jpg)
PredictionPrediction
Use the score to sort the pairs from most likely to be an alias to least likely
See where the true aliases lie in the sorted list and produce a ROC curve
Evaluate classifier based on ROC curve
![Page 22: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/22.jpg)
SummarySummary
TrainLogisticRegression
Calc Attributes
Calc Attributes
True alias pairs(no query name) Random pairs
Query name
Run Classifier ROC curve
![Page 23: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/23.jpg)
ROC CurveROC Curve
Start from (0,0) on the graphGo down the sorted listIf the name on the list is a true alias, move
y by one unitIf the name on the list is not a true alias,
move x by one unit
![Page 24: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/24.jpg)
Perfect ROC ExamplePerfect ROC Example
1 2 3
1
2
3
0
name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Usama Yes (0,2)Osama The Emir Yes (0,3)Osama Sid No (1,3)Osama Bob No (2,3)Osama John No (3,3)
![Page 25: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/25.jpg)
ROC ExampleROC Example
1 2 3
1
2
3
0
name1 name2 true alias? PositionOsama The Prince Yes (0,1)Osama Bob no (1,1)Osama Usama Yes (1,2)Osama Sid No (2,2)Osama John No (3,2)Osama The Emir Yes (3,3)
![Page 26: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/26.jpg)
ROC: NormalizeROC: Normalize
0.3 0.6 1
0.3
0.6
1
0
Balance positive and negative examples
Area under curve(AUC) = 5/9
Able to average multiple curves
![Page 27: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/27.jpg)
Empirical ResultsEmpirical Results
Test on one web page link data set and two spam link data sets
Hand pick aliases for each set
![Page 28: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/28.jpg)
Empirical ResultsEmpirical Results
Choose an alias from the set of hand pick aliases as a query name
Build classifier from other aliases that are not aliases with the query name
Do prediction and obtain ROC curveRepeat for each alias in the set of hand pick
aliasesAverage all ROC curves by normalized axis
![Page 29: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/29.jpg)
EvaluationEvaluation
We want to know how significant is each group of attributes
Train one classifier with just orthographic attributes
Train another with just semantic attributesTrain a third with both sets of attributesCompare curve and area under curve (AUC)
![Page 30: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/30.jpg)
Terrorist Data SetTerrorist Data Set
Manually extracted from public web pagesNews and articles related to terrorismNames mentioned in the articles are
subjectively linkedUsed 919 alias pairs for training
![Page 31: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/31.jpg)
Web Page ChartWeb Page Chart
![Page 32: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/32.jpg)
Spam Data SetSpam Data Set
Collection of spam emailsFilter out html tagsAll the words are converted to tokens with
white spaces being the boundariesCommon tokens are filtered (e.g. “the” “a”)Each email represents a linkEach link contains tokens from
corresponding email
![Page 33: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/33.jpg)
ExampleExample
Subject:Mortgage rates as low as 2.95%Ref<suyzvigcffl>ina<swwvvcobadtbo>nce to<shecpgkgffa>day to as low as2.<sppyjukbywvbqc>95% Sa<scqzxytdcua>ve thou<sdzkltzcyry>sa<sefaioubryxkpl>nds of
dol<scarqdscpvibyw>l<sklhxmxbvdr>ars or b<skaavzibaenix>uy the <br>ho<solbbdcqoxpdxcr>me of yo<svesxhobppoy>ur dr<sxjsfyvhhejoldl>eams!<br>
Filtered to:(mortgage, rates, low, refinance, today,
save, thousands, dollars, home, dreams)
![Page 34: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/34.jpg)
Spam I ChartSpam I Chart
![Page 35: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/35.jpg)
Spam II ChartSpam II Chart
![Page 36: Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung](https://reader038.vdocuments.mx/reader038/viewer/2022103022/56649d405503460f94a1ab40/html5/thumbnails/36.jpg)
ConclusionConclusion
Orthographic measures work wellSemantic sometimes better, sometimes
worse than orthographicCombining them produces the bestFuture work includes adding other measures
such as phonetic string edit distanceLarger question: many aliases to many
names