an entity resolution approach to isolate instances of

Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 77–84Copenhagen, Denmark, September 7, 2017. c©2017 Association for Computational Linguistics

An Entity Resolution Approach to Isolate Instances ofHuman Trafficking Online

Chirag Nagpal, Kyle Miller, Benedikt Boecking and Artur [email protected], [email protected], [email protected], [email protected]

Carnegie Mellon University

Abstract

Human trafficking is a challenging lawenforcement problem, and traces of vic-tims of such activity manifest as ‘escortadvertisements’ on various online forums.Given the large, heterogeneous and noisystructure of this data, building models topredict instances of trafficking is a con-voluted task. In this paper we proposean entity resolution pipeline using a no-tion of proxy labels, in order to extractclusters from this data with prior historyof human trafficking activity. We applythis pipeline to 5M records from back-page.com and report on the performanceof this approach, challenges in terms ofscalability, and some significant domainspecific characteristics of our resolved en-tities.

1 Introduction

Over the years, human trafficking has grown tobe a challenging law enforcement issue. The ad-vent of the internet has brought the problem intothe public domain making it an ever greater soci-etal concern. Prior studies (Kennedy, 2012) haveleveraged computational techniques to mine on-line escort advertisements from classifieds web-sites to detect spatio-temporal patterns, by utiliz-ing certain domain specific features of the ads.Other studies (Dubrawski et al., 2015) have uti-lized machine learning approaches to identify ifads are likely to be involved in human traffickingactivity. Significant work has also been carried outin building large distributed systems to store andprocess such data, and carry out entity resolutionto establish ontological relationships between var-ious entities. (Szekely et al., 2015)

In this paper we explore the possibility of lever-aging online escort data in an attempt to identifysources of advertisements, i.e. grouping relatedads by the persons that generated them. We isolatesuch clusters of related advertisements originatingfrom the same source and identify if these poten-tial sources of are involved in human traffickingusing prior domain knowledge.

In case of ordinary Entity Resolution schemes,each record is considered to represent a singleentity. A popular approach in such scenarios isa ‘merge and purge’ strategy where records arecompared, matched, and then merged into a sin-gle more informative record, and the individualrecords are deleted from the dataset. (Benjellounet al., 2009)

While our problem can be considered a caseof Entity Resolution, escort advertisements poseadditional challenges as they form a noisy andunstructured dataset. For example, a single ad-vertisement may represent more than one entityand as such can contain features belonging tomore than one individual or group. We describethese idiosyncrasies of this domain in detail inthe following sections. Since the advertisementsare associated with multiple modalities includingtext, hyperlinks, images, timestamps, locationsetc. in order to featurize characteristics from texts,we use a regex based information extractor con-structed via the GATE framework (Cunningham,2002). This allows us to generate certain domainspecific features from our dataset such as aliases,cost, location, phone numbers, or specific URLs.We use these hand engineered features along withother generic features like text similarity, num-ber of common images as features for our binarymatch function. We note that many identifyingcharacteristics of entities like aliases, are commonand shared between escorts which makes it diffi-cult to generate exact matches over individual fea-

77

(a) Search Results on backpage.com (b) Representative escort advertisement

Figure 1: Escort advertisements are a classic source of what can be described as noisy text. Notice theexcessive use of emojis, intentional misspelling and relatively benign colloquialisms to obfuscate a morenefarious intent. Domain experts extract meaningful cues from the spatial and temporal indicators, andother linguistic markers to identify suspected trafficking activity, which further motivate the leveragingof computational approaches to support such decision making.

tures.We proceed to leverage machine learning ap-

proaches to learn a classifier that can predict iftwo advertisements are from the same source,the challenge being the lack of prior knowledgeof the source of advertisements. We thus de-pend upon a strong linking feature, in our casephone numbers, which can be used as proxy ev-idence for the source of the advertisements andcan help us generate labels for the training andtest data for a classifier. We can therefore use suchstrong evidence as to learn another function, whichcan help us generate labels for our dataset, thissemi-supervised approach is described as ‘surro-gate learning’ in (Veeramachaneni and Kondadadi,2009). Pairwise comparisons result in an ex-tremely high number of comparisons over the en-tire dataset. In order to reduce this computationalburden we introduce a blocking scheme describedlater.

The resulting clusters are labeled according totheir human trafficking relevance using prior ex-pert knowledge. Rule learning is used to estab-lish differences between such relevant clusters andother extracted clusters. The entire pipeline is rep-resented by Figure 2.

2 Domain and Feature Extraction

Figure 1 is illustrative of the search results of es-cort advertisements and a page advertising a par-ticular individual. The text is inundated withspecial characters, emojis, as well as misspelledwords that are specific markers and highly infor-mative to domain experts. The text consists of

Table 1: Performance of TJBatchExtractor

Feature Precision Recall F1 ScoreAge 0.980 0.731 0.838Cost 0.889 0.966 0.926E-mail 1.000 1.000 1.000Ethnicity 0.969 0.876 0.920Eye Color 1.000 0.962 0.981Hair Color 0.981 0.959 0.970Name 0.896 0.801 0.846Phone Number 0.998 0.995 0.997Restriction(s) 0.949 0.812 0.875Skin Color 0.971 0.971 0.971URL 0.854 0.872 0.863Height 0.978 0.962 0.970Measurement 0.919 0.883 0.901Weight 0.976 0.912 0.943

information regarding the escort’s area of oper-ation, phone number, any particular client pref-erences, and the advertised cost. We built aregular expression based feature extractor to ex-tract this information and store it in a fixedschema, using the popular JAPE tool part of theGATE suite of NLP tools. The extractor webuild for this domain, TJBatchExtractor, is opensource and publicly available at github.com/autoncompute/CMU_memex.

Table 1 lists the performance of our extractiontool on 1,000 randomly sampled escort advertise-ments, for the various features. Most of the fea-tures are self explanatory. (The reader is directedto (Dubrawski et al., 2015) for a complete descrip-

78

Feature Extractionfrom

Raw DataRule Learning

Entity Resolutionwith

Strong Features

Sample Data andTrain Match

Function

Entity ResolutionWith learnt

Match Function

Figure 2: The proposed Entity Resolution pipeline

tion of the fields extracted.) The noisy nature,along with intentional obfuscations, especially incase of features like names results in lower perfor-mance as compared to the other extracted features.

Apart from the regular expression based fea-tures, we also extract the hashvalues of the imagesin the advertisements as their identifier, along withthe posting date and time, and location.1

Figure 3: On applying our match function, weaklinks are generated for classifier scores above acertain match threshold. The strong links betweennodes are represented by solid lines. Dashed linesrepresent the weak links generated by our classi-fier.

3 Entity Resolution

3.1 DefinitionThe trained match function can be used to repre-sent our data as a graph where the nodes representadvertisements and edges represent the similaritybetween ads. We approach the problem of extract-ing connected components from our dataset usingpairwise entity resolution. The similarity or con-nection between two nodes is treated as a learningproblem, with training data for the problem gen-erated by using ‘proxy’ labels from existing evi-dence of connectivity from strong features.

More formally, the problem can be consid-ered to be to sample all connected componentsHi(V, E) from a graph G(V, E). Here, V , the setof vertices ({v1, v2, ..., vn}) is the set of adver-tisements and E , {(vi, vj), (vj , vk), ..., (vk, vl)} isthe set of edges between individual records, the

1These features are present as metadata, and do not re-quire the use of hand engineered regular expressions.

presence of which indicates that they represent thesame entity.

We need to learn a function M(vi, vj) such thatM(vi, vj) = Pr((vi, vj) ∈ E(Hi),∀Hi ∈ H)

The set of strong features present in a givenrecord can be considered to be the function ‘S’. Inour problem, Sv represents all the phone numbersassociated with v.

Thus S =⋃Svi ,∀vi ∈ V . Here, |S| << |V|

Now, let us further consider the graph G∗(V, E)defined on the set of vertices V , such that(vi, vj) ∈ E(G∗) if |Svi ∩ Svj | > 0 (more sim-ply, the graph described by strong features.)

Let H∗ be the set of all the of connected com-ponents {H∗1(V, E),H∗2(V, E), ...,H∗n(V, E)} de-fined on the graph G∗(V, E)

Now, function P is such that for any pi ∈ SP(pi) = V(H∗k) ⇐⇒ pi ∈ ⋃Svi , ∀vi ∈ V(H∗k)

3.2 Sampling Scheme

For our classifier we need to generate a set of train-ing examples ‘T ’, and Tpos & Tneg are the subsetsof samples labeled positive and negative.Tpos = {Fvi,vj |vi ∈ P(pi), vj ∈ P(pi), ∀pi ∈ S}Tneg = {Fvi,vj |vi ∈ P(pi), vj 6∈ P(pi),∀pi ∈ S}

In order to ensure that the sampling schemedoes not end up sampling near duplicate pairs, weintroduce a sampling bias such that for every fea-ture vector Fvi,vj ∈ Tpos, Svi ∩ Svj = φThis reduces the likelihood of sampling near-duplicates as evidenced in Figure 5, which is a his-togram of jaccard similarities between the sets ofunigrams of ad pairs.sim(vi, vj) = |unigrams(vi)∩unigrams(vj)|

|unigrams(vi)∪unigrams(vj)|We observe that although we do still end withsome near duplicates (sim > 0.9), we have highnumber of non duplicates. (0.1 < sim < 0.3)which ensures robust training data for our classi-fier.

3.3 Training

To train our classifier we experiment with var-ious classifiers like Logistic Regression (LR),Naive Bayes (NB) and Random Forest (RF) using

79

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

sRFLRNBRnd

Regx+Temporal

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx+Temporal+NLP

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx+Temporal+NLP+Spatial

Figure 4: ROC curves for our match function trained on various feature sets. The ROC curve showsreasonably large true positive rates for extremely low false positive rates, which is a desirable behaviourof the match function.

0.0 0.2 0.4 0.6 0.8 1.0Text Similarity

05000

100001500020000250003000035000

Num

ber o

f Pai

rs

Figure 5: Text similarity for our sampling Scheme.We use Jaccard similarity between the ad uni-grams as a measure of text similarity. The his-togram shows that the sampling scheme results inboth, a large number of near duplicates and nonduplicates. Such a behavior is desired to ensure arobust match function.

Scikit (Pedregosa et al., 2011). Table 2 shows themost informative features learnt by the RandomForest classifier. It is interesting to note that themost informative features include the spatial (Lo-cation), temporal (Time Difference, Posting Date)and also the linguistic (Number of Special Char-acters, Longest Common Substring) features. Wealso find that the domain specific features, ex-tracted using regexs, prove to be informative.

The receiver operating characteristic (ROC)curves for the classifiers we tested with differentfeature sets are presented in Figure 4. The classi-fiers perform well at very low false positive rates.Such a behavior is desirable for the classifier to actas a match function, in order to generate sensibleresults for the downstream tasks. High false pos-

Table 2: Most Informative Features

Top 10 Features1 Location (State)2 Number of Special Characters3 Longest Common Substring4 Number of Unique Tokens5 Time Difference6 If Posted on Same Day7 Presence of Ethnicity8 Presence of Rate9 Presence of Restrictions10 Presence of Names

0.495 0.496 0.497 0.498 0.499 0.500 0.501

0

2000

4000

6000

8000

10000

Size

of C

on. C

ompo

nent

0.495 0.496 0.497 0.498 0.499 0.500 0.50102004006008001000120014001600

No. o

f Con

. Com

pone

nts

Logistic Regression

0.88 0.90 0.92 0.94 0.96 0.98 1.00

2000

4000

6000

8000

10000

Size

of C

on. C

ompo

nent

0.88 0.90 0.92 0.94 0.96 0.98 1.000

200

400

600

800

1000

1200

1400

No. o

f Con

. Com

pone

nts

Random Forest

Figure 6: The plots represents the number of con-nected components and the size of the largest com-ponent versus the match threshold.

80

itive rates increase the number of links betweenour records, leading to a ‘snowball effect’ whichresults in a break-down of the downstream EntityResolution process as evidenced in Figure 6.

In order to minimize this breakdown, we need toheuristically learn an appropriate confidence valuefor our classifier. This is done by carrying out theEntity Resolution process on 10,000 randomly se-lected records from our dataset. The value of sizeof the largest extracted connected component andthe number of such connected components iso-lated is calculated for different decision thresholdsof our classifier. This allows us to come up with asensible heuristic for the confidence value.

Bigrams Unigrams Images

Figure 7: Blocking Scheme

3.4 Blocking SchemeOur dataset consists of over 5 million records.Naive pairwise comparisons across the datasetmakes this problem computationally intractable.In order to reduce the number of comparisons, weintroduce a blocking scheme and perform exhaus-tive pairwise comparisons only within each blockbefore resolving the dataset across blocks. Weblock the dataset on features including rare uni-grams, rare bigrams and rare images. Figure 7 rep-resents the distribution of the frequency of adver-tisements across the different blocking schemes.

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

Positive SetMean Random Guessing

Figure 9: ROC for the connected component clas-sifier. The black line is the positive set, whilethe red line is the average ROC for 100 randomlyguessed predictors.

(a) This pair of ads have extremely similar textual content in-cluding use of non-latin and special characters. The ad also ad-vertises the same individual, as strongly evidenced by the com-mon alias, ‘Paris’.

(b) The first ad here does not include any specific names of indi-viduals. However, The strong textual similarity with the secondad and the same advertised cost, helps to match them and dis-cover the individuals being advertised as ‘Nick’ and ‘Victoria’.

(c) While this pair is not extremely similar in terms of language,however the existence of the rare alias ‘SierraDayna’ in bothadvertisemets helps the classifier in matching them. This matchcan also easily be verified by the similar language structure ofthe pair.

(d) The first advertisement represents entities ‘Black China’ and‘Star Quality’, while the second advertisement, reveals that thepictures used in the first advertisement are not original and be-long to the author of the second ad. This example pair showsthe robustness of our match function. It also reveals how com-plicated relationships between various ads can be.

Figure 8: Representative results of advertisementpairs matched by our classifier. In all the fourcases the advertisement pairs had no phone num-ber information (strong feature) in order to detectconnections. Note that sensitive elements havebeen intentionally obfuscated.

81

Table 3: Results Of Rule Learning

Rule Support Ratio LiftXminchars<=250, 120000<Xmaximgfrq, 3<Xmnweeks<=3.4, 4<Xmnmonths<=6.5 11 90.9% 2.67

Xminchars<=250, 120000<Xmaximgfrq 4<Xmnmonths<=6.5, 16 81.25% 2.4Xstatesnorm<=0.03, 3.6<Xuniqimgsnorm<=5.2, 3.2<Xstdmonths 17 100.0% 2.5Xstatesnorm<=0.03, 1.95<Xstdweeks<=2.2, 3.2<Xstdmonths 19 94.74% 2.37

0 5 10 15 20 25False Positives

0

5

10

15

20

25

True

Pos

itive

s

Max Rules: 5Max Rules: 3Max Rules: 2

Figure 10: The figure presents PN curves for vari-ous values of the maximum rules extracted by therule learner.

4 Rule Learning

We extract clusters and identify records that areassociated with human trafficking using domainknowledge from experts. We featurize the ex-tracted components using features like size of thecluster, the spatio-temporal characteristics, and theconnectivity of the clusters. For our analysis, weconsider only components with more than 300 ad-vertisements. We then train a random forest clas-sifier to predict if a cluster contains indicators ofhuman trafficking. In order to establish statisticalsignificance, we compare the ROC results of ourclassifier using four fold cross-validation for 100random connected components versus the positiveset. Figure 9 & Table 4 lists the performance ofthe classifier in terms of false positive and truepositive Rate while Table 5 lists the most informa-tive features for this classifier.

Additionally we extract rules from our featureset that help establish differences between thetrafficking relevant clusters. Some of the ruleswith corresponding ratios and lift are given inTable 3. PN curves for Rule Learning are analo-gous to ROC Curves in Classification (Furnkranzand Flach, 2005) and PN curves corresponding tovarious rules learnt are presented in the Figure 10.It can be observed that the features used by therule learning to learn rules with maximum supportand ratios, correspond to the ones labeled by therandom forest as informative. This also serves asvalidation for the use of rule learning.

JessicaMontgomery, AL

KimberlyMontgomery, ALWilkes-Barre, PA

Amber/DesireMontgomery, AL

Eve/Eden/DesireMontgomery, AL

NicoleMontgomery, AL

MonicaScranton, PA

CandiceMontgomery, AL

StephanieScranton, PA

Montgomery, AL

Figure 11: Representative entity isolated by ourpipeline, believed to be involved in human traffick-ing. The nodes represent advertisements, whilethe edges represent links between advertisements.This entity has 802 nodes and 39,383 edges. Thisvisualization is generated using Gephi. (Bastianet al., 2009). This entity operated in cities, acrossstates and advertised multiple different individu-als along with multiple phone numbers. This sug-gests a more complicated and organised activityand serves as an example of how complicated cer-tain entities can be in this trade.

.

Table 4: Metrics for the Connected Componentclassifier

AUC TPR@FPR=1% FPR@TPR=50%90.38% 66.6% 0.6%

Table 5: Most Informative Features

Top 5 Features1 Posting Months2 Posting Weeks3 Std-Dev. of Image Frequency4 Norm. No. of Names5 Norm. No. of Unique Images

82

5 Conclusion

In this paper we approached the problem of isolat-ing sources of human trafficking from online es-cort advertisements with a pairwise Entity Reso-lution approach. We trained a classifier able topredict if two advertisements are from the samesource using phone numbers as a strong featurewhich we exploit as proxy ground truth to generatetraining data. The resulting classifier proved to berobust, as evidenced from extremely low false pos-itive rates. Other approaches like (Szekely et al.,2015) aim at building similar knowledge graphsusing similarity score between each feature. Thishas some limitations. Firstly, we need labelledtraining data in order to train match functions todetect ontological relations. The challenge is ag-gravated since this approach considers each fea-ture independently making generation of enoughlabelled training data for training multiple matchfunctions an extremely complicated task.

Since we utilise existing features as proxy evi-dence, our approach can generate a large amountof training data without the need of any humanannotation. Our approach requires just learning asingle function over the entire featureset. Hence,our classifier can learn multiple complicated rela-tions between features to predict a match, insteadof the naive feature independence assumption.

We then proceeded to use this classifier in orderto perform entity resolution using a heuristicallylearned match threshold. The resultant connectedcomponents were again featurised, and a classi-fier model was fit before subjecting to rule learn-ing. On comparison with (Dubrawski et al., 2015),the connected component classifier performs a lit-tle better with higher values of the area under theROC curve and the TPR@FPR=1% indicating asteeper ROC curve. We hypothesize that due tothe entity resolution process, we are able to gen-erate larger, more robust amount of training datawhich is immune to the noise in labelling and re-sults in a stronger classifier. The learnt rules showhigh ratios and lift for reasonably high support asshown in Table 3. Rule learning also adds an ele-ment of interpretability to the models we built andas compared to more complex ensemble methodslike Random Forests, having hard rules as classifi-cation models are preferred by domain experts tobuild evidence for incrimination.

6 Future Work

While our blocking scheme performs well to re-duce the number of comparisons, scalability is stilla significant challenge since our approach involvesnaive pairwise comparisons. One solution to thisissue may be to design such a pipeline in a dis-tributed environment. Another approach could beto use a computationally inexpensive technique tode-duplicate the dataset first, which would greatlyhelp with regard to scalability.

In our approach, the ER process depends uponthe heuristically learnt match threshold. Lowerthreshold values can significantly degrade the per-formance, producing extremely large connectedcomponents as a result. The possibility of treatingthis attribute as a learning task, would help mak-ing this approach more generic, and non domainspecific.

Hashcodes of the images associated with theads were also utilized as a feature for the matchfunction. However, simple features like number ofunique and common images etc., did not prove tobe very informative. Further research is requiredin order to make better use of such visual data.

Acknowledgments

The authors would like to thank all staff, fac-ulty and students who made the Robotics Insti-tute Summer Scholars program 2015 at CarnegieMellon University possible. This work has beenpartially supported by the National Institute ofJustice (2013-IJ-CX-K007) and the Defense Ad-vanced Research Projects Agency (FA8750-14-2-0244).

ReferencesMathieu Bastian, Sebastien Heymann, and Math-

ieu Jacomy. 2009. Gephi: An open sourcesoftware for exploring and manipulating networks.http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.

Omar Benjelloun, Hector Garcia-Molina, David Men-estrina, Qi Su, Steven Euijong Whang, and JenniferWidom. 2009. Swoosh: a generic approach to en-tity resolution. The VLDB JournalThe InternationalJournal on Very Large Data Bases 18(1):255–276.

Hamish Cunningham. 2002. Gate, a general architec-ture for text engineering. Computers and the Hu-manities 36(2):223–254.

Artur Dubrawski, Kyle Miller, Matthew Barnes,Benedikt Boecking, and Emily Kennedy. 2015.

83

Leveraging publicly available data to discern pat-terns of human-trafficking activity. Journal of Hu-man Trafficking 1(1):65–85.

Johannes Furnkranz and Peter A Flach. 2005. Rocnrule learningtowards a better understanding of cov-ering algorithms. Machine Learning 58(1):39–77.

Emily Kennedy. 2012. Predictive patterns of sex traf-ficking online .

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer,Ron Weiss, Vincent Dubourg, Jake Vanderplas,Alexandre Passos, David Cournapeau, MatthieuBrucher, Matthieu Perrot, and Edouard Duch-esnay. 2011. Scikit-learn: Machine learningin python. J. Mach. Learn. Res. 12:2825–2830.http://dl.acm.org/citation.cfm?id=1953048.2078195.

Pedro Szekely, Craig A. Knoblock, Jason Slepicka,Andrew Philpot, Amandeep Singh, Chengye Yin,Dipsy Kapoor, Prem Natarajan, Daniel Marcu,Kevin Knight, David Stallard, Subessware S.Karunamoorthy, Rajagopal Bojanapalli, StevenMinton, Brian Amanatullah, Todd Hughes, MikeTamayo, David Flynt, Rachel Artiss, Shih-FuChang, Tao Chen, Gerald Hiebel, and Lidia Fer-reira. 2015. Building and using a knowledge graphto combat human trafficking. In Proceedings of the14th International Semantic Web Conference (ISWC2015).

Sriharsha Veeramachaneni and Ravi Kumar Kon-dadadi. 2009. Surrogate learning: from feature in-dependence to semi-supervised classification. InProceedings of the NAACL HLT 2009 Workshopon Semi-Supervised Learning for Natural LanguageProcessing. Association for Computational Linguis-tics, pages 10–18.

84

an entity resolution approach to isolate instances of

Documents