cleaning the spurious links in data
TRANSCRIPT
amounts of data from heterogeneous sources, fur-ther increasing the likelihood of introducing errors.When mining these data warehouses for decision-making information, making logical and well-informed decisions becomes problematic if the data’squality is doubtful.
Data cleaning refers to processes for detecting andremoving errors and inconsistencies from data.Research in this area has included outlier detection,1
noise handling for classification,2 and duplicate elim-ination.3,4 Given the “garbage in, garbage out” princi-ple, clean data is crucial for database integration, datawarehousing, and data mining, thus leading to datacleaning’s strong association with these research areas.
A newly discovered class of erroneous data is spu-rious links, where a real-world entity has multiplelinks that might not be properly associated with it.Standard duplicate elimination techniques such asthe sorted-neighborhood method3 or the priority-queue algorithm5 can’t find this anomaly. The exis-tence of such spurious links often leads to confusionand misrepresentation in the data records represent-ing the entity. Consider the Digital Bibliography &Library Project database (http://dblp.uni-trier.de), alarge collection of computer science bibliographicalrecords. Each DBLP record captures a publication’slist of authors, title, conference or journal name, andpage numbers. Although the data set is well knownfor its high-quality bibliographic information, col-lecting and maintaining the data from diverse sourcesrequires enormous effort. Errors, including spuriouslinks, are inevitable.
To solve this problem, we use context information
to identify spurious links. First, we identify datarecords that contain potential spurious links. We thendetermine the set of attributes that constitute eachrecord’s context. For example, the coauthors and con-ference/journal fields constitute a publication record’scontext and offer good hints for identifying poten-tial spurious links. Experiments with three real-worlddatabases have demonstrated that our approach canaccurately identify spurious links.
A context-based solutionFigure 1 gives an overview of our approach. After
finding records that might contain spurious links, thisapproach uses selected fields that capture the records’context to compute their context similarity. Thesecontext fields might have access to additional domainknowledge that provides a more comprehensive pic-ture of the records’ context. We developed a methodto determine context similarity between two recordsets, and we use their overall degree of similarity tosubsequently identify any spurious links.
Table 1 shows a pair of records from the DBLPdatabase. At first glance, Record 2 seems to contain atypographical error in the author name “TshiharuHasegawa,” compared with “Toshiharu Hasegawa” inRecord 1. The possibility that “Tshiharu Hasegawa”and “Toshiharu Hasegawa” refer to the same real-worldperson increases if we have this context information:
• They have some common coauthors, such as “Tet-suya Takine” and “Yutaka Takahashi.”
• Their publication titles show similar research areas.For example, both the publication titles in Records
E n h a n c i n g I n f o r m a t i o n
Cleaning the SpuriousLinks in DataMong Li Lee, Wynne Hsu, and Vijay Kothari, National University of Singapore
D ata quality problems can arise from abbreviations, data entry mistakes, dupli-
cate records, missing fields, and many other sources. These problems prolifer-
ate when you integrate multiple data sources in data warehousing, federated databases,
and global information systems. Data warehouses load and frequently update large
Comparing context
information between
data records can help
solve the data quality
problem of spurious
links—that is, multiple
links between data
entries and real-world
entities.
28 1094-7167/04/$20.00 © 2004 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society
1 and 2 are related to networking research.• The conference proceedings or journals in
which they publish have overlappingresearch areas. For example, the scope of“Performance” overlaps with that coveredin “Performance Evaluation.”
Records with potential spuriouslinks
First, we need to identify the records withpotential spurious links. We use existingstring similarity matching algorithms to iden-tify record pairs with a high degree of simi-larity in some attribute. For example, Table 1shows a pair of records with high similarityfor one of the coauthors, “Toshiharu Hase-gawa” and “Tshiharu Hasegawa.” On thebasis of these high similarity values, weretrieve the associated records.
Tables 2 and 3 show extracts from the setsof records retrieved from the DBLP Com-puter Science Bibliography using “ToshiharuHasegawa” and “Tshiharu Hasegawa,”respectively.
Context attributesNext, we identify the attributes that might
provide clues about the existence of spuriouslinks. A spurious attribute contains erro-neous values that might result in spuriousrecord links. A context attribute contains val-ues strongly correlated with those in the spu-rious attribute.
To identify the context attributes, we apply
association-rules mining to the database todiscover all associations among the attributevalues. That is, we try to determine if a valuein one attribute frequently occurs with somevalue in another attribute. We’re particularlyinterested in rules whose antecedent containsthe spurious attribute. If we obtain manyrules that associate values between the spu-rious attribute and some other set of attri-butes, we consider the latter to be candidate
context attributes. For example, for theDBLP data set, we found many rules withhigh confidence and minimum support (auser-defined threshold to retain only the rulessatisfied by a reasonable number of datatuples) involving authors and their coauthors.An attribute can be a context attribute evenif we don’t obtain high-confidence associa-tion rules. For instance, the values corre-sponding to the Proceedings attribute don’t
MARCH/APRIL 2004 www.computer.org/intelligent 29
Data set
Identifyspurious links
Computecontext
similarity
Hierarchicaldomain
structure
Select fieldswith contextinformation
Retrieverecords with
possiblespurious links
Figure 1. Context-based cleaning of spurious links.
Table 3. Extracted records using “Tshiharu Hasegawa.”
Record Author list Title Proceedings
1 Tetsuya Takine, Analysis of Asymmetric Single-Buffer SIGMETRICSHideaki Takagi, Polling and Priority of Systems Yutaka Takahashi, without Switchover TimesTshiharu Hasegawa
2 Yoshiyuki Shiozawa, Analysis of a Polling System with Computer Tetsuya Takine, Correlated Input Networks Yutaka Takahashi, & ISDN SystemsTshiharu Hasegawa
3 Fumio Ishizaki, Analysis of a Discrete-Time Queue Performance Tetsuya Takine, with Gated Priority EvaluationTshiharu Hasegawa
4 Shoji Kasahara, Analysis of Waiting Time of PerformanceYutaka Takahashi, M/G/1/K System EvaluationTshiharu Hasegawa
Table 2. Extracted records using “Toshiharu Hasegawa.”
Record Author list Title Proceedings
1 Shojiro Muro, File Redundancy Issues in VLDBToshihide Ibaraki, Distributed Database SystemsHidehiro Miyajima,Toshiharu Hasegawa
2 Kazuhiro Ohtsuki, Analysis for Traverse Time in an ICCYutaka Takahashi, Integrated Communication NetworkToshiharu Hasegawa
3 Shojiro Muro, Evaluation of File Redundancy in TSEToshihide Ibaraki, Distributed Database SystemsHidehiro Miyajima,Toshiharu Hasegawa
4 Tetsuya Takine, Analysis of Asymmetric Polling PerformanceYutaka Takahashi, System with Single BuffersToshiharu Hasegawa
Table 1. A possible pair of duplicates in the Digital Bibliography & Library Project data set.
Record Author list Title Proceedings
1 Tetsuya Takine, Analysis of an Asymmetric Polling Performance Yutaka Takahashi, System with Single BuffersToshiharu Hasegawa
2 A. Sugahara, Analysis of a Nonpreemptive Performance Tetsuya Takine, Priority Queue with SPP Arrivals EvaluationYutaka Takahashi, of High ClassTshiharu Hasegawa
appear in any high-confidence associationrule satisfying the minimum support. How-ever, if we generalize the Proceedingsattribute values to the research area, we dis-cover high-confidence rules involving Authorand Research Area, and we can use the lat-ter as a context attribute. We use the concepthierarchies to generalize attribute values6 sothat we don’t miss such context attributes.
Figure 2 shows a concept hierarchy con-structed for research articles’ conference/journalfield.
On the basis of this rationale, we apply thisalgorithm to determine the potential contextattributes:
Algorithm FindContextAttr1. If there exist attributes with a concept hierarchy
Then generalize the attribute values.2. Generate association rules for the database given
user-specified minimum support and minimumconfidence.
3. Identify the rules that contain spurious attributesin their antecedents.
4. Attributes in the consequent of these rules consti-tute the context attributes.
Record similarityWe now determine the retrieved records’
similarity to find spurious links. Given two setsof records and a list of context attributes, wemust decide if the two sets refer to the samereal-world entity. We do this by determininghow similar their context attributes are.
Our study indicates that for identifyingspurious links, a simple, yet effective col-umn-wise similarity measure achieves thebest accuracy. Essentially, the column-wisesimilarity measure for a context attribute Cwould take the union of the values of attribute
C in each set of records and find the degreeof overlap according to the formula
ColSimC (A1, A2) = | A1 ∩ A2 | / min(|A1|, |A2|),
where A1 is the set of unique values for thecontext attribute C in the first table ofrecords, and A2 is the set of unique values forC in the second table of records.
Referring to the author list attribute inTables 2 and 3, the two sets of attribute val-ues have a similarity value of 4/6 = 0.67.
Given a set of context attributes C1, C2, …,Cn, we find the overall context similarity ofthe record sets R1 and R2 using
where wi is the weight of context attribute Ci,and w1 + … + wn = 1. If most context attri-butes exhibit similarity above some prede-termined threshold, we can infer that therecords likely refer to the same person in thereal world. Therefore, a spurious link exists.
Optimal threshold valuesIn our experiments (which we describe in
the next section), we determined the optimalthreshold at which our approach achievesmaximum accuracy in identifying spuriouslinks. We defined the accuracy metric as theratio of correctly identified duplicate pairs tothe total number of potential duplicate pairs.Figure 3 shows how the threshold valueaffects data set accuracy when we assignequal weights to the context attributes.
Testing the approachWe implemented our algorithms in Java
and ran the experiments on a Pentium 4 1.6-GHz system with 256 Mbytes of RAM run-ning Windows XP Professional. We usedIntelliClean4 on the three data sets to findrecord sets with high similarity in the spuri-ous attribute, and we used DM2-CBA7 togenerate association rules to determine thecontext attributes.
DBLPThe DBLP database consists of over
200,000 records, each stored in an XML file.We used a subset of the data to create a rela-tional database of 12,258 records. Each recordin the relational database contains informa-tion about the publication’s first author, coau-thors, title, and conference or journal (called
ContextSim R R w ColSimi cii
n
1 21
, , ( ) = ∗=∑
E n h a n c i n g I n f o r m a t i o n
Subject
Math PhysicsComputerscience
Machinelearning
Multimedia Theoreticalcomputerscience
IAAIIEEE
Multimedia
Visualinterfaces Information
sciences
. . .
. . .AI
Figure 2. A concept hierarchy for research articles’ conference/journal field.
0102030405060708090
100
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Threshold
Accu
racy
(%
)
DBLP
Movie
Hep-ph
Figure 3. The threshold value’s effect on accuracy.
30 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
“proceedings” in the database). IntelliCleanoutput 305 pairs of records with high similar-ity in their First Author attribute. Table 4shows a sample of the records found and thedegree of string similarities.
Our algorithm for finding context attri-butes in the data set outputs the coauthor andproceedings fields. First, using the concept hier-archy in Figure 2, we generalize the proceedingsfield to the research areas the publicationsbelong to. We generate association rules forthe database having a minimum confidenceof 75 percent and a minimum support of atleast two records. We consider only thoserules that contain the First Author attributevalues in their antecedent. We generalizethese rules to the corresponding attributesand obtain two associations: First Author →Coauthor, and First Author → Proceedings.These rules indicate that strong associationsexist between the values in the first author fieldand those in the coauthor and proceedings fields.So we use the Coauthor and Proceedingsattributes as context attributes.
For each pair of duplicate records, we usethe first authors’ names to retrieve recordswith the same name from the database. Ifwe’ve identified “name1” and “name2” aspossible duplicates, we can then use thesevalues to retrieve two record sets from thedatabase. Set A consists of records whosefirst author is “name1,” and Set B consists ofrecords whose first author is “name2.” Wethen apply a column-wise similarity methodto find the similarity between the two sets.
Assigning equal weights to the contextattributes and using the optimal threshold valueof 0.5, we achieved an accurate detection rateof 89 percent for 50 pairs of duplicates. Table5 shows a sample of the results for the DBLPdata set. A high-context similarity value indi-cates that spurious links exist; a low-contextsimilarity value indicates no spurious links.
MovieWe performed similar experiments on a
Movie data set, which consists of 11,453records. Each record in the Movie database(www-db.ics.uci.edu/pages/flamingo/Dataset.htm), created by the University of Cal-ifornia, Irvine database group, contains infor-mation such as movie ID (a unique numericalidentifier), title, year of release, director, pro-ducers, studios, category, and awards.
We first extracted a set of 200 potentialduplicate pairs from the director field. Fromthe association rules generated, we foundProducers, Studios, Category, and Location
to be context attributes for director. Assigningequal weights to the context attributes andusing the optimal threshold value of 0.6 (seeFigure 3), we achieved a maximum accuracyof 96 percent in identifying spurious links.Table 6 shows a sample of the results.
Hep-phThis experiment sought out spurious links
in the KDD Cup 2003 Hep-ph data set (http://arxiv.org/archive/hep-ph), an archive of high-energy physics and particle phenomenologypublications. We extracted information fromall papers printed over six years and createda data set containing 28,204 records. Each
record contains title, author, proceedings,year, and page number information.
We identified 585 record pairs containingpotential duplicates of the first author. Again,context attributes include the Coauthor andProceedings attributes. Assigning equalweights to the context attributes, we obtaineda maximum accuracy of 82 percent with athreshold of 0.5.
This data set’s lower accuracy stems fromthe existence of many potential spurious linksthat had insufficient context information. Forexample, we found that many publicationsdidn’t include coauthor information. Table 7shows a sample of the results.
Table 7. Sample results for the Hep-ph (high-energy physics and particle phenomenology publications) data set.
Name 1 Name 2 Context similarity value Actual match?
A. Hoecker A. Hocker 1.000 Yes
M.E. Carrington M. Carrington 0.700 Yes
M. Goeckeler M. Gockeler 0.666 Yes
J. Hashiba J. Hashida 0.400 No
R. Holman R. Hofmann 0.200 No
E. Gabrielli A. Gabrieli 0 No
Table 6. Sample results for the Movie data set.
Name 1 Name 2 Context similarity value Actual match?
DeMille DeMile 0.750 Yes
Hitchcock Hitchcok 0.636 Yes
Conway Convay 0.602 Yes
Mulligan Milligan 0.380 No
Francis Francisci 0.142 No
Table 5. Sample results for the DBLP data set.
Name 1 Name 2 Context similarity value Actual match?
Steven Minton Steve Minton 0.775 Yes
Sangjin Lee Sang-Jin Lee 0.710 Yes
Yin-Feng Xu Yinfeng Xu 0.590 Yes
David Thaler David Hartley 0.250 No
Changjie Tang Chang-Jie Tang 0 No
Honghua Yang Zhonghua Yang 0 No
Table 4. Sample duplicates from the DBLP data set.
Name 1 Name 2 Similarity
Toshiharu Hasegawa Tshiharu Hasegawa 0.972
Patricia A. Jacobson Patricia A. Jacobs 0.979
Kenny Wong Ken Wong 0.950
MARCH/APRIL 2004 www.computer.org/intelligent 31
Sensitivity experimentsThe next set of experiments explored how
context attributes and choice of similaritymethods affected accuracy in detecting spu-rious links in the three data sets. Figure 4shows that for the same threshold values,using context attributes yields higher accu-racy than using all attributes.
For example, in the Movie data set, the 96percent accuracy for context attributes was 8
percent higher than the accuracy for all attri-butes. This is because using irrelevant attri-butes tends to decrease the records’ overallsimilarity, thereby leading to the wrong con-clusions. So, using only context informationis more efficient and accurate in identifyingthe spurious links.
Next, we evaluated the performance of thecolumn-wise and cosine similarity methods.The cosine similarity measure defines the
similarity between two vectors as the cosineof the angle between them. As the cosinevalue approaches 1, the two vectors becomecoincident, implying that they refer to thesame concept. To determine the similaritybetween two record sets, we first calculatedeach attribute’s cosine similarity measure andtook the average to obtain the overall simi-larity. We computed the cosine similaritybetween the sets of records for context
E n h a n c i n g I n f o r m a t i o n
32 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS
0.5 0.6 0.7 0.8 0.9 1.0Threshold
Accu
racy
(%
)
50
60
70
80
90
100
Context only—Movie
All attributes—Movie
Context only—DBLP
All attributes—DBLP
0.5 0.6 0.7 0.8 0.9 1.0Threshold
Accu
racy
(%
)
50
60
70
80
90
100
Context only—Hep-ph
All attributes—Hep-ph
(b)(a)
Figure 4. How context attributes affect accuracy in the (a) DBLP and Movie data sets and (b) Hep-ph data set.
The problem of dirty data emerged during 1980s censuswork by the US Internal Revenue Service.1 Since then, a steadystream of data-cleaning research has focused on preprocessingof dirty data,2 noise handling for classification,3 and duplicatedetection and elimination in databases.4,5 One classification ofdata-cleaning problems distinguishes between those arisingfrom single and multiple data sources.5
Two systems in particular provide a systematic, comprehen-sive solution to data cleaning. AJAX proposes a declarativeframework that extends SQL to allow the specification of datatransformation, duplicate elimination, and matching of multi-ple tables.6 Potter’s Wheel is an interactive framework thatprovides a graphical user interface for specifying data transfor-mation.7 Both AJAX and Potter’s Wheel are domain-indepen-dent approaches.
IntelliClean, a knowledge-based system, follows three mainstages.2 The first preprocesses data to remove abbreviationsand standardize the data types and formats. The second stageuses the knowledge base rules to identify and remove approxi-mate duplicate records. The third (or postprocessing) stageinvolves human intervention to verify and validate the list ofduplicates produced.
Clearly, current research and systems have concentrated oneliminating duplicates. Spurious links, however, might persistin data even after the elimination of duplicates.
References
1. Record Linkage Techniques: Proc. Workshop Exact MatchingMethodologies, B. Kilss and W. Alvery, eds., Statistics of IncomeDivision, US Internal Revenue Service, 1985.
2. W.L. Low, M.L. Lee, and T.W. Ling, “A Knowledge-Based Approachfor Duplicate Elimination in Data Cleaning,” Information Systems,vol. 26, no. 8, Dec. 2001, pp. 585–606.
3. X. Zhu, X. Wu, and Q. Chen, “Eliminating Class Noise in Large Data-sets,” Proc. 20th Int’l Conf. Machine Learning (ICML 03), AAAI Press,2003, pp. 920–927.
4. M.A. Hernandez and S.J. Stolfo, “The Merge/Purge Problem forLarge Databases,” Proc.1995 ACM SIGMOD Conf. Management ofData (SIGMOD 95), ACM Press, 1995, pp. 127–138.
5. E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Ap-proaches,” IEEE Data Eng. Bull., vol. 23, no. 4, Dec. 2000, pp. 3–13.
6. H. Galhardas et al., “AJAX: An Extensible Data Cleaning Tool,” Proc.2000 ACM SIGMOD Conf. Management of Data (SIGMOD 00), ACMPress, 2000, p. 590.
7. V. Raman and J.M. Hellerstein, “Potter’s Wheel: An Interactive DataCleaning System,” Proc. 27th Int’l Conf. Very Large Databases (VLDB01), Morgan Kaufmann, 2001, pp. 381–390.
Related Work
attribute c1 using the formula
.
Figure 5 shows the accuracies for differentthreshold values for the three data sets. We con-sider only the context attributes here. The col-umn-wise similarity method clearly performedbetter than the cosine similarity method.
Our method could also help solve a vari-ant of the spurious link problem,
where data records for different real-worldentities are grouped together as belonging toone real-world entity. For example, in a bib-liography database, publications retrieved foran author might not belong to a single personif two authors have the same name. Contextinformation such as coauthors and researcharea could help us solve this problem. Also,although we have focused on bibliographicdata, the method could easily be extended tosolve spurious link problems in different datatypes, such as biomedical and genomic.
References
1. S. Ramaswamy, R. Rastogi, and K. Shim,“Efficient Algorithms for Mining Outliersfrom Large Data Sets,” Proc. 2000 ACM SIG-MOD Conf. Management of Data (SIGMOD00), ACM Press, 2000, pp. 427–438.
2. X. Zhu, X. Wu, and Q. Chen, “EliminatingClass Noise in Large Datasets,” Proc. 20thInt’l Conf. Machine Learning (ICML 03),AAAI Press, 2003, pp. 920–927.
3. M.A. Hernandez and S.J. Stolfo, “TheMerge/Purge Problem for Large Databases,”Proc. 1995 ACM SIGMOD Conf. Manage-ment of Data (SIGMOD 95), ACM Press,1995, pp. 127–138.
4. W.L. Low, M.L. Lee, and T.W. Ling, “A Knowl-edge-Based Approach for Duplicate Elimina-tion in Data Cleaning,” Information Systems,vol. 26, no. 8, Dec. 2001, pp. 585–606.
5. A.E. Monge and C.P. Elkan, “An EfficientDomain-Independent Algorithm for DetectingApproximately Duplicate Database Records,”Proc. ACM SIGMOD Workshop Research Issues
on Knowledge Discovery and Data Mining(DMKD 97),1997; www.informatik.uni-trier.de/~ley/db/conf/dmkd/dmkd97.html#MongeE97.
6. H.J. Hamilton, R.J. Hilderman, and N. Cercone,“Attribute-Oriented Induction Using DomainGeneralization Graphs,” Proc. 8th Int’l Conf.Tools with Artificial Intelligence (ICTAI 96),IEEE CS Press, 1996, pp. 246–253.
7. B. Liu, W. Hsu, and Y. Ma, “Integrating Clas-sification and Association Rule Mining,”Proc. 4th Int’l Conf. Knowledge Discoveryand Data Mining (KDD 98), ACM Press,1998, pp. 80–86.
Sim d dd d
d dc1 1 21 2
1 2, ( ) =
••
MARCH/APRIL 2004 www.computer.org/intelligent 33
0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0Threshold Threshold
Accu
racy
(%
)
Accu
racy
(%
)
50
60
70
80
90
100
50
60
70
80
90
100
Column-wise—Hep-ph
Cosine—Hep-ph
Column-wise—Movie
Cosine—Movie
Column-wise—DBLP
Cosine—DBLP
(b)(a)
Figure 5. How similarity methods affect accuracy in the (a) DBLP and Movie databases and (b) Hep-ph database.
T h e A u t h o r sMong Li Lee is an assistant professor at the National University of Singa-pore’s School of Computing. Her research interests include data cleaning,data integration of heterogeneous and semistructured data, and performancedatabase issues in dynamic environments. She received her PhD in computerscience from the National University of Singapore. Contact her at the Schoolof Computing, Nat’l Univ. of Singapore, 3 Science Dr. 2, Singapore 117543;[email protected].
Wynne Hsu is an associate professor of computer science at the NationalUniversity of Singapore’s School of Computing. Her research interestsinclude knowledge discovery in databases with an emphasis on data miningalgorithms in relational databases, XML databases, image databases, andspatiotemporal databases. She received her PhD in electrical engineeringfrom Purdue University. She is a member of the ACM. Contact her at theSchool of Computing, Nat’l Univ. of Singapore, 3 Science Dr. 2, Singapore117543; [email protected].
Vijay Kothari is working in India. His research interests include data cleaning and data mining. Hereceived his MSc in computer science from the National University of Singapore. Contact him at No.22, Murugappa St., Purasawalkam, Chennai, Tamil Nadu, India 600007; [email protected].