cleaning the spurious links in data

amounts of data from heterogeneous sources, fur-ther increasing the likelihood of introducing errors.When mining these data warehouses for decision-making information, making logical and well-informed decisions becomes problematic if the data’squality is doubtful.

Data cleaning refers to processes for detecting andremoving errors and inconsistencies from data.Research in this area has included outlier detection,1

noise handling for classification,2 and duplicate elim-ination.3,4 Given the “garbage in, garbage out” princi-ple, clean data is crucial for database integration, datawarehousing, and data mining, thus leading to datacleaning’s strong association with these research areas.

A newly discovered class of erroneous data is spu-rious links, where a real-world entity has multiplelinks that might not be properly associated with it.Standard duplicate elimination techniques such asthe sorted-neighborhood method3 or the priority-queue algorithm5 can’t find this anomaly. The exis-tence of such spurious links often leads to confusionand misrepresentation in the data records represent-ing the entity. Consider the Digital Bibliography &Library Project database (http://dblp.uni-trier.de), alarge collection of computer science bibliographicalrecords. Each DBLP record captures a publication’slist of authors, title, conference or journal name, andpage numbers. Although the data set is well knownfor its high-quality bibliographic information, col-lecting and maintaining the data from diverse sourcesrequires enormous effort. Errors, including spuriouslinks, are inevitable.

To solve this problem, we use context information

to identify spurious links. First, we identify datarecords that contain potential spurious links. We thendetermine the set of attributes that constitute eachrecord’s context. For example, the coauthors and con-ference/journal fields constitute a publication record’scontext and offer good hints for identifying poten-tial spurious links. Experiments with three real-worlddatabases have demonstrated that our approach canaccurately identify spurious links.

A context-based solutionFigure 1 gives an overview of our approach. After

finding records that might contain spurious links, thisapproach uses selected fields that capture the records’context to compute their context similarity. Thesecontext fields might have access to additional domainknowledge that provides a more comprehensive pic-ture of the records’ context. We developed a methodto determine context similarity between two recordsets, and we use their overall degree of similarity tosubsequently identify any spurious links.

Table 1 shows a pair of records from the DBLPdatabase. At first glance, Record 2 seems to contain atypographical error in the author name “TshiharuHasegawa,” compared with “Toshiharu Hasegawa” inRecord 1. The possibility that “Tshiharu Hasegawa”and “Toshiharu Hasegawa” refer to the same real-worldperson increases if we have this context information:

• They have some common coauthors, such as “Tet-suya Takine” and “Yutaka Takahashi.”

• Their publication titles show similar research areas.For example, both the publication titles in Records

E n h a n c i n g I n f o r m a t i o n

Cleaning the SpuriousLinks in DataMong Li Lee, Wynne Hsu, and Vijay Kothari, National University of Singapore

D ata quality problems can arise from abbreviations, data entry mistakes, dupli-

cate records, missing fields, and many other sources. These problems prolifer-

ate when you integrate multiple data sources in data warehousing, federated databases,

and global information systems. Data warehouses load and frequently update large

Comparing context

information between

data records can help

solve the data quality

problem of spurious

links—that is, multiple

links between data

entries and real-world

entities.

28 1094-7167/04/$20.00 © 2004 IEEE IEEE INTELLIGENT SYSTEMSPublished by the IEEE Computer Society

1 and 2 are related to networking research.• The conference proceedings or journals in

which they publish have overlappingresearch areas. For example, the scope of“Performance” overlaps with that coveredin “Performance Evaluation.”

Records with potential spuriouslinks

First, we need to identify the records withpotential spurious links. We use existingstring similarity matching algorithms to iden-tify record pairs with a high degree of simi-larity in some attribute. For example, Table 1shows a pair of records with high similarityfor one of the coauthors, “Toshiharu Hase-gawa” and “Tshiharu Hasegawa.” On thebasis of these high similarity values, weretrieve the associated records.

Tables 2 and 3 show extracts from the setsof records retrieved from the DBLP Com-puter Science Bibliography using “ToshiharuHasegawa” and “Tshiharu Hasegawa,”respectively.

Context attributesNext, we identify the attributes that might

provide clues about the existence of spuriouslinks. A spurious attribute contains erro-neous values that might result in spuriousrecord links. A context attribute contains val-ues strongly correlated with those in the spu-rious attribute.

To identify the context attributes, we apply

association-rules mining to the database todiscover all associations among the attributevalues. That is, we try to determine if a valuein one attribute frequently occurs with somevalue in another attribute. We’re particularlyinterested in rules whose antecedent containsthe spurious attribute. If we obtain manyrules that associate values between the spu-rious attribute and some other set of attri-butes, we consider the latter to be candidate

context attributes. For example, for theDBLP data set, we found many rules withhigh confidence and minimum support (auser-defined threshold to retain only the rulessatisfied by a reasonable number of datatuples) involving authors and their coauthors.An attribute can be a context attribute evenif we don’t obtain high-confidence associa-tion rules. For instance, the values corre-sponding to the Proceedings attribute don’t

MARCH/APRIL 2004 www.computer.org/intelligent 29

Data set

Identifyspurious links

Computecontext

similarity

Hierarchicaldomain

structure

Select fieldswith contextinformation

Retrieverecords with

possiblespurious links

Figure 1. Context-based cleaning of spurious links.

Table 3. Extracted records using “Tshiharu Hasegawa.”

Record Author list Title Proceedings

1 Tetsuya Takine, Analysis of Asymmetric Single-Buffer SIGMETRICSHideaki Takagi, Polling and Priority of Systems Yutaka Takahashi, without Switchover TimesTshiharu Hasegawa

2 Yoshiyuki Shiozawa, Analysis of a Polling System with Computer Tetsuya Takine, Correlated Input Networks Yutaka Takahashi, & ISDN SystemsTshiharu Hasegawa

3 Fumio Ishizaki, Analysis of a Discrete-Time Queue Performance Tetsuya Takine, with Gated Priority EvaluationTshiharu Hasegawa

4 Shoji Kasahara, Analysis of Waiting Time of PerformanceYutaka Takahashi, M/G/1/K System EvaluationTshiharu Hasegawa

Table 2. Extracted records using “Toshiharu Hasegawa.”


1 Shojiro Muro, File Redundancy Issues in VLDBToshihide Ibaraki, Distributed Database SystemsHidehiro Miyajima,Toshiharu Hasegawa

2 Kazuhiro Ohtsuki, Analysis for Traverse Time in an ICCYutaka Takahashi, Integrated Communication NetworkToshiharu Hasegawa

3 Shojiro Muro, Evaluation of File Redundancy in TSEToshihide Ibaraki, Distributed Database SystemsHidehiro Miyajima,Toshiharu Hasegawa

4 Tetsuya Takine, Analysis of Asymmetric Polling PerformanceYutaka Takahashi, System with Single BuffersToshiharu Hasegawa

Table 1. A possible pair of duplicates in the Digital Bibliography & Library Project data set.


1 Tetsuya Takine, Analysis of an Asymmetric Polling Performance Yutaka Takahashi, System with Single BuffersToshiharu Hasegawa

2 A. Sugahara, Analysis of a Nonpreemptive Performance Tetsuya Takine, Priority Queue with SPP Arrivals EvaluationYutaka Takahashi, of High ClassTshiharu Hasegawa

appear in any high-confidence associationrule satisfying the minimum support. How-ever, if we generalize the Proceedingsattribute values to the research area, we dis-cover high-confidence rules involving Authorand Research Area, and we can use the lat-ter as a context attribute. We use the concepthierarchies to generalize attribute values6 sothat we don’t miss such context attributes.

Figure 2 shows a concept hierarchy con-structed for research articles’ conference/journalfield.

On the basis of this rationale, we apply thisalgorithm to determine the potential contextattributes:

Algorithm FindContextAttr1. If there exist attributes with a concept hierarchy

Then generalize the attribute values.2. Generate association rules for the database given

user-specified minimum support and minimumconfidence.

3. Identify the rules that contain spurious attributesin their antecedents.

4. Attributes in the consequent of these rules consti-tute the context attributes.

Record similarityWe now determine the retrieved records’

similarity to find spurious links. Given two setsof records and a list of context attributes, wemust decide if the two sets refer to the samereal-world entity. We do this by determininghow similar their context attributes are.

Our study indicates that for identifyingspurious links, a simple, yet effective col-umn-wise similarity measure achieves thebest accuracy. Essentially, the column-wisesimilarity measure for a context attribute Cwould take the union of the values of attribute

C in each set of records and find the degreeof overlap according to the formula

ColSimC (A1, A2) = | A1 ∩ A2 | / min(|A1|, |A2|),

where A1 is the set of unique values for thecontext attribute C in the first table ofrecords, and A2 is the set of unique values forC in the second table of records.

Referring to the author list attribute inTables 2 and 3, the two sets of attribute val-ues have a similarity value of 4/6 = 0.67.

Given a set of context attributes C1, C2, …,Cn, we find the overall context similarity ofthe record sets R1 and R2 using

where wi is the weight of context attribute Ci,and w1 + … + wn = 1. If most context attri-butes exhibit similarity above some prede-termined threshold, we can infer that therecords likely refer to the same person in thereal world. Therefore, a spurious link exists.

Optimal threshold valuesIn our experiments (which we describe in

the next section), we determined the optimalthreshold at which our approach achievesmaximum accuracy in identifying spuriouslinks. We defined the accuracy metric as theratio of correctly identified duplicate pairs tothe total number of potential duplicate pairs.Figure 3 shows how the threshold valueaffects data set accuracy when we assignequal weights to the context attributes.

Testing the approachWe implemented our algorithms in Java

and ran the experiments on a Pentium 4 1.6-GHz system with 256 Mbytes of RAM run-ning Windows XP Professional. We usedIntelliClean4 on the three data sets to findrecord sets with high similarity in the spuri-ous attribute, and we used DM2-CBA7 togenerate association rules to determine thecontext attributes.

DBLPThe DBLP database consists of over

200,000 records, each stored in an XML file.We used a subset of the data to create a rela-tional database of 12,258 records. Each recordin the relational database contains informa-tion about the publication’s first author, coau-thors, title, and conference or journal (called

ContextSim R R w ColSimi cii

n

1 21

, , ( ) = ∗=∑


Subject

Math PhysicsComputerscience

Machinelearning

Multimedia Theoreticalcomputerscience

IAAIIEEE

Multimedia

Visualinterfaces Information

sciences

. . .

. . .AI

Figure 2. A concept hierarchy for research articles’ conference/journal field.

0102030405060708090

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Threshold

Accu

racy

(%

)

DBLP

Movie

Hep-ph

Figure 3. The threshold value’s effect on accuracy.

30 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

“proceedings” in the database). IntelliCleanoutput 305 pairs of records with high similar-ity in their First Author attribute. Table 4shows a sample of the records found and thedegree of string similarities.

Our algorithm for finding context attri-butes in the data set outputs the coauthor andproceedings fields. First, using the concept hier-archy in Figure 2, we generalize the proceedingsfield to the research areas the publicationsbelong to. We generate association rules forthe database having a minimum confidenceof 75 percent and a minimum support of atleast two records. We consider only thoserules that contain the First Author attributevalues in their antecedent. We generalizethese rules to the corresponding attributesand obtain two associations: First Author →Coauthor, and First Author → Proceedings.These rules indicate that strong associationsexist between the values in the first author fieldand those in the coauthor and proceedings fields.So we use the Coauthor and Proceedingsattributes as context attributes.

For each pair of duplicate records, we usethe first authors’ names to retrieve recordswith the same name from the database. Ifwe’ve identified “name1” and “name2” aspossible duplicates, we can then use thesevalues to retrieve two record sets from thedatabase. Set A consists of records whosefirst author is “name1,” and Set B consists ofrecords whose first author is “name2.” Wethen apply a column-wise similarity methodto find the similarity between the two sets.

Assigning equal weights to the contextattributes and using the optimal threshold valueof 0.5, we achieved an accurate detection rateof 89 percent for 50 pairs of duplicates. Table5 shows a sample of the results for the DBLPdata set. A high-context similarity value indi-cates that spurious links exist; a low-contextsimilarity value indicates no spurious links.

MovieWe performed similar experiments on a

Movie data set, which consists of 11,453records. Each record in the Movie database(www-db.ics.uci.edu/pages/flamingo/Dataset.htm), created by the University of Cal-ifornia, Irvine database group, contains infor-mation such as movie ID (a unique numericalidentifier), title, year of release, director, pro-ducers, studios, category, and awards.

We first extracted a set of 200 potentialduplicate pairs from the director field. Fromthe association rules generated, we foundProducers, Studios, Category, and Location

to be context attributes for director. Assigningequal weights to the context attributes andusing the optimal threshold value of 0.6 (seeFigure 3), we achieved a maximum accuracyof 96 percent in identifying spurious links.Table 6 shows a sample of the results.

Hep-phThis experiment sought out spurious links

in the KDD Cup 2003 Hep-ph data set (http://arxiv.org/archive/hep-ph), an archive of high-energy physics and particle phenomenologypublications. We extracted information fromall papers printed over six years and createda data set containing 28,204 records. Each

record contains title, author, proceedings,year, and page number information.

We identified 585 record pairs containingpotential duplicates of the first author. Again,context attributes include the Coauthor andProceedings attributes. Assigning equalweights to the context attributes, we obtaineda maximum accuracy of 82 percent with athreshold of 0.5.

This data set’s lower accuracy stems fromthe existence of many potential spurious linksthat had insufficient context information. Forexample, we found that many publicationsdidn’t include coauthor information. Table 7shows a sample of the results.

Table 7. Sample results for the Hep-ph (high-energy physics and particle phenomenology publications) data set.

Name 1 Name 2 Context similarity value Actual match?

A. Hoecker A. Hocker 1.000 Yes

M.E. Carrington M. Carrington 0.700 Yes

M. Goeckeler M. Gockeler 0.666 Yes

J. Hashiba J. Hashida 0.400 No

R. Holman R. Hofmann 0.200 No

E. Gabrielli A. Gabrieli 0 No

Table 6. Sample results for the Movie data set.


DeMille DeMile 0.750 Yes

Hitchcock Hitchcok 0.636 Yes

Conway Convay 0.602 Yes

Mulligan Milligan 0.380 No

Francis Francisci 0.142 No

Table 5. Sample results for the DBLP data set.


Steven Minton Steve Minton 0.775 Yes

Sangjin Lee Sang-Jin Lee 0.710 Yes

Yin-Feng Xu Yinfeng Xu 0.590 Yes

David Thaler David Hartley 0.250 No

Changjie Tang Chang-Jie Tang 0 No

Honghua Yang Zhonghua Yang 0 No

Table 4. Sample duplicates from the DBLP data set.

Name 1 Name 2 Similarity

Toshiharu Hasegawa Tshiharu Hasegawa 0.972

Patricia A. Jacobson Patricia A. Jacobs 0.979

Kenny Wong Ken Wong 0.950


Sensitivity experimentsThe next set of experiments explored how

context attributes and choice of similaritymethods affected accuracy in detecting spu-rious links in the three data sets. Figure 4shows that for the same threshold values,using context attributes yields higher accu-racy than using all attributes.

For example, in the Movie data set, the 96percent accuracy for context attributes was 8

percent higher than the accuracy for all attri-butes. This is because using irrelevant attri-butes tends to decrease the records’ overallsimilarity, thereby leading to the wrong con-clusions. So, using only context informationis more efficient and accurate in identifyingthe spurious links.

Next, we evaluated the performance of thecolumn-wise and cosine similarity methods.The cosine similarity measure defines the

similarity between two vectors as the cosineof the angle between them. As the cosinevalue approaches 1, the two vectors becomecoincident, implying that they refer to thesame concept. To determine the similaritybetween two record sets, we first calculatedeach attribute’s cosine similarity measure andtook the average to obtain the overall simi-larity. We computed the cosine similaritybetween the sets of records for context


32 www.computer.org/intelligent IEEE INTELLIGENT SYSTEMS

0.5 0.6 0.7 0.8 0.9 1.0Threshold

Accu

racy

(%

)

50

60

70

80

90

100

Context only—Movie

All attributes—Movie

Context only—DBLP

All attributes—DBLP

0.5 0.6 0.7 0.8 0.9 1.0Threshold

Accu

racy

(%

)

50

60

70

80

90

100

Context only—Hep-ph

All attributes—Hep-ph

(b)(a)

Figure 4. How context attributes affect accuracy in the (a) DBLP and Movie data sets and (b) Hep-ph data set.

The problem of dirty data emerged during 1980s censuswork by the US Internal Revenue Service.1 Since then, a steadystream of data-cleaning research has focused on preprocessingof dirty data,2 noise handling for classification,3 and duplicatedetection and elimination in databases.4,5 One classification ofdata-cleaning problems distinguishes between those arisingfrom single and multiple data sources.5

Two systems in particular provide a systematic, comprehen-sive solution to data cleaning. AJAX proposes a declarativeframework that extends SQL to allow the specification of datatransformation, duplicate elimination, and matching of multi-ple tables.6 Potter’s Wheel is an interactive framework thatprovides a graphical user interface for specifying data transfor-mation.7 Both AJAX and Potter’s Wheel are domain-indepen-dent approaches.

IntelliClean, a knowledge-based system, follows three mainstages.2 The first preprocesses data to remove abbreviationsand standardize the data types and formats. The second stageuses the knowledge base rules to identify and remove approxi-mate duplicate records. The third (or postprocessing) stageinvolves human intervention to verify and validate the list ofduplicates produced.

Clearly, current research and systems have concentrated oneliminating duplicates. Spurious links, however, might persistin data even after the elimination of duplicates.

References

1. Record Linkage Techniques: Proc. Workshop Exact MatchingMethodologies, B. Kilss and W. Alvery, eds., Statistics of IncomeDivision, US Internal Revenue Service, 1985.

2. W.L. Low, M.L. Lee, and T.W. Ling, “A Knowledge-Based Approachfor Duplicate Elimination in Data Cleaning,” Information Systems,vol. 26, no. 8, Dec. 2001, pp. 585–606.

3. X. Zhu, X. Wu, and Q. Chen, “Eliminating Class Noise in Large Data-sets,” Proc. 20th Int’l Conf. Machine Learning (ICML 03), AAAI Press,2003, pp. 920–927.

4. M.A. Hernandez and S.J. Stolfo, “The Merge/Purge Problem forLarge Databases,” Proc.1995 ACM SIGMOD Conf. Management ofData (SIGMOD 95), ACM Press, 1995, pp. 127–138.

5. E. Rahm and H.H. Do, “Data Cleaning: Problems and Current Ap-proaches,” IEEE Data Eng. Bull., vol. 23, no. 4, Dec. 2000, pp. 3–13.

6. H. Galhardas et al., “AJAX: An Extensible Data Cleaning Tool,” Proc.2000 ACM SIGMOD Conf. Management of Data (SIGMOD 00), ACMPress, 2000, p. 590.

7. V. Raman and J.M. Hellerstein, “Potter’s Wheel: An Interactive DataCleaning System,” Proc. 27th Int’l Conf. Very Large Databases (VLDB01), Morgan Kaufmann, 2001, pp. 381–390.

Related Work

attribute c1 using the formula

.

Figure 5 shows the accuracies for differentthreshold values for the three data sets. We con-sider only the context attributes here. The col-umn-wise similarity method clearly performedbetter than the cosine similarity method.

Our method could also help solve a vari-ant of the spurious link problem,

where data records for different real-worldentities are grouped together as belonging toone real-world entity. For example, in a bib-liography database, publications retrieved foran author might not belong to a single personif two authors have the same name. Contextinformation such as coauthors and researcharea could help us solve this problem. Also,although we have focused on bibliographicdata, the method could easily be extended tosolve spurious link problems in different datatypes, such as biomedical and genomic.

References

1. S. Ramaswamy, R. Rastogi, and K. Shim,“Efficient Algorithms for Mining Outliersfrom Large Data Sets,” Proc. 2000 ACM SIG-MOD Conf. Management of Data (SIGMOD00), ACM Press, 2000, pp. 427–438.

2. X. Zhu, X. Wu, and Q. Chen, “EliminatingClass Noise in Large Datasets,” Proc. 20thInt’l Conf. Machine Learning (ICML 03),AAAI Press, 2003, pp. 920–927.

3. M.A. Hernandez and S.J. Stolfo, “TheMerge/Purge Problem for Large Databases,”Proc. 1995 ACM SIGMOD Conf. Manage-ment of Data (SIGMOD 95), ACM Press,1995, pp. 127–138.

4. W.L. Low, M.L. Lee, and T.W. Ling, “A Knowl-edge-Based Approach for Duplicate Elimina-tion in Data Cleaning,” Information Systems,vol. 26, no. 8, Dec. 2001, pp. 585–606.

5. A.E. Monge and C.P. Elkan, “An EfficientDomain-Independent Algorithm for DetectingApproximately Duplicate Database Records,”Proc. ACM SIGMOD Workshop Research Issues

on Knowledge Discovery and Data Mining(DMKD 97),1997; www.informatik.uni-trier.de/~ley/db/conf/dmkd/dmkd97.html#MongeE97.

6. H.J. Hamilton, R.J. Hilderman, and N. Cercone,“Attribute-Oriented Induction Using DomainGeneralization Graphs,” Proc. 8th Int’l Conf.Tools with Artificial Intelligence (ICTAI 96),IEEE CS Press, 1996, pp. 246–253.

7. B. Liu, W. Hsu, and Y. Ma, “Integrating Clas-sification and Association Rule Mining,”Proc. 4th Int’l Conf. Knowledge Discoveryand Data Mining (KDD 98), ACM Press,1998, pp. 80–86.

Sim d dd d

d dc1 1 21 2

1 2, ( ) =

••


0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0Threshold Threshold

Accu

racy

(%

)

Accu

racy

(%

)

50

60

70

80

90

100

50

60

70

80

90

100

Column-wise—Hep-ph

Cosine—Hep-ph

Column-wise—Movie

Cosine—Movie

Column-wise—DBLP

Cosine—DBLP

(b)(a)

Figure 5. How similarity methods affect accuracy in the (a) DBLP and Movie databases and (b) Hep-ph database.

T h e A u t h o r sMong Li Lee is an assistant professor at the National University of Singa-pore’s School of Computing. Her research interests include data cleaning,data integration of heterogeneous and semistructured data, and performancedatabase issues in dynamic environments. She received her PhD in computerscience from the National University of Singapore. Contact her at the Schoolof Computing, Nat’l Univ. of Singapore, 3 Science Dr. 2, Singapore 117543;[email protected].

Wynne Hsu is an associate professor of computer science at the NationalUniversity of Singapore’s School of Computing. Her research interestsinclude knowledge discovery in databases with an emphasis on data miningalgorithms in relational databases, XML databases, image databases, andspatiotemporal databases. She received her PhD in electrical engineeringfrom Purdue University. She is a member of the ACM. Contact her at theSchool of Computing, Nat’l Univ. of Singapore, 3 Science Dr. 2, Singapore117543; [email protected].

Vijay Kothari is working in India. His research interests include data cleaning and data mining. Hereceived his MSc in computer science from the National University of Singapore. Contact him at No.22, Murugappa St., Purasawalkam, Chennai, Tamil Nadu, India 600007; [email protected].

cleaning the spurious links in data

Documents