[ieee 2010 international conference on advances in social networks analysis and mining (asonam 2010)...

8
Learning from the Past: An Analysis of Person Name Corrections in DBLP Collection and Social Network Properties of Affected Entities Florian Reitz, Oliver Hoffmann Department of Databases and Information Systems University of Trier Universit¨ atsring D-54286 Trier, Germany Email: [email protected], [email protected] Abstract—Identifying real world persons by their name is a significant problem, especially for digital libraries like DBLP. Though there are a large number of algorithmic approaches, finding and correcting name-related inconsistencies is time- consuming and expensive. We introduce an extension to the DBLP collection which allows us to mine for modifications to name entities in a period of ten years. We use our findings to analyze how defective entities integrated into different dynamic social networks. Based on first results which showed that name errors are unevenly distributed in these networks we present and evaluate an approach to identify areas which are prone to name inconsistencies and require a more extensive monitoring. I. I NTRODUCTION The DBLP bibliographic project 1 is a large and frequently used meta data collection of publications in computer science and related fields. During the last years the collection has been the subject of several studies with questions ranging from analyzing the structure of our community (e.g. [3], [4]) to predicting aspects of future development (e.g. [8], [13]). It has also been used as a test case for new techniques and algorithms. All of these applications require a high data quality, i.e. the difference between the data set and the reality should be minimal. An important dimension of data quality is the correctness of person representation in the data set. For a digital library for example, it is desirable that an author a is represented by a single entity and that all related publications are linked to this entity. At the same time, it is required that a query for the publications of a returns no publications which have not been authored by a. The problem is to find a reliable way to identify authors. There are a number of approaches to assign unique identifiers to researchers (e.g. [6], [17]) but at the moment none of them is widely accepted. Projects like DBLP must fall back on the person name to identify an author which is problematic. There are names which refer to more than one person in the collection (a homonym). In February 2010 DBLP listed 905 names which are known to identify several persons. Homonyms make it difficult to map a publication record to the correct person because most sources give little 1 dblp.uni-trier.de Fig. 1. Example of a mapping between persons and names. N1 and N2 are synonyms of A1 while N2 is also a homonym of A1 and A2. Only the mapping A3 7N3 is correct more information than the name. This problem becomes more significant with an increasing number of researchers from China because different Chinese names are transcript to the same Latin letter sequence. For example, DBLP is aware of 19 computer scientists named Wei Wang. The opposite problem is that several names refer to a single author (a set of synonyms). Synonyms can be caused by name changes, using different transcription systems, omitting or abbreviating name parts or simply spelling errors in some records. For example, Martin Gonzalez-Rodriguez, is listed in DBLP with four different names and in 2005 the ACM digital library listed the publications of Jeffrey D. Ullman by ten different names [9]. Figure 1 shows an example of name matching inconsistencies including a case of combined homonym and synonym. Finding name-related inconsistencies is difficult and time consuming. Since this problem is relevant for several research areas, there are a number of approaches to find names which are a homonym or part of a synonym set. However, the actual correction of an error virtually always requires verification by a handmade analysis [12]. Human resources are expensive so it is crucial to have tools which return reliable candidates for corrections, i.e., tools with a high precision. During the last years DBLP corrected a number of name related errors which have been added to the collection. Apart from algorithmic solutions, these modifications have often been triggered by community input, for example, by authors who wanted their publication records fixed. In this paper, we analyze these 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.35 9 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.35 9 2010 International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4138-9/10 $26.00 © 2010 IEEE DOI 10.1109/ASONAM.2010.35 9

Upload: oliver

Post on 19-Dec-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Learning from the Past: An Analysis of PersonName Corrections in DBLP Collection and Social

Network Properties of Affected EntitiesFlorian Reitz, Oliver Hoffmann

Department of Databases and Information SystemsUniversity of Trier

Universitatsring D-54286 Trier, GermanyEmail: [email protected], [email protected]

Abstract—Identifying real world persons by their name is asignificant problem, especially for digital libraries like DBLP.Though there are a large number of algorithmic approaches,finding and correcting name-related inconsistencies is time-consuming and expensive. We introduce an extension to theDBLP collection which allows us to mine for modifications toname entities in a period of ten years. We use our findings toanalyze how defective entities integrated into different dynamicsocial networks. Based on first results which showed that nameerrors are unevenly distributed in these networks we present andevaluate an approach to identify areas which are prone to nameinconsistencies and require a more extensive monitoring.

I. INTRODUCTION

The DBLP bibliographic project1 is a large and frequentlyused meta data collection of publications in computer scienceand related fields. During the last years the collection hasbeen the subject of several studies with questions rangingfrom analyzing the structure of our community (e.g. [3], [4])to predicting aspects of future development (e.g. [8], [13]).It has also been used as a test case for new techniquesand algorithms. All of these applications require a high dataquality, i.e. the difference between the data set and the realityshould be minimal.

An important dimension of data quality is the correctnessof person representation in the data set. For a digital libraryfor example, it is desirable that an author a is represented bya single entity and that all related publications are linked tothis entity. At the same time, it is required that a query for thepublications of a returns no publications which have not beenauthored by a. The problem is to find a reliable way to identifyauthors. There are a number of approaches to assign uniqueidentifiers to researchers (e.g. [6], [17]) but at the momentnone of them is widely accepted. Projects like DBLP mustfall back on the person name to identify an author whichis problematic. There are names which refer to more thanone person in the collection (a homonym). In February 2010DBLP listed 905 names which are known to identify severalpersons. Homonyms make it difficult to map a publicationrecord to the correct person because most sources give little

1dblp.uni-trier.de

Fig. 1. Example of a mapping between persons and names. N1 and N2are synonyms of A1 while N2 is also a homonym of A1 and A2. Only themapping A3 7→ N3 is correct

more information than the name. This problem becomes moresignificant with an increasing number of researchers fromChina because different Chinese names are transcript to thesame Latin letter sequence. For example, DBLP is aware of19 computer scientists named Wei Wang.

The opposite problem is that several names refer to a singleauthor (a set of synonyms). Synonyms can be caused byname changes, using different transcription systems, omittingor abbreviating name parts or simply spelling errors in somerecords. For example, Martin Gonzalez-Rodriguez, is listedin DBLP with four different names and in 2005 the ACMdigital library listed the publications of Jeffrey D. Ullmanby ten different names [9]. Figure 1 shows an example ofname matching inconsistencies including a case of combinedhomonym and synonym.

Finding name-related inconsistencies is difficult and timeconsuming. Since this problem is relevant for several researchareas, there are a number of approaches to find names whichare a homonym or part of a synonym set. However, the actualcorrection of an error virtually always requires verification bya handmade analysis [12]. Human resources are expensive soit is crucial to have tools which return reliable candidates forcorrections, i.e., tools with a high precision. During the lastyears DBLP corrected a number of name related errors whichhave been added to the collection. Apart from algorithmicsolutions, these modifications have often been triggered bycommunity input, for example, by authors who wanted theirpublication records fixed. In this paper, we analyze these

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.35

9

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.35

9

2010 International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4138-9/10 $26.00 © 2010 IEEE

DOI 10.1109/ASONAM.2010.35

9

corrections and try to learn from them. More specific, we minefor changes to author names in the past and evaluate how thesedefective entities integrated into different social networks. Themain contributions of our work are:• We introduce the historic DBLP collection which extends

the well-known DBLP data set by information on mod-ifications in recent years and a framework to mine andcategorize these changes.

• We show how defective names are integrated into differ-ent social networks.

• We present a heuristic which considers past correctionsto estimate the correctness of a name.

The remainder of this paper is organized as follows: InSection II, we show the structure and content of the historiccollection and discuss technical limitations. Then, we showhow to mine name-related changes from the collection (SectionIII). Based on this information we define two dynamic socialnetworks which describe how DBLP has evolved since the year1999 (Section IV). In Section V we analyze network centralitymeasures of modified name entities at the time of correctionand compare it with unchanged entities. We conclude ourpaper by discussing an approach to identify reliable areas innetworks which we compute by considering the distributionof defects in the past (Section VI).

In the following sections, we will use the term person torefer to a real-world person and name to refer to the DBLPentity (or entities) which represents this person.

II. DATA SOURCE

The internal structure of the DBLP collection is a simpledirectory tree. For each conference and journal there is onefolder which contains the corresponding record information.For each record there is one small XML file in the appropriatedirectory. For example the directory

/db/conf/sigmod

contains 2589 files for the papers of the SIGMOD confer-ence. Each file has a unique name. The elements of the filesare similar to those of BibTeX files. Unlike BibTeX, fieldswith a quantity of two or more are represented by multipleelements. For example, a paper authored by Adam and Bobcontains a list similar to this:

<author>Adam</author><author>Bob</author>

The order of the author elements usually conforms to theorder of authors in the publication, but can sometimes differ.There is no differentiation between a person’s first and lastname and no unique identifier. If a name is identified ashomonym, a numeric suffix is added to the name to differenti-ate between the different persons. In the case of Wei Wang, oneof these persons is referred to as Wei Wang while the othersare named Wei Wang 0002 till Wei Wang 0019. Synonymsare handled in two different ways once they are detected. Insome cases, the author’s name is harmonized in all records.However, if there are many entries to be changed or if both

names should be preserved a special record is added. It doesnot represent a publication but matches different names ofthe same person with each other. In February 2010 there were3765 alternative name records with up to four synonyms. Notethat these alternative records are the only files which directlyrepresent persons. There is no way to get a list of a person’spublications without scanning all 1.3 million records.

There is no version control system like CVS or Subversionwhich logs changes to the files in the DBLP tree. Only thelast time of modification (including the creation) is storedin the file modification date field managed by the operatingsystem. Each new change to the file overwrites this date. Toget all modification dates and all versions of the file contentwe consider a set of about 3300 backups of the tree. Theywere created between October 1995 and September 2009 in away that the modification date was preserved. We call this thehistoric collection. We call the DBLP collection at date d aversion vd. Before 1995, DBLP used a different storing systemwhich is not compatible to the record tree. Some backups weredamaged and contained misleading information. We have beenable to reliably trace changes since June 1999. At this timethe collection contained about 120,000 records which is lessthan 10% of the current size. So the lost changes have only alimited weight for our analysis. In the remaining time framewe registered 775,650 changes. 59.6% of all records have beenaffected with an average of 0.6 changes and a maximum of23. Many changes are not relevant for our analysis becauseno author elements have been modified.

III. IDENTIFYING CORRECTIONS

A change is a transformation which modifies an authorelement in exactly one record file. Each change belongs toone version of the historic dataset. Prior to the modification,the content of an author element is called source name. Theresulting content is called target name. Usually, it requiresmore than one change to repair a name-related inconsistency.For example, when H. Schweppe was renamed to HeinzSchweppe, all records of his 12 publications were modified.We group all changes we encounter for a version which havesimilar source and target names to a correction. A correctionis maximal in a sense that there is no other change with thesame characteristics at the same date that could be added toit.

A. Types of Corrections

Based on the appearance of source and target name in thereference version (the version preceding the correction) andthe version which is linked to the correction, we differentiatebetween four types of corrections (Figure 2). The distinct typesare of different significance for our analysis.Rename: All elements which contained the source namedisappear and are replaced by elements with the target name.The target name has not been known before. The renaming ofH. Schweppe is a typical example.Merge: Similar to rename, all source names disappear. How-ever, in a merge, the target name has been known before.

101010

(a) Rename (b) Merge (c) Split (d) Distribute

Fig. 2. The four types of corrections with source (S) and target (T) names.

Publications which have previously been connected to twodifferent names are now mapped to one. This can indicate theresolution of a synonym. A special form of merge occurs whenmultiple source names are merged into one target name whichhas not been known before. We consider those as independentcorrections.Split: A split is the opposite operation of a merge. Some ofthe elements with the source name are changed to the targetname while the others remain stable. The target name hasnot been known before this correction. A Split can indicate ahomonym resolution. For example, the publications of CraigA. Lee were listed as papers of Craig Lee who is a differentperson. When the error was discovered, the correspondingrecords were corrected.Distribute: Similar to split, some entries of the source nameare changed. In case of a distribute correction, the targetalready existed. In most cases, distributions fix papers assignedto the wrong author.

Our analysis aims at finding influences of corrections tosocial networks. A rename correction does not influence anynetwork structure or property but merely assigns a new labelto a name entity. For the further analysis, we ignore renamecorrections.

B. Mining for Corrections

To compute and categorize the corrections we extract thechanges in the historic collection. For each date d betweenJune 1999 and September 2009, we extract all records with anew modification date which have not just recently been added.Only those records can differ from previous versions. Findingthe old version of a record is simple because the combinationof record name and path in the collection defines an almostunique identifier. In the relevant time frame, only 1954 recordshave been deleted or renamed and none of them is relevantfor our study. For each version vd in the historic collection,we determine if an author name has been changed. The strictstructure of XML makes it easy to identify modified elements.Of the 775,650 changes 147,686 (19%) have altered an authorentry. The most commonly altered field is the URL whichlocates the full version of the publication on the web. Weignore changes which add or remove authors from a recordor change their order because they are not directly related tothe person-name problem. If only one author element differsfrom the reference record defining source and target name isstraight forward. If more than one author element changed,i.e., multiple changes affected the record at once, there is no

TABLE INUMBER OF IDENTIFIED CORRECTIONS AT DIFFERENT DISTANCE TO THE

REFERENCE VERSION d BY TYPE

Type vd vd+1 vd+2 vd+3

Split 3,964 3,871 3,849 3,819Distribute 14,289 14,177 14,152 14,126Merge 56,190 56,287 56,307 56,332-detected 52,674 52,771 52,791 52,816-record 3,516 3,516 3,516 3,516Rename 28,056 27,909 27,901 27,877

clear mapping. For example when the author set {Adam, Bob}is changed into {Dave, Mike} there are two possible sets ofchanges: {Adam 7→ Dave, Bob 7→Mike} and {Adam 7→Mike,Bob 7→ Dave}. In most cases source and target names are alike.We compute the Levenshtein distance [11] for all possiblename pairs and choose those with the lowest value. For 660 outof 15,039 records with multiple changes we have no significantresults. We ignore them in the further analysis. We also ignore255 entries which have been subject to different changes in avery short time. This is usually caused by a defective changewhich had to be reverted. In a second step we combine changesfrom the same date with the same target and source to onecorrection.

Sometimes, resolving a name inconsistency takes severalversions till it is completed. This is regularly the case if alarge number of records need to be modified. An ’unfinishedcorrection’ can cause a false categorization. An incompleterename for example is interpreted as a split because some ofthe source names remain unchanged for a while. To get a morereliable categorization we also consider the versions whichfollow vd if existing. If we find a split correction we considerit as one. Table I lists the number of corrections by type weobtained when we considered different numbers of additionalversions. The differences in the distribution of correction typesbetween vd and vd+1 (where vd+1 is the version directlysucceeding vd) are larger compared to the differences betweenthe other versions. With a growing distance to the referenceversion, there is an increasing risk of overlaps with othercorrections. For our analysis, we use the corrections foundat vd+2. Merge is by far the most common type of correction.It includes the merges defined by the alternative name recordsand other modifications we extracted as described above.Splits are rare, mainly because they are closely related to thedistribution corrections.

111111

IV. SOCIAL NETWORKS IN THE HISTORIC COLLECTION

From the historic collection we extract two social networks.The collaboration network describes the coauthor relationbetween name entities. Two actors are related if their namesappear in the same record. We use the historic collectionto compute the collaboration network for every day between1999-06-02 and 2009-09-30 on which DBLP was subject tochanges. We order these 2496 static networks into a sequenceC to obtain a dynamic collaboration network. There are severalstudies based on dynamic collaboration graphs which aredefined by the year of publication (e.g. [4], [14]). These graphsdiffer from those created by our approach because the date ofpublication and the date of adding the meta data to DBLP donot correspond. An extreme example is the relation betweenL. Chwistek and W. Hetper. They published a joint paper in1938 but this relation has not been added to the dynamiccollaboration network before 2003-10-13.

In the same way, we define the dynamic name-streamrelation network S. Stream is the term DBLP uses to refer toconferences and journals. The actors of this two-mode networkare names and streams. There is a relation between name nand stream s if s accepted at least one paper written by n.

Figure 3(a) shows the increasing size of C and S. The valuesare almost identical because the number of streams is smallcompared to the number of names. In September 2009 therewere about 760,000 names but only 3500 streams. Figure 3(b)shows that the average node degree of the static networks alsoincreased but at a much lower rate. Both dynamic networkshave a significant giant component. Increasing over time aswell, it contains between 59.6% and 79.9% of all actors inC. The giant component in S is even bigger because it iscentralized around the streams and those are closely relatedbecause of common authors. There are no significant smallerconnected components in C or S.

V. CHARACTERISTICS OF DEFECTIVE ENTITIES

To learn more on the behavior of defective entities weanalyze several measures on how they integrate into C andS. We then compare these values to the properties of nonedefective entities. For this purpose we assume that the correc-tions we identified in Section III-B have removed all name-based inconsistencies from the collection and that all actorsaffected by these corrections had actually been defective.These are very strong assumptions because a data set of thissize is virtually never free of errors and we have found severalcorrections executed after September 2009. It is also likely thatat least some corrections themselves have been defective andcaused inconsistencies instead of removing them. However,DBLP is a high quality data base and we can assume that themajority of the 360,000 names which were never the sourceof a correction are actually congruent with real persons. Weare aware that our approach is biased by the techniques DBLPuses to find defective names. If these tools exploit a propertyA we will have a high number of source-target pairs withthat property. We rely on the assumption that other propertiesof defective entities are independent from A so we can get

0.0

0.2

0.4

0.6

0.8

00 01 02 03 04 05 06 07 08 09

[Acto

rs i

n m

illi

ons]

actors in Cactors in S

(a) number of actors

0

1

2

3

4

5

6

7

00 01 02 03 04 05 06 07 08 09

[Avera

ge N

ode D

egre

e]

degree in Cdegree in S

(b) average node degree

Fig. 3. Evolving aspects of C and S between June 1999 and September2009

0.0

0.2

0.4

0.6

0.8

1.0

0 2 5 10 20 50 100 200

unchangedmergesplitdistribute

Fig. 4. Distribution of the node degree in C by the type of correction

significant results. There is also a large community feedback,i.e., a person reports an error which triggers corrections andis independent from algorithmic considerations.

A. Local Properties

Local network properties are used by several inconsistencyidentification heuristics because they can be computed in shorttime even for large networks. At first we considered the degreed(n) of actor n. In C, the degree denotes the number of co-authors and in S the number of streams which accepted apaper written by n. We compute the degree of all correctionsources immediately before this correction was executed.Figure 4 shows an aggregated probability distribution for thedegree in C. The semi-logarithmic plot gives the probabilityp(d(n) ≥ x) for each type of correction. As a comparison,we added the distribution of all unchanged nodes as described

121212

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.0

unchangedmergesplitdistribute

Fig. 5. Distribution of the clustering coefficient in C by the type of correction

above on 2009-09-30. As expected [4], the distribution followsa power law with a long tail. We can see that correctiontypes have different distributions. While the sources of mergeoperations tend to have a slightly lower degree than unchangedactors, split and distribute candidates have a larger immediateneighborhood. We expected this because the source of a mergecorrection represents only a part of a real authors work whilesplit and merge candidates represent multiple persons. It turnedout that all nodes with a degree in C of more than 430 (witha maximum of 699) are affected by a split or a distribute afterthey reached this threshold. Merges and distributes increasethe degree of the target entity slightly while splits drasticallyreduce it.

The degree in S exhibits the similar behavior but witha stronger parameter value. We think that these differencesare caused by the way DBLP adds new records. Usually aproceedings or a volume is added in one instead of completingthe publication list of a specific author[16]. This way it is morelikely that a person is listed by two names for different streamswhich causes a lower degree in S.

Though the distribution of the node degree shows differ-ences between corrected and unchanged names, this informa-tion is of little use for inconsistency identification. Especiallyfor common, i.e., low degree values, the differences betweendefective and none defective entities are marginal. The localclustering coefficient clust(a) provides a more reliable heuris-tic for merge and split candidates. The clustering coefficientdescribes how close the neighborhood of an actor is to beinga clique. We compute the number of pairs of coauthors whothemselves are directly related. We then divide this value bythe total number of pairs. If clust(a) = 1 then a is part ofa clique. If clust(a) = 0 then a is center of a star-shapedstructure. Figure 5 shows the distribution for the actors inC. Merge and split candidates have a very low clusteringcoefficient because they represent several authors who usuallyhave distinct coauthor sets. This property makes the clusteringcoefficient and similar metrics useful for homonym detectionapproaches like [16].

B. Global Properties

If we consider global properties of defective and unmodifiedname entities we must differentiate between the giant compo-

0.0

0.2

0.4

0.6

0.8

1.0

1 2 4 6 8 10 12 14

unchangedmergesplitdistribute

Fig. 6. Distribution of the closeness centrality in C by correction type.

nent of C and S and smaller structures. Defective entities arenot evenly distributed. For example, in C the chance of anactor a to be corrected is about twice as high if a is in thegiant component. Nevertheless, we are interested in the smallstructures as well. We define Ad(a) as the set of all actors inthe static network Cd which can be reached by a path startingat actor a. a itself is not included in Ad(a). To evaluate howwell an actor is integrated in the respective Ad(a) we computetwo network centrality measures.

The closeness centrality [1] DdC(a) of actor a in network

Cd is the average distance to all other actors in Ad(a). Moreformal, we compute

DdC(a) =

∑b∈Ad(a)

dist(a, b)n

where dist(a, b) returns the minimal distance between actors aand b in Cd and n is the size of Ad(a). The node betweennesscentrality [7] DB(a) is based on the assumption that a iscentral in the network if it is on many shortest paths betweenall pairs of nodes. We compute

DdB(a) =

∑s 6=a 6=t∈Ad(a)

s 6=t

σst(a)σst

· 1n(n− 1)

where σst is the number of different shortest paths betweenactors s and t and σst(a) denotes the number of those pathswhich pass actor a. To achieve a fast computation time weused the algorithm of Brandes [2].

To our surprise, the distribution results for both centralitymeasures show that there are smaller differences betweenthe types of corrections than for the local properties. Thismeans that defective entities can reach a centrality equal tothe centrality of correct entries. Figure 6 shows that the distri-bution for the closeness centrality has only minimal differencesbetween the types. For low closeness values all defect typeshave a better integration that the unchanged entities. A possibleexplanation is that names with high centrality are more likelychecks for inconsistencies than others.

We took a closer look at the actors in C2009−09−30 with thehighest betweenness and closeness values. Table II lists thenumber of corrections which affected the 102, 103 and 104

131313

TABLE IICORRECTIONS OF NAME ENTITIES WITH HIGH GLOBAL NETWORKCENTRALITY AND SHARE OF THE RESPECTIVE CORRECTION TYPE.

merge distribute split allDB102 171 (0.3%) 239 (1.7%) 45 (1.1%) 455DB103 1,703 (3.0%) 1,233 (8.7%) 149 (3.8%) 3,087DB104 11,285 (20.0%) 5,491 (38.7%) 535 (13.8%) 17,316DC102 142 (0.25%) 236 (1.7%) 52 (1.3%) 412DC103 1,431 (2.5%) 1,009 (7.1%) 135 (3.5%) 2,575DC104 8,369 (14.9%) 4,099 (28.9%) 450 (11.6%) 12,923

actors with the highest closeness and betweenness centralityrespectively. Obviously, the share of these entities to the totalnumber of corrections is much higher than the average. Thisis especially true for distribute corrections. In C2009−09−30

DB104 makes um 1.4% of the nodes but is affected by38.7% of all distribute corrections. In general, the effect isless strong for actors with high closeness values but stillsignificant. We assume that there are two reasons for this.Many nodes with a high centrality represent persons who arecentral in the computer science community. We think that theseperson draw more attention from both the DBLP maintainerand the observing community so it is more likely that aninconsistency is detected. In Addition, most central authorshave a high publication count. With a growing number ofpapers the probability of errors grows as well.

C. Relations between Source and Target

Many approaches for synonym detection make use of sim-ilarities between names which represent the same real-worldperson. Lee [10] and On [15] for example discuss heuristicsbased on the number of common coauthors of source-targetpairs prior to a merge. For the test set provided by the historiccollection we found that 55.8% of all source-target pairs ofa merge operation shared at least one colleague. The averagenumber of common coauthors is 2.16 which are 30.8% of thecolleagues of the source name and 10.9% of the colleagues ofthe target name. For a random sample of 100 million namepairs in C2009−09−30, we found intersections of coauthor setsfor only 3.4% of the pairs. Note that the average degree inC increases over time so the probability of finding commonauthors is higher for C2009−09−30 than it is for the other staticnetworks.

In a more general analysis, we computed the distance distin C between the source-target pairs of all operations beforeand after a correction, in case they are connected. While theaverage distance between reachable actors in C2009−09−30 is6.5, merge pairs are at a much closer distance of ¯distmerge =2.84 in average with a standard deviation of σmerge = 1.72.The values we found for source-target pairs of a distributeare similar ( ¯distdistribute = 2.908, σdistribute = 2.1). Theaverage distance of source name and target name of a splitoperation is 5.21 which is lower than the average distance.We expected this value to be higher because it requiresa certain distinctness of the real-world persons to detect ahomonym. However, 28.5% of these pairs are part of differentconnectivity components and therefore do not account for the

average value. For other corrections the chance of source andtarget to be in the same component of C is higher (merge:78.9%, distribute: 88.7%).

VI. RELIABLE AND ERROR PRONE AREA

In the previous section, we saw that corrections are notevenly distributed in C and S. This can be exploited by toolswhich estimate the correctness of a name. For example, if wefind a name with no changes in the past but with propertieswe know are typical for erroneous entities we can label thisname as suspicious. This label might be helpful to a DBLPuser who tries to estimate how reliable a publication list is.However, the properties we found in Section V are of little helpbecause they have limited significance or are too difficult tocompute to be useful in an applied approach. In this section wediscuss a reliability measure for names which is based on theneighborhood of an entity in C and S. We assume that thereare regions in the graphs which have more defective entitiesthan others. If a name n is closely related to several othernames which were changed frequently in the past, we assumethan n is part of an problematic area and should be treated withcaution. On the other hand, there are areas which are almostfree of corrections which makes them more trustworthy. In thissection we discuss what causes unreliable areas and presentpreliminary results on how information on these effects canbe used to predict future corrections.

A. Reliable Areas in S

For each stream a in S, we count the number of correctionswhich affected adjacent name entities. We normalize thisfigure by the size of the stream community, i.e., the number ofall adjacent name entities. We obtain a value errS(a) whichdenotes the density of corrections close to a in the past.Table III lists the streams with the highest and the lowesterrS values in S2009−09−30 for all streams with at least 1000publications. The mean value is 0.0544 with a high standarddeviation of 0.0384. We can see that the correction density ofthe International Work-Conference on Artificial and NaturalNeural Networks (IWANN) is about 33 times as high as thedensity of the IEICE Transactions.

We considered several reasons for this significant differ-ences including singular events which caused bursts of cor-rections in short time or unevenly distributed interest in fixingname inconsistencies. We found a weak correlation betweenerrS and the average age of records. If a defective entity isin the collection for a long time there is a better chance todetect it. The column age lists the average number of daysbetween the adding of a record and 2009-09-30. Though thedifferences seems strong for the examples in Table III, theage alone does not sufficiently explain the inequality. If weconsider the number of corrections by publisher, we see that 17of the 50 most changed streams were published by the IEEE,followed by the LNCS series (Springer) with 10 streams. Thetwo other big publishers, ACM and Elsevir are represented by3 and 4 streams respectively. For IEEE publications we foundan average number of 0.076 corrections per related author.

141414

TABLE IIINUMBER OF CORRECTIONS BY STREAM. (J) INDICATES JOURNALS

Stream corr. names errS age1 IWANN 635 3,140 0.202 1,9902 ICIP 2,340 13,707 0.171 1,5923 MICCAI 777 4,970 0.156 1,8934 DCG (J) 209 1,345 0.155 1,3455 QUESTA (J) 162 1,048 0.155 1,048

...284 EIK (J) 9 757 0.012 2,282285 BMCBI (J) 84 9,078 0.009 439286 MVA 21 2,601 0.008 483287 CSSE 27 3,826 0.007 244288 IEICE Trans. (J) 85 14,019 0.006 515

LNCS follows with a quotient of 0.073 which is mainly causedby two large and often changed proceedings. The quotients forACM and Elsevir are 0.065 and 0.058 respectively. We assumethat the publishers provide meta data of different structure orquality which causes the differences. However, some streamsdo not follow the average of the publisher. The least modifiedstream, for example, is published by the IEEE. In general,journals have a better quality than conferences. To make surethat this is a stable property we computed errS for all streamsof previous dates and found the same constellation.

We assume that the risk of a name entity n to be defectivecan be estimated by the errS values of the related conferencesand journals. If n mainly published on streams with a highcorrection density there is a higher chance that a futurecorrection will affect this entity. For each name entity wecompute the average of errS of all adjacent conferences toobtain the stream node risk estimation of n: rS(n).

B. Reliable Areas in C

Like errS for streams, we compute errC for persons bycounting the corrections which affected the neighborhood (thecoauthors) of n. Unlike streams, name entities can be affectedby corrections which may displace them to another part of thecollaboration network. An author who has witnessed a largenumber of corrections might end up with a small coauthor set.If we used normalization by the number of these coauthors likewe did for errS we would obtain unrealistically large values.So we refrain from using normalization for coauthors. Likefor errS , there are significant differences in the errC values.While 326,865 actors in C2009−09−30 have never witnesseda close correction, the mean values is 5.28 with a standarddeviation of 9.8 and a maximum of 648.

We can not use the errC information to define rC(n) likewe used errS to define rC(n). Our evaluation showed thatthe direct neighborhood of most name entities is too smallto return significant results. We consider a larger area whichconsists of all actors which can be reached by passing upto two edges in the collaboration network. This extendedneighborhood has an average size of 118. We define rC(n)as the average of errC(t) of all actors t in this area. Whenwe consider the name entities with the worst rC values, wefind two common characteristics. For many entities only partsof the name are provided like T. D. Rogers. Obviously, missing

TABLE IVRELIABILITY MATRIX M COMPUTED FROM THE rS AND rC VALUES OF

2007-09-30

c1 c2 c3 c4s1 46,277 35,295 26,961 23,406 25.00%s2 26,796 30,174 35,585 39,331 25.00%s3 29,968 32,163 33,920 36,144 25.05%s4 28,954 34,433 35,234 33,007 24.95%

25.02% 25.03% 24.96% 25.00%

first names makes it more difficult to find a mapping forperson and name. If we assume that the completeness of thename depends on the data provided by the publisher there isa correlation between rC(n) and rS(n). On the other handthere are many names of Chinese or Spanish origin that havehigh rC values. We discussed before that the transcription ofChinese names is likely to cause homonyms. Spanish namestend to be long they are often only provided in parts whichprovokes synonym inconsistencies.

C. Heuristic and Evaluation

For the final reliability heuristic we combine rS and rC . Atfirst, we partition all name entities by their rC value in foursets, c1 . . . c4. c1 contains about 25% of the nodes with thelowest rC value. c2 contains the next best 25% and so on. Weadd all name entities with the same rC value to the same set,so the sets are not equally sized. In the same way, we derives1 . . . s4 from rS . We then define a 4× 4 matrix M . For eachfield Mi,j with i, j = 1 . . . 4 we create a set which contains theintersection of si and cj . In set M1,1 we find those actors withthe best rS values and the best rC values. We expect these tobe seldom affected by corrections while the name entities inM4,4 are more likely to be changed in future.

To evaluate our heuristic we computed rS and rC for thestatic networks of 2007-09-30, exactly two years from theend of our relevant time frame. Table IV shows the sizeof the different Mi,j we obtained. Then we analyzed thesucceeding corrections and counted how many entities of thedifferent Mi,j were affected. Table V shows the density ofcorrections of each Mi,j for all corrections and for each type.The percentages show the fraction of the total number ofaffected entities which can be found in the row or columnrespectively. The density of corrections in M1,1 is only athird of the density in M4,4. While the density of correctionsincreases with growing rC , the maximum for rS can be foundin s2 and s3 rather than in s4. We have found no explanationfor this yet. If we look at the different types of corrections,we see that the results for distribute changes are better thanthe average.

VII. RELATED WORK

The problem of matching real world persons with namesin data sets is relevant for various applications in differentfields. It is referred to by different terms including nameauthority control, entity resolution and record linkage. Themost frequently used techniques for synonym detection baseon the comparison of the names which is usually preceded by

151515

TABLE VDISTRIBUTION OF CORRECTIONS BETWEEN 2007-09-30 AND 2009-09-30

(a) All correctionsc1 c2 c3 c4

s1 0.0261 0.0381 0.0506 0.0575 18.5%s2 0.0287 0.0492 0.0691 0.0892 28.9%s3 0.0232 0.0455 0.0700 0.0975 28.4%s4 0.0268 0.0424 0.0611 0.0757 24.2%

12.1% 20.2% 29.4% 38.2%

(b) Merge Correctionsc1 c2 c3 c4

s1 0.0199 0.0286 0.0352 0.0376 20.7%s2 0.0217 0.0335 0.0430 0.0496 27.9%s3 0.0159 0.0304 0.0424 0.0558 27.0%s4 0.0207 0.0287 0.0394 0.0450 24.5%

14.2% 21.9% 29.1% 35.8%

(c) Split Correctionsc1 c2 c3 c4

s1 0.0009 0.0012 0.0014 0.0017 13.2%s2 0.0008 0.0018 0.0029 0.0058 33.4%s3 0.0010 0.0015 0.0027 0.0042 26.3%s4 0.0012 0.0024 0.0033 0.0029 27.0%

10.6% 18.3% 28.7% 42.4%

(d) Distribute Correctionsc1 c2 c3 c4

s1 0.0052 0.0084 0.0138 0.0181 14.9%s2 0.0062 0.0138 0.0231 0.0337 30.5%s3 0.0062 0.0136 0.0249 0.0374 31.5%s4 0.0049 0.0113 0.0182 0.0273 23.2%

8.1% 17.2% 29.9% 44.7%

a blocking step to reduce the number of name pairs which haveto be evaluated. We can not give a comprehensive overviewon this theme but refer to an extensive survey by Elmagarmidet al. [5] and a comparative evaluation by Byung-Won On etal. [15].

To identify homonyms, several simple social network prop-erties have been used. Mong-Li Lee et al. [10] make use ofrelations between name entities to detect synonyms. Once apair of possibly related actors is identified by comparing theirname labels, the context in various relations is considered.For a small sub set of DBLP, this approach showed anaccuracy of 89% if common coauthors and common themes ofinterest are considered. Instead of a name-stream relation, theyconsider a more general view of fields of computer science.This correlates with our findings on the low distance betweensource and target before a merge in C and S. Reuther et al.[16] introduced the connected triple similarity which baseson the local clustering coefficient. We also showed that theclustering coefficient serves well to detect homonyms.

Past studies had to deal with the problem of finding asuitable test collection. Based on DBLP, Byung-Won On etal. generate an artificial test collection for their evaluation.They apply false split corrections to the 100 most productiveauthors, i.e., they assign wrong name labels to half of theassociated records. In addition to decide which entities tomanipulate, they also had to choose how to change the names.They critically discuss the usability of this approach.

VIII. CONCLUSION AND FUTURE WORK

We introduced the historic collection and showed howname-related corrections can be extracted from it. For localand global network centrality measures we showed that thereare differences in the integration of defective entities. Wediscussed the usability for quality assurance systems and foundthat recall and precision of some heuristics are not sufficient.We also found that corrections cluster in social networks. Weused this information to introduce a new heuristic to estimatethe correctness of a name entity based on the distribution ofpast corrections. A simple evaluation showed that the heuristiccan be used to measure the reliability of our approach.

So far we have applied the reliable area approach onsimple regions in C and S. In a next step we need to findmore complex techniques to define the set of nodes that areconsidered for the computation of rS and rC . Communitybased clustering might provide better results than the simpleextended neighborhood concept. So far we also neglected thedynamic aspect of the integration of defective entities. Futurework will have to analyze how this integration changed till theinconsistency is resolved by a correction.

REFERENCES

[1] A. Bavelas. Communication patterns in task-oriented groups. Journalof the Acoustical Society of America, 22:725–730, 1950.

[2] U. Brandes. A Faster Algorithm for Betweenness Centrality. Journalof Mathematical Sociology, 25(2):163–177, 2001.

[3] H. Deng, I. King, and M. R. Lyu. Formal Models for Expert Finding onDBLP Bibliography Data. In ICDM, pages 163–172. IEEE ComputerSociety, 2008.

[4] E. Elmacioglu and D. Lee. On six degrees of separation in DBLP-DBand more. SIGMOD Record, 34(2):33–40, 2005.

[5] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate RecordDetection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16, 2007.

[6] M. Enserink. Are You Ready to Become a Number? Science,323(5922):1662–1664, 2009.

[7] L. C. Freeman. A Set of Measures of Centrality Based Upon Betwee-ness. Sociometry, 40(1):35–41, 1977.

[8] Z. Huang, Y. Yan, Y. Qiu, and S. Qiao. Exploring Emergent SemanticCommunities from DBLP Bibliography Database. In ASONAM, pages219–224. IEEE Computer Society, 2009.

[9] D. Lee, B.-W. On, J. Kang, and S. Park. Effective and scalable solutionsfor mixed and split citation problems in digital libraries. In IQIS, pages69–76. ACM, 2005.

[10] M.-L. Lee, W. Hsu, and V. Kothari. Cleaning the Spurious Links inData. IEEE Intelligent Systems, 19(2):28–33, 2004.

[11] V. I. Levenshtein. Binary codes capable of correcting deletions, inser-tions, and reversals (translation from russian). Soviet Physics Doklady,10(8):707–710, 1966.

[12] M. Ley and P. Reuther. Maintaining an Online Bibliographical Database:The Problem of Data Quality. In EGC, volume RNTI-E-6 of Revuedes Nouvelles Technologies de l’Information, pages 5–10. Cepadues-Editions, 2006.

[13] X. Li, C. S. Foo, K. L. Tew, and S.-K. Ng. Searching for Rising Starsin Bibliography Networks. In DASFAA, volume 5463 of Lecture Notesin Computer Science, pages 288–292. Springer, 2009.

[14] M. A. Nascimento, J. Sander, and J. Pound. Analysis of SIGMOD’sco-authorship graph. SIGMOD Record, 32(3):8–10, 2003.

[15] B.-W. On, D. Lee, J. Kang, and P. Mitra. Comparative study of namedisambiguation problem using a scalable blocking-based framework. InJCDL, pages 344–353. ACM, 2005.

[16] P. Reuther, B. Walter, M. Ley, A. Weber, and S. Klink. Managing theQuality of Person Names in DBLP. In ECDL, volume 4172 of LectureNotes in Computer Science, pages 508–511. Springer, 2006.

[17] M. M. M. Snyman and M. J. van Rensburg. Revolutionizing nameauthority control. In DL ’00: Proceedings of the fifth ACM conferenceon Digital libraries, pages 185–194, New York, NY, USA, 2000. ACM.

161616