exceptional data quality using intelligent matching and

10
S everal business and technology drivers are disrupting the world of enterprise software, and that in turn is driving the need for more effective data-quality solutions. These driv- ers include business acceptance of the software-as-a-service (SaaS) model, wide adoption of the web as a platform, collapse of enterprise application silos, aggregation of data from dis- parate internal and external sources, the agile mindset, and the economic conditions driving agility. SaaS is a software deployment model in which the provider licenses applications for use as service on demand, most often accessed through a browser. In 2007 SaaS clearly gained momen- tum, with the sector having three billion dollars in revenue; by 2013 revenue could reach 50 percent of all application software revenues (Ernst and Dunham 2006). Customer benefits from SaaS deployments include much quicker and easier implemen- tations, relatively painless upgrades, global access through the browser, lower total cost of ownership, and software vendors sharing more of the risk (Friar et al. 2007). Related to SaaS is another significant shift, the move from proprietary platforms to the web as a platform. Again the user benefits because of less vendor lock-in, open standards, well documented and broadly accepted standards, and long-term commitment by the software industry as a whole. These all contribute to higher quality solu- tions across the industry as well as accelerated innovation and more choice and flexible solutions available to the user. Articles SPRING 2010 65 Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 Exceptional Data Quality Using Intelligent Matching and Retrieval Clint Bidlack and Michael P. Wellman n Recent advances in enterprise web-based software have created a need for sophisticated yet user-friendly data-quality solutions. A new category of data-quality solutions that fill this need using intelligent matching and retrieval algorithms is discussed. Solutions are focused on customer and sales data and include real- time inexact search, batch processing, and data migration. Users are empowered to maintain higher quality data resulting in more efficient sales and marketing operations. Sales managers spend more time with customers and less time managing data.

Upload: khangminh22

Post on 10-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Several business and technology drivers are disrupting theworld of enterprise software, and that in turn is driving theneed for more effective data-quality solutions. These driv-

ers include business acceptance of the software-as-a-service(SaaS) model, wide adoption of the web as a platform, collapseof enterprise application silos, aggregation of data from dis-parate internal and external sources, the agile mindset, and theeconomic conditions driving agility.

SaaS is a software deployment model in which the providerlicenses applications for use as service on demand, most oftenaccessed through a browser. In 2007 SaaS clearly gained momen-tum, with the sector having three billion dollars in revenue; by2013 revenue could reach 50 percent of all application softwarerevenues (Ernst and Dunham 2006). Customer benefits fromSaaS deployments include much quicker and easier implemen-tations, relatively painless upgrades, global access through thebrowser, lower total cost of ownership, and software vendorssharing more of the risk (Friar et al. 2007). Related to SaaS isanother significant shift, the move from proprietary platformsto the web as a platform. Again the user benefits because of lessvendor lock-in, open standards, well documented and broadlyaccepted standards, and long-term commitment by the softwareindustry as a whole. These all contribute to higher quality solu-tions across the industry as well as accelerated innovation andmore choice and flexible solutions available to the user.

Articles

SPRING 2010 65Copyright © 2010, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602

Exceptional Data Quality Using Intelligent

Matching and Retrieval

Clint Bidlack and Michael P. Wellman

n Recent advances in enterprise web-basedsoftware have created a need for sophisticatedyet user-friendly data-quality solutions. A newcategory of data-quality solutions that fill thisneed using intelligent matching and retrievalalgorithms is discussed. Solutions are focusedon customer and sales data and include real-time inexact search, batch processing, and datamigration. Users are empowered to maintainhigher quality data resulting in more efficientsales and marketing operations. Sales managersspend more time with customers and less timemanaging data.

Over the past couple of decades enterprise soft-ware solutions tended to result in data in silos, forexample, a deployed accounting system that isunable to share data with a customer-relationshipmanagement system. The potential business value,or benefit, from removing data silos has drivencompanies to pursue such efforts to completion.Collapse of the data silo is related to the larger phe-nomenon of overall data aggregation on the web.A growing pool of online structured data, bettertools, and, again, a large economic driver are allpushing organizations to aggregate and use datafrom a multitude of sources.

Finally, one of the most significant shifts in thesoftware industry is the explicit transition towardagile software development. Agile developmentincludes iterative software development method-ologies in which both requirements and solutionsevolve during the development of the software.Agile methodologies are in contrast to waterfallmethodologies, which imply that requirements arewell known before development starts (Larmanand Basili 2003). More effective agile processes areimportant not only for software development, butfor organizations of all types and sizes. The neces-sity to keep organizations aligned with opportuni-ties and external competitive forces is forcing thisreality. Being agile allows an organization to adaptmore rapidly to external forces, which in turnincreases the chances of survival.

For companies, the above trends are resulting inmore effective use of enterprise software as well asmore efficient business operations. Effectiveness isdriving adoption across the business landscape,across industries, and from very small companiesup to the Global 2000. Efficiency is driving appli-cation acceptance and usage within the company.This combination of effectiveness and efficiency,driving adoption and usage, is fueling enormousgrowth of structured business data.

Large volumes of structured business datarequire significant effort to maintain the quality ofthe data. For instance, with customer-relationshipmanagement (CRM) systems (deployed under SaaSor traditional software installations), ActivePrimeand its partners have found that data quality hasbecome the number one issue that limits return oninvestment. As the volume of data grows, the painexperienced from poor quality data grows moreacute. Data quality has been an ongoing issue inthe IT industry for the past 30 years, and it contin-ues and is expanding as an issue, fueling thegrowth of the data-quality industry to one billiondollars in 2008. It is also estimated that companiesare losing 6 percent of sales because of poor man-agement of customer data (Experian QAS 2006).

Several competing definitions of data qualityexist. The pragmatic definition is considered here;specifically, if data effectively and efficiently sup-

Articles

66 AI MAGAZINE

ports an organization’s analysis, planning, andoperations, then that data is considered of highquality. In addition, data cleansing is defined asmanual or automated processes that are expectedto increase the quality of data.

Existing Solutions Past solutions to data-quality problems were driv-en in part by the economics of the institutionshaving the problem. Traditionally, the demand fordata-quality solutions was driven by very largeorganizations, such as Global 2000 corporations.They had the resources to deploy complex soft-ware system for gathering data, and they were thefirst to notice and suffer from the inevitable data-quality problems resulting from this complexity.Accordingly, the approaches developed by theinformation technology researchers pioneeringthe area of data quality (Lee et al. 2006) tended toemphasize statistical data assessment, businessprocess engineering, and comprehensive organiza-tional data-assurance policies. Given their size, theearly data-quality customers had the resources toadopt these labor-intensive and therefore expen-sive solutions. The traditional data-quality solu-tions tended to rely heavily on manual operations,in two different respects.

First, data was often hand-cleansed by contract-ing with external staffing agencies. Business ana-lysts would first identify what type of data-qualitywork needed to be performed and on which data.Large data sets would be broken up into reasonablesizes and put into spreadsheets. This data would bedistributed to individuals along with instructionsfor cleansing. After the manual work on a spread-sheet was finished oftentimes the work could becross-checked by another worker and any discrep-ancies investigated. Once the data was finished itwas reassembled into the appropriate format forloading back into the source IT system. Such man-ual effort has clear drawbacks, including the timerequired to cycle through the entire process, thepossibility for manual error, and the need to exportand then import the final results. Exporting is usu-ally fairly easy. The importing is almost always thebigger issue. Import logic typically needs to identi-fy not only which specific fields and records toupdate, but also how to deal with deleted ormerged data, and how to accomplish this all with-out introducing new errors. Another issue is thatadditional work is required to build and thorough-ly test the import tools. Finally, the entire manualprocess has little room for increased return oninvestment. The customer has to pay for the man-ual work each time data is cleansed, meaning thatthe economic benefits of automation are not real-ized.

Second, earlier data-quality vendors provided

technological solutions to data-quality problems,and these required significant manual setup. Thereasons for the manual setup included businessanalysis to understand data-quality needs of theorganization, identification of the final data-quali-ty work flow, and then the actual programmingand configuration to put the data-quality solutionin place. In other words, these companies werebuilding custom data-quality solutions using data-quality vendor application programming inter-faces (APIs). Once the solution was put in place,automation reduced the manual effort, so a longerhorizon for return on investment was acceptable.These solutions worked fine for large companiesthat could afford both the problem, initial enter-prise system that aggregates the data, and the solu-tions. Today, sophisticated business applicationsare being used by even the smallest organizations;hence, the manual effort associated with data qual-ity must be in alignment with the resources ofthese smaller organizations. The data-quality solu-tions must leverage automation and, at the sametime, provide the business user with intuitiveaccess to the processing results and the ability tooverride the results.

Data-quality research has also seen significantprogress with the first issue of the new ACM Jour-nal of Data and Information Quality published in2009. Frameworks for researching data qualityhave been introduced (Madnick 2009, Wang 1995)as well as specific mathematical models foraddressing the record linkage problem (Fellegi andSunter 1969). Recent research in record linkageincludes the development and deployment ofmore intelligent linkage algorithms (Moustakidesand Verykios 2009, Winkler 2006). Data linkage isa core issue in many data-cleansing operations andis the process of identifying whether two separaterecords refer to the same entity. Linkage can beused for both identifying duplicate records in adatabase as well as identifying similar recordsacross disparate data sets.

Solutions ActivePrime’s initial products and services focus onincreasing the quality of data in CRM systems.CRM systems store people-centered data at thecore with other types of data, like product, trans-actional, or other forms of data, considered sec-ondary in nature. Though the secondary data isessential as well, in a CRM system the user interac-tions predominantly revolve around managingand querying the people-centered data. The mainentities involved are organizations and people.Organizations are typically referred to as accountsand people as contacts. As is expected, accountsand contacts have properties such as postaladdresses, email addresses, phone numbers, and

such. The main focus has been on the effective andefficient management of the quality of accountand contact data.

Before we describe the various solutions, it isimportant to understand why the quality of datain CRM systems is so poor in the first place, andalso how data enters the system. One of the mainreasons for poor quality data in CRM systems isthat the data is initially entered by hand into somesystem, sometimes by users without appreciationfor the importance of data quality. For instance,the user may initially enter the data into an emailclient for which the user is the sole consumer ofthe data. For this dedicated one-party use, thequality of the data may not be so important. If thecompany name is misspelled, or if the person’s titleis abbreviated in a unique way, there really is noproblem. The issues arise if that data is then syn-chronized with a large volume of data in a CRMsystem with many other users, and the organiza-tion is depending on the data for operational pur-poses. Suddenly the misspellings, abbreviations,and missing information have significant, materi-al impact on the ability of the organization toleverage the data.

There are five points of entry for the majority ofdata entering CRM systems: manual entry of databy sales, marketing, and support personnel; batchimporting of data from external sources of infor-mation; data synchronization from other softwareand devices like email clients, mobile phones, andPDAs; web entry of data by a potential customer orother third party; and migration of data from lega-cy systems.

Successful CRM data-quality initiatives mustaccount for managing the quality of data at allthese points of entry. ActivePrime’s three solu-tions, CleanEnter, CleanCRM, and CleanMove,address each of these areas.

Data Quality upon Manual Entry Upon manual entry of data, either through theweb or using the CRM data-entry user interface(UI), CleanEnter provides search functionality thatnotifies users if they are about to enter a duplicaterecord. Figure 1 displays the search UI on contactentities and shows that entering a new contact of“ron raegan” has a potential match with fourrecords in the system. At this point the user candrill down on a record, navigate to a record in theCRM system, or choose to continue entering “RonRaegan” as a new record.

Data Quality upon Batch Entry Customer data is often batch-entered after atten-dance at a trade show or after purchasing a list ofcustomer data from a third party. Upon batchentry of data, or when cleansing large volumes ofdata that have just been synchronized into the

Articles

SPRING 2010 67

CRM system, sales or marketing support staff willneed to identify areas in the data that need to bestandardized or find and merge duplicate data. Fig-ure 2 shows the standardization screen in Clean-CRM, with each pair of lines representing therecord before and after standardization. For exam-ple, Rows 97 and 98 illustrate two fixes: the addressconvention standardized to short form—Deptinstead of Department, and the country name nor-malized from America to USA. The user has controlover which standardizations can be applied byselecting controls on a UI (not shown) that includeturning all standardizations on or off and settingnaming conventions for corporate names, businessdesignations, addresses, secondary address format-ting, city spelling corrections, and state names asshort form or long form.

The second type of cleansing applied by Clean-CRM is identification and merging of duplicatedata. Duplicates can be identified based on exact orinexact matching on any field. Duplicates can thenbe merged together according to some predefinedmerge rule or a custom merge rule. Merge rulesspecify how records are rolled up, which record isthe master record in the system, and which fielddata are selected in the case of conflicting data. Fig-ure 3 shows the merge screen with two duplicatesets. Note that the first set contains two records, 8.1

Articles

68 AI MAGAZINE

and 8.2 with differences in account name and firstname. Even given spelling errors and nicknamesthe duplication has been found.

A combination of domain knowledge and inex-act matching allows for finding these records. Thenickname ontology is deployed and in thisinstance finds that Becky and Rebecca refer to thesame person. Note that if either Becky or Rebeccahad been misspelled, CleanCRM still would haveidentified them as being duplicate because of inex-act matching. The three duplicates records flaggedas 7.1, 7.2, and 7.3 show how the robust inexactmatching works, including the handling of non-ASCII character sets. The first names includeÝvonne and Yvonne. In both duplicate sets a finalrecord is shown that displays to the user what thefinal record will look like in the CRM system.

The grid in figure 3 also enables the user to editthe final record manually after standardization andmerging. Any record that has been standardizedshows up in the grid, highlighted accordingly, andthe user can toggle the record to see the valuesbefore and after standardization. At this point anystandardization values can be discarded. Becausethe software performed most of the difficult workof identifying duplicates and applying appropriatestandardizations, the user is left with simplyreviewing and overriding as necessary. If more spe-cific standardizations or merging is needed, thenplug-ins can be provided for an extra level ofautomation. Specific business logic can be appliedresulting in a very streamlined and automateddata-cleansing process with the user being able toadjust the results accordingly. A balance is provid-ed between full automation and appropriate man-ual intervention.

Data Quality upon Migration CleanMove is a data-quality solution for assistingthe migration of customer data from a previous,legacy CRM system to a new system. It expeditesthe data-migration process by applying a series ofdata-cleansing rules to customer data that verifythe quality, integrity, and format of the data. Rulesinclude verifying unique data in database keyfields; referential integrity checking; phone,address, and date formatting; finding and mergingduplicates; and other processing. Some CRM sys-tems, like Oracle CRM On Demand, put strict con-trols upon the format of data before it can beimported. If data like phone, addresses, and pick-list values are not formatted properly, the databasewill reject the data. In other words, if the data isnot clean then the customer cannot even importthe data. When millions of records are beingbrought into such systems, the value provided byCleanMove is very significant. Several months ofeither manual work or custom coding can be savedby using CleanMove.

Figure 1. Results of Real-Time Inexact Matching.

A Data-Quality Platform All three solutions are built on a platform for rap-idly building data-quality applications: an inte-grated set of APIs, algorithms, 64-bit Linux servers,ontologies, specialized databases, and third-partydata sets. For performance reasons low-level algo-rithms are coded in C or Cython; business logic,application logic, and ontologies are encoded in

Python. User interfaces are coded in standards-based, cross-browser HTML and JavaScript, withdata being transferred as XML, JSON, or YAML.

Intelligent Matching and Retrieval ActivePrime leverages AI-related techniques inthree broad categories: lightweight ontologies,search-space reduction (SSR), and query optimiza-

Articles

SPRING 2010 69

Figure 2. Results of Standardization of Address Data during Batch Processing.

Figure 3. Results of Duplicate Identification during Batch Processing.

tion. The use of lightweight ontologies has beendescribed elsewhere (Bidlack 2009). In summary,lightweight ontologies are deployed as modulesand classes in the Python programming language,enabling rapid, iterative development of ontolo-gies using a popular scripting language. Theontologies also benefit from the large repository ofbuilt-in Python operators. Sophisticated opera-tions on ontologies can be performed with just afew lines of code.

SSR techniques are utilized when performinginexact matching on larger volumes of data,when record counts grow into the many thou-sands and millions. Query optimization tech-niques allow for real-time detection of duplicaterecords when matching one record to a largeremote data base.

Search Space Reduction Robust inexact matching algorithms, comparingthe closeness of two strings, have been in circula-tion for at least four decades (Levenshtein 1965)and remain an active area of research (Navarro2001, Chattaraj and Parida 2005). The challengetoday is how to perform such matching on largervolumes of data, very quickly. Inexact matching isimportant in many data-cleansing operationsincluding normalization, searching for duplicates,and performing record linkage (Dunn 1946). Insuch situations the computational cost of compar-ing each record in the large data set to every otherrecord is prohibitive, especially given that eachinexact string comparison operation itself isexpensive, being O(n m) to compare strings oflength n and m. Thus, a brute-force comparison ofall pairs among K strings, each of length n, wouldtake O(K2 n2). Clearly this complexity is unaccept-able in many data-cleansing operations wherethere are millions of records, that is, K > 106. Inaddition, if matching is being performed on z fieldson each record, then the complexity is O(K2 n2 z).Processing time with modern hardware wouldmost likely be measured in years.

To provide solutions with acceptable computa-tional complexity, four types of SSR techniques areleveraged. First, many data-quality operationsinvolve matching strings, where matches aredefined as those strings within some acceptablethreshold of one another. Therefore, candidatesoutside the threshold can be pruned without com-puting an exact edit distance. The histogram-basedpruning technique does this by quickly computinga bound on the edit distance between two strings.Second, search-based pruning uses the acceptabledistance threshold to terminate the edit-distancecomputation as soon as the threshold is reached.Third, inexact indexing facilitates rapid inexactmatching on large volumes of data. Finally, forconjunctive queries over multiple fields, choosing

an intelligent query-field ordering can producedramatic performance improvements.

Histogram-Based Pruning Histogram-based pruning is a technique to rapidlyidentify whether two strings are beyond an accept-able edit-distance limit. Pruning of inexact stringcomparisons can reduce run time, sometimes bywell over 70 percent, by decreasing the number ofexact edit-distance operations being performed.Exact edit-distance calculations are rather expen-sive (quadratic time), and histogram-based prun-ing cheaply identifies many instances where suchoperations are unnecessary. The histogram com-putation is itself on average much better than O(n+ m).

Histogram-based pruning works by computing aglobal histogram-based metric on the strings beingcompared. The metric provides a lower bound onthe distance between two strings. If the metric isgreater than T, the acceptable difference thresholdbetween the two strings, then there’s no need toperform the computationally expensive edit dis-tance.

An alphabet histogram H is computed by mak-ing a one-time scan over the two strings beingcompared, S1 and S2. H contains N cells, where Nis the number of letters in the alphabet B and B hasall possible characters in the alphabet. Forinstance, N = 36 for an alphabet of case-insensitive,alphanumeric ASCII characters. For each characterC1 in S1, increment H[C1] by one, and for eachcharacter C2 in S2, decrement H[C2] by one. If acharacter C appears equally often in both strings,then H[C] = 0. Define H+ as the sum of all positivenumbers in H, with H– the sum of all negativenumbers in H. The magnitudes of H+ and H– eachprovide a lower bound on the edit distancebetween the two strings. Thus if the absolute valueof either H+ or H– is greater than T, then the stringscannot be similar to one another.

Search-Based Pruning The edit distance between two strings can bedefined as the shortest path in a graph of strings,where edges represent edit operations. Standardedit-distance algorithms calculate the distance byfinding the shortest path. Since we do not careabout the exact distance if the match is unaccept-able, we can often terminate the search early, oncewe have ruled out all paths of length T or less.

Note that just as T can be used to terminate editdistance computations, T can also be used to ter-minate histogramming. In many instances thealgorithm can terminate early such that S1, S2, andH are never fully scanned. The computation isbeyond the acceptable distance, hence the searchfor a solution is terminated. Finally, H is most oftena sparse array, hence with proper accounting, while

Articles

70 AI MAGAZINE

computing H+ and H–. only nonzero elements in Hneed to be accessed, providing a further practicalspeedup of the algorithm.

Inexact Indexing Indexing can enable speedups in processing of sev-eral orders of magnitude. Indexing is viewed as away to reduce processing by grouping the recordsinto a large number of buckets, and then any spe-cific string comparison searches only over the datain a small number of relevant buckets, effectivelypruning the search space. The creation of bucketsis possible by mapping strings (keys) such that allpotentially similar strings map to the same bucket.Then when performing a search, a string ismatched against only the records in the appropri-ate buckets.

A main engineering challenge with inexactindexing is developing the appropriate mapping.The ActivePrime matchers use several differentmappings. Two of the most common are length-based mapping and starts-with mapping.

Length-based mapping simply puts each stringinto a bucket based on its length. For example, if400,000 records are to be indexed and the stringlengths range from 3 to 43 characters, then onaverage there will be 10,000 strings per bucket. If T(the acceptable threshold) is on average four edits,then on average the number of buckets to search inwill be four and the computations can be reducedby a factor of 10.

Starts-with mapping can potentially segmentdata into a very large number of buckets, depend-ing on how many characters are used for mapping.All strings that start with the same N characters areput into the same buckets. Given a case-insensi-tive, alphanumeric ASCII alphabet, for N = 2, thenthe number of buckets is 1296 (36 squared). How-ever, this mapping will result in missing someinexact matches: those with discrepancies withinthe first N characters. One way to significantlyreduce incidence of such misses is to include anends-with index and then search on both starts-with and ends-with. On average, for N = 2, thesearch space is reduced by two orders of magni-tude.

Querying Field Ordering A second way to leverage inexact indexing is totake advantage of matching on multiple fieldswhere the user expects to conjoin (AND) theresults together, which is the case for most data-quality applications. Data-quality applicationsalmost always require matching on multiple fields,and search times vary greatly on different fields,depending on the data itself as well as the matchthreshold being applied to the field. Because ofvarying search times, the order of querying can beadjusted to take advantage of these differences by

dynamically detecting the optimal field searchorder. This ordering effectively tiers the searchingof indexes.

After indexes are created for each field, a smallrandom selection of queries is applied to eachfield’s index to rank the field indexes from fastestto slowest. Then all future queries are applied tothe field indexes in the selected order. The edit-dis-tance computation is being applied to a progres-sively smaller number of the total records (pro-gressive reduction). Each field index value storesthe unique record identifiers. Identifiers are therow numbers of the records in the original data-base. For instance, a database with 1000 recordswill have unique identifiers from 0 to 999. Pro-gressive reduction is possible because matching isan AND operator (intersection of records). Forinstance, assume a database of 1000 records, andwith two fields being queried, fields F1 and F2 withthe F1 field index being faster and thus alwaysbeing queried ahead of the F2 field index. Whencomparing strings S1 and S2 to F1 and F2, S1 isqueried in F1, and the subset of records returned isU1 and is 1000 records or less, and most oftenmuch less than 1000. Then S2 only needs to becompared to the subset of records in U1 on the F2field index. Therefore, if U1 reduces 1000 to 20, thetotal number of edit-distance computations is1020, 1000 on the fastest field index F1, and 20 onthe slower field index F2. Statistical evidenceshows that with three fields and with the largevariation in the field index performance, thereduction in processing can often be between oneto two orders of magnitude.

Query Optimization Query optimization is deployed to enable real-timematching of a record to a remote database. Forinstance, with CleanEnter, while entering a newrecord into the CRM system, the user first searchesto identify whether the new record already exists.The user is comparing the one new record to theentire CRM database, searching for anything thatlooks similar.

The user’s query is analyzed based on the contextof the fields, like company name or state name, andappropriate domain knowledge is referenced forexpanding the query. For instance, the state Massa-chusetts may have MA and Mass as synonyms andthe query is expanded appropriately. Besides expan-sion through domain knowledge, query expansionoccurs using phonetic rules as well as heuristicsaround transposition and removal of characters.The query optimizer effectively constructs a querythat has a high probability of finding potentialinexact matches while only retrieving a very smallsubset of the remote database. The subset of recordsis then analyzed using SSR techniques to computeactual inexact matches.

Articles

SPRING 2010 71

Results and Impact Over the past several years, experience shows thatthe number of duplicate records in corporate CRMsystems range from the low end of 10 percent to astaggering 70 percent on the high end. In additionit is common for at least 50 percent of CRM recordsto have faulty data, for instance, misspelled cus-tomer names or address data.

More specifically, Figure 4 shows average resultsfor the percentage of errors found in data sets fromeight operational (in production) data sets. The x-axis represents the percentage of records that areupdated during various data-cleansing operations.The operations are de-duplication and normaliza-tion of the specified field names. For instance, onaverage 8 percent of all data in the city fields wasnormalized in some fashion. The six fields in figure4 are often candidates for data cleansing in a CRMsystem. There are however many more fields thatget cleansed.

The Title/Dept field in figure 4 corresponds tofields in the CRM system that represent the title ofthe person and the associated department. Inmany instances standardization of data in thesefields is important for customer messaging. There’sa direct correlation between the quality of the nor-malization and the ability to customize the mes-saging for a specific person, and that is incrediblyimportant and valuable from a marketing perspec-tive. Context-aware messaging always producesbetter results. How a company communicates val-ue to a chief technology officer, for instance, is

much different from how communications areprovided to a financial analyst. Such targeted mes-saging is possible only with high-quality data. Thetitle or department of the person must be welldefined (normalized) to perform precise messag-ing.

Using a combination of approaches, cleansingdata upon initial migration and then ongoingcleansing in real-time and in batch processes,duplicates can be drastically reduced. Typicaldeployments reach below 1 percent duplicates andhave a fix rate of up to 90 percent of faulty data.

Quantification of the business impact includescustomers who have saved their sales representa-tives up to four hours per week. These savings aredue to less time spent looking for the correct dataand determining where the data should residebecause there are fewer duplicates, as well asimprovement in communication with customersdue to higher-quality account and contact data. For large companies, the savings of four hours perweek per sales representative results in very largefinancial savings, and more important, more effi-cient use of the sales force, which directly results inmore revenue. The combination of savings andextra revenue can quickly reach into the hundredsof thousands and sometimes millions of dollars.

Another significant customer benefit isenhanced reporting and analytics. For large publiccompanies accurate reporting is very importantdue to the increase in securities compliances suchas those required by the Sarbanes-Oxley Act. Fail-ing to properly report on metrics such as customercounts and revenues has potential financial penal-ties that can be rather severe.

Benefits to Data Mining Data-mining uses have clearly expanded in reachthe past decade, broadening out from the corpo-rate business analytics system to now becoming acommon component of data-intensive, web-basedconsumer and business systems (Segaran 2007).When performing data mining there is often arequirement to perform analysis on numeric datawith a dependency on associated textual data. Forinstance, when deriving analytics or performingpattern analysis on sales information in the 50USA states, a significant issue that arises is simplynormalizing the state names so that a small set ofqueries can be generated to obtain and prepare thedata for analysis. If the data contains many differ-ent and unknown means of representing the statenames, then the queries themselves become verycomplex. Worse than that, the result of the queriesand hence the results of the analysis become sus-pect.

Robust data-cleansing operations as described inthis article can have a significant benefit in data-

Articles

72 AI MAGAZINE

Title/Department

Address 1

Postal Code

First/Last Name

City

Account Name

Duplicates 33

34

8

6

12

23

65

0 20 40 60 80

Figure 4.

Statistics showing the average percentage of records and fields that are updat-ed during typical data-cleansing operations. Operations include identifyingduplicates (the first entry) and field standardization (the bottom six entries).

mining applications. Not only can theanalysis be more automated andhence operationally efficient, but theresults themselves can be more readilyaccepted as valid results.

Conclusions The described products connect tothree of the leading CRM vendor solu-tions and are in use by more than 150customers, in more than 40 countries,representing more than 30 differentlanguages. Companies as large asFidelity and Oracle and as small astwo-person consulting firms use theproducts. Important lessons have beenlearned over the past few years regard-ing how to build high-performancedata-quality solutions on distributed,SaaS systems.

Integration of domain knowledgecan drastically improve the quality ofresults (Bidlack 2009). A key challengewith this integration is that the searchspace is expanded because now morematches need to be considered. Thedomain knowledge expands singleterms out to multiple terms. Forinstance, Massachusetts may beexpanded to include MA and Mass.Now there are three strings to consid-er when searching instead of just one.This search expansion adds one morereason for focusing on high-perfor-mance inexact matching algorithms.

AI techniques discussed in this arti-cle, various search space reductiontechniques and query optimizations,drastically improve the performanceof inexact matching of textual data onlarge volumes. With corporate data-bases for even midsize companies nowgrowing into the millions of records,and the desire for better results asenabled by domain knowledge inte-gration, high-performance matchingis becoming critical to user adoptionof any data-quality solution. SaaS-based systems and other industrydynamics are also changing userexpectations. Users are expecting greatresults with less effort on their part. Itis now clear that AI-based systems willcontinue to play an important role inmatching the users’ expectations onthe data-quality front. This article out-lined a few ways that intelligent algo-rithms have been leveraged, and it is

expected that future data-quality solu-tions will continue down this path.

Acknowledgments Creating a successful software compa-ny requires an enormous effort fromnumerous professionals with a diversi-ty of skills and perspectives. We aregrateful for the effort of not onlyeveryone within the company but alsofor the work of advisors, investors, andbusiness partners. A special apprecia-tion is reserved for our customers, whohelp to keep us focused on solving rel-evant problems and providing solu-tions that make a difference.

References Bidlack, C. 2009. Enabling Data Qualitywith Lightweight Ontologies. In Proceedingsof the Twenty-First Innovative Applications ofArtificial Intelligence Conference, 2–8. MenloPark, CA: AAAI Press.

Chattaraj, A., and Parida, L. 2005. An Inex-act-Suffix-Tree-Based Algorithm for Detect-ing Extensible Patterns. Theoretical Comput-er Science 335(1): 3–14

Dunn, H. L. 1946. Record Linkage. Ameri-can Journal of Public Health 36(12): 1412–1416.

Ernst, T., and Dunham, G. 2006. Software-as-a-Service—Opening Eyes in ’07; Half theMarket in ’13 (FITT Research). Frankfurt,Germany: Deutsche Bank AG.

Experian QAS. 2006. U.S. Business LosingRevenue through Poorly Managed Cus-tomer Data (Experian QAS Briefing 3/28).Boston, MA: Experian QAS.

Fellegi, I., and Sunter, A. 1969. A Theory forRecord Linkage. Journal of the American Sta-tistical Association 64(328): 1183–1210.

Friar, S.; Zorovic, S.; Grieb, F; and Isenstein,M. 2007. Getting SaaS Savvy—SuccessfulInvesting in On Demand. New York: Gold-man, Sachs & Company.

Larman, C., and Basili, V. 2003. Iterativeand Incremental Development: A Brief His-tory. IEEE Computer 36(6): 47–56.

Lee, Y. W.; Pipino, L. L.; Funk, J. D.; andWang, R. Y. 2006. Journey to Data Quality.Cambridge, MA: MIT Press.

Levenshtein, V. I. 1965. Binary Codes Capa-ble of Correcting Spurious Insertions andDeletions of Ones. Problems of InformationTransmission 1(1): 8–17.

Madnick, S. 2009. Overview and Frame-work for Data and Information QualityResearch. ACM Journal of Data and Informa-tion Quality Research 1(1).

Moustakides, G. V., and Verykios, V. S.2009. Optimal Stopping: A Record-Linkage

Articles

SPRING 2010 73

Approach. ACM Journal of Data and Infor-mation Quality 1(2).

Navarro, G. 2001. A Guided Tour toApproximate String Matching. ACM Com-puting Surveys 33(1): 31–88.

Segaran, T. 2007. Programming CollectiveIntelligence. Sebastopol, CA: O’Reilly Media,Inc.

Wang, R. Y.; Storey, V. C.; and Firth, C. P.1995. A Framework for Analysis of DataQuality Research.  IEEE Transactions onKnowledge and Data Engineering 7(4): 623–640.

Winkler, W. 2006. Data Quality: AutomatedEdit/Imputation and Record Linkage (U.S.Census Bureau, Research Report Series, Sta-tistics 2006–7). Washington, DC: U.S. Gov-ernment Printing Office.

Clint Bidlack is president, chief technicalofficer, and cofounder of ActivePrime. Priorto ActivePrime, he was the founding chiefexecutive officer of Viveca, a venture-capi-tal-backed company that built innovativeelectronic catalog software. Viveca was soldin 2001. Bidlack has been building data-intensive software systems for almost 20years, first in the automotive industry, thenin research roles at the University of Ten-nessee in conjunction with Oak RidgeNational Laboratory, and then at the Uni-versity of Michigan. In 1992 Bidlack waspart of the team that won the first AAAIrobot competition.

Michael P. Wellman is a professor of com-puter science and engineering at the Uni-versity of Michigan and a technical adviserto ActivePrime. He received a Ph.D. fromthe Massachusetts Institute of Technologyin 1988 for his work in qualitative proba-bilistic reasoning and decision-theoreticplanning. From 1988 to 1992, Wellmanconducted research in these areas at theU.S. Air Force’s Wright Laboratory. At theUniversity of Michigan, his research hasfocused on computational market mecha-nisms for distributed decision making andelectronic commerce. As chief market tech-nologist for TradingDynamics, Inc. (nowpart of Ariba), he designed configurableauction technology for dynamic business-to-business commerce. Wellman previouslyserved as chair of the ACM Special InterestGroup on Electronic Commerce (SIGecom)and as executive editor of the Journal ofArtificial Intelligence Research. He is a Fellowof the Association for the Advancement ofArtificial Intelligence and the Associationfor Computing Machinery.

Articles

74 AI MAGAZINE

WORKSHOPS AT THE TWENTY-FOURTH AAAI CONFERENCE ON

ARTIFICIAL INTELLIGENCE (AAAI-10)

Sponsored by the Association for the Advancement of Artificial Intelligence

AAAI is pleased to present the AAAI-10 Workshop program. Workshops will be held Sunday and Monday, July 11–12,2010 at the Westin Peachtree Plaza in Atlanta, Georgia. The AAAI-10 workshop program includes 13 workshops cov-ering a wide range of topics in artificial intelligence. Each workshop is limited to approximately 25 to 65 participants,and participation is usually by invitation from the workshop organizers. However, most workshops also allow generalregistration by other interested individuals. There is a separate fee for attendance at a workshop, and is discounted forAAAI-10 technical registrants. Registration information will be mailed directly to all invited participants. All workshopparticipants must preregister, and must indicate which workshop(s) they will be attending. Workshop reports areincluded in the workshop registration fee, and will be distributed onsite during the workshop. In most cases, reportswill also be available after the conference as part of the AAAI Press technical report series.

THE WORKSHOPSW1 – AI and FunW2 – Bridging the Gap between Task and Motion PlanningW3 – Collaboratively-Built Knowledge Sources and Artificial IntelligenceW4 – Goal-Directed AutonomyW5 – Intelligent SecurityW6 – Interactive Decision Theory and Game TheoryW7 – Metacognition for Robust Social SystemsW8 – Model Checking and Artificial IntelligenceW9 – Neural-Symbolic Learning and ReasoningW10 – PAIR: Plan, Activity, and Intent Recognition 2010W11 – StarAI — Statistical Relational AIW12 – Visual Representations and ReasoningW13 – Workshop on Abstraction, Reformulation, and Approximation

DEADLINESMarch 29: Submissions due (unless noted otherwise)April 15: Notification of acceptanceMay 11: Camera-ready copy dueJuly 11–12: AAAI-10 Workshop ProgramAAAI Formatting Guidelines: www.aaai.org/Publications/Author/author.php

SUBMISSION AND FORMAT REQUIREMENTSSubmission requirements vary for each workshop, but most key deadlines are uniform, unless otherwise noted. Sub-missions are due to the organizers on March 29, 2010, except where noted. Workshop organizers will notify submit-ters of acceptance by April 15, 2010. Accepted camera-ready copy is due on May 4, 2010. Please mail your submis-sions directly to the chair of the individual workshop according to their directions. Do not mail submissions to AAAI.For further information about a workshop, please contact the chair of that workshop. AAAI two-column format is oftenrequired for workshop submissions, and is always required for all final accepted submissions. Links to styles, macros,and guidelines for this format are located in the publications area of the AAAI website

AAAI Workshop CochairsMichael Beetz and Giuseppe De Giacomo

www.aaai.org/Workshops/ws10.php