![Page 1: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/1.jpg)
CS639:DataManagementfor
DataScienceLecture23:DataCleaning
[basedonslidesbyJohnCanny]
TheodorosRekatsinas1
![Page 2: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/2.jpg)
DirtyData• TheStatistics View:
• Thereisaprocessthatproducesdata• Wewanttomodelidealsamplesofthatprocess,butinpracticewehavenon-idealsamples:
• Distortion – somesamplesarecorruptedbyaprocess• SelectionBias- likelihoodofasampledependsonitsvalue• Leftandrightcensorship- userscomeandgofromourscrutiny• Dependence – samplesaresupposedtobeindependent,butarenot(e.g.socialnetworks)
• Youcanaddnewmodelsforeachtypeofimperfection,butyoucan’tmodeleverything.
• What’sthebesttrade-offbetweenaccuracyandsimplicity?
![Page 3: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/3.jpg)
DirtyData
• TheDatabase View:• Igotmyhandsonthisdataset• Someofthevaluesaremissing,corrupted,wrong,duplicated
• Resultsareabsolute(relationalmodel)• Yougetabetteranswerbyimprovingthequalityofthevaluesinyourdataset
![Page 4: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/4.jpg)
DirtyData
• TheDomainExpert’s View:• ThisDataDoesn’tlookright• ThisAnswerDoesn’tlookright• Whathappened?
• Domainexpertshaveanimplicitmodelofthedatathattheycantestagainst…
![Page 5: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/5.jpg)
DirtyData
• TheDataScientist’s View:• SomeCombinationofalloftheabove
![Page 6: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/6.jpg)
DataQualityProblems• (Source)Dataisdirtyonitsown.
• Transformationscorruptthedata(complexityofsoftwarepipelines).
• Datasetsarecleanbutintegration (i.e.,combiningthem)screwsthemup.
• “Rare”errorscanbecomefrequentaftertransformationorintegration.
• Datasetsarecleanbutsuffer“bitrot”• Olddatalosesitsvalue/accuracyovertime
• Anycombinationoftheabove
![Page 7: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/7.jpg)
BigPicture:WherecanDirtyDataArise?
7
ExtractTransform
Load
IntegrateClean
![Page 8: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/8.jpg)
NumericOutliers
AdaptedfromJoeHellerstein’s 2012CS194GuestLecture
![Page 9: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/9.jpg)
DataCleaningMakesEverythingOkay?
Theappearanceofaholeintheearth'sozonelayeroverAntarctica,firstdetectedin1976,wassounexpectedthatscientistsdidn'tpayattentiontowhattheirinstrumentsweretellingthem;theythoughttheirinstrumentsweremalfunctioning.
NationalCenterforAtmosphericResearch Infact,thedatawere
rejectedasunreasonablebydataqualitycontrolalgorithms
![Page 10: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/10.jpg)
DirtyDataProblems• FromStanfordDataIntegrationCourse:
1) parsingtextintofields(separatorissues)2) Namingconventions:ER:NYCvs NewYork3) Missingrequiredfield(e.g.keyfield)4) Differentrepresentations(2vs Two)5) Fieldstoolong(gettruncated)6) Primarykeyviolation(fromun- tostructuredor
duringintegration7) RedundantRecords(exactmatchorother)8) Formattingissues– especiallydates9) Licensingissues/Privacy/keepyoufromusingthe
dataasyouwouldlike?
![Page 11: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/11.jpg)
ConventionalDefinitionofDataQuality• Accuracy
• Thedatawasrecordedcorrectly.
• Completeness• Allrelevantdatawasrecorded.
• Uniqueness• Entitiesarerecordedonce.
• Timeliness• Thedataiskeptuptodate.
• Specialproblemsinfederateddata:timeconsistency.
• Consistency• Thedataagreeswithitself.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 12: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/12.jpg)
Problems…• Unmeasurable
• Accuracyandcompletenessareextremelydifficult,perhapsimpossibletomeasure.
• Contextindependent• Noaccountingforwhatisimportant.E.g.,ifyouarecomputingaggregates,youcantoleratealotofinaccuracy.
• Incomplete• Whataboutinterpretability,accessibility,metadata,analysis,etc.
• Vague• Theconventionaldefinitionsprovidenoguidancetowardspracticalimprovementsofthedata.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 13: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/13.jpg)
Findingamoderndefinition
• Weneedadefinitionofdataqualitywhich• Reflectstheuse ofthedata• Leadstoimprovementsinprocesses• Ismeasurable (wecandefinemetrics)
• First,weneedabetterunderstandingofhowandwheredataqualityproblemsoccur
• Thedataqualitycontinuum
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 14: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/14.jpg)
MeaningofDataQuality(2)
• Therearemanytypesofdata,whichhavedifferentusesandtypicalqualityproblems
• Federateddata• Highdimensionaldata• Descriptivedata• Longitudinaldata• Streamingdata• Web(scraped)data• Numericvs.categoricalvs.textdata
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 15: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/15.jpg)
MeaningofDataQuality(2)• Therearemanyusesofdata
• Operations• Aggregateanalysis• Customerrelations…
• DataInterpretation:thedataisuselessifwedonʼtknowalloftherules behindthedata.
• DataSuitability:Canyougettheanswerfromtheavailabledata
• Useofproxydata• Relevantdataismissing
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 16: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/16.jpg)
TheDataQualityContinuum• Dataandinformationisnotstatic,itflowsinadatacollectionandusageprocess
• Datagathering• Datadelivery• Datastorage• Dataintegration• Dataretrieval• Datamining/analysis
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 17: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/17.jpg)
DataGathering• Howdoesthedataenterthesystem?• Sourcesofproblems:
• Manualentry• Nouniformstandardsforcontentandformats• Paralleldataentry(duplicates)• Approximations,surrogates– SW/HWconstraints• Measurementorsensorerrors.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 18: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/18.jpg)
DataGathering- Solutions• PotentialSolutions:
• Preemptive:• Processarchitecture(buildinintegritychecks)• Processmanagement(rewardaccuratedataentry,datasharing,datastewards)
• Retrospective:• Cleaningfocus(duplicateremoval,merge/purge,name&addressmatching,fieldvaluestandardization)
• Diagnosticfocus(automateddetectionofglitches).
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 19: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/19.jpg)
DataDelivery• Destroyingormutilatinginformationbyinappropriatepre-processing
• Inappropriateaggregation• Nullsconvertedtodefaultvalues
• Lossofdata:• Bufferoverflows• Transmissionproblems• Nochecks
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 20: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/20.jpg)
DataDelivery- Solutions• Buildreliabletransmissionprotocols
• Usearelayserver
• Verification• Checksums,verificationparser• Dotheuploadedfilesfitanexpectedpattern?
• Relationships• Aretheredependenciesbetweendatastreamsandprocessingsteps
• Interfaceagreements• Dataqualitycommitmentfromthedatastreamsupplier.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 21: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/21.jpg)
DataStorage
• Yougetadataset.Whatdoyoudowithit?• Problemsinphysicalstorage
• Canbeanissue,butterabytesarecheap.
• Problemsinlogicalstorage• Poormetadata.
• Datafeedsareoftenderivedfromapplicationprogramsorlegacydatasources.Whatdoesitmean?
• Inappropriatedatamodels.• Missingtimestamps,incorrectnormalization,etc.
• Ad-hocmodifications.• StructurethedatatofittheGUI.
• Hardware/softwareconstraints.• DatatransmissionviaExcelspreadsheets,Y2K
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 22: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/22.jpg)
DataStorage- Solutions• Metadata
• Documentandpublishdataspecifications.
• Planning• Assumethateverythingbadwillhappen.• Canbeverydifficult.
• Dataexploration• Usedatabrowsinganddataminingtoolstoexaminethedata.
• Doesitmeetthespecificationsyouassumed?• Hassomethingchanged?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 23: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/23.jpg)
DataRetrieval
• Exporteddatasetsareoftenaviewoftheactualdata.Problemsoccurbecause:
• Sourcedatanotproperlyunderstood.• Needforderiveddatanotunderstood.• Justplainmistakes.
• Innerjoinvs.outerjoin• UnderstandingNULLvalues
• Computationalconstraints• E.g.,tooexpensivetogiveafullhistory,weʼll supplyasnapshot.
• Incompatibility• Ebcdic?Unicode?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 24: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/24.jpg)
DataMiningandAnalysis• Whatareyoudoingwithallthisdataanyway?• Problemsintheanalysis.
• Scaleandperformance• Confidencebounds?• Blackboxesanddartboards• Attachmenttomodels• Insufficientdomainexpertise• Casualempiricism
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 25: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/25.jpg)
RetrievalandMining- Solutions• Dataexploration
• Determinewhichmodelsandtechniquesareappropriate,finddatabugs,developdomainexpertise.
• Continuousanalysis• Aretheresultsstable?Howdotheychange?
• Accountability• Maketheanalysispartofthefeedbackloop.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 26: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/26.jpg)
DataQualityConstraints
• Manydataqualityproblemscanbecapturedbystaticconstraintsbasedontheschema.
• Nullsnotallowed,fielddomains,foreignkeyconstraints,etc.
• Manyothersareduetoproblemsinworkflow,andcanbecapturedbydynamic constraints
• E.g.,ordersabove$200areprocessedbyBiller2
• Theconstraintsfollowan80-20rule• Afewconstraintscapturemostcases,thousandsofconstraintstocapturethelastfewcases.
• Constraintsaremeasurable.DataQualityMetrics?
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 27: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/27.jpg)
DataQualityMetrics
• Wewantameasurablequantity• Indicateswhatiswrongandhowtoimprove• RealizethatDQisamessyproblem,nosetofnumberswillbeperfect
• Typesofmetrics• Staticvs.dynamicconstraints• Operationalvs.diagnostic
• Metricsshouldbedirectionallycorrect withanimprovementinuseofthedata.
• Averylargenumbermetricsarepossible• Choosethemostimportantones.
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 28: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/28.jpg)
ExamplesofDataQualityMetrics• Conformancetoschema
• Evaluateconstraintsonasnapshot.
• Conformancetobusinessrules• Evaluateconstraintsonchangesinthedatabase.
• Accuracy• Performinventory(expensive),oruseproxy(trackcomplaints).Auditsamples?
• Accessibility• Interpretability• Glitchesinanalysis• Successfulcompletionofend-to-endprocess
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 29: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/29.jpg)
TechnicalApproaches• Weneedamulti-disciplinaryapproachtoattackdataqualityproblems
• Nooneapproachsolvesallproblem
• Processmanagement• Ensureproperprocedures
• Statistics• Focusonanalysis:findandrepairanomaliesindata.
• Database• Focusonrelationships:ensureconsistency.
• Metadata/domainexpertise• Whatdoesitmean?Interpretation
AdaptedfromTedJohnson’sSIGMOD2003Tutorial
![Page 30: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/30.jpg)
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL
![Page 31: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/31.jpg)
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL1.Detect
UniversityofChicago,Cicago,IL
![Page 32: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/32.jpg)
Data cleaning for structured dataDetect andrepair errorsinastructureddataset
UniversityofChicago,Cicago,IL1.Detect
UniversityofChicago,Cicago,IL
UniversityofChicago,Chicago,IL2.Repair
![Page 33: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/33.jpg)
A simple exampleChicago’sfoodinspectiondataset
Detect andrepair errorsinastructureddataset
![Page 34: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/34.jpg)
Constraints and minimalityFunctionaldependencies
Bohannonetal.,2005,2007;KolahiandLakshmanan,2005;Bertossietal.,2011;Chuetal.,2013;2015Faginetal.,2015
![Page 35: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/35.jpg)
Constraints and minimalityFunctionaldependencies
Action:Fewererroneousthancorrectcells;performminimumnumberofchangestosatisfyallconstraints
![Page 36: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/36.jpg)
Constraints and minimalityFunctionaldependencies
Error;correctzipcodeis60608
Doesnotfixerrorsandintroducesnewones.
![Page 37: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/37.jpg)
External informationExternallistofaddressesMatchingdependencies
Fanetal.,2009;Bertossietal.,2010;Chuetal.,2015
![Page 38: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/38.jpg)
External informationExternallistofaddressesMatchingdependencies
Action:Mapexternalinformationtoinputdatasetusingmatchingdependenciesandrepairdisagreements
![Page 39: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/39.jpg)
External informationExternallistofaddressesMatchingdependencies
Externaldictionariesmayhavelimitedcoverageornotexistaltogether
![Page 40: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/40.jpg)
Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple
Estimatethedistributiongoverningeachattribute
Hellerstein,2008;Mayfieldetal.,2010;Yakoutetal.,2013
Example:Chicagoco-occurswithIL
![Page 41: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/41.jpg)
Quantitative statisticsReasonaboutco-occurrenceofvaluesacrosscellsinatuple
Estimatethedistributiongoverningeachattribute
Again,failstorepairthewrongzipcode
![Page 42: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/42.jpg)
Let’s combine everything
Quantitativestatistics
Constraintsandminimality Externaldata
Differentsolutionssuggestdifferentrepairs
![Page 43: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/43.jpg)
A probabilistic model for data repairs
![Page 44: Lecture 23 cleaning - GitHub Pages · •The conventional definitions provide no guidance towards practical improvements of the data. ... •Data mining/analysis Adapted from Ted](https://reader033.vdocuments.mx/reader033/viewer/2022050603/5faaa890f621b114aa781685/html5/thumbnails/44.jpg)
Learning the model