cleenex: iterative data cleaning with user intervention · cleenex: iterative data cleaning with...

10
CLEENEX: Iterative Data Cleaning with User Intervention atia Borges Ormonde [email protected] Instituto Superior T ´ ecnico, University of Lisbon Lisbon, Portugal October 2017 Abstract Data cleaning is the process of correcting data quality problems, in datasets, to foster their fitness for use. Most data cleaning solutions’ tendency is to automate their tasks. However, data cleaning processes may require human knowledge to achieve optimal results. Hence, to provide results of excellence, data cleaning processes should be iterative and support user involvement. CLEENEX is a data cleaning framework that addresses the need for cleaning data iteratively. It enables the specification of data cleaning processes as graphs of data transformations and supports user intervention, recurring to Data Cleaning Graphs (DCGs) equipped with Quality Constraints (QCs) and Manual Data Repairs (MDRs). This thesis’ objective is to allow the iterative execution of DCGs, improving its user intervention, allowing the support for MDRs and overall reducing the user effort required. The exhaustive experiments conducted proved to validate the implementations done over this framework, reducing the user effort successfully, and evidencing its effectiveness. Keywords: Data Cleaning; Data Quality; User Intervention; Quality Constraint; Manual Data Repair; Iterative Execution 1. Introduction Data is all around us, and with the emergence of in- novative techniques and solutions to extract value from it, it is crucial that datasets are fit for use; par- ticularly, they should conform to domain-specific quality standards [11]. In order to resolve the data quality problems found in databases, data cleaning processes are conducted. There is already a good arsenal of approaches for data cleaning at our disposal [1]. Data cleaning solutions that completely automate the data clean- ing tasks may not always be adequate - namely for large quantities of data - because it is difficult to create a data cleaning process, at a first try, that addresses all the data quality problems that exist in a dataset. Data cleaning processes usually re- quire refinement to produce data with the idealized quality [12]. Hence, data cleaning should be seen as an iterative task, that can be gradually refined by a user with solid domain knowledge. CLEENEX is a data cleaning framework, whose fundamentals were introduced in [7]. It is based on the Ajax [5][6] framework, and its goal is to solve data quality problems. CLEENEX allows the specification of data cleaning processes as DCGs, which are graphs of data transformations. In CLEENEX, data cleaning processes are meant to be refined and guided by users. To com- plement DCGs, and provide support for user inter- vention, QCs and MDRs can be used. Upon the start of this thesis, we identified some limitations of the CLEENEX prototype, in what con- cerns the execution of the DCGs and the incorpo- ration of user interaction. For addressing such lim- itations, we proposed the contributions: (i) Adding support for the MDRs, in order for CLEENEX to allow the creation and application of all the sup- ported MDR actions (insert, update and delete) chosen by the user; (ii) Implementation of the it- erative execution of data cleaning processes, ac- cording to the defined operational semantics; and inclusion of the notions of deterministic and non deterministic attributes, in the data cleaning pro- grams’ definition; (iii) Incorporation of a feature to allow the persistent storage of MDRs and a com- plementary option to allow the user to pause and reload the data cleaning process; (iv) Inclusion of the notion of MDR instance conflicts, as well as a functionality to allow the detection and resolution of the MDR instance conflicts; (v) Inclusion of new Graphical User Interface (GUI) components, and update existing ones; to conclude, we performed 1

Upload: others

Post on 14-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

CLEENEX: Iterative Data Cleaning with UserIntervention

Catia Borges [email protected]

Instituto Superior Tecnico, University of LisbonLisbon, Portugal

October 2017

Abstract

Data cleaning is the process of correcting data quality problems, in datasets, to foster their fitnessfor use. Most data cleaning solutions’ tendency is to automate their tasks. However, data cleaningprocesses may require human knowledge to achieve optimal results. Hence, to provide results ofexcellence, data cleaning processes should be iterative and support user involvement. CLEENEX is adata cleaning framework that addresses the need for cleaning data iteratively. It enables the specificationof data cleaning processes as graphs of data transformations and supports user intervention, recurringto Data Cleaning Graphs (DCGs) equipped with Quality Constraints (QCs) and Manual Data Repairs(MDRs). This thesis’ objective is to allow the iterative execution of DCGs, improving its user intervention,allowing the support for MDRs and overall reducing the user effort required. The exhaustive experimentsconducted proved to validate the implementations done over this framework, reducing the user effortsuccessfully, and evidencing its effectiveness.Keywords: Data Cleaning; Data Quality; User Intervention; Quality Constraint; Manual Data Repair;Iterative Execution

1. IntroductionData is all around us, and with the emergence of in-novative techniques and solutions to extract valuefrom it, it is crucial that datasets are fit for use; par-ticularly, they should conform to domain-specificquality standards [11]. In order to resolve the dataquality problems found in databases, data cleaningprocesses are conducted.

There is already a good arsenal of approachesfor data cleaning at our disposal [1]. Data cleaningsolutions that completely automate the data clean-ing tasks may not always be adequate - namely forlarge quantities of data - because it is difficult tocreate a data cleaning process, at a first try, thataddresses all the data quality problems that existin a dataset. Data cleaning processes usually re-quire refinement to produce data with the idealizedquality [12]. Hence, data cleaning should be seenas an iterative task, that can be gradually refinedby a user with solid domain knowledge.

CLEENEX is a data cleaning framework, whosefundamentals were introduced in [7]. It is basedon the Ajax [5][6] framework, and its goal is tosolve data quality problems. CLEENEX allows thespecification of data cleaning processes as DCGs,which are graphs of data transformations.

In CLEENEX, data cleaning processes aremeant to be refined and guided by users. To com-plement DCGs, and provide support for user inter-vention, QCs and MDRs can be used.

Upon the start of this thesis, we identified somelimitations of the CLEENEX prototype, in what con-cerns the execution of the DCGs and the incorpo-ration of user interaction. For addressing such lim-itations, we proposed the contributions: (i) Addingsupport for the MDRs, in order for CLEENEX toallow the creation and application of all the sup-ported MDR actions (insert, update and delete)chosen by the user; (ii) Implementation of the it-erative execution of data cleaning processes, ac-cording to the defined operational semantics; andinclusion of the notions of deterministic and nondeterministic attributes, in the data cleaning pro-grams’ definition; (iii) Incorporation of a feature toallow the persistent storage of MDRs and a com-plementary option to allow the user to pause andreload the data cleaning process; (iv) Inclusion ofthe notion of MDR instance conflicts, as well as afunctionality to allow the detection and resolutionof the MDR instance conflicts; (v) Inclusion of newGraphical User Interface (GUI) components, andupdate existing ones; to conclude, we performed

1

Page 2: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

(vi) an exhaustive experimental validation, in orderto prove that, with the functionalities implemented,the user effort required is reduced, and the datacleaning process’ effectiveness is as expected.

This document is organized in six Chapters, asfollows. Chapter 2 presents the CLEENEX frame-work. Chapter 3 summarizes relevant related work.Chapter 4 explains the contributions done, in thecontext of the thesis. Chapter 5 describes theexperimental validations conducted. Chapter 6presents our conclusions, as well as relevant guid-ance for future work.

2. CLEENEXThe CLEENEX framework allows the specificationof data cleaning processes through its specificationlanguage, based on Structured Query Language(SQL).

This framework’s goal is to allow the user to in-tervene in a data cleaning process, in order toguarantee the output of data with the best qualitypossible.

2.1. Logical and Physical LevelsTypically, the development of a data cleaning pro-cess encloses two phases: (i) designing the datatransformations’ graph; and (ii) designing and set-ting the adequate heuristics. Considering this, theCLEENEX framework is divided in two correspond-ing levels: (i) logical level : where the graph is de-scribed, using a declarative language, based onSQL; and (ii) physical level : where specific algo-rithms can be selected, to perform the logical op-erations.

2.2. Logical OperatorsThe five logical operators supported by CLEENEX,for the data transformations, are:

1. Mapping: takes a single relation as input andoutputs one or more relations. It is able tostandardize data formats or create recordswith a more suitable format;

2. Matching: considering two input relations, itlocates the pairs of records that seem to referthe same object, through the usage of match-ing criteria, for their comparison;

3. Clustering: groups highly-matching recordpairs together, according to a similarity value,given by a grouping criteria (e.g., by transitiveclosure);

4. Merging: returns a unique record, for each setof records included in a cluster (i.e., outputs arepresentative for each cluster);

5. View : corresponds to an arbitrary SQL query,augmented with some integrity checking overits results.

2.3. Quality Constraints and Manual Data RepairsCLEENEX allows the intermediate data (along thegraph) to be inspected and manually repaired. Forsuch purpose, and with the intention of guiding theuser, a DCG can be complemented by the con-structs: QCs, that consist of quality integrity con-straints that data should satisfy, and MDRs, whichcorrespond to actions (insert, update and delete)that can be manually applied, over specifically de-fined updatable views, by the user.

By complementing a DCG with QCs and MDRsand using the MDRs as templates for user actionsover the blamed records, we are constraining thechoices a user has (i.e., when inspecting the un-cleaned data), therefore making it easier to providefeedback. An example of a DCG’s excerpt is pre-

Figure 1: SurveyRes’ DCG.

sented in Figure 1. This DCG’s objective is to cleana fictional dataset (i.e., the relation SurveyRes,presented in Figure 2). SurveyRes includes ba-sic information about people who answered a carsurvey (i.e., their name, birth year, country, andemail address), and their response to the survey(i.e., their preferred car brand, pBrand). The DCGuses two additional tables, Brand and Manufac-turer - that refer to cars’ brands and manufacturers,respectively.

Figure 2: Subset of the SurveyRes relation, that contains a carsurvey results.

In this DCG example, the nodes are representedas T1, R1, etc. (transformations and relations), andthe edges are the process’ input and output (Sur-veyRes, Brand, Manufacturer, CleanSurveyRes).Some nodes are associated to a QC and MDR.For example, the node R1 is associated to QC1,

2

Page 3: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

which is used to reinforce that data records’ name,bYear and email attributes should be unique; andMDR1, which states that a possible manual re-pair to perform over violations to QC1 is to deletethem. This DCG’s goal is to: (i) find if there areany duplicate records regarding the unique set ofattributes name, bYear, email (QC1) and, if so,enable the user to eliminate them (creating an in-stance of MDR1); (ii) get each person’s preferredcar brand manufacturer ID (transformation T2 andrelation R2); (iii) make sure that each survey re-sponse (i.e., per email address) is associated toonly one manufacturer name (T3 is used to get themanufacturer’s name); etc., and then (iv) save thecleaned data, with an updated schema, to its out-put (CleanSurveyRes table).

2.4. ArchitectureFigure 3 depicts the CLEENEX framework’s maincomponents. The components are described as:

• GUI: responsible for displaying information tothe user (e.g., by graphically representing thedata cleaning process);

• Parser : responsible for parsing the data clean-ing processes specified by the programmer(e.g., performing the syntactical validation andinterpretation);

• Optimizer : responsible for selecting the opti-mal execution plans, for processes’ execution;

• Scheduler : responsible for scheduling the ex-ecution of the tasks that constitute the execu-tion plan (chosen by the Optimizer );

• Catalog Manager : responsible for generatingthe internal representation of the data cleaningprocesses;

• Debugger : responsible for triggering an in-spection trail mechanism;

• QC Manager : responsible for parsing the QCsdefined by the user, and managing their us-age (e.g., computing the respective blamedtuples);

• MDR Manager : responsible for parsing theMDRs defined by the user, and managing theirusage (e.g., enabling the MDRs’ usage, fora certain relation, when the respective condi-tions apply);

• Database Manager : responsible for communi-cating with the underlying Relational DatabaseManagement System (RDBMS). It conducts,for example, the creation of relations, and exe-cution of SQL statements (as requested by theother components).

Figure 3: Overview of CLEENEX’ UML component diagram.

The CLEENEX framework’s execution phasesare explained as follows:

1. Library Specification: allows the action ofadding new, externally-defined algorithmsand/or functions to the CLEENEX’ functions’library, which exists for extensibility purposes;

2. Program Specification: the act of a program-mer specifying a data cleaning process inCLEENEX (i.e., defining a DCG and respec-tive QCs and MDRs);

3. Optimization: the act of CLEENEX to compileand optimize the specified data cleaning pro-cess;

4. Execution: after the compilation, CLEENEXproceeds to the execution of the previously de-fined data cleaning process. In this phase,during the execution of some operations, itcan invoke externally-defined algorithms/func-tions, that may exist in the functions’ library.During the execution, the user can check theexecution status (i.e., by watching the variouselements of the graph’s workflow);

5. Testing & Debugging: after execution, the usercan use the data derivation mechanism to nav-igate in the graph, in order to inspect the ori-gins of data and the transformations applied tothem;

6. Refinement : as the user tests and debugsthe data along the DCG, some opportunitiesfor data cleaning criteria refinement might befound. If such opportunities are identified, thenthe cleaning process can be better tailored,and re-executed.

3. Related WorkData cleaning tasks can be automated or humanguided. The latter implies considering user in-tervention to repair the data. It is also possiblethat data may have quality problems because the

3

Page 4: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

data cleaning rules defined are outdated; thereforesome approaches also focus on cleaning the cor-responding integrity rules.

In the context of user involvement, we analyzedstate of the art research and commercial solutions.To classify them, we used a taxonomy (Figure 4),which considers three main questions involved indata cleaning [4] (What to repair?, How to repair?,When to repair?).

Figure 4: Taxonomy for classifying data cleaning techniques,inspired by [3][8].

In respect of research solutions, we consid-ered the works: Guided Data Repairs, Holis-tic Data Cleaning, NADEEF, LLUNATIC, Potter’sWheel, TAILOR, Unified Repair, and ContinuousData Cleaning.

In what regards the commercial solutions, weconsidered those by the leading developers in thefield (according to the Gartner Magic Quadrant fordata quality); namely: Informatica, IBM, SAP, SAS,Talend and Oracle.

The majority of the research solutions analyzedare focused on integrity rules; while others con-sider both (e.g., LLUNATIC and CLEENEX), sincethe criteria is not mutually exclusive. On the otherhand, in what regards the commercial tools, all ofthem use data transformations to conduct the datacleaning processes.

There has been significant progress in the de-sign and implementation of data cleaning tools.This has led to an increase in the importance ofproviding means for: debugging and data valida-tion; the interleaving of analysis and cleaning, dur-ing data cleaning processes; and the possibility ofreducing the user effort required. The tendencyis to enable the data cleaning systems to facilitateand automate rapid human-in-the-loop interactivity[9][10][2]. We noted that most tools integrate userfeedback, provide a rich GUI, and some of themeven have mechanisms for learning the user’s be-havior (i.e., to automatize the decision making).

After the research conducted, we acknowledgedthat CLEENEX, apart from not using MachineLearning techniques, neither cleaning the dataquality rules, nor following a holistic approach, pro-vides support for the user to execute a data clean-ing process, and intends for his/her effort to be re-duced. CLEENEX’s goal is to enable an iterativeexecution of data cleaning processes; while inte-grating user actions, and re-applying them accord-ingly. In terms of commercial tools (which all use

data transformations), we noted they were all au-tomatic, focused on data transformations, and pro-vided a rich GUI. Neither of them considered therepair of data quality rules. Additionally, not allthe commercial tools provided debugging facilities- something that is available in CLEENEX.

4. Iterative Execution of a Data Cleaning ProgramThe CLEENEX framework prototype had variouslimitations, which were restraining the data clean-ing processes from being executed properly. Suchlimitations were targeted by this thesis’ contribu-tions, particularly: the prototype did not yet enableusers to iteratively clean data effectively; neitherto provide feedback (i.e., creating MDR instances),nor to have their feedback integrated into the datacleaning process. Another problem identified wasthat the mutability of data, upon the re-executionof a DCG, was being ignored. Additionally, therewere some faults in terms of the GUI, specificallythe lack of means to keep the user informed, andto support the creation and application of the MDRinstances.

To enable the iterative execution of the DCGs,while incorporating user intervention, we ad-dressed the existing limitations, with various contri-butions, categorized into three topics: (i) the sup-port for MDRs, (ii) the iterative execution of a DCG,and (iii) GUI.

An illustration of CLEENEX’ main components isshown in Figure 5. The various implementationsperformed were done both in the server and theclient sides of the CLEENEX framework.

Figure 5: Overview diagram of CLEENEX’ main components.

4.1. Support for MDRsIn the context of the support for MDRs, we startedby enabling the creation and application of MDR in-stances. In order to support the three MDR actions(i.e., insert, delete and update), it was necessaryto perform the “bridging” between the CLEENEX’Core, and the GUI. Furthermore, it was necessaryto refine and add components to the graphical in-terface, in order to allow the user to manually cleanthe data records intuitively, and with minor effort.

A visual representation of the workflow of infor-mation in CLEENEX, when a user creates an MDRinstance, is depicted in Figure 6. The communica-tion between the GUI and CLEENEX’ Core is donethrough various web services, which have differ-

4

Page 5: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

ent purposes. For example, in this figure we arerepresenting two of them: one to receive MDR in-stances’ creation requests, and another to notifythe GUI of the MDR instances’ creation status (i.e.,failure or success).

Figure 6: MDR instance creation workflow.

Currently, whenever a user tries to create a newMDR instance, through CLEENEX’ GUI, a creationrequest is sent to CLEENEX’ Core (i.e., to the re-spective web service component). The informa-tion sent to the web service (“MDRs Controller”)is carried through a Data Transfer Object (DTO),which specifically contains information about theMDR to instantiate, the action to perform, the tar-get tuple’s original values (if applicable), and the at-tributes that the user wants to execute the action on(e.g., the attributes to update). When the requestreaches CLEENEX’ Core’s “MDRs Controller”, itis passed on to the Java class that is responsi-ble for managing the MDRs (i.e., the MDR man-ager). Such class then creates an object to rep-resent the MDR instance requested. While doingthis, a SQL statement (depending on the MDR ac-tion chosen) is constructed, in order to be passedon to the RDBMS, for further execution.

After creating the MDR instance Java object, andsending the SQL statement to the RDBMS: uponthe SQL statement’s execution, the MDR man-ager analyzes the SQL statement’s execution re-sult, and notifies the “MDR Instances Creation Re-porter” that the request has been completed. The“MDR Instances Creation Reporter” then sends amessage to the GUI, to inform the user about thestatus of the MDR instance creation request.

In the CLEENEX’ Core, we distinguish the var-ious types of MDR instances that can be instanti-ated by the MDR Manager. This is advantageousbecause provides some extensibility, for example,if we wish to add support for additional actions, inthe future. The composition of the SQL statements,for each MDR instance created, is done accordingto the action chosen. For example: (i) an update

MDR instance will have a SQL statement whichcorresponds to an update of the target tuple(s) onthe RDBMS; (ii) an insert MDR instance will havea SQL statement that corresponds to a tuple inser-tion on the RDBMS; and (iii) a delete MDR instancewill have a SQL statement that corresponds to a tu-ple removal.

The construction of the SQL statements wasnot being done correctly. To resolve it, we addeda component, MdrQueryUtils, that contains utilityfunctions related to the construction of SQL state-ments. For example, the composition of the whereand set SQL clauses is now being done in thatclass. We opted for creating a single Java class inorder to reuse code that was similar, for the MDRactions that use such SQL clauses.

4.2. Iterative Execution of a DCG

To implement the iterative execution of a DCG wehad to ensure the operational semantics deter-mined in [7] was being followed. The operationalsemantics is as described in Definition 1.

Definition 1. Let G = 〈G, 〈Q,M〉〉 be a datacleaning graph for a set R1, ..., Rn of input rela-tions. Let r1, ..., rn be instances of these relationsand M be a manual data repair state for G, i.e., afunction that assigns to every relation R ∈ rels(G),a list of instances of manual data repairs over R.The result of executing G over r1, ..., rn and M is{〈tuples(R), tuplesbl(R)〉 : R ∈ rels(G)} calculatedas follows:

1: for i = 1 to n do2: for each∗∗ ι∈M(Ri) do3: vr ←

compute view(view(ι), tuples(Ri))4: apply mdr(ι, vr)5: tuples(Ri) ←

propagate(vr)6: end for7: end for8: for i = 1 to n do9: tuplesbl(Ri) ←

blamed(tuples(ri))10: end for11: for each∗ T ∈ trans(G) do12: let {R′1, ..., R

′k} = •T

13: tuples(T•) ←T (tuples(R′1), ..., tuples(R

′k))

14: for each∗∗ ι∈M(T•) do15: vr ←

compute view(view(ι), tuples(T•))16: apply mdr(ι, vr)17: tuples(T•) ←

propagate(vr)18: end for19: tuplesbl(T•) ←

blamed(tuples(T•))20: end for

5

Page 6: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

21: apply mdr(mdrInstances, vr)22: for each∗∗ ι ∈ mdrInstances

do23: if action(mdr(ι)) = delete

then24: vr ← vr \ {tuple(ι)}25: else if action(mdr(ι)) = in-

sert then26: vr ← r ∪ {tuple(ι)}27: else if action(mdr(ι)) = up-

date then28: newt← tuple(ι)29: newt[attribute(action(mdr(ι)))]←

value(ι)30: vr ← (vr \ {tuple(ι)}) ∪

{newt}31: end if32: end for∗Assuming that the underlying iteration will traverse the set in ascendingelement order. ∗∗Assuming that the underlying iteration will traverse thelist in proper sequence.

According to the operational semantics pre-sented, the execution of a DCG should be as fol-lows: considering each of the DCG’s input rela-tions, we traverse through the list of MDR instancescreated for it (if there are any). For each MDR in-stance that is assigned to that relation node, wecompute the relation node’s view, and apply theMDR instance. The application of the MDR in-stance results in an updated view (i.e., consideringthe action performed). If there are various MDRinstances associated to the relation node, their ap-plication is done in the original order (i.e., order ofcreation). After applying the MDR instances, theblamed tuples are computed, for each node.

To proceed: for each transformation node thatexists in the DCG, and considering its input re-lations, we compute (i.e., transform, according tothe operation chosen) the resulting output relationnode. Then, for each MDR instance associated tothe outputted relation, we apply it over the com-puted relation view; and then propagate the result-ing view, to the transformation node’s output (i.e.,to the upcoming nodes in the graph). After apply-ing all the node’s MDR instances, and having anupdated view (i.e., the node’s output), the blamedtuples are re-computed.

In CLEENEX, a new data cleaning process be-gins without any MDR instances. On its first exe-cution, the system starts by computing the list ofblamed tuples of a table (considering the QCs thatare associated to that relation). If blamed tuplesare found, the system then proceeds to computethe resulting view. After such view is computed,the user is able to consult the data records markedas blamed, as well as that view’s results. When an-alyzing the data (namely the blamed tuples), if theuser decides to perform MDRs, he/she creates thedesired MDR instances. After creating the MDR in-stances successfully, the views are updated. Whenthe user requests a re-execution of the DCG, theMDR instances created previously are re-appliedto the respective views, in order of creation.

4.2.1 The DCG re-execution problem

An important problem related to the re-executionof MDR instances was found: essentially, not allthe DCGs’ MDR instances were being correctly re-executed because, eventually, the original valuesof the targeted tuples would change (i.e., the in-put dataset had been updated). This way, the ex-isting MDR instances would become deprecated,and not be executed successfully, because the tar-geted data was not found (i.e., it had changed uponthe re-execution).

The DCG re-execution problem was happen-ing when the tuples’ values were generated bycertain externally defined functions (chosen bythe user), that were, for example, returning vari-able/“randomly” generated values, even if the inputwas the same (i.e., they were returning non deter-ministic values). Due to such tuples’ non determin-istic values, when the DCG was re-executed, thecorresponding MDR instances’ actions would notbe executed, because the initially targeted tupleshad changed (i.e., upon re-execution). This situa-tion - of an MDR instance’s target tuple becomingdeprecated/outdated - was named as an MDR in-stance conflict.

The CLEENEX framework allows its users toplug in any user-defined function. Therefore, it iscrucial to keep in mind that the situation describedabove (i.e., the “MDR instance conflicts”) can hap-pen. We consider that it is the responsibility of theusers who define the DCGs to identify if the exter-nal functions used are deterministic or non deter-ministic.

To solve this issue, we came up with a solutionthat deals with this kind of situations, in order toprevent the loss of MDR instances, and guaran-tee the correct execution of the MDRs’ actions: (i)identifying the attributes that are non deterministic,in the data cleaning process’ specification, and (ii)upon the re-execution of the DCG, detecting MDRinstances that target records with non determinis-tic values; to verify if those MDR instances need tobe updated or not (i.e., if an MDR instance conflictappeared). This verification, to detect the MDR in-stance conflicts, is done by a new component, theMDR conflict manager. It not only detects the con-flicts, but also resolves them automatically, whenpossible; or provides means for resolving them, byrequesting the user to pick the best option for theresolution (i.e., through a new GUI component).When an MDR instance conflict is found, the useris notified through the GUI, as the respective node(to which the MDR instance is associated) has itscolor changed to orange, and a warning symbol ap-pears (as exemplified in Figure 7).

When the “Resolve” button is clicked, the useris shown another pop-up window (Figure 8), that

6

Page 7: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

allows him/her to select the new (updated) sourcetuple (i.e., according to the new input data - whichresulted from the DCG’s re-execution). After se-lecting the correct data record to which the MDRinstance should be applied to, the user must se-lect the “Apply” button, to conclude the updatingof the whole information regarding that MDR in-stance, and apply it with success.

Figure 7: GUI displayed upon detection of MDR instance con-flicts.

Another functionality implemented was the per-sistent storage of the MDRs and respective MDRinstances. By performing this way, we are ableto keep track of the data records. Additionally, weadded a complementary feature to recover from apaused data cleaning process - by reloading all theMDR instances previously created.

4.3. Graphical User InterfaceIn terms of the GUI we did not only add support forthe new features, but we also improved and cor-rected faults that existed in the prototype.

We changed the GUI to include new componentsfor: the resolution of MDR instance conflicts - asshown in Figure 8; for the creation of MDR in-stances (insert and delete actions); and for dis-playing notifications to the user, in order for him/herto be more informed during the application of MDRinstances.

Figure 8: GUI displayed, with pop-up window, for MDR instanceconflicts resolution.

Additionally, we corrected the filtering and sort-ing functions, for the tabular data’s columns (i.e.,in the data browser component). We also enabledthe automatic resizing of some GUI components(e.g., the buttons); and updated others. For exam-

ple, we updated the JavaScript code related to theupdate MDR instances.

4.4. DiscussionThe various implementations performed over theCLEENEX framework, in the context of this thesis,had the objective of enabling the iterative executionof the DCGs, while allowing the user to intervenein the data cleaning process, through MDRs.

To fully enable the two-way guidance thatCLEENEX can provide, we started by perform-ing the necessary implementations, to ensure thecorrect incorporation of the feedback given by theuser. That is, we added support for the MDR in-stances’ creation and application.

In order to achieve the thesis’ goals, we cameacross various impediments. The main limitationsfound were related to the concept of MDRs. Partic-ularly, the fact that an MDR did not consider themutability of the data targeted by the respectiveMDR instances. To resolve this issue, we startedby providing means for identifying the data thatwas prone to change (i.e., which we called thenon deterministic attributes). We included supportfor the specification of non deterministic attributes,in the data cleaning programs, in order to allowCLEENEX to be aware of the possibility of havingerrors in the re-application of the MDR instances.Then, we added the notion of MDR instance con-flicts, to flag eventual errors caused because ofthose non deterministic attributes (i.e., the errorsthat happen when an MDR instance is not appliedbecause the target tuple(s) have changed). WhenMDR instance conflicts are found, they are asso-ciated to the respective MDRs. By having this no-tion, and performing the detection and resolution ofthe MDR instance conflicts that appear (i.e., duringthe execution of a DCG), we managed to createa functionality that allows the user to recover fromerrors (i.e., errors in the application of the MDR in-stances), without losing the work previously per-formed.

In terms of the re-execution of the MDR in-stances, during the iterative execution of a DCG,we found it important to store persistently all theMDRs and respective instances. We implementedthe persistent storage and retrieval of the MDR-related data, and added an additional feature, toenable the recovery of the previously created MDRinstances; which can be quite helpful in case theuser wishes to pause the data cleaning processand recovery its status later.

In what regards the GUI, we improved someof the existing components (e.g., correcting somefaults), and added others, to support the function-alities that were updated and/or implemented.

Overall, we consider the functionalities imple-

7

Page 8: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

mented to be potential value adders, because theyare enabling what was addressed by this thesis,and may have a positive impact in what regards theuser experience (e.g., by providing better guidanceto the user and reducing the effort required).

5. Experimental ValidationAlthough supplying feedback (i.e., through MDRs)to the data cleaning process requires effort, theactions performed by the user may be crucial toobtain data with higher quality, at the end of adata cleaning process. The goal is to minimizethe user effort required, such that the trade-off be-tween the data quality obtained and the effort re-quired is worthwhile. Regarding this, we measureboth the user effort required and the data cleaningprocesses’ effectiveness (i.e., considering the dataquality produced).

In order to validate the implemented function-alities, we performed three distinct experiments.Each experiment was done for two datasets, fromdifferent business domains, and with distinct dataquality problems.

5.1. Experimental ScenariosEach experiment consists in the execution of a setof data cleaning processes. Table 1 contains infor-mation about each experiment and the respectivemain functionalities that they target (i.e., as shownin the first column, Targeted Validation), as well asthe metrics used for such purpose (i.e., as men-tioned in the second column, Metrics of Interest).A description of the validation metrics is presentedin Section 5.3.

5.2. Data Cleaning ProcessesThe various experiments’ data cleaning processes’conditions are explained accordingly:

• Experiment (A) From Manual to DCG:

1. Manual : the user manually cleans thedataset. We consider that its output iscompletely accurate.

2. DCG, with Further Cleaning: the datacleaning process is modeled and exe-cuted through a DCG, without neitherQCs nor MDRs. The user manuallycleans the data (i.e., to fix the remainingdata quality problems), after the DCG’sexecution.

3. DCG, with QCs and MDRs: a data clean-ing process is modeled and executedthrough a DCG with MDRs and/or QCs.During the execution of the process, theuser interacts with the DCG, providingfeedback.

• Experiment (B) Conflict Detection and Resolu-tion: A data cleaning process is modeled and

executed through a DCG with MDRs and/orQCs. During the execution of the process, theuser interacts with the DCG, providing feed-back.

3. DCG, with QCs and MDRs: the MDR in-stance conflict detection and resolutionfunctionality is not used.

4. DCG, with QCs, MDRs and MDR in-stance Conflicts: the MDR instance con-flict detection and resolution functionalityis used.

• Experiment (C) Work Loss and Recovery: Forthis scenario we consider that the user’s ma-chine undergoes an outage, when half of therequired MDR instances had already beencreated.

4.(a) With Work Loss: the user does not usethe MDRs recovery functionality. TheMDR instances created are irrecoverablylost.

4.(b) With Work Loss and Recovery : the usertakes advantage of the MDRs recoveryfunctionality.

5.3. Validation MeasuresThe metrics considered are as follows:

• Effectiveness: we consider Precision, Recalland F1 Score, as follows.

Precision =#TP

#TP +#FP(1)

Recall =#TP

#TP +#FN(2)

F1 = 2× Precision×Recall

Precision+Recall(3)

• User Effort :

1. Manual Data Inspection:– Inspection Effort : average number of

characters inspected.2. Manual Data Update:

– Updating Effort : (i) number of char-acters modified; (ii) number of char-acters added; (iii) number of charac-ters deleted.

3. Repeated Manual Data Update: whenthe user has to repeat the work done(e.g., by re-creating the MDR instanceslost) he/she is repeating actions. This re-sults in an extra user effort, measured interms of:

– Extra Updating Effort : (i) number ofMDR instances created; (ii) numberof MDR instances re-applied (withsuccess).

8

Page 9: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

Table 1: Experiments’ targeted functionalities and metrics of interest.Targeted Validation Metrics of Interest

Experiment MDR Instances MDR InstancesConflicts

Persistent Storageand Restoration User Effort Effectiveness

(A) From Manual to DCG - Inspection Effort- Updating Effort

- Precision- Recall- F1 Score

(B) Conflict Detection andResolution

- Updating Effort- Extra Updating Effort

(C) Work Loss and Recovery - Updating Effort- Extra Updating Effort

5.4. ResultsAlthough the experiments were performed for twodistinct datasets, we will only be presenting the re-sults for one of them: the Publications dataset; be-cause there was consistency in what regards theresults (i.e., we noted similar improvements).

• Experiment (A): the results obtained, in termsof user effort, are shown in Table 2. The re-sults regarding the effectiveness of the datacleaning processes executed are shown in Ta-ble 3. The results obtained for Precision, Re-call, and F1 Score are “1”, which is the idealvalue.

• Experiment (B): the results obtained, in termsof user effort, are shown in Table 4. We con-clude that by using the MDR instance conflicts’detection and resolution functionality, the UserEffort was greatly reduced, by at least 38%.

• Experiment (C): the results obtained, in termsof user effort, are shown in Table 5. By ana-lyzing its content we conclude that the usageof the MDR instances’ recovery is useful, as itguarantees less User Effort (i.e., both for Up-dating Effort and Repetition Effort), in case theuser needs to “reload” the MDR instances.

Considering the various results obtained, for theexperiments performed, we concluded that, withthe new functionalities, the required user effort wasproven to reduce. Furthermore, the data qualityobtained always corresponded to what was ideal-ized. Therefore, we consider that the work doneapropos of this thesis is validated.

6. ConclusionsConsidering the initial limitations of the CLEENEXprototype, we presented our contributions, whichwere made to improve the CLEENEX frameworkat several levels. The implementations done overthe CLEENEX framework addressed the follow-ing: (i) support for MDRs, (ii) iterative executionof a DCG, and (iii) inclusion and/or refinement ofGUI components. In terms of the (i) support forMDRs, we enabled the creation and execution ofMDR instances, for the three supported actions(insert, update and delete). In what regards the

(ii) iterative execution of a DCG, we implementedthe necessary changes to ensure its conformity tothe CLEENEX’ operational semantics; we imple-mented the persistent storage of MDRs; we addedthe notion of non deterministic attributes - in orderfor the framework to be aware of MDR instanceconflicts; we added support for an important nonfunctional requirement (which was previously beingignored) of enabling the recovery of MDRs’ appli-cation errors (i.e., by detecting and resolving thoseerrors automatically). We also added a comple-mentary feature, to allow the user to pause a datacleaning process. Regarding the (iii) GUI, we im-proved some limitations and/or faults it had, andadded graphical components, to support the com-prehension and usage of the newly added function-alities.

Overall, the functionalities implemented over theCLEENEX framework are value adders, becausethey do not only allow a correct execution of theDCGs, but also enable the incorporation of userfeedback; which can be essential for the datacleaning processes to be effective, producing datawith the idealized quality.

For future work, we believe the following top-ics would be advantageous: (i) Data Transforma-tions’ Incremental Execution: to reduce the com-putational effort and the execution time, the appli-cation of data transformations could be done solelyon the new data (i.e., the data that is new to thecurrent iteration of the DCG); (ii) MDR Instances’GUI Management: it could be advantageous to im-plement an additional GUI component, to allow theuser to manage the MDR instances created. Forexample, to allow the user to discard (i.e., rollback)certain MDR instances; and (iii) Additional Eval-uations: it would be interesting to conduct addi-tional experiments, in order to calculate the com-putational overhead required for the execution ofsome of the new functionalities (e.g., the recoveryof MDR instance conflicts).

9

Page 10: CLEENEX: Iterative Data Cleaning with User Intervention · CLEENEX: Iterative Data Cleaning with User Intervention Catia Borges Ormonde´ catia.ormonde@tecnico.ulisboa.pt Instituto

Table 2: Experiment (A) results, regarding User Effort, for the Publications dataset.User Effort

Updating Effort Visualization EffortData Cleaning Process # Characters Modified # Characters Added # Characters Deleted # Characters Visualized1. Manual 217107 111078 106029 1714372. DCG, with Further Manual Cleaning 193698 127143 66555 1310453. DCG, with QCs and MDRs 132910 118322 14588 78030

Table 3: Experiment (A) results, regarding Effectiveness, for the Publications dataset. NA = Not Applicable.Effectiveness

Data Cleaning Process True Positive (TP) True Negative (TN) False Positive (FP) False Negative (FN) Precision Recall F1 Score1. Manual 152 216 NA NA 1 1 12. DCG, with Further Cleaning 152 216 NA NA 1 1 13. DCG, with QCs and MDRs 152 216 0 0 1 1 1

Table 4: Experiment (B) results, regarding User Effort, for the Publications dataset.User Effort

Updating Effort Repetition EffortData Cleaning Process # Characters Modified # Characters Added # Characters Deleted # MDR Instances

Created# MDR InstancesRe-applied

3. DCG, withQCs and MDRs

# Executions1 132910 118322 14588 608 2262 162086 132910 29176 990 2263 191262 147498 43764 1372 226

4. DCG, with QCs, MDRsand MDR Conflicts 132910 118322 14588 608 608

Table 5: Experiment (C) results, regarding User Effort, for the Publications dataset.User Effort

Updating Effort Repetition Effort

Data Cleaning Process # Characters Updated # Characters Added # Characters Deleted # MDR InstancesCreated

# MDR InstancesRe-applied

4. (a) With Work Loss 199365 177483 21882 912 04. (b) With Work Loss and Restoration 132910 118322 14588 608 608

References[1] Z. Abedjan, X. Chu, D. Deng, R. C. Fer-

nandez, I. F. Ilyas, M. Ouzzani, P. Papotti,M. Stonebraker, and N. Tang. Detecting dataerrors: Where are we and what needs to bedone? Proceedings of the VLDB Endowment,9(12):993–1004, 2016.

[2] K. Adu-Manu Sarpong and J. Kingsley Arthur.Analysis of Data Cleansing Approaches re-garding Dirty Data – A Comparative Study. In-ternational Journal of Computer Applications,76(7):975–8887, 2013.

[3] X. Chu and I. F. Ilyas. Qualitative Data Clean-ing. Proceedings of the VLDB Endowment,9(13):1605–1608, 2016.

[4] X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang.Data Cleaning: Overview and Emerging Chal-lenges. Proceedings of the SIGMOD Confer-ence, pages 2201–2206, 2016.

[5] H. Galhardas, D. Florescu, D. Shasha, andE. Simon. AJAX: An Extensible Data Clean-ing Tool. Proceedings of the SIGMOD Confer-ence, page 590, 2000.

[6] H. Galhardas, D. Florescu, D. Shasha, E. Si-mon, and C. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. TechnicalReport RR-4149, INRIA, 2001.

[7] H. Galhardas, A. Lopes, and E. Santos. Sup-port for user involvement in data cleaning.Proceedings of the DAWAK Conference, 6862LNCS:136–151, 2011.

[8] I. F. Ilyas and X. Chu. Trends in CleaningRelational Data: Consistency and Deduplica-tion. Foundations and Trends in Databases,5(4):281–393, 2015.

[9] S. Krishnan, D. Haas, M. J. Franklin, andE. Wu. Towards Reliable Interactive DataCleaning: A User Survey and Recommenda-tions. Proceedings of the HILDA Workshop,pages 9:1–9:5, 2016.

[10] H. Muller and J.-c. Freytag. Problems, Meth-ods, and Challenges in Comprehensive DataCleansing. Technical Report HUB-IB-164,Humboldt University Berlin, Institute for Com-puter Science, 2003.

[11] P. Neely, S. Lin, J. Gao, and A. Koronios. Thedeficiencies of current data quality tools in therealm of engineering asset management. Pro-ceedings of the AMCIS Conference, 1:430–438, 2006.

[12] L. L. Pipino, Y. W. Lee, and R. Y. Wang. DataQuality Assessment. Magazine Communica-tions of the ACM, 45(4):211–218, 2002.

10