dissertation proposal presentation

41
MODELING AND MAPPING FORMS OVER DATABASES: EMPOWERING USERS TO DESIGN DATABASES IN INDUSTRIAL DOMAINS Dissertation Proposal October 07 2010 Ritu Khare 1

Upload: ritu-khare

Post on 11-Jun-2015

1.162 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dissertation Proposal Presentation

1

MODELING AND MAPPING FORMS OVER DATABASES:

EMPOWERING USERS TO DESIGN DATABASES IN INDUSTRIAL DOMAINS

Dissertation Proposal

October 07 2010

Ritu Khare

Page 2: Dissertation Proposal Presentation

2

Database Design by Non-technical UsersWhy existing methods have not reached the industrial domains?

MOTIVATION

Page 3: Dissertation Proposal Presentation

3

Database Design By Non-technical Users

Our inspiration: Applications (Google Forms, FormAssembly, Zohocreator) that allow users to design databases How? Forward Engineering of User Needs into Databases

Great innovation in DB Usability! Database closely reflects user needs.

Very Popular for online data collection – surveys, event organization, etc.

Not used in industrial domains! – healthcare, automobile, etc.

Clinician

design

collect data

User Designed DB

ID Name Phone

DOB

ID PatientID

Date Height

Weight

Patient

VitalSigns

F/W engg

Page 4: Dissertation Proposal Presentation

4

Why existing methods are unfit for industrial domains?

No provision to modify or extend an existing database

Translation(Forward Engineering) Method is not reported.

Not tested on non-technical users

Databases are required to evolve w.r.t. new user needs

Data and Database Quality is important quality leads to

productivity. (Batini and Scannapieco, 2006)

Users have no background in data modeling and databases

Existing ApplicationsFeatures of Industrial Domains

Page 5: Dissertation Proposal Presentation

5

Proposed System and Research GoalsOpportunity: FormsExample: Form to Database MappingChallenges in Mapping

THE PROPOSAL

Page 6: Dissertation Proposal Presentation

6

Proposed System and Research Goals

Proposed System: An application to model and map user needs into an existing database

Goals:1. Modeling: “Usable” medium for users to model

needs Efficiency, Effectiveness, Adoption

2. Mapping: The resultant database should be high-quality, i.e. should satisfy: (Silberschatz et al. 2001, Batini and Scannapieco, 2006, Batini et al. 1992)

Normalization Completeness Compactness Correctness

Page 7: Dissertation Proposal Presentation

7

Opportunity: Forms

MODELING: Data-entry Forms provide a good communication medium for users to specify their data collection needs. (Choobineh et al. 1988, Embley, 1989)

MAPPING: Important information on databases could be retrieved by analyzing forms (Choobineh and Mannino, 1988). Search forms provide a useful way in determining the

underlying database(Benslimane, 2007) (Covered in Candidacy Exam)

Data-entry forms provide key guidelines in designing a prospective database(Mannino and Choobineh, 1984).

Page 8: Dissertation Proposal Presentation

8

The proposed application: An Example

New User Designed Form

Clinician

designID Patient

IDDate Heigh

tWeight

VitalSigns

Form to Databa

se Mappin

g

ID Name Phone

DOB

Patient

ID

PatientID

Date

Height

Weight

BP Smoking Stat

Existing Database

New Need

s

Evolved Database

Form Modelin

g

NEW PROBLEM!

Page 9: Dissertation Proposal Presentation

9

Uniqueness of “Form to Database” Mapping

Two structures are similar.

Mapping involves only schema elements (no values).

Do not consider schema /database evolution when there are unmapped elements.

Semiautomatic

Mapping Discovery How to reconcile the

differences in structures and semantics?

How to detect the form(or need) components (including values) which already exist in the database?

Database Evolution How to extend database based on

new elements in the form? How to automatically determine

functional dependencies and cardinalities from a form?

Schema Mapping(Rahm and Bernstein 2001)

Form to Database Mapping

Page 10: Dissertation Proposal Presentation

10 Proposed Application

Page 11: Dissertation Proposal Presentation

11

1. Form Design Interface

Title

Category

Field

Format

Subcategory

Supporting Text

Unit

Extended

Checkbox

optionCondition

SIMPLE!1. Terminology (intuitive)2. Features(form patterns)

Subfield

Simple FormAdvanced Form

Page 12: Dissertation Proposal Presentation

12

1. Form Design Interface

Input: User actions (based

on data collection

needs)Output: Form

1. Enter the Title “Patient Encounter Form”

2. Enter the category “Patient”3. Enter the field “Name”4. Pick a format “textbox”5. Enter the field “Age”6. …

Page 13: Dissertation Proposal Presentation

13

Defining High-Quality Guiding Principles(with respect to a given form)

Completeness Every form element has a place in database

Correctness For each correspondence the form element and

the database element refer to the same real-world element (has matching labels and contexts).

Compactness Every database element occurs just once.

Normalization The database is in 3NF

Page 14: Dissertation Proposal Presentation

14

A Simple Approach.

1. Lose grouping information

2. Lose form values3. Heterogeneous attributes placed in same relation. Generated

database is incomplete and not in 3NF (low-quality)!So we propose a tree representation to form.

Page 15: Dissertation Proposal Presentation

15

2. Tree Generation Definition: Form Tree

Previous works have proposed a similar tree representation for search forms. (Dragut et al. 09, Wu et al. 09)

1) data-entry forms.2) format nodes to improve DB quality. 3) different representation for checkboxes and radiobuttons.

Input: FormOutput: Form

Tree

Page 16: Dissertation Proposal Presentation

16

Form to Database Mapping

Form Tree

ExistingDatabase

Map and Merge???

Main challenges: 1discovering a mapping between two

heterogeneous structures 2. merging new elements into existing database

Form Tree

New Database Graph

ExistingDatabase Graph

ExistingDatabaseMERGE

MAP

3.Birthing

4. Classificatio

n

5. Extension

Page 17: Dissertation Proposal Presentation

17

Definition: Database Graph

Page 18: Dissertation Proposal Presentation

18

Definition: Mapping Correspondences

Direct correspondenc

e

IndirectCorrespondence

(Value collected on form element is

stored in database element)

Page 19: Dissertation Proposal Presentation

19

3. Birthing(term adopted from Jagadish et al. 2007)

Input: Form TreeOutput: New

Database Graph

Page 20: Dissertation Proposal Presentation

20

3. Birthing – Pattern 1 (Textbox)

Induced Functional Dependencies:

Address.id -> line1Address.id -> line2Patient.id -> NamePatient.id -> Age

Page 21: Dissertation Proposal Presentation

21

3. Birthing – Pattern 2: Radiobutton & Pattern 3: Checkbox

Radiobutton values are mapped to database

valuesRepresent M:1

relationship between Patient and Insurance

Checkbox values are mapped to database

columns(yes/no)Represent 1:1

relationship between Patient and Symptoms

M:1 1:1

Page 22: Dissertation Proposal Presentation

22

3. Birthing – Pattern 4: Category/subcat. Pattern 5: Sibling Categories

M:M

M:M

Page 23: Dissertation Proposal Presentation

23

3. Birthing Patterns Summarized

Page 24: Dissertation Proposal Presentation

24

4. Database Graph Classification

Classify each node to see if it pre-exists in the existing

database or not.i.e. to find whether it “maps” or not.

Existing DB

New Database Graph

Existing DBGraph

Page 25: Dissertation Proposal Presentation

25

4. Database Graph ClassificationAlgorithm

Problem: Finding Matching Nodes between new(DGn ) and existing database graph(DGe).

Algorithm For each table node tn in DGn

Let te be the label-matching table node in DGe

If two table nodes tn and te “match”(TableMatch algo) Tag tn i.e., mark this node as a matching/mapped node Tag all matching column and value nodes(ColumnMatch

algo) Else

Rename the table

Page 26: Dissertation Proposal Presentation

26

4. Database Graph ClassificationTableMatch Algorithm

Two table nodes “match” if Their labels match Null-value column ratio(NCR) <

tolerance-threshold (efficiency consideration – minimize null value possibilities during data collection) NCR = number of unmatched columns(as

per ColumnMatch) in either table (whichever is higher) / size of union set of columns in both tables

Page 27: Dissertation Proposal Presentation

27

Example: NULL Value Column(NCR) Calculation

map NCR= 2/5=0.4

If tolerance-threshold =

0.5(high)

If tolerance-threshold =

0.3(low)

When using Form1, 2 columns will have null

valuesWhen using form 2, 1

columnwil have null values

Page 28: Dissertation Proposal Presentation

28

4. Database Graph ClassificationColumnMatch Algorithm

Two non-key column nodes “match” if their Labels /names are same Data types are same Not null constraints are same

Two foreign key column nodes “match” if They both point to the same table nodes as

determined by TableMatch algorithm

Page 29: Dissertation Proposal Presentation

29

5. Extension of the Existing Database

Add unmapped tables, columns,

and values

Page 30: Dissertation Proposal Presentation

30

Usability ExperimentsMapping ExperimentsContributions

Preliminary Evaluation

Implementation – MySQL, JAVA, JSP, JavaScript, HTML, CSS, Lucene Indexing Package, yFiles Package

Page 31: Dissertation Proposal Presentation

31

Usability Evaluation – User Study

5 nurse professionals. No knowledge of database Moderate computer users Familiar with Paper-based

Forms 2 Tasks

Build task Replicate a paper-based form on

the system Model and build task

Model and build a given need (in natural language) into a form using the system interface.

2 rounds (form scale = no. of steps to design a form) Round 1: Small scale needs

Avg. form scale = 17 Generated Avg. 4.2 relations,

5.8 non-key attributes, 1.8 values, and 3.2 foreign key references

Round 2: Large scale needs Avg. form scale 47.4 Generated Avg. 6.2 relations,

13.8 attributes, 10.4 values, and 4.6 foreign key references

Participants and Tasks Study Settings

Page 32: Dissertation Proposal Presentation

32

MEASUREMENTS

Duration Ratio = Time(in min)/

Form Scale(#of steps to build form)

Assistance Ratio =# of assistances sought/ Form Scale(#of steps to

build form)

Outliers: P3: considered design

alternatives(high duration ratio)

P5: had difficulty in form terminology(needed more

assistance)

Page 33: Dissertation Proposal Presentation

33

Findings

Effectiveness: In 19/20 cases, participants finished the tasks with 100% effectiveness. The unsuccessful case: a

building error committed by a participant who skipped a component while building forms.

Efficiency: Duration ranged from 1 to 9 minutes for simple small-scale needs, and 7 to 19 minutes for advanced long-scale needs. Exception: A participant who

considered several design alternatives .

System Adoption Efficiency : consistently

improved from round 1 to round 2.

Confidence: Very confident for specifying

small-scale needs for both the tasks.

Improved from round 1 to round 2 for the build task. Did not improve for model-and-build task, from round 1 to round 2.

Understanding: improved greatly in round 2. They started synthesizing their

knowledge of form concepts and domain knowledge to consider different design alternatives.

Comparison with a Related Work Appforge (Yang et al. 2008): Users are required to create forms and expressive views and are exposed to the existing schema. In our work, users only create forms and mapping is handled by system.

Page 34: Dissertation Proposal Presentation

34

Mapping Experiment Set 1

Experiments on 5 industrial domains.

For each domain, Designed certain

forms and used the mapping algorithms to evolve a database.

Tab.

Attr

Val.

FK

D1 +8 0 0 +16,-8

D2 +6 0 NA +12,-6

D3 +6 0 0 +12,-6

D4 +6 0 NA +12,-6

D5 +5 0 0 +10,-5

S.No.

Domain Form

Table

Attr

Val.

FK

D1 DVD Store

8 22 27 6 27

D2 Charity 6 14 17 0 14

D3 Library 7 18 19 2 17

D4 Automobile

7 16 17 0 17

D5 Insurance

4 14 22 8 13

+ indicates extra element- Indicates missing element

No sign indicates perfect match

Compared with a gold standard (found on the Web) developed by experts

Page 35: Dissertation Proposal Presentation

35

Analyzing Inaccuracies and System Enhancement

Added another layer of interaction : to disambiguate cardinality between 2 entities.

M:M

M:M

Result: All the databases were identical to respective gold standard databases.

Inference: The mapping algorithms have the ability to generate databases in industrial domains.

0

10

20

30

40

50

D1 D2 D3 D4 D5

% Red. in Tables

% Red. in Joins

Page 36: Dissertation Proposal Presentation

36

Mapping Experiment Set 2

For each domain Performed mapping experiments with at

least 5 different sequences of forms (representing diff. merging situations. )

Result: All the databases generated from different sequences are identical to each other and to the gold standard databases. Inference: The mapping algorithms have the ability to evolve databases in industrial domains in a variety of merging situations

Form Sequence Resultant Database

F1, F2, F3, F4 D1

F2, F4, F1, F3 D2

F1, F4, F3, F2 D3

… …

Page 37: Dissertation Proposal Presentation

37

Current and Predicted Contributions Introducing the Form to Database Mapping Algorithms

driven by data-quality principles Mapping experiments on 5 domains

System has the potential to generate high-quality databases in industrial settings solely

based on user-designed forms and user-provided domain knowledge.

to evolve existing databases in a variety of merging situations. Usability Study

System has the potential to be adopted by non-technical users while providing them efficiency and effectiveness in form modeling.

Page 38: Dissertation Proposal Presentation

38

Possible Research ExperimentsOther Research Areas/System RefinementPlan for Thesis Completion

What Next?

Page 39: Dissertation Proposal Presentation

39

Possible Research Experiments(in healthcare domain)

Have multiple clinicians evolve a new database using diff. forms representing diff. kinds of information. Alter Form and

Database Complexity. Guided Vs unguided

Experiment Scenario 1 Experiment Scenario 2

Have a clinician evolve an existing database based on new needs represented in multiple forms. Alter Form and Database

Complexity Guided Vs unguided

Guided: user is provided with specific needs.

Unguided: user is only given a context and comes up with her own needs

Page 40: Dissertation Proposal Presentation

40

Scope for Other Research Areasand System Refinement

Form Design Interface Design Recommendation Different Form Patterns Used in Industrial Domains Modify existing form

Form Filling Component Data Recommendation

Tree Generation Handle else-where designed forms by combining existing

form information extraction techniques (SIGMOD Record Survey, 2010)

Birthing Algorithm Derive Weak Entities, Generalization.

Merging Algorithm – Label Matching Match synonyms, hyponyms, etc.

Page 41: Dissertation Proposal Presentation

41

Plan for Dissertation Completion

Thank you!