extracting schema from semistructured data
DESCRIPTION
Extracting Schema from Semistructured Data. Nestorov, Abiteboul, and Motwani at Stanford. Perspective. This paper is new work. More than the details look at the issues: What are their goals? What does this contribute? Do they attain their goals? Why do we need this?. Sample Database. 7. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/1.jpg)
Extracting Schema from Semistructured Data
Nestorov, Abiteboul, and Motwani at Stanford
![Page 2: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/2.jpg)
2
Perspective
• This paper is new work.
• More than the details look at the issues:– What are their goals?– What does this contribute?– Do they attain their goals?– Why do we need this?
![Page 3: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/3.jpg)
3
Sample Database
“The Keg” “Steak” “Jim”
NameEntree
Manager
“BurgerKing”
“Fries”
Name EntreeManager
“AA+Management”
543-7798
CompanyName Phone
4
1
2 3 1098
7
65
11
Hours
Schema = Types
24
![Page 4: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/4.jpg)
4
Where does semistructured data come from?
• Document collections
• Biological data
• HTML
• Bibtex, etc.
![Page 5: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/5.jpg)
5
Who needs structure?
• For the user– To know what queries are possible– Browsing the database– Type checking
• Storage– Data layout to facilitate querying
• E.g. place similar objects on same page
– Indexes
![Page 6: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/6.jpg)
6
Who Needs Structure?(2)
• Query optimization– All the relational query optimization tricks
• Maintaining statistics per data type– Cardinality, # of pages, Index cardinality, etc.
• Estimating the cost/size of result of query plans
– Efficient processing of path expressions
• Other?
![Page 7: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/7.jpg)
7
Their Goals
Approximate typing (schema extraction) of semistructured data.
Restaurant(X) :- Link(X,A,B,C) & Name-atom(A) &Entrée-atom(B) & Manager-atom(C)
Example (little lie) Typing Program:
![Page 8: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/8.jpg)
8
Given a database:
Outline of the Algorithm
1. Find the perfect typing program.– This typing might be too large so we:
2. Coalesce similar types into k types.
3. Assign a type to objects in database.
4. Deduce meaningful names for the types.
![Page 9: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/9.jpg)
9
Typing
“The Keg” “Steak” “Jim”
NameEntree
Manager
1098
7The two base relations:
- link(FromObj, ToObj, Label)
- atomic(Obj, Value)
These are the only two EDB’s of the typing program.
Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) &link(X,B,Entrée) & atomic(B, Bp) &link(X,C,Manager) & atomic(C,Cp)
![Page 10: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/10.jpg)
10
Typing 2Restaurant(X) :- link(X,A,Name) & atomic(A, Ap) &
link(X,B,Entrée) & atomic(B, Bp) &link(X,C,Manager) & atomic(C,Cp)
EDB:link(7, 8, Name) atomic(8, “The Keg”)
IDB: (intensional relations)
defined by the typing program
Extension of an IDB:
Restaurant(1)
![Page 11: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/11.jpg)
11
Restriction on TypesArbitrary type programs are not allowed.
Rules typei(X) can only be built from the following:
1. link(Y, X, c) & typej(Y)2. link(X, Y, c) & typej(Y)3. link(X, Y, c) & atomic(Y, Z)
Types can only express local characteristics.
The collection of typed links is a set.(2 entrées = 1 entrée)
cj
cj
c0
X
![Page 12: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/12.jpg)
12
Semantics of Type Program
The greatest fixpoint of a datalog program on a database defines the semantics of the typing.
Fixpoint = Extensions of IDB’s + EDB’s– Least fixpoint
• start with model of only EDB’s
• at each step union into the model anything new.
![Page 13: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/13.jpg)
13
Greatest Fixpoint
1. Start with a model of EDB’s and all possible extensions.2. At each step, remove any extensions not derived by applying
the rules.
Least fixpoint doesn’t work:
person(X) :- link(X, Y, is-manager-of) & firm(Y) & link(X, Yp, name) & atomic(Yp, Z)
firm(X) :- link(X, Y, is-managed-by) & person(Y) & link(X, Yp, name) & atomic(Yp, Z)
![Page 14: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/14.jpg)
14
Imperfect TypesDefect: a measure of how well an
object fits a given type.
= Excess + deficit
type1 = +
+
Defect is 2 for assigning 11to type1.
“McD”
“Steak” “Jim”
NameEntree
Manager
654
7
“biscuit” 53
NameEntree
# seats
1098
11
“The Keg”manager0
name0 entree0
![Page 15: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/15.jpg)
15
Imperfect Types(2)
“McD”
“Steak” “Jim”
NameEntree
Manager
654
7
“biscuit” 53
NameEntree
# seats
1098
11
“The Keg”
• Excess: # of EDB’s not used to validate any object’s type.
• Deficit: Minimum # of ground facts that need to be added to make all type derivations possible.
![Page 16: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/16.jpg)
16
Perfect Typing Program (Stage 1)
Gore.
![Page 17: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/17.jpg)
17
Multiple Roles
Name
CountryTeam
Movie Name
NameCountry
Country
TeamMovie
Movie
Scholes
England
Man Utd
Cantona
Star Trek
France
Binoche
Bleu
RockyHorror
O1 O2O3
How hard is it to choose to types for the cover?How do you quantify atomization?
![Page 18: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/18.jpg)
18
Clustering (Stage 2)
Define a distance function between two types:
First approximation is difference between the bodies oftheir rule definitions.
t1 :- a0, b2 t2 :- a0, b1
t3 :- b2, b1, b3
d(t1, t2) = 2
![Page 19: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/19.jpg)
19
A Better Function
Include some measure of the weight of a type(# of objects of that type):
t2 ~> t1
Some desirable properties:• increasing in d = coalesce similar types
• decreasing in w1 = compensate for ‘expected noise’
• increasing in w2 = maintain types with large extents
Choosing what to coalesce is hard!
),( 21 ww
![Page 20: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/20.jpg)
20
Recasting (Stage 3)
Assign each object to types within the k types formedfrom stage 2.
(optional) choose a better value of k an rerun step 2.
![Page 21: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/21.jpg)
21
Results
• Heavy use of synthetic data.– Create a type definition and generate instances
that are peturbed randomly in some way.
• What do the graphs show?– Are the data sets realistic?
![Page 22: Extracting Schema from Semistructured Data](https://reader035.vdocuments.mx/reader035/viewer/2022062805/56814cf3550346895db9f605/html5/thumbnails/22.jpg)
22
Conclusions
• Paper problems:– The algorithm isn’t completely explained.– Many comments are not elaborated.
• But, it’s an important problem and good first approach.