data integration hanna zhong [email protected] department of computer science university of...
TRANSCRIPT
![Page 1: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/1.jpg)
Data Integration
Hanna [email protected]
Department of Computer ScienceUniversity of Illinois, Urbana-Champaign
11/12/2009
![Page 2: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/2.jpg)
2
Overview
• Motivation
• Problem Definition
• Data Integration Approaches– Virtual integration
– Data warehouse
• Issues
• Discussion
![Page 3: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/3.jpg)
3
Why Data Integration?
![Page 4: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/4.jpg)
4
Example: Apartment SearchFind an
apartment near Siebel Center
RamshawJSM Bankier
![Page 5: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/5.jpg)
5
mediated schema
RamshawJSM
source schema 2
Bankier
source schema 3source schema 1
wrapper wrapperwrapper
Find an apartment near Siebel Center
![Page 6: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/6.jpg)
6
mediated schema
RamshawJSM
source schema 2
Bankier
source schema 3source schema 1
wrapper wrapperwrapper
Apartment Search
Find an apartment near Siebel Center
![Page 7: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/7.jpg)
7
More Examples
• People Search– Build a yellowpage application on db people
• Many people doing database stuff in the US• How can we find information about a database
person, such as classes taught, publications, collaborators, etc?
– Homepages
– http://dblife.cs.wisc.edu/
![Page 8: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/8.jpg)
8
Example Systems
• Apartment Search
• DB People Search
• Etc…
![Page 9: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/9.jpg)
10
Overview
• Motivation
• Problem Definition
• Data Integration Approaches– Virtual integration
– Data warehouse
• Issues
• Discussion
![Page 10: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/10.jpg)
11
What is Data Integration?
The process of 1. Combining data from different data sources
• Data sources: – Databases, websites, documents, blogs, discussion
forums, emails, etc
2. Presenting a unified view of these data
Source
Source
…
Unified View
![Page 11: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/11.jpg)
12
What is Data Integration? (2)
• The process of 1. Combining data from different data sources
2. Presenting a unified view of these data
Source
Source
…
Results:
User Query Search
as if the data are from the SAME source
![Page 12: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/12.jpg)
13
Problem Definition
How can we access a set of heterogeneous, distributed, autonomous databases as if accessing a single database?
![Page 13: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/13.jpg)
14
Data Integration is Hard • Data sources are heterogeneous, distributed, and autonomous
– Sources Type• Relational database, text , xml, etc
– Query-Language• SQL queries, keyword queries, XQuery
– Schema• Databases have different schemas
– Data type & value• The same data are represented differently in different sources
– Type (e.g. time represented as varchar or timestamp)– Value (e.g. 8pm represented as 8:00pm or 20:00:00)
– Semantic• Words have different meanings at different sources (e.g. title)
– Communication • Some sources are accessed via HTTP, others FTP
![Page 14: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/14.jpg)
15
Overview
• Motivation
• Problem Definition
• Data Integration Approaches– Virtual integration
– Data warehouse
• Issues
• Discussion
![Page 15: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/15.jpg)
16
Event Search
• Provide a comprehensive search on Champaign-Urbana events in one place– Search events by its title, description, location
proximity, dates, venues, and/or data sources
![Page 16: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/16.jpg)
17
Virtual Integration Approach
• Leave the data in the sources
• When a query comes in:– Determine the relevant sources to the query– Break down the query into sub-queries for the
sources– Get the answers from the sources, and
combine them appropriately
• Data is fresh
![Page 17: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/17.jpg)
18
Virtual Integration Architecture
wrapper
Query
Mediated schema
Data sourcecatalog
Optimizer
Execution engine
Source SourceSource
wrapperwrapper
Mediator: Reformulation enginemeta-information about the sources (ie. source contents, what queries
are supported, etc)
Determine relevant sources
Break down the query into sub-queries for the sources
![Page 18: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/18.jpg)
19
Virtual Integration Architecture
wrapper
Query
Mediated schema
Data sourcecatalog
Reformulation engine
Optimizer
Execution engine
Source SourceSource
wrapperwrapper
Mediator:
Communicate with the data source and do format translationsGet the answers and format appropriately
![Page 19: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/19.jpg)
20
Virtual Integration Example
UIUC calendareventful.com zvents.com
Find upcoming events in Champaign-Urbana
schema 1 schema 2 schema 3
Events(title, location, description, cost, start, end)
Events(title, where, description, start, end)
Mediator:
Events(title, location, cost, start, end)
![Page 20: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/20.jpg)
21
Virtual Integration Example
UIUC calendareventful.com zvents.com
wrapper
Find upcoming events in Champaign-Urbana
schema 1 schema 2 schema 3
Events(title, location, description, cost, start, end)
Events(title, where, description, start, end)
Mediator:
title where description start end
Music Canopy Music 9pm 11pm
![Page 21: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/21.jpg)
22
Virtual Integration Example
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Find upcoming events in Champaign-Urbana
schema 1 schema 2 schema 3
Events(title, location, description, cost, start, end)
Events(title, location, cost, start, end)
Mediator:
title location cost start end
Music Canopy Club $5 8pm 11pm
![Page 22: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/22.jpg)
23
Challenges
title where description start end
Music Canopy Music 9pm 11pm
title location Cost start end
Music Canopy Club $5 8pm 11pm
Events(title, location, description, cost, start, end)
Data Integration: the process of combining data from different data sources and presenting a unified view of these data
title location description cost start end
Schema Matching
![Page 23: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/23.jpg)
24
Challenges
title where description start end
Music Canopy Music 9pm 11pm
title location Cost start end
Music Canopy Club $5 8pm 11pm
Events(title, location, description, cost, start, end)
Data Integration: the process of combining data from different data sources and presenting a unified view of these data
title where description cost start end
![Page 24: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/24.jpg)
25
Virtual Integration Architecture
wrapper
Query
Mediated schema
Data sourcecatalog
Optimizer
Execution engine
Source SourceSource
wrapperwrapper
Mediator: Reformulation engine
Break down the query into sub-queries for the sources
![Page 25: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/25.jpg)
26
Virtual Integration
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Query: Find upcoming events in Champaign-Urbana
schema 1 schema 2 schema 3
Events(title, location, description, cost, start, end)
Mediator:
subQuery 1 subQuery 2 subQuery 3
![Page 26: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/26.jpg)
27
Mediators
• Global-as-view– express the mediated schema relations as a
set of views over the data source relations
• Local-as-view– express the source relations as views over
the mediated schema
Mediated schema
Query
schema 1
schema 2
schema 3wrapper
wrapper
wrappersubQuery 1
subQuery 2
subQuery 3
![Page 27: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/27.jpg)
28
Global-as-View GAV• Express the mediated schema relations as a set of views over the data source
relations – The mediated schema is modeled as a set of views over the source schemas
• Design the mediated schema around the source schemas• Mediated schema:
Events(title, location, description, cost, start, end)• Source schema:
– S1: Events(title, where, description, start, end)– S2: Events(title, location, description, cost, start, end, performer)– S3: Events(title, location, cost, start, end)
• GAV:Create View Events ASselect title, where AS location, description, NULL, start, end from S1 UNION select title, location, description, cost, start, end from S2 UNION select title, location, NULL, cost, start, end from S3
![Page 28: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/28.jpg)
29
Global-as-View GAV (2)
• Adding sources is hard – The core work is on how to retrieve elements
from the source databases – Need to consider all other sources that are
available
![Page 29: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/29.jpg)
30
Global-as-View GAV (3)• Mediated schema:
Events(title, location, description, cost, start, end)Venues(location, city, state)
• Source schema:– S4: Events(title, description, city, state)
GAV:Create View Events AS select title, NULL, description, NULL, NULL, NULL from S4
Create View Venues AS select NULL, city, state from S4
What if we want to find events that are in Champaign?
![Page 30: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/30.jpg)
31
Local-as-View LAV• Express the source relations as views over the mediated schema• The mediated schema is already designed
– Create views on the source schemas• Mediated schema:
Events(title, location, description, cost, start, end)• Source schema:
– S1: Events(title, where, description, start, end)– S2: Events(title, location, description, cost, start, end, performer)– S3: Events(title, location, cost, start, end)
• LAV:Create View S1 select title, location AS where, description, start, end from Events
Create View S2 select title, location, description, start, end, NULL from Events
Create View S3 select title, location, cost, start, end from Events
![Page 31: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/31.jpg)
32
Local-as-View LAV (2)• Mediated schema:
Events(title, location, description, cost, start, end)Venues(location, city, state)
• Source schema:– S4: Events(title, description, city, state)
What if we want to find events that are in Champaign?LAV:
Create View S4select title, description, city, state from Events e, Venues v where e.location=v.location AND city=“champaign”
![Page 32: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/32.jpg)
33
Local-as-View LAV (3)
• Very flexible. – You have the power of the entire query
language to define the contents of the source.– Hence, can easily distinguish between
contents of closely related sources
• Adding sources is easy– They’re independent of each other
![Page 33: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/33.jpg)
34
Virtual Integration Example
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Find upcoming events in Champaign-Urbana
schema 1 schema 2 schema 3
Events(title, location, description, cost, start, end)
Mediator:
![Page 34: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/34.jpg)
35
What if Schema is Unknown?
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Find upcoming events in Champaign-Urbana
? ? ?
Events(title, location, description, cost, start, end)
Mediator:
![Page 35: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/35.jpg)
36
Query Interface
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Find upcoming events in Champaign-Urbana
query interface 1 query interface 2 query interface 3
Mediator:global query interface
![Page 36: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/36.jpg)
37
Query Interface
![Page 37: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/37.jpg)
38
Query Interface
![Page 38: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/38.jpg)
39
Wrappers
global query interface
UIUC calendareventful.com
wrapper
zvents.com
wrapperwrapper
Find upcoming events in Champaign-Urbana
query interface 1 query interface 2 query interface 3
Communicate with the data source and do format translationsGet the answers and format appropriately
![Page 39: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/39.jpg)
40
Wrappers
![Page 40: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/40.jpg)
41
Wrappers (2)
• Once the query is submitted via the query interface, results are returned– Formatting is specific to each source
![Page 41: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/41.jpg)
42
Wrappers (2)
![Page 42: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/42.jpg)
43
Information Extraction
• What information to extract?
![Page 43: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/43.jpg)
44
Information Extraction (2)
• Complications…
![Page 44: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/44.jpg)
45
Wrappers
• Hard to build and maintain (very little science)• Major approaches
– Machine Learning– Data-intensive, completely-automatic
• Roadrunner http://portal.acm.org/citation.cfm?doid=564691.564778
• Data sources are accessed via query interfaces– Query interface to each data source is different
• Scalability– One wrapper per source vs one wrapper per domain
![Page 45: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/45.jpg)
46
Overview
• Motivation
• Problem Definition
• Data Integration Approaches– Virtual integration
– Data warehouse
• Issues
• Discussion
![Page 46: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/46.jpg)
47
Data Warehousing
• Load all the data periodically into a central database (warehouse)– Performance is good– Data may not be fresh– Need to clean, scrub you data
![Page 47: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/47.jpg)
48
Data Warehousing Architecture
Repository
Query
Data extractionprograms
Data cleaning/scrubbing
Source Source Source
Determine relevant sources
Collect and clean data
![Page 48: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/48.jpg)
49
Data Warehousing Example
…
Data Collection
Information Extraction Index
Repository
Query Parser Search Summarize
eventful.com
zvents.com
![Page 49: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/49.jpg)
50
Data Collection
• RSS feed
• Crawlers– Programs that browse the Web
in a methodical, automated manner
• Link followingExtract
Hyperlinks
URL List
![Page 50: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/50.jpg)
51
Crawlers (2)
![Page 51: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/51.jpg)
52
Data Warehousing Example
Source
Source
…
Data Collection
Information Extraction Index
Repository
What data to store? How to store?What data to index? What kind of
indexes? How to index?
How to search? How to rank?How to map a user query into one the system understands?
How to display results? What’s the view (ranked list, map, calendar… etc)
Query Parser Search Display
What data to extract?
![Page 52: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/52.jpg)
53
Data Integration Approaches
• Virtual integration– No data are collected offline– On a search, data are collected and processed from
various sources at runtime
• Data warehouse– Data are collected offline and stored in a central
repository– Search is performed on the repository
• When should we take the virtual integration approach?• When should we take the warehousing approach?
![Page 53: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/53.jpg)
54
Overview
• Motivation
• Problem Definition
• Data Integration Approaches– Virtual integration
– Data warehouse
• Issues
• Discussion
![Page 54: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/54.jpg)
55
Data Integration Issues
• Data collection– Wrappers, crawlers, RSS– Duplications, spams– Freshness, completeness, etc
• Information extraction– What information to extract?
• Schema matching • Query optimization• Query reformulation• Scalability
– When there are many sources out there, does the solution still work?
![Page 55: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/55.jpg)
56
Discussion
![Page 56: Data Integration Hanna Zhong hzhong@illinois.edu Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009](https://reader031.vdocuments.mx/reader031/viewer/2022013101/5697bf891a28abf838c8a40d/html5/thumbnails/56.jpg)
57
References
• Some slides taken from Professor Anhai Doan, from FALL2005 CS511, UIUC