information mediation: integrating information from multiple information sources
DESCRIPTION
Information Mediation: Integrating Information from Multiple Information Sources. Naveen Ashish Amit P. Sheth Department of Computer Science and Large Scale Distributed Information Systems Lab University of Georgia, Athens. What is an Information Agent/Mediator ?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/1.jpg)
Information Mediation: Integrating Information from Multiple Information
Sources
Naveen Ashish
Amit P. Sheth
Department of Computer Science and
Large Scale Distributed Information Systems Lab
University of Georgia, Athens
![Page 2: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/2.jpg)
What is an Information Agent/Mediator ?
A software system that provides integrated and structured query access to multiple distributed information sources
Sources may be databases of various kinds or Web sources Sources are autonomously created and heterogeneous Accessible via a network Mediator provides the illusion of a single information
source
![Page 3: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/3.jpg)
Information Agents aka Mediators
Example: Restaurant and Theatre Info on the Web
AriadneMediator
Map Servers
Geocoders
Movies
Zagat Health Ratings
![Page 4: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/4.jpg)
Why the Interest in Building Such Systems ?
MEDIATOR
Oracle
Legacy SystemIBM DB2
Sybase
Object-Oriented DB
![Page 5: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/5.jpg)
Mediators on the Web
MEDIATOR
Wrapper
DB1DB2
![Page 6: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/6.jpg)
Organization of Remainder of Talk
Introduction– Information Agents, System Architecture
Research Issues– Information Modeling– Query Planning– Semi-automatic Wrapper Generation– Performance Optimization by Materialization– Resolving Inconsistencies
Industry Products for Data Extraction and Integration Start-up Ventures
![Page 7: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/7.jpg)
Representative Systems (Research Projects)
SIMS/Ariadne University of Southern California/ISI
TSIMMIS Stanford
Information Manifold AT&T Research
Garlic IBM Almaden
Tukwila University of Washington
InfoSleuth MCC
DISCO University of Maryland/INRIA
HERMES University of Maryland
InfoMaster Stanford
InfoQuilt University of Georgia
![Page 8: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/8.jpg)
Information Modeling
Multiple, heterogeneous, autonomously created information sources
Users sees an integrated (global) view– Queries a “mediated schema”
A uniform model for all sources– Must be (at least) expressive enough to model the most
complex information source Each source provides a set of relations or classes
– Translation (model) is done by wrapper at each source Integration
– Global as view, Local as view
![Page 9: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/9.jpg)
Name
FODORS Phone # Reviews
ZAGATName
AddressTelephone
RESTAURANT
ZAGAT FODORS
Global as View
DOH RatingsName
Rating
For each relation (class) in mediated schema we specify how to obtain its tuples from the sources
Name
Phonenumber
GEOCODER
Address Lat Lon
![Page 10: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/10.jpg)
Heterogeneity Resolution
Sources may use different models– OO, Relational, Legacy, …..– May be Web sources– Wrapper “exports” contents in a uniform model
Structural and schematic differences
(name, address) (name, street, city, state, zip)
Semantic
(name, phonenumber) (name, telephone)
![Page 11: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/11.jpg)
Global as View: Models
KR based models (SIMS, Ariadne, ….)– LOOM, CLASSIC
OO, based on ODMG (DISCO, Garlic …)
interface Restaurant {attribute string name;attribute string address;attribute string cuisine;attribute string review;
}
extent restaurant 0 of Restaurant wrapper w0 repository r0map ((zagts0=restaurant0) (name=n) (address=a)(cuisine=c))
![Page 12: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/12.jpg)
Local as View
For every information source S describe it in terms of relations in the mediated schema
v1(name,address,cuisine,rating) :- Restaurant(name,address, cuisine,rating) ^ city = “Santa Monica”
v2(name, foodrating) :- Restaurant(name,address,cuisine,rating)
….
![Page 13: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/13.jpg)
Query Planning and Optimization
Mediator must generate an information gathering plan Constraints on execution
– Binding patterns .... Optimization of query plans Current areas of work
– Optimization– Approximate answers (incomplete sources)– Query planning for other sources such as simulations,
computer programs etc. – Query execution engines
![Page 14: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/14.jpg)
Query Plans and Plan Quality
Low-Quality Plan
High-Quality Plan
![Page 15: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/15.jpg)
Accessing Sources via WrappersSELECT address, telFROM RestaurantWHERE cuisine = “chinese”
Chinois, 2720 Main St, 310-777-9876Peking Star, 1 Broad St, 213-999-7676.....
![Page 16: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/16.jpg)
Semi-Automatic Wrapper Generation
Need wrappers for several sites– Building wrappers by hand is tedious and time consuming
Approaches to automating the process– Exploit format information (structure, HTML etc. )– Template based approaches– Machine learning techniques
XML
<name> Peking Star </name><address> 1 Broad Street, Los Angeles </address><phone>31-822-1511 </phone>
![Page 17: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/17.jpg)
Wrappers .... Work in Progress
Database wrappers Variety of techniques for Web wrappers “Upmarking”
– To XML Building “Web-bases” Other Artificial Intelligence techniques
– Natural Language Processing– IR– Classifiers
![Page 18: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/18.jpg)
Performance Issue
Query processing time is typically very high Despite the mediator generating efficient query plans Cost of fetching data and pages from remote sources dominates
– Have to typically fetch a large number of Web pages– The Web sources are not designed for database like query
access– The Web sources can be slow
Further improve performance by materializing data at the mediator side.
![Page 19: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/19.jpg)
Store and Materialize Data Locally
MEDIATOR
Wrapped Web Source (SLOW)
Materialized Data (FAST)
![Page 20: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/20.jpg)
Selective Materialization
Why not simply materialize all the data in all the Web sources being integrated and have a really fast mediator ??– Will not scale, amount of space needed may be too much– Web sources can get updated
Cost of keeping data consistent can get prohibitive– We are building a mediator, not a data warehouse !
Approach then is to selectively materialize data How do we automatically identify the portion of data most
useful to materialize ?
![Page 21: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/21.jpg)
Selecting Data to Materialize
Distribution of User Queries(Identify frequently accessed classes)
Distribution of User Queries(Identify frequently accessed classes)
Structure of Sources(Prefetch data to speed up expensive queries)
Structure of Sources(Prefetch data to speed up expensive queries)
Updates(Have to consider maintenance cost)
Updates(Have to consider maintenance cost)
Classes ofData to Materialize
SELECTING CLASSES
![Page 22: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/22.jpg)
Inconsistency Resolution
Same object in different formats
“United States” and “US”
“Red Lobster” and “The Red Lobster”
“John Smith”, “Smith, J.” , “J. Smith”, “Dr. John Smith” ... Has appeared in other database and IR contexts Solutions
– Mapping tables For finite domains (such as cities, countries, companies …) Simply maintain an enumerated list of possible formats for each
object (New York, N.Y., NYC, New York City, Big Apple)
![Page 23: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/23.jpg)
Mapping Functions
Mapping functions– When domain is not finite (person names)– Domain specific mapping transformations
Stemming common words (Inc., Corp., The etc.) Matching full word and abbreviation Match 2 formats with a score
Current work– Learning mapping functions from example matches– IR based approaches– Building “metabases”
![Page 24: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/24.jpg)
Mediator Prototypes and Software Software and tools from mediator research projects What may be available.
– Mediator kernels (integration engines)– Data modeling tools, Description Logic systems– Wrapper and extractor toolkits and software – Plenty of papers !
Ariadne, USC/ISI, http://www.isi.edu/ariadnehttp://www.isi.edu/ariadne TSIMMIS, Stanford, http://www-db.stanford.edu/tsimmis/http://www-db.stanford.edu/tsimmis/ MIX, UCSD, http://feast.ucsd.edu/Projects/MIX/http://feast.ucsd.edu/Projects/MIX/ InfoSleuth, MCC, http://www.mcc.com/projects/infosleuth/ http://www.mcc.com/projects/infosleuth/ DISCO, U Maryland, http://www.umiacs.umd.edu/labs/CLIP/im.htmlhttp://www.umiacs.umd.edu/labs/CLIP/im.html Garlic, IBM Almaden, http://www.almaden.ibm.com/cs/garlic.htmlhttp://www.almaden.ibm.com/cs/garlic.html Tukwila, U Washington, http://data.cs.washington.edu/integration/tukwila/http://data.cs.washington.edu/integration/tukwila/
![Page 25: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/25.jpg)
Applications of Mediators
Heterogeneous and Distributed Database Integration– Legacy systems integration
Web Sources Integration Data Integration for E-commerce
– Integrating product catalogs, multiple vendors Data Warehousing
– For populating data warehouses Bioinformatics Information Management Environments
Digital Libraries Healthcare Information Systems
![Page 26: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/26.jpg)
Industry Products (IBM DB2 DataJoiner)
IBM DB2 DataJoiner http://www-4.ibm.com/software/data/datajoiner/http://www-4.ibm.com/software/data/datajoiner/ Enterprise data integration middleware DataJoiner functionality now incorporated in IBM DB2 UDB http://www-4.ibm.com/software/data/db2/udb/about.htmlhttp://www-4.ibm.com/software/data/db2/udb/about.html Native support for popular relational data sources
– DB2, Informix, SQL Server, Sybase, Teradata and others– Supports non relational data sources– Support for Web data– Available on variety of platforms and OS
![Page 27: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/27.jpg)
Start-up ventures: Junglee Corp
Website: www.amazon.com (Acquired) Researcher Founders: Rajaraman, Gupta, Harinarayanan, Mathur Products and Services:
– Tools for data extraction and integration– Building warehouse from multiple Web sources
Integrating apartment listings from multiple sources Integrating job postings from multiple online job sources
Market focus: Online shopping Current Status: Acquired by Amazon Similar ventures: Netbots Inc. (www.excite.com) Acquired by Excite
![Page 28: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/28.jpg)
Cohera Website: www.cohera.com Researcher Founders: Stonebraker, Hellerstein Products and Services:
– Cohera E-Catalog System– Integrates product data from multiple sellers and product catalogs– Set of software servers and tools for building and running live “e-catalogs”
Market(s) Targetted: E-Commerce Customers: E-Commerce communities - ThomasNet, Trapezo, LiveListings,
FoodService.Com Current Status: Founded October 1997, Privately Held Similar ventures: Ensosys Markets Inc. (www.enosysmarkets.com)
Mergent Inc. (www.mergent.com)
![Page 29: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/29.jpg)
Nimble Technology
Website: www.nimble.com Researcher Founders: Levy, Weld Products and Services:
– Nimble Data Integration Suite– XML base integration approach– Current focus on multiple information sources integration– Tools for data extraction and Data Integration Engine
Market focus: CRM, Business Intelligence, B2B, Portals Current Status: Founded June 1999, Privately Held
![Page 30: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/30.jpg)
WhizbangLabs !
Website: www.whizbanglabs.com Researcher Founders: Quass, Geddes, Mitchell Products and Services:
– Technology for building “Webbases” - databases created by extracting data from Web pages
– Topic specific– Topic specific crawler for retrieving pages– Tools for extracting data from Web pages, cleaning data
and loading into database Market focus: Content providing portals Current Status: Founded March 1999, Privately held Similar ventures: Fetch Technologies (www.fetch.com)
![Page 31: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/31.jpg)
Bioinformatics: A Data Integration Grand Challenge
Mapping of Human Genetic Code complete– New, revolutionary, computational approach to drug discovery
Huge amounts of genetic, chemical and biological data being generated at an exponential rate in biotech/pharma R&D
– Complex structures, maps, sequence data etc. Drug discovery scientists need integrated access to this data
– Look for patterns across data sources Need to integrate data from multiple labs Lab procedures (thus the data) keeps changing Good amount of genomic data is free text DiscoveryLink: State of the art Life Sciences data integration
middleware from IBM
http://www-4.ibm.com/software/webservers/lifesciences/discovery.htmlhttp://www-4.ibm.com/software/webservers/lifesciences/discovery.html
![Page 32: Information Mediation: Integrating Information from Multiple Information Sources](https://reader036.vdocuments.mx/reader036/viewer/2022062321/56813d42550346895da6fff0/html5/thumbnails/32.jpg)
Conclusion
Information mediation Issues in building such systems Research projects Industry products Start-up ventures Applicable to wide areas such as E-commerce, database and
legacy systems integration, Web source extraction, content management, portals, digital libraries, bioinformatics.