bethesda, maryland, april 6, 1999 amit sheth large scale distributed information systems lab...
TRANSCRIPT
Bethesda, Maryland, April 6, 1999Bethesda, Maryland, April 6, 1999
Amit ShethAmit Sheth
Large Scale Distributed Information Systems LabLarge Scale Distributed Information Systems Lab
University of GeorgiaUniversity of Georgia
http://lsdis.cs.uga.eduhttp://lsdis.cs.uga.edu
Information Integration PerspectiveInformation Integration Perspectivedistribution
auto
no
my
heterogeneity
Three perspectives to GlobISThree perspectives to GlobIS
Information Brokering PerspectiveInformation Brokering Perspective
data
meta-data
semantic (terminological,contextual)
““Vision” PerspectiveVision” Perspectivedataconnectivity computing
information
knowledge
MermaidMermaidDDTSDDTS
Multibase, MRDSM, ADDS, Multibase, MRDSM, ADDS, IISS, Omnibase, ...IISS, Omnibase, ...
Generation IGeneration I
1980s1980s
Evolving targets and approaches in integratingEvolving targets and approaches in integratingdata and information data and information (a personal perspective)(a personal perspective)
DL-II projectsDL-II projectsADEPT,ADEPT,InfoQuiltInfoQuilt
Generation IIIGeneration III
1997...1997...
InfoSleuth, KMed, DL-I projectsInfoSleuth, KMed, DL-I projectsInfoscopes, HERMES, SIMS, Infoscopes, HERMES, SIMS,
Garlic,TSIMMIS,Harvest, RUFUS,... Garlic,TSIMMIS,Harvest, RUFUS,...
Generation IIGeneration II
1990s1990s
VisualHarnessVisualHarnessInfoHarnessInfoHarness
a society for ubiquitous exchange of (tradeable) information in all digital forms of representation;
information anywhere, anytime, any forms
Generation IGeneration I
• Data recognized as corporate resource — leverage it!
• Data predominantly in structured databases, different data models,
transitioning from network and hierarchical to relational DBMSs
• Heterogeneity (system, modeling and schematic) as well as need to
support autonomy posed main challenges;
major issues were data access and connectivity
• Information integration through Federated architecture
• Support for corporate IS applications as the primary objective,
update often required, data integrity important
(heterogeneity in FDBMSs)
CCoommmmuunniiccaattiioonn
Hardware/System• instruction set• data representation/coding• configuration
Operating System• file system• naming, file types, operation• transaction support• IPC
Database System• Semantic HeterogeneitySemantic Heterogeneity• Differences in DBMSDifferences in DBMS
• data models data models (abstractions, constraints, query languages)• System level support System level support (concurrency control, commit, recovery)
1970s1970s
1980s1980s
Generation IGeneration I
Generation IGeneration I(Federated Database Systems: Schema Architecture)
ComponentDBS
LocalSchema
ComponentSchema
ExportSchema
ExportSchema
ExportSchema
FederatedSchema
ExternalSchema
ExternalSchema
. . .. . .
ComponentDBS
LocalSchema
ComponentSchema
. . .. . .
. . .. . .
. . .. . .
. . .. . .
schematranslation
schemaintegration
• Model Heterogeneity:
Common/Canonical
Data Model
Schema Translation
• Information sharing
while preserving
autonomy
• Dimensions for
interoperability and
integration:
distribution, autonomy
and heterogeneity
(characterization of schematic conflicts in multidatabase systems)
SchematicSchematicConflictsConflicts
Sheth & Kashyap, Kim & SeoSheth & Kashyap, Kim & Seo
Generalization Conflicts
Aggregation Conflicts
Abstraction LevelAbstraction LevelIncompatibilityIncompatibility
Data Value Attribute Conflict
Entity Attribute Conflict
Data Value Entity Conflict
SchematicSchematicDiscrepanciesDiscrepancies
Naming Conflicts
Database Identifier Conflicts
Schema Isomorphism
Conflicts
Missing Data Items Conflicts
Entity DefinitionEntity DefinitionIncompatibilityIncompatibility
Naming Conflicts
Data Representation Conflicts
Data Scaling Conflicts
Data Precision Conflicts
Default Value Conflicts
Attribute Integrity Constraint Conflicts
Domain DefinitionDomain DefinitionIncompatibilityIncompatibility
Known Inconsistency
Temporal Inconsistency
Acceptable Inconsistency
Data ValueData ValueIncompatibilityIncompatibility
B U Tthese techniques for dealing with schematic heterogeneity do not directly map to dealing with much larger variety of heterogeneous
media
Generation IGeneration I
Generation IIGeneration II
• Significant improvements in computing and connectivity (standardization
of protocol, public network, Internet/Web); remote data access as given;
• Increasing diversity in data formats, with focus on variety of textual data
and semi-structured documents
• Many more data sources, heterogeneous information sources,
but not necessarily better understanding of data
• Use of data beyond traditional business applications:
mining + warehousing, marketing, e-commerce
• Web search engines for keyword based querying against HTML pages;
attribute-based querying available in a few search systems
• Use of metadata for information access; early work on ontology support
distribution applied to metadata in some cases
• Mediator architecture for information management
(limited types of metadata, extractors, mappers, wrappers)
Generation IIGeneration II
Global/EnterpriseWeb Repositories
METADATAMETADATA
EXTRACTORSEXTRACTORS
Digital Maps
NexisUPIAP
Documents
Digital Audios
Data Stores
Digital Videos
Digital Images. . .
. . . . . .
Find Marketing Manager positions in a
company that is within 15 miles of San
Francisco and whose stock price has
been growing at a rate of at least 25%
per year over the last three years
Junglee, SIGMOD Record, Dec. 1997
(a metadata classification: the informartion pyramid)
Generation IIGeneration II
Data (Heterogeneous Types/Media)(Heterogeneous Types/Media)
Content Independent Metadata (creation-date, location, type-of-sensor...)(creation-date, location, type-of-sensor...)
Content Dependent Metadata (size, max colors, rows, columns...)(size, max colors, rows, columns...)
Direct Content Based Metadata (inverted lists, document vectors, WAIS, Glimpse, LSI)(inverted lists, document vectors, WAIS, Glimpse, LSI)
Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML(C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...)Document Type Definitions, C program structure...)
Domain Specific Metadata area, population (Census),area, population (Census), land-cover, relief (GIS),metadata land-cover, relief (GIS),metadata concept descriptions from ontologiesconcept descriptions from ontologies
OntologiesClassificationsClassificationsDomain ModelsDomain Models
User METADATA STANDARDSMETADATA STANDARDS
General Purpose:
Dublin Core, MCF
Domain/industry specific:
Geographic (FGDC, UDK, …),
Library (MARC,…)
Move in thisMove in this
direction to direction to
tackletackle
informationinformation
overload!! overload!!
Query processing and information requestsQuery processing and information requests
NOWNOW
traditional queries based on keywords attribute based queries content-based queries
NEXTNEXT
‘high level’ information requests involving
ontology-based, iconic, mixed-media, and
media-independent information rrequests user selected ontology, use of profiles
What’s next (after comprehensive use of metadata)?What’s next (after comprehensive use of metadata)?
GIS Data Representation – ExampleGIS Data Representation – Example
multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain
FGDC Metadata ModelFGDC Metadata Model
Theme keywordsTheme keywords:: digital line graph,
hydrography, transportation...
TitleTitle: Dakota Aquifer
Online linkageOnline linkage::
http://gisdasc.kgs.ukans.edu/dasc/
Direct Spatial Reference Method:Direct Spatial Reference Method: Vector
Horizontal Coordinate System Definition:Horizontal Coordinate System Definition:
Universal Transverse Mercator
… … … ...
UDK Metadata ModelUDK Metadata Model
Search termsSearch terms:: digital line graph,
hydrography, transportation...
TopicTopic:: Dakota Aquifer
Adress Id:Adress Id:
http://gisdasc.kgs.ukans.edu/dasc/
Measuring Techniques:Measuring Techniques: Vector
Co-ordinate System:Co-ordinate System:
Universal Transverse Mercator
… … … ...
Kansas StateKansas State
Generation IIIGeneration III
• Increasing information overload and broader variety of information
content (video content, audio clips etc) with increasing amount of visual
information, scientific/engineering data
• Continued standardization related to Web for representational and metadata
issues (MCF, RDF, XML)
• Changes in Web architecture; distributed computing (CORBA, Java)
• Users demand simplicity, but complexities continue to rise
• Web is no longer just another information source, but decision supportdecision support through
“data mining and information discovery, information fusion, information
dissemination, knowledge creation and management”, “information management
complemented by cooperation between the information system and humans”
• Information Brokering Architecture proposed for information management
Information Brokering: An Enabler for the InfocosmInformation Brokering: An Enabler for the Infocosm
INFORMATION/DATAINFORMATION/DATAOVERLOADOVERLOAD
INFORMATION PROVIDERS
Newswires
Universities
Corporations
Research Labs
InformationSystem
DataRepository
InformationSystem
INFORMATION CONSUMERS
Corporations
Universities
People
Government
Programs
User Query
User Query
User Query
arbitration between information consumers and providers for resolving
information impedance
INFORMATION BROKERINGINFORMATION BROKERING
InformationSystem
DataRepository
InformationSystem
InformationRequest
InformationRequest
InformationRequest
dynamic reinterpretation of information requests for determination of relevant
information services and products—
dynamic creation and composition of information products
Information Brokering: Three DimensionsInformation Brokering: Three Dimensions
S E M A N T I C SS E M A N T I C S
S T R U C T U R ES T R U C T U R E
S Y N T A XS Y N T A X
S Y S T E MS Y S T E M
C O N S U M E R SC O N S U M E R S
B R O K E R SB R O K E R S
P R O V I D E R SP R O V I D E R S
D A
T A
D A
T A
M E
T A
D A
T A
M E
T A
D A
T A
V O
C A
B U
L A
R Y
V O
C A
B U
L A
R Y
T H R E E D I M E N S I O N S
Objective:Objective: Reduce the problem of knowing structure and semantics of data in the huge
number of information sources on a global scale to: understanding and
navigating a significantly smaller number of domain ontologies
W W WW W W
a confusing heterogeneity of media,formats (Tower of Babel)
information correlation using physical (HREF)links at the extensional data level
location dependent browsing of informationusing physical (HREF) links
user has to keep track of information content !!
W W WW W W + Information Brokering + Information Brokering
Domain Specific Ontologies as “semantic conceptual views”
Information correlation using concept mappings at the intensional concept level
Browsing of information using terminological relationships across ontologies
Higher level of abstraction, closerto user view of information !!
What else can Information Brokering do?What else can Information Brokering do?
Concepts, tools and techniques to support semanticsConcepts, tools and techniques to support semantics
context
media-independentinformation correlations
semanticproximity inter-ontological
relations
ontologies(esp. domain-specific) profiles
domain-specific metadata
Tools to support semanticsTools to support semantics
• Context, context, contextContext, context, context
• Media-independent information correlations
• Multiple ontologies
– Semantic Proximity (relationships between concepts within
and across ontologies) using domain, context,
modeling/abstraction/representation, state
– Characterizing Loss of Information incurred due to
differences in vocabulary
BIG challenge:BIG challenge: identifying relationship oridentifying relationship or
similarity between objects of different media, similarity between objects of different media,
developed and managed by different persons and systemsdeveloped and managed by different persons and systems