bethesda, maryland, april 6, 1999 amit sheth large scale distributed information systems lab...

20
Bethesda, Maryland, April 6, 1999 Bethesda, Maryland, April 6, 1999 Amit Sheth Amit Sheth Large Scale Distributed Information Systems Lab Large Scale Distributed Information Systems Lab University of Georgia University of Georgia http://lsdis.cs.uga.edu http://lsdis.cs.uga.edu

Upload: hugh-oconnor

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Bethesda, Maryland, April 6, 1999Bethesda, Maryland, April 6, 1999

Amit ShethAmit Sheth

Large Scale Distributed Information Systems LabLarge Scale Distributed Information Systems Lab

University of GeorgiaUniversity of Georgia

http://lsdis.cs.uga.eduhttp://lsdis.cs.uga.edu

Information Integration PerspectiveInformation Integration Perspectivedistribution

auto

no

my

heterogeneity

Three perspectives to GlobISThree perspectives to GlobIS

Information Brokering PerspectiveInformation Brokering Perspective

data

meta-data

semantic (terminological,contextual)

““Vision” PerspectiveVision” Perspectivedataconnectivity computing

information

knowledge

MermaidMermaidDDTSDDTS

Multibase, MRDSM, ADDS, Multibase, MRDSM, ADDS, IISS, Omnibase, ...IISS, Omnibase, ...

Generation IGeneration I

1980s1980s

Evolving targets and approaches in integratingEvolving targets and approaches in integratingdata and information data and information (a personal perspective)(a personal perspective)

DL-II projectsDL-II projectsADEPT,ADEPT,InfoQuiltInfoQuilt

Generation IIIGeneration III

1997...1997...

InfoSleuth, KMed, DL-I projectsInfoSleuth, KMed, DL-I projectsInfoscopes, HERMES, SIMS, Infoscopes, HERMES, SIMS,

Garlic,TSIMMIS,Harvest, RUFUS,... Garlic,TSIMMIS,Harvest, RUFUS,...

Generation IIGeneration II

1990s1990s

VisualHarnessVisualHarnessInfoHarnessInfoHarness

a society for ubiquitous exchange of (tradeable) information in all digital forms of representation;

information anywhere, anytime, any forms

Generation IGeneration I

• Data recognized as corporate resource — leverage it!

• Data predominantly in structured databases, different data models,

transitioning from network and hierarchical to relational DBMSs

• Heterogeneity (system, modeling and schematic) as well as need to

support autonomy posed main challenges;

major issues were data access and connectivity

• Information integration through Federated architecture

• Support for corporate IS applications as the primary objective,

update often required, data integrity important

(heterogeneity in FDBMSs)

CCoommmmuunniiccaattiioonn

Hardware/System• instruction set• data representation/coding• configuration

Operating System• file system• naming, file types, operation• transaction support• IPC

Database System• Semantic HeterogeneitySemantic Heterogeneity• Differences in DBMSDifferences in DBMS

• data models data models (abstractions, constraints, query languages)• System level support System level support (concurrency control, commit, recovery)

1970s1970s

1980s1980s

Generation IGeneration I

Generation IGeneration I(Federated Database Systems: Schema Architecture)

ComponentDBS

LocalSchema

ComponentSchema

ExportSchema

ExportSchema

ExportSchema

FederatedSchema

ExternalSchema

ExternalSchema

. . .. . .

ComponentDBS

LocalSchema

ComponentSchema

. . .. . .

. . .. . .

. . .. . .

. . .. . .

schematranslation

schemaintegration

• Model Heterogeneity:

Common/Canonical

Data Model

Schema Translation

• Information sharing

while preserving

autonomy

• Dimensions for

interoperability and

integration:

distribution, autonomy

and heterogeneity

(characterization of schematic conflicts in multidatabase systems)

SchematicSchematicConflictsConflicts

Sheth & Kashyap, Kim & SeoSheth & Kashyap, Kim & Seo

Generalization Conflicts

Aggregation Conflicts

Abstraction LevelAbstraction LevelIncompatibilityIncompatibility

Data Value Attribute Conflict

Entity Attribute Conflict

Data Value Entity Conflict

SchematicSchematicDiscrepanciesDiscrepancies

Naming Conflicts

Database Identifier Conflicts

Schema Isomorphism

Conflicts

Missing Data Items Conflicts

Entity DefinitionEntity DefinitionIncompatibilityIncompatibility

Naming Conflicts

Data Representation Conflicts

Data Scaling Conflicts

Data Precision Conflicts

Default Value Conflicts

Attribute Integrity Constraint Conflicts

Domain DefinitionDomain DefinitionIncompatibilityIncompatibility

Known Inconsistency

Temporal Inconsistency

Acceptable Inconsistency

Data ValueData ValueIncompatibilityIncompatibility

B U Tthese techniques for dealing with schematic heterogeneity do not directly map to dealing with much larger variety of heterogeneous

media

Generation IGeneration I

Generation IIGeneration II

• Significant improvements in computing and connectivity (standardization

of protocol, public network, Internet/Web); remote data access as given;

• Increasing diversity in data formats, with focus on variety of textual data

and semi-structured documents

• Many more data sources, heterogeneous information sources,

but not necessarily better understanding of data

• Use of data beyond traditional business applications:

mining + warehousing, marketing, e-commerce

• Web search engines for keyword based querying against HTML pages;

attribute-based querying available in a few search systems

• Use of metadata for information access; early work on ontology support

distribution applied to metadata in some cases

• Mediator architecture for information management

(limited types of metadata, extractors, mappers, wrappers)

Generation IIGeneration II

Global/EnterpriseWeb Repositories

METADATAMETADATA

EXTRACTORSEXTRACTORS

Digital Maps

NexisUPIAP

Documents

Digital Audios

Data Stores

Digital Videos

Digital Images. . .

. . . . . .

Find Marketing Manager positions in a

company that is within 15 miles of San

Francisco and whose stock price has

been growing at a rate of at least 25%

per year over the last three years

Junglee, SIGMOD Record, Dec. 1997

(a metadata classification: the informartion pyramid)

Generation IIGeneration II

Data (Heterogeneous Types/Media)(Heterogeneous Types/Media)

Content Independent Metadata (creation-date, location, type-of-sensor...)(creation-date, location, type-of-sensor...)

Content Dependent Metadata (size, max colors, rows, columns...)(size, max colors, rows, columns...)

Direct Content Based Metadata (inverted lists, document vectors, WAIS, Glimpse, LSI)(inverted lists, document vectors, WAIS, Glimpse, LSI)

Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML(C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...)Document Type Definitions, C program structure...)

Domain Specific Metadata area, population (Census),area, population (Census), land-cover, relief (GIS),metadata land-cover, relief (GIS),metadata concept descriptions from ontologiesconcept descriptions from ontologies

OntologiesClassificationsClassificationsDomain ModelsDomain Models

User METADATA STANDARDSMETADATA STANDARDS

General Purpose:

Dublin Core, MCF

Domain/industry specific:

Geographic (FGDC, UDK, …),

Library (MARC,…)

Move in thisMove in this

direction to direction to

tackletackle

informationinformation

overload!! overload!!

VisualHarness – an exampleVisualHarness – an example

Query processing and information requestsQuery processing and information requests

NOWNOW

traditional queries based on keywords attribute based queries content-based queries

NEXTNEXT

‘high level’ information requests involving

ontology-based, iconic, mixed-media, and

media-independent information rrequests user selected ontology, use of profiles

What’s next (after comprehensive use of metadata)?What’s next (after comprehensive use of metadata)?

GIS Data Representation – ExampleGIS Data Representation – Example

multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain

FGDC Metadata ModelFGDC Metadata Model

Theme keywordsTheme keywords:: digital line graph,

hydrography, transportation...

TitleTitle: Dakota Aquifer

Online linkageOnline linkage::

http://gisdasc.kgs.ukans.edu/dasc/

Direct Spatial Reference Method:Direct Spatial Reference Method: Vector

Horizontal Coordinate System Definition:Horizontal Coordinate System Definition:

Universal Transverse Mercator

… … … ...

UDK Metadata ModelUDK Metadata Model

Search termsSearch terms:: digital line graph,

hydrography, transportation...

TopicTopic:: Dakota Aquifer

Adress Id:Adress Id:

http://gisdasc.kgs.ukans.edu/dasc/

Measuring Techniques:Measuring Techniques: Vector

Co-ordinate System:Co-ordinate System:

Universal Transverse Mercator

… … … ...

Kansas StateKansas State

Generation IIIGeneration III

• Increasing information overload and broader variety of information

content (video content, audio clips etc) with increasing amount of visual

information, scientific/engineering data

• Continued standardization related to Web for representational and metadata

issues (MCF, RDF, XML)

• Changes in Web architecture; distributed computing (CORBA, Java)

• Users demand simplicity, but complexities continue to rise

• Web is no longer just another information source, but decision supportdecision support through

“data mining and information discovery, information fusion, information

dissemination, knowledge creation and management”, “information management

complemented by cooperation between the information system and humans”

• Information Brokering Architecture proposed for information management

Information Brokering: An Enabler for the InfocosmInformation Brokering: An Enabler for the Infocosm

INFORMATION/DATAINFORMATION/DATAOVERLOADOVERLOAD

INFORMATION PROVIDERS

Newswires

Universities

Corporations

Research Labs

InformationSystem

DataRepository

InformationSystem

INFORMATION CONSUMERS

Corporations

Universities

People

Government

Programs

User Query

User Query

User Query

arbitration between information consumers and providers for resolving

information impedance

INFORMATION BROKERINGINFORMATION BROKERING

InformationSystem

DataRepository

InformationSystem

InformationRequest

InformationRequest

InformationRequest

dynamic reinterpretation of information requests for determination of relevant

information services and products—

dynamic creation and composition of information products

Information Brokering: Three DimensionsInformation Brokering: Three Dimensions

S E M A N T I C SS E M A N T I C S

S T R U C T U R ES T R U C T U R E

S Y N T A XS Y N T A X

S Y S T E MS Y S T E M

C O N S U M E R SC O N S U M E R S

B R O K E R SB R O K E R S

P R O V I D E R SP R O V I D E R S

D A

T A

D A

T A

M E

T A

D A

T A

M E

T A

D A

T A

V O

C A

B U

L A

R Y

V O

C A

B U

L A

R Y

T H R E E D I M E N S I O N S

Objective:Objective: Reduce the problem of knowing structure and semantics of data in the huge

number of information sources on a global scale to: understanding and

navigating a significantly smaller number of domain ontologies

W W WW W W

a confusing heterogeneity of media,formats (Tower of Babel)

information correlation using physical (HREF)links at the extensional data level

location dependent browsing of informationusing physical (HREF) links

user has to keep track of information content !!

W W WW W W + Information Brokering + Information Brokering

Domain Specific Ontologies as “semantic conceptual views”

Information correlation using concept mappings at the intensional concept level

Browsing of information using terminological relationships across ontologies

Higher level of abstraction, closerto user view of information !!

What else can Information Brokering do?What else can Information Brokering do?

Concepts, tools and techniques to support semanticsConcepts, tools and techniques to support semantics

context

media-independentinformation correlations

semanticproximity inter-ontological

relations

ontologies(esp. domain-specific) profiles

domain-specific metadata

Tools to support semanticsTools to support semantics

• Context, context, contextContext, context, context

• Media-independent information correlations

• Multiple ontologies

– Semantic Proximity (relationships between concepts within

and across ontologies) using domain, context,

modeling/abstraction/representation, state

– Characterizing Loss of Information incurred due to

differences in vocabulary

BIG challenge:BIG challenge: identifying relationship oridentifying relationship or

similarity between objects of different media, similarity between objects of different media,

developed and managed by different persons and systemsdeveloped and managed by different persons and systems

Heterogeneity...Heterogeneity... … … is a Babel Tower!!is a Babel Tower!!

SEMANTIC INTEROPERABILITYSEMANTIC INTEROPERABILITY

metadata

ontologies

contexts

SEMANTIC HETEROGENEITYSEMANTIC HETEROGENEITY