purpose designed building science research infrastructure fund: £ 6m

1 May 2007

Turning information into knowledge: the challenges of integrating diverse

information sources

Alex Poulovassilis, Birkbeck, U. of LondonCo-Director of the London Knowledge Lab

1 May 2007

purpose designed buildingScience Research Infrastructure Fund: £ 6m

Research staff and students: 50Location: Bloomsbury

Open: June 2004

Institute of EducationUniversity of London

Birkbeck College University of London

Social scientistsExperts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...

Computer scientistsExperts in information systems,

information management, web technologies, personalisation,

ubiquitous technologies …

The London Knowledge Lab

1 May 2007

to understand how digital technologies and media are transforming people’s relationships to information, learning and culture at home, work and playto design, build and evaluate systems, processes and interfaces which enhance learning throughout lifeto examine critically the assumptions about knowledge and learning that underlie the different uses of digital technologiesThe starting point for our

mission is that digital technologies and new media will change how we learn,work, collaborate and communicate

LKL mission

1 May 2007

LKL research themes

Our research is funded by projects from EU, EPSRC, ESRC, BBSRC,

JISC, Wellcome Trust – currently about 25 projects.

Four broad themes guide our work and inform our research:

• new forms of knowledge

• turning information into knowledge

• the changing cultures of new media

• creating empowering technologies for formal and informal learning

1 May 2007

New forms of knowledge

What do children and adults of the twenty-first century need to know?

How can we learn in new and more effective ways?

What kinds of knowledge are emerging in the knowledge economy?

How can this knowledge be made more accessible to more people?

1 May 2007

Turning information into knowledge

• The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies

• How can people benefit from this information in their learning, working and social lives ?

• What new techniques are necessary for managing, accessing, integrating and personalising such information ?

• How to design and build tools that help people to understand such information and generate new knowledge from it ?

1 May 2007

The changing cultures of new media

What are differences and continuities between ‘old’ media (books, film, TV) and ‘new’ media (internet, computer games, mobile phones) ?

How do children and adults use these media in different contexts, both as consumers and produces ?

How are they learning in, and from, this convergent media environment ?

What are the implications of these developments for formal and informal learning ?

1 May 2007

Creating empowering technologies for learning

How are equity, participation, learner autonomy, and the structuring of learning impacted by digital technologies and new media?

Which media-enhanced approaches can help people to learn and collaborate?

How can the Internet, and ambient and mobile technologies create new learning opportunities?

1 May 2007

Turning information into knowledge – information integration

AutoMed (EPSRC)– developing tools for semi-automatic integration of heterogeneous information sources– can handle both structured and semi-structured (RDF/S, XML) data – can handle virtual, materialised and hybrid integration scenarios – application in biological data integration, e-learning, p2p data integration

ISPIDER (BBSRC e-Science programme)– developing an integrated platform of proteomic data sources, enabled as Grid and Web services– collaboration with groups at EBI, Manchester, UCL

1 May 2007

The AutoMed Project

Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence/subsumption Low-level metamodel, the Hypergraph Data Model (HDM), in

terms of which higher-level data modelling languages are defined – extensible therefore with new modelling languages

Provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q) deleteT(c,q) renameT(c,n,n’)

Also two more primitive transformations for imprecise integration scenarios:• extendT(c,Range q q’) contractT(c,Range q q’)

1 May 2007

Features of the AutoMed toolkit

Schema transformations are automatically reversible:• addT/deleteT(c,q) by deleteT/addT(c,q)• extendT(c,Range q1 q2) by contractT(c,Range q1 q2)• renameT(c,n,n’) by renameT(c,n’,n)

Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas

The queries within transformations allow automatic data and query translation

Schemas may be expressed in a variety of modelling languages

1 May 2007

Schema transformation/integration networks

US1 US2 USi USn

LS1 LS2 LSi LSn

GS

id id id id id

… …

… …

1 May 2007

Schema transformation/integration networks (cont’d)

On the previous slide:• GS is a global schema• LS1, …, LSn are local schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and

USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository

• the transformation pathway between USi and GS is similar

• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps

1 May 2007

AutoMed architecture

Global Query Processor

Global Query Optimiser

Schema Evolution Tool

Schema Transformationand Integration Tools

Model Definition Tool

Schema and Transformations Repository (STR)

Model Definitions Repository (MDR)

Wrapper

1 May 2007

Other data integration approaches: GAV & LAV

Global-As-View (GAV) approach: specify GS constructs by view definitions over LS constructs

Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs

RDF

XMLFileRDB

Local Schema

GlobalSchema

Local SchemaLocal Schema

Vie

wD

efin

itio

n

View

Def

initi

on

View

Definition

1 May 2007

Evolution problems of GAV and LAV

GAV does not readily support evolution of local schemas e.g. adding a new attribute to a source table may invalidate some of the global view definitions

In LAV, changes to a local schema impact only the derivation rules defined for that schema

But conversely LAV has problems if one wants to evolve the global schema since all the view definitions defining local schema constructs in terms of the global schema would need to be reviewed

These evolution problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas

1 May 2007

AutoMed vs GAV/LAV/GLAV

AutoMed schema transformation pathways capture at least the information available from GAV and LAV rules:• add/extend transformations correspond to GAV rules• delete/contract transformations correspond to LAV

rules Thus, GAV and LAV view definitions can be derived from

a BAV network GLAV rules e :- e’ are also captured, by BAV

transformations of the form add(T,e); …;del(T,e’) Thus, any reasoning or processing that is possible using

GAV, LAV or GLAV is also possible using BAV

1 May 2007

Schema Evolution in BAV

Unlike GAV/LAV/GLAV, BAV readily supports the evolution of both local and global schemas.

The evolution of a global or local schema is specified by a schema transformation pathway T from the old schema S to the new schema S’

The transformation network and schemas can then be systematically repaired (rather than having to be redefined)

Global SchemaS

New GlobalSchema S’

T

New LocalSchema S’

Local SchemaS

T

1 May 2007

Global Query Processing

We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL

A query Q expressed in a high-level query language on a global schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit)

View definitions are derived from the transformation pathways between S and the requested data source schemas

These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs

1 May 2007

Global Query Processing (cont’d)

Query optimisation and query evaluation then occur During query evaluation, the evaluator submits to

wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources

The wrappers translate sub-query results back into the IQL type system

Further query post-processing then occurs in the IQL evaluator

1 May 2007

Other AutoMed research at BBK

As well as virtual integration of data sources, we have investigated using AutoMed for materialised data integration i.e. a data warehousing approach

In particular, Hao Fan has worked on incremental view maintenance, data lineage tracing and schema evolution over AutoMed schema transformation pathways

Lucas Zamboulis has developed semi-automatic techniques for transforming and integrating heterogeneous XML data

In recent work he is investigating used correspondences to ontologies to enhance these techniques

Sandeep Mittal is working on update translation and update propagation along AutoMed pathways e.g. in P2P environments

1 May 2007

Other AutoMed research at BBK (cont’d)

Dean Williams has been working on extracting structure from unstructured text sources

The aim here is to integrate information extracted from unstructured text with structured information available from other sources

Dean is using existing technology (the GATE tool) for the text annotation and IE part of this work

The information extracted from the text is matched with existing structured information to derive new instance data and perhaps also new schema fragments

AutoMed is being used for the schema and data integration aspects of this project

1 May 2007

ISPIDER Project

Partners: Birkbeck, EBI, Manchester, UCL Aims:

• Vast, heterogeneous biological data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use

existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.

1 May 2007

Project Aims

1 May 2007

myGrid / DQP / AutoMed

myGrid: collection of services/components allowing high-level integration of data/applications for in-silico experiments in biology

DQP: • OGSA-DAI (Open Grid Services Architecture Data

Access and Integration)• Distributed query processing over OGSA-DAI enabled

resources Ongoing research:

• AutoMed / DQP interoperability• AutoMed / myGrid interoperability

1 May 2007

DQP / AutoMed interoperability

Data sources wrapped with OGSA-DAI

AutoMed OGSA-DAI wrappers extract data sources’ metadata

Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema

IQL queries submitted to this integrated schema are:• Reformulated to IQL

queries on the data sources, using the AutoMed transformation pathways

• Submitted to DQP for evaluation

AutoMed Wrappers

AutoMedRepository

OGSA-DAIActivity

OGSA-DAIActivity

OGSA-DAIActivity

DB

AutoMedwrapper

AutoMedwrapper

AutoMedwrapper

DistributedQuery Processor

IntegratedAutoMed Schema

AutoMedSchema

AutoMedSchema

AutoMedSchema

AutoMedQuery Processor

IQL query

OQL query

OGSA-DAIService

OGSA-DAIService

OGSA-DAIService

DBDB

AutoMed DQPwrapper

OQL result

IQL result

IQL query

IQL result

1 May 2007

Ongoing and future research

Heterogeneous data integration in Grid and P2P environments, with bioinformatics and e-learning as example application domains

Flexible combinations of virtual, materialised or hybrid integration

Flexible query processing in imprecise integration scenarios

P2P query processing over BAV pathways P2P update processing over BAV pathways Use of ECA rules and a P2P ECA rule execution engine

for flexible update processing and data sharing

purpose designed building science research infrastructure fund: £ 6m

Documents