purpose designed building science research infrastructure fund: £ 6m
DESCRIPTION
Turning information into knowledge: the challenges of integrating diverse information sources Alex Poulovassilis, Birkbeck, U. of London Co-Director of the London Knowledge Lab. The London Knowledge Lab. Institute of Education University of London. Birkbeck College University of London. - PowerPoint PPT PresentationTRANSCRIPT
1 May 2007
Turning information into knowledge: the challenges of integrating diverse
information sources
Alex Poulovassilis, Birkbeck, U. of LondonCo-Director of the London Knowledge Lab
1 May 2007
purpose designed buildingScience Research Infrastructure Fund: £ 6m
Research staff and students: 50Location: Bloomsbury
Open: June 2004
Institute of EducationUniversity of London
Birkbeck College University of London
Social scientistsExperts in education, sociology, culture and media, semiotics, philosophy, knowledge management ...
Computer scientistsExperts in information systems,
information management, web technologies, personalisation,
ubiquitous technologies …
The London Knowledge Lab
1 May 2007
to understand how digital technologies and media are transforming people’s relationships to information, learning and culture at home, work and playto design, build and evaluate systems, processes and interfaces which enhance learning throughout lifeto examine critically the assumptions about knowledge and learning that underlie the different uses of digital technologiesThe starting point for our
mission is that digital technologies and new media will change how we learn,work, collaborate and communicate
LKL mission
1 May 2007
LKL research themes
Our research is funded by projects from EU, EPSRC, ESRC, BBSRC,
JISC, Wellcome Trust – currently about 25 projects.
Four broad themes guide our work and inform our research:
• new forms of knowledge
• turning information into knowledge
• the changing cultures of new media
• creating empowering technologies for formal and informal learning
1 May 2007
New forms of knowledge
What do children and adults of the twenty-first century need to know?
How can we learn in new and more effective ways?
What kinds of knowledge are emerging in the knowledge economy?
How can this knowledge be made more accessible to more people?
1 May 2007
Turning information into knowledge
• The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies
• How can people benefit from this information in their learning, working and social lives ?
• What new techniques are necessary for managing, accessing, integrating and personalising such information ?
• How to design and build tools that help people to understand such information and generate new knowledge from it ?
1 May 2007
The changing cultures of new media
What are differences and continuities between ‘old’ media (books, film, TV) and ‘new’ media (internet, computer games, mobile phones) ?
How do children and adults use these media in different contexts, both as consumers and produces ?
How are they learning in, and from, this convergent media environment ?
What are the implications of these developments for formal and informal learning ?
1 May 2007
Creating empowering technologies for learning
How are equity, participation, learner autonomy, and the structuring of learning impacted by digital technologies and new media?
Which media-enhanced approaches can help people to learn and collaborate?
How can the Internet, and ambient and mobile technologies create new learning opportunities?
1 May 2007
Turning information into knowledge – information integration
AutoMed (EPSRC)– developing tools for semi-automatic integration of heterogeneous information sources– can handle both structured and semi-structured (RDF/S, XML) data – can handle virtual, materialised and hybrid integration scenarios – application in biological data integration, e-learning, p2p data integration
ISPIDER (BBSRC e-Science programme)– developing an integrated platform of proteomic data sources, enabled as Grid and Web services– collaboration with groups at EBI, Manchester, UCL
1 May 2007
The AutoMed Project
Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence/subsumption Low-level metamodel, the Hypergraph Data Model (HDM), in
terms of which higher-level data modelling languages are defined – extensible therefore with new modelling languages
Provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: • addT(c,q) deleteT(c,q) renameT(c,n,n’)
Also two more primitive transformations for imprecise integration scenarios:• extendT(c,Range q q’) contractT(c,Range q q’)
1 May 2007
Features of the AutoMed toolkit
Schema transformations are automatically reversible:• addT/deleteT(c,q) by deleteT/addT(c,q)• extendT(c,Range q1 q2) by contractT(c,Range q1 q2)• renameT(c,n,n’) by renameT(c,n’,n)
Hence bi-directional transformation pathways (more generally transformation networks) are defined between schemas
The queries within transformations allow automatic data and query translation
Schemas may be expressed in a variety of modelling languages
1 May 2007
Schema transformation/integration networks
US1 US2 USi USn
LS1 LS2 LSi LSn
GS
id id id id id
… …
… …
1 May 2007
Schema transformation/integration networks (cont’d)
On the previous slide:• GS is a global schema• LS1, …, LSn are local schemas• US1, …, USn are union-compatible schemas• the transformation pathways between each pair LSi and
USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository
• the transformation pathway between USi and GS is similar
• the transformation pathway between each pair of union-compatible schemas consists of id transformation steps
1 May 2007
AutoMed architecture
Global Query Processor
Global Query Optimiser
Schema Evolution Tool
Schema Transformationand Integration Tools
Model Definition Tool
Schema and Transformations Repository (STR)
Model Definitions Repository (MDR)
Wrapper
1 May 2007
Other data integration approaches: GAV & LAV
Global-As-View (GAV) approach: specify GS constructs by view definitions over LS constructs
Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs
RDF
XMLFileRDB
Local Schema
GlobalSchema
Local SchemaLocal Schema
Vie
wD
efin
itio
n
View
Def
initi
on
View
Definition
1 May 2007
Evolution problems of GAV and LAV
GAV does not readily support evolution of local schemas e.g. adding a new attribute to a source table may invalidate some of the global view definitions
In LAV, changes to a local schema impact only the derivation rules defined for that schema
But conversely LAV has problems if one wants to evolve the global schema since all the view definitions defining local schema constructs in terms of the global schema would need to be reviewed
These evolution problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas
1 May 2007
AutoMed vs GAV/LAV/GLAV
AutoMed schema transformation pathways capture at least the information available from GAV and LAV rules:• add/extend transformations correspond to GAV rules• delete/contract transformations correspond to LAV
rules Thus, GAV and LAV view definitions can be derived from
a BAV network GLAV rules e :- e’ are also captured, by BAV
transformations of the form add(T,e); …;del(T,e’) Thus, any reasoning or processing that is possible using
GAV, LAV or GLAV is also possible using BAV
1 May 2007
Schema Evolution in BAV
Unlike GAV/LAV/GLAV, BAV readily supports the evolution of both local and global schemas.
The evolution of a global or local schema is specified by a schema transformation pathway T from the old schema S to the new schema S’
The transformation network and schemas can then be systematically repaired (rather than having to be redefined)
Global SchemaS
New GlobalSchema S’
T
New LocalSchema S’
Local SchemaS
T
1 May 2007
Global Query Processing
We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL
A query Q expressed in a high-level query language on a global schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit)
View definitions are derived from the transformation pathways between S and the requested data source schemas
These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs
1 May 2007
Global Query Processing (cont’d)
Query optimisation and query evaluation then occur During query evaluation, the evaluator submits to
wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources
The wrappers translate sub-query results back into the IQL type system
Further query post-processing then occurs in the IQL evaluator
1 May 2007
Other AutoMed research at BBK
As well as virtual integration of data sources, we have investigated using AutoMed for materialised data integration i.e. a data warehousing approach
In particular, Hao Fan has worked on incremental view maintenance, data lineage tracing and schema evolution over AutoMed schema transformation pathways
Lucas Zamboulis has developed semi-automatic techniques for transforming and integrating heterogeneous XML data
In recent work he is investigating used correspondences to ontologies to enhance these techniques
Sandeep Mittal is working on update translation and update propagation along AutoMed pathways e.g. in P2P environments
1 May 2007
Other AutoMed research at BBK (cont’d)
Dean Williams has been working on extracting structure from unstructured text sources
The aim here is to integrate information extracted from unstructured text with structured information available from other sources
Dean is using existing technology (the GATE tool) for the text annotation and IE part of this work
The information extracted from the text is matched with existing structured information to derive new instance data and perhaps also new schema fragments
AutoMed is being used for the schema and data integration aspects of this project
1 May 2007
ISPIDER Project
Partners: Birkbeck, EBI, Manchester, UCL Aims:
• Vast, heterogeneous biological data• Need for interoperability• Need for efficient processing • Development of Proteomics Grid Infrastructure, use
existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.
1 May 2007
Project Aims
1 May 2007
Project Aims
1 May 2007
Project Aims
1 May 2007
Project Aims
1 May 2007
Project Aims
1 May 2007
myGrid / DQP / AutoMed
myGrid: collection of services/components allowing high-level integration of data/applications for in-silico experiments in biology
DQP: • OGSA-DAI (Open Grid Services Architecture Data
Access and Integration)• Distributed query processing over OGSA-DAI enabled
resources Ongoing research:
• AutoMed / DQP interoperability• AutoMed / myGrid interoperability
1 May 2007
DQP / AutoMed interoperability
Data sources wrapped with OGSA-DAI
AutoMed OGSA-DAI wrappers extract data sources’ metadata
Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema
IQL queries submitted to this integrated schema are:• Reformulated to IQL
queries on the data sources, using the AutoMed transformation pathways
• Submitted to DQP for evaluation
AutoMed Wrappers
AutoMedRepository
OGSA-DAIActivity
OGSA-DAIActivity
OGSA-DAIActivity
DB
AutoMedwrapper
AutoMedwrapper
AutoMedwrapper
DistributedQuery Processor
IntegratedAutoMed Schema
AutoMedSchema
AutoMedSchema
AutoMedSchema
AutoMedQuery Processor
IQL query
OQL query
OGSA-DAIService
OGSA-DAIService
OGSA-DAIService
DBDB
AutoMed DQPwrapper
OQL result
IQL result
IQL query
IQL result
1 May 2007
Ongoing and future research
Heterogeneous data integration in Grid and P2P environments, with bioinformatics and e-learning as example application domains
Flexible combinations of virtual, materialised or hybrid integration
Flexible query processing in imprecise integration scenarios
P2P query processing over BAV pathways P2P update processing over BAV pathways Use of ECA rules and a P2P ECA rule execution engine
for flexible update processing and data sharing