digir di stributed g eneric i nformation r etrieval

31
DiGIR 1 DiGIR Di stributed G eneric I nformation R etrieval Stan Blum, Dave Vieglais, P.J. Schwartz

Upload: platt

Post on 11-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

DiGIR Di stributed G eneric I nformation R etrieval. Stan Blum, Dave Vieglais, P.J. Schwartz. Project Goals. To define a protocol for retrieving structured data from multiple, heterogeneous databases To build a reference implementation of said protocol. Design Goals. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 1

DiGIRDistributed Generic Information Retrieval

Stan Blum, Dave Vieglais, P.J. Schwartz

Page 2: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 2

Project Goals To define a protocol for retrieving

structured data from multiple, heterogeneous databases

To build a reference implementation of said protocol

Page 3: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 3

Design Goals To use open protocols and standards, such

as HTTP, XML, and UDDI to leverage existing and emerging technologies

To de-couple the protocol, software and semantics

To automate the establishment of a new data provider as much as possible

Page 4: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 4

High-level Architecture

ProtocolProviderPortalRegistry

Page 5: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 5

Protocol Defines request and response message

formats for communication between Provider and Portal

Assumes Providers conform to a known federation schema

Remains flexible to allow for federation schema pluggability

Page 6: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 6

Provider Makes structured data

available to portals Communicates via protocol

compliant messaging only Complies with a known

federation schema Supplies meta-data to

describe data classification and availability

Page 7: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 7

Portal The entry point for a “user” Can make requests of N

number of providers Communicates via protocol

compliant messaging only Queries registry for available

providers Can determine, based on

provider meta-data, whether a provider should be queried

Page 8: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 8

Project Information The DiGIR project is a collaborative effort DiGIR is currently established as an open

source project on SourceForge (http://sourceforge.net).

Further documentation is available on the SourceForge site.

Please join us in collaborating!

Page 9: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 9

Protocol Details

Page 10: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 10

Protocol Details Specified in an XML Schema (.xsd) Intended to work in conjunction with

federation schemas, also expressed as XML Schemas

Actual request and response documents are instance documents conforming to both the protocol schema and a federation schema

Page 11: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 11

<request xmlns="http://www.namespaceTBD.org/digir" xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd">

<header> <requestType>search</requestType> </header> <search> <dbName>myDiggableBipesDB</dbName> <filter> <and> <in> <list xsi:type=“darwin:list”> <darwin:Month>11</darwin:Month> <darwin:Month>12</darwin:Month> </list> </in> <equals> <darwin:Genus>Bipes</darwin:Genus> </equals> </and> </filter> <records start=“0” count=“50”> </search></request>

Page 12: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 12

Request Explanation Composed of elements from the protocol

namespace (default) and the schema namespace <header> contains information about the payload <search> contains dbName, filter, and record

specification (will also specify result format) <filter> is effectively an XML representation of a

SQL where clause This search request is for the first 50 specimen

records that are genus Bipes and were found in the months of November or December.

Page 13: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 13

Filter BuildingLOPs (logical operators) <and> <or> <andNot> <orNot> Can be nested

COPs (comparison ops) <equals> <lessThan> <lessThanOrEquals> <notEquals> <greaterThan> <greaterThanOrEquals> <like> <in> (multi value)

Page 14: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 14

What “binds” the schemas? The protocol schema defines various abstract

types and elements:<xsd:element name="searchCondition" abstract="true"><xsd:element name="alphaSearchCondition" abstract="true“

substitutionGroup="searchCondition"><xsd:complexType name="listType" abstract="true" /><xsd:complexType name="numericListType" abstract="true" />

A federation schema must define searchable concepts, or groups of them, as substitutable for these abstract elements or extensions of the abstract types

<xsd:element name="Species" type="xsd:string“substitutionGroup="digir:alphaSearchCondition" />

Page 15: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 15

<xsd:complexType name="list <xsd:complexContent> <xsd:extension base="digir:listType"> <xsd:sequence> <xsd:choice> <xsd:element ref="ScientificName" maxOccurs="unbounded"/> <xsd:element ref="Kingdom" maxOccurs="unbounded" /> <xsd:element ref="Phylum" maxOccurs="unbounded" /> <xsd:element ref="Class" maxOccurs="unbounded" /> <xsd:element ref="Order" maxOccurs="unbounded" /> <xsd:element ref="Family" maxOccurs="unbounded" /> <xsd:element ref="Genus" maxOccurs="unbounded" /> <xsd:element ref="Species" maxOccurs="unbounded" /> <…> </xsd:choice> </xsd:sequence> </xsd:extension> </xsd:complexContent></xsd:complexType>

Page 16: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 16

Why “bind” like this? To provide data-typing (string, numeric,

etc.) for various concepts within operators at an abstract level (e.g. LIKE only valid for string data; IN allows for multiples, but in a controlled fashion)

To allow for federation schemas to simply classify data as types without having to redefine/extend operators

Page 17: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 17

Request Issues Do we need another abstract element such as

dateSearchCondition? What information will be useful in the header? How should we specify the format of the results?

What standard formats should be offered (I.e. brief, full?).

Will tblName be part of the meta-data required of providers?

What concepts of Darwin Core 2 are searchable?

Page 18: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 18

Response Prototype<response xmlns="http://www.namespaceTBD.org/digir"

xmlns:darwin="http://www.namespaceTBD.org/darwin" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.namespaceTBD.org/digir digir.xsd http://www.namespaceTBD.org/darwin darwin.xsd">

<header> <!-- contents TBD --> </header> <content> <record> </record> </content> <diagnostics> </diagnostics></response>

Page 19: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 19

Response Issues How do we format and validate the response

content? What elements are needed for the <header>, if

any? Do we always have diagnostics, or only if there is

an error? Should a finite set of diagnostics be created and

maintained in its own XML Schema? Will there ever be a diagnostic that is specific to a federation schema?

Page 20: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 20

Provider Details

Page 21: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 21

Provider Details Implemented as a web application that answers questions Interface is not specific to a particular information domain No state information is recorded

Each request is treated as unique and uninfluenced by previous requests

Must always generate a valid response Consists of four key components

Request handler Filter handler Result set cache Response generator

Page 22: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 22

Request Handler Receives XML document Validates document Generates internal structures for further

processing

Page 23: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 23

Filter Handler Internal structural representation of filter

(query) structure Responsible for generating a native query

string for querying the database Communicates with UDDI to obtain

standard database definition Custom configured to work with specific

database implementation

Page 24: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 24

Result Set Cache Contains the results of applying a query Responsible for generating the response

records in the requested format Somewhat directly integrated with the

response generator

Page 25: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 25

Response Generator Generates the response XML document Serializes the response header information Serializes diagnostic information Serializes the requested subset of records

Page 26: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 26

Provider ConfigurationPortal

ProfileSchema

Data Provider System

Data

DiGIRProvider

Data MapSchema

Data Provider System

Data

DiGIRProvider

Data MapSchema

Page 27: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 27

Portal Details

Page 28: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 28

Portal Details Divided into two distinct components: a

presentation layer and PortalServices The presentation layer supports the UI and

translates requests (HTTP requests from forms or links) into protocol compliant XML requests

The presentation layer also handles all display issues involving the responses, such as format, sorting, collating, etc…

The presentation layer is envisioned to be an application server/web server implementation

Page 29: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 29

Portal Details PortalServices handles all external network

activity (UDDI calls, provider calls, etc) PortalServices limits provider calls to those

necessary based on provider meta-data PortalServices threads provider calls for

increased performance (I.e. response time) PortalServices is envisioned to be a webapp and

supporting classes running within an application server, such as TomCat

Page 30: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 30

PortalServices RegistryAccess ProviderCache PortalConfig PortalServlet PortalRequestHandler ProviderFilterer Marshallers

Page 31: DiGIR Di stributed  G eneric  I nformation  R etrieval

DiGIR 31

Portal Issues What information will be stored in UDDI about a

provider? What information will be known for

communicating with a Provider (I.e. IP address, port, etc…?)

What meta-data will be provided and what are the rules for using such data for provider filtering?

What requirements are there for logging and monitoring?