using srb and irods with the cheshire3 information framework building data grids with irods 27-30...
TRANSCRIPT
Using SRB and iRODS with theCheshire3 Information Framework
Building Data Grids with iRODS27-30 May, 2008National e-Science CentreEdinburgh
Dr Robert SandersonDept. of Computer ScienceUniversity of [email protected]
http://www.cheshire3.org/
Building Data Grids with iRODS
iRODS Workshop, May 27th 2008 Slide 1
Cheshire3IntroductionArchitecture
SRB IntegrationArchitectureGridUsage
iRODS IntegrationPossible Architectures
Overview
iRODS Workshop, May 27th 2008 Slide 2
Cheshire3:Information Analysis Framework
Digital Library/Information Retrieval engine with ...Data Mining/Machine LearningText Mining/Natural Language ProcessingComputational GridData Grid
Standards Based: Unicode, XML/XPath, MPI, Z39.50/SRU, ...
Object Oriented Architecture
Easy to develop and extend in Python,
... but heavy lifting possible in imported C libraries
Developed at University of Liverpool, plus UC Berkeley
Version: 0.9.10
Mostly stable, needs thorough testing/documentation
Introduction
iRODS Workshop, May 27th 2008 Slide 3
Context
iRODS Workshop, May 27th 2008 Slide 4
Architecture
iRODS Workshop, May 27th 2008 Slide 5
Index
Extractor
ServerConfigStore
UserStore
User
Object
Database
Query
Query
Normalizer
Record
Document
Document
PreParser
Parser
Transformer
Records
ProtocolHandler
RecordStore
Terms
Documents
Ingest Process
ResultSetPreParserPreParser
DocumentFactory
DocumentStoreIndexStore
Tokenizer
TokenMerger
Architecture 2
iRODS Workshop, May 27th 2008 Slide 6
Index
Record
IndexStore
Extractor
XPathObject
Extractor
XPathObject
Extractor
Normalizer
Index Index Index
Normalizer Normalizer
Tokenizer
TokenMerger
Tokenizer
TokenMerger
Index
Normalizer
SRB Integration
iRODS Workshop, May 27th 2008 Slide 7
RecordStore / DocumentStore
Filesystem Berkeley DB SQL RDBMS(postgresql)
SRB
record, document
data
SRB Integration
iRODS Workshop, May 27th 2008 Slide 8
IndexStore
SRB
terms
a-b c-d e-f g-h ...
Index
dbs
db with query term
Grid Implementation
iRODS Workshop, May 27th 2008 Slide 9
Focus on ingest, not discovery (yet)
Instantiate architecture on every node
Assign one node as master, rest as slaves.
Master then divides the processing as appropriate.
Calls between slaves possible
Calls as small, simple as possible:
(objectIdentifier, functionName, *arguments)
Typically:
(workflow_id, 'process', document_id)
Grid Architecture
iRODS Workshop, May 27th 2008 Slide 10
Master Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(workflow, process, document) (workflow, process, document)
fetch document fetch document
document document
extracted data extracted data
Grid Architecture 2
iRODS Workshop, May 27th 2008 Slide 11
Master Task
Slave Task 1 Slave Task N
Data Grid
GPFS Temporary Storage
(index, load) (index, load)
store index store index
fetch extracted data fetch extracted data
NARA ERA Demonstrator
20Gb of web crawled data in SRB, indexes stored in SRB
Interface generated by easily deployable Python layer
Medline Dataset Experiments
16.5 Million Abstracts plus associated metadata
Parsed data stored in SRB
Indexes in filesystem
NSDL Grade Level Analysis
NSDL web crawl data (3 Tb+)
Data already in SRB, analysis stored to SRB
Usage
iRODS Workshop, May 27th 2008 Slide 12
Simple Integration (ala SRB) possible:
Store data in iRODS for Storage classes
Requires Python interface to iRODS
Doesn't really benefit from rule capabilities
Other (more interesting) Options:
Cheshire3 as External Microservice Platform
Cheshire3 as Internal Microservice Platform
Cheshire3 as Rules Platform(?)
iRODS Integration
iRODS Workshop, May 27th 2008 Slide 13
External Microservice Platform
iRODS Workshop, May 27th 2008 Slide 14
iRODSCheshire3
C3 Microservic
e
C3 Microservic
e
C3 Interface
Microservice
data
data
processed data
Possible Interfaces:MPI/PVMRPCSOAPXml Over HttpArbitrary Transport Protocoletc.
Loose Coupling via Client Interface
Internal Microservice Platform
iRODS Workshop, May 27th 2008 Slide 15
iRODSC3
Microservice
C3 Microservic
e
data
Cheshire3
Requires iRODS to have Python interpreter as alternative Microservice platform, rather than a Python client API.
Much tighter integration: Cheshire3 would have access to iRODS internal information rather than just what was passed over interface.
Microservice definition problem becomes Cheshire3 Workflow definition – XML description
No bandwidth problems of transferring large amounts of data back and forth
Tight Coupling via Python Integration
Rules Platform?
iRODS Workshop, May 27th 2008 Slide 16
iRODS
data
Cheshire3Rules C3
Microservice
C3 Microservic
eMicroservic
es
Requires Python interpreter at the Rules execution level, rather than (as well as) at the Microservice level.
More flexible in terms of rule design
Easier to write rules than current rule language
Event system rather than rules execution?
Integration of Computational Grid for rule/microservice execution?
Website: http://www.cheshire3.org/
Acknowledgements:
SHAMAN: EU 7th Framework Programme
Cheshire3: JISC, NSF
Questions?
Thank You!
iRODS Workshop, May 27th 2008 Slide 17