a look into the apache oodt ecosystem

A Look into the Apache OODT Ecosystem

Chris A. MattmannNASA JPL/Univ. Southern California/ASF

[email protected] November 9, 2011

mailto:[email protected]

• Apache Member involved in– OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS

(Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)

• Senior Computer Scientist at NASA JPL in Pasadena, CA USA

• Software Architecture/Engineering Prof at Univ. of Southern California

And you are?

Welcome to the Apache in Space! (OODT) Track

Agenda

• Overview of OODT and its history• How we got it to Apache• How other projects can follow our model• Existing successful deployments of OODT• Pointers to papers, and more information

including case studies

• Increasing data volumes (exponential growth)

• Increasing complexity of instruments and algorithms

• Increasing availability of proxy/sim/ancillary data

• Increasing rate of technology refresh

… all of this while NASA Earth Mission funding was decreasing

A data system framework based on a standard architecture and reusable software components for supporting all future missions.

Lessons from 90’s era missions

Object Oriented Data Technology http://oodt.apache.org

Funded initially in 1998 by NASA’s Office of Space Science

Envisaged as a national software framework for sharingdata across heterogeneous, distributed data repositories

OODT is both an architecture and a reference implementation providing

Data Production

Data Distribution

Data Discovery

Data Access

OODT is Open Source and available from the Apache Software Foundation

Enter OODT

Apache OODT• Originally funded by NASA to focus on

– distributed science data system environments

– science data generation

– data capture, end-to-end

– Distributed access to science data repositories by the community

• A set of building blocks/services to exploit common system patterns for reuse

• Supports deployment based on a rich information model

• Selected as a top level Apache Software Foundation project in January 2011

• Runner up for NASA Software of the Year

• Used for a number of science data system activities in planetary, earth, biomedicine, astrophysics

http://oodt.apache.org

Apache OODT Press

Why Apache and OODT?• OODT is meant to be a set of tools to

help build data systems– It’s not meant to be “turn key” – It attempts to exploit the boundary

between bringing in capability vs. being overly rigid in science

– Each discipline/project extends

• Apache is the elite open source community for software developers– Less than 100 projects have been

promoted to top level (Apache Web Server, Tomcat, Solr, Hadoop)

– Differs from other open source communities; it provides a governance and management structure

Governance Model+NASA=&hearts;

• NASA and other government agencies have tons of process– They like that

Publicly accessible and searchable archives

• http://svnsearch.org/svnsearch/repos/ASF/search?path=%2Foodt

• http://mail-archives.apache.org/mod_mbox/oodt-dev/

• http://mail-archives.apache.org/mod_mbox/oodt-user/

• 100+ ML list subscriptions

http://svnsearch.org/svnsearch/repos/ASF/search?path=/oodt

http://svnsearch.org/svnsearch/repos/ASF/search?path=/oodt

Great Metrics and Insight

• http://www.ohloh.net/p/oodt

Movement to the ASF

• Meeting held June 15, 2007 at JPL with ASF President Justin Erenkrantz– Develop plan moving forward to bring first

NASA project to Apache– Discuss obstacles, sponsorship– Discuss outlook

2007: original goals

• Come up with incubation proposal– Chris Mattmann was one of the principal contributors

to the proposal for the Tika project, and to other Incubation activities (Apache SIS)

– Send out emails to the Incubator mailing list• Look for mentors

• Get sponsorship from ranking Apache PMC member or board member– Justin and others

• Top-level project versus sub project outlook heading out of incubation

OODT Incubator Planning

• Monthly Updates (for first 3 months, then quarterly)– Status– Progress– Community– Acceptance

• Plan for exiting incubation– How to have a solid user base– How to operate as a unit in the Apache way– Maintenance of user interest and community going

forward

OODT’s next steps circa 2007

• JPL to tackle legal issues– Is OODT releasable as an Apache product– http://www.apache.org/licenses/software-grant.txt

• This needs to be signed by parties that be by JPL

– Contributor License Agreement• Do we need a corporate one?

• In parallel to this– Draft OODT incubation proposal– Start identifying who would initially be interested

• More external, non-JPL people who are interested, the better

• Justin to get slides from other incubator people

…2 years later

• Worked it out with JPL legal– Turns out the ALv2 license is extremely friendly and is

something that JPL (note not all of NASA) was amenable to

• Developed OODT incubator proposal– http://wiki.apache.org/incubator/OODTProposal

• Found willing Apache mentors besides Justin– Jean-Frederic Clere, Ross Gardler, Ian Holsman

• …Put OODT at Apache!

Apache OODT Community

• Includes PMC members from– NASA JPL, Univ. of Southern California, Google, Children’s

Hospital Los Angeles (CHLA), Vdio, South African SKA Project

• Projects that are deploying it operationally at– Decadal-survey recommended NASA Earth science

missions, NIH, and NCI, CHLA, USC, South African SKA project

• Use in the classroom– My graduate-level software architecture and seach

engines courses

OODT Framework

OBJECT ORIENTED DATA TECHNOLOGY FRAMEWORK

OODT/Science Web Tools

ArchiveClient

ProfileXML Data

DataSystem 1

DataSystem 2

Catalog & ArchiveService

ProfileService

ProductService

QueryService

Bridge to ExternalServices

Navigation Service

OtherService 1

OtherService 2

You’ll hear about this later today

I’ll tell you about these now

Andrew Hart and Emily Law will talk about these later

Architectural Principles• Division of Labor

– Don’t make one component the workhorse!• Technology Independence

– Don’t get bitten in the rear when a software vendor decides to charge you a lot of $$$ for their previously low cost technology

• Metadata as a first-class citizen– Descriptions of resources come in handy

• Separation of software and data models– Allow each to evolve independently

OODT Architecture

• Reference Architecture– Four pairs of component types

• Product Client/Server, Profile Client/Server, Query Client/Server, Catalog and Archive Client/Server

– Two connector types• Messaging layer discussed in

http://sunset.usc.edu/~mattmann/pubs/ICSE.pdf • Handler connector (discussed in this presentation)

• Instantiated for different domains using these fundamental building blocks

Product Client and Server

Product

Client (A)

Product

Server (A)

Web site

MSSQLRAIDDisk

-Deliver data from underlying data store

-Accept uniform query structure that identifies 0 or more “products” (data items) to retrieve

-Many-to-Many

Product

Client (B)Product

Client (C)

Product

Server (B)Product

Server (C)

How about an example of a product?

Profile Client and Server

Profile

Client (A)

Profile

Server (A)

Web site

MSSQLOracle

-Deliver metadata from underlying metadata store

-Metadata gives user enough information about where to find actual data

-Housekeeping information

-Resource information

-Domain-specific information

-Many-to-ManyProfile

Client (B)Profile

Client (C)

Profile Server (B)Profile

Server (C)

How about an example of a profile?

Attributes

Relationships

Credit: A. Hart

Query Client and Server

Query

Server (A)

Query

Client (A)Product

Server (A)

Product

Server (B)

Discovered

Profile

Server (B)

Profile

Server (A)

Initial set

-Query Server seeded with initial set of pointers to Profile Servers

-Profile Servers point to actual resources (Product Servers, even other Profile Servers)

-Interactive (metadata returned)and non-interactive (data returned)

-Many-to-Many

Query

Client (B)Query

Client (C)

Query

Server (B)Query

Server (C)

Catalog and Archive Client and Server (CAS)

Archive

Client (A)

Archive

Server (A)

Repository

Registry

-Ingest data into repository and metadata into registry

-Run processing algorithms on data/metadata upon ingestion

-Workflow support

-Serve back Repository data with Product Server

-Serve back Registry metadata with Profile Server

-Many-to-Many

Profile

Server (A)

Product

Server (A)

Archive Server (B)

Archive Client (B)Archive

Client (C)

Some notes about CAS

• All Core components implemented as web services– XML-RPC used to communicate between components– Servers implemented in Java– Clients implemented in Java, scripts, Python, PHP and web-apps– Service configuration implemented in ASCII and XML files

Credit: D. Woollard

Handler Connectors

DBMS Product Handler

Flat File Product Handler

Web Site Product Handler

Product/Profile Server

Product/Profile Server

Web site

MSSQLRAIDDisk

MSSQL-Encapsulate (meta-)data coordination and communication

-Allow for dynamic addition and removal of different classes of back end metadata and data stores

Example handler connectors

• XMLPS– http://oodt.apache.org/components/maven/

xmlps/ – XML config file specifies recipe for extracting

records from an RDBMS and turning them into a NoSQL repository

• PS– XML configurable profile server to unlock

OPeNDAP datasets and pass them to OODT

So, how do you piece them together: NASA VODC

• NASA’s Virtual Oceanographic Data Center (VODC)

• http://vodc.jpl.nasa.gov

• Information integration using OODT components

Profile, Product, Query, also uses Apache Solr, and Plone

So, how do you piece them together: JPL’s CDX

• CDX = Climate Data Exchange

• Provide comparison of remote sensing data and model outputs

• Existing systems remain in place; services expose data and functions over the network; support the era of IPCC 5th assessment and distributed, petabytes of data

Who’s doing what?• Children’s Hospital Los Angeles

– Improving upon XMLPS, and CAS (Andrew Hart + Ricky Nguyen will talk about this)– Supporting data analytics

• Google– Brian Foster working on command line improvements and data protocol push/pull

• SKA South Africa– Deploying file manager and crawler for use in KAT-7 pipeline ingestion

• NIH/NCI– Maintaining the XMLPS components, and CAS components– Helping with user interfaces

• Various JPL and NASA research projects– OPeNDAPps, XMLPS

• Various NASA missions– Workflow, PCS, services, OPSui, other web apps

Latest release: 0.3• First appearance of PCS

– Core, Services (JAX-RS)

• Web Applications– Balance (PHP), and Wicket (Java)-based apps for file

management and workflow monitoring

• First release deployed to Maven Central– We did backport 0.2 there after this– Over 60 issues fixed in JIRA

• June 2011: recommended stable release

Working on: 0.4• Operator Interface (OODT-157)

– Andrew Hart and I will talk about this

• Workflow2 integration (OODT-215) and all of its sub-issues– Global workflow conditions, dynamic workflows, parallel/sequential

model, new workflow engine, etc.

• OODT RADIX for super easy deployment (OODT-120)– Paul Ramirez and Cameron Goodale will discuss this

• Solr sync with File Manager (OODT-326)• Improvements to XMLPS (OODT-333) and new crawler actions

(OODT-33, OODT-34, OODT-35, OODT-36, OODT-37)• Over 48 issues currently resolved• Likely to come before end of Q4 2011

Using Apache OODT as a testbed for software process

• Missions maintaintheir own local CMs

• Local mission CMs contain forks of existing OSS software– Forks can be patch

based or CM based

• Changes found particularly effectiveare discussed within the comm.And eventually brought before a CCB that reviews their generality, etc.

36

Credit: D. Freeborn

Some Grand Challenges I’m interested in: OODT can help!

• How do we handle 700 TB/sec of data coming off the wire when we actually have to keep it around?– Required by the Square Kilometre Array

• Joe scientist says I’ve got an IDL or Matlab algorithm that I will not change and I need to run it on 10 years of data from the Colorado River Basin and store and disseminate the output products– Required by the Western Snow Hydrology project

Some Grand Challenges I’m interested in: OODT can help!

• How do we compare petabytes of climate model output data in a variety of formats (HDF, NetCDF, Grib, etc.) with petabytes of remote sensing data to improve climate models for the next IPCC assessment?– Required by the 5th IPCC assessment and the Earth

System Grid and NASA

• How do we catalog all of NASA’s current planetary science data?

Key Takeaway

OODT is already doing and/or preparing the world to handle all of these diverse use cases!

It’s a constantly evolving and improving framework – join up and help.

It’s free and open source from Apache and helping government demonstrate the public good

OODT Project Contact Info• Learn more and track our progress at:

– http://oodt.apache.org – WIKI: https://cwiki.apache.org/OODT/ – JIRA: https://issues.apache.org/jira/browse/OODT

• Join the mailing list:– [email protected]

• Chat on IRC:– #oodt on irc.freenode.net

• Acknowledgements– Key Members of the OODT teams: Chris Mattmann, Daniel J. Crichton, Steve Hughes,

Andrew Hart, Sean Kelly, Sean Hardman, Paul Ramirez, David Woollard, Brian Foster, Dana Freeborn, Emily Law, Mike Cayanan, Luca Cinquini, Heather Kincaid

– Projects, Sponsors, Collaborators: Planetary Data System, Early Detection Research Network, Climate Data Exchange, Virtual Pediatric Intensive Care Unit, NASA SMAP Mission, NASA OCO-2 Mission, NASA NPP Sounder Peate, NASA ACOS Mission, Earth System Grid Federation

Alright, I’ll shut up now

• Any questions?

• THANK YOU!– [email protected] – @chrismattmann on Twitter

a look into the apache oodt ecosystem

Technology

apache oodt press

oodt http

data systems

science data repositories

nasa software

oodt vp

apache member

data system framework