pycon 2007

33
Accessing and serving scientific datasets with Python Dr. Rob De Almeida

Upload: rob-de-almeida

Post on 06-Jun-2015

328 views

Category:

Documents


2 download

DESCRIPTION

Presentation given at PyCon 2007 in Dallas, TX, 2007, about pyDAP.

TRANSCRIPT

Page 1: PyCon 2007

Accessing and serving scientific datasets with Python

Dr. Rob De Almeida

Page 2: PyCon 2007

The Data Access Protocol

● De facto standard for distributing science data on the internet, used by oceanography, meteorology and climate communities

● Simple HTTP-based protocol with XDR encoding for data transmission

● Supports complex dataset structures● Model output, satellite images, in-situ data,

etc.

Page 3: PyCon 2007

Protocol details

● A dataset has different URLs describing it● http://server/dataset● http://server/dataset.dds (structure)● http://server/dataset.das (attributes)● http://server/dataset.dods (data)

● Client (usually) retrieves metadata from DDS/DAS responses and downloads data from DODS response as necessary

Page 4: PyCon 2007

A simple example

● Dataset with a list “a” of integers from 0 to 9

● Let's also add a few attributes: author, history

● What is the representation of metadata and data?

Page 5: PyCon 2007

Dataset Descriptor Structure

Dataset {

Int32 a[a = 10];

} test;

Page 6: PyCon 2007

Dataset Attribute Structure

Attributes {

a {

String author "Rob De Almeida";

String history "Created for PyCon 2007";

}

}

Page 7: PyCon 2007

DODS response

Dataset {

Int32 a[a = 10];

} test;

Data:

\x00\x00\x00\x0a\x00\x00\x00\x0a

\x00\x00\x00\x00\x00\x00\x00\x01

\x00\x00\x00\x02\x00\x00\x00\x03

\x00\x00\x00\x04\x00\x00\x00\x05

\x00\x00\x00\x06\x00\x00\x00\x07

\x00\x00\x00\x08\x00\x00\x00\x09

Page 8: PyCon 2007

Using pyDAP as a client

● The client retrieves and parses the metadata (DAS/DDS), building a dataset object with all the variables than can be introspected

● Data is downloaded on the fly when required

● Uses httplib2 and a custom-made xdrlib based on numpy or array

Page 9: PyCon 2007

Example usage

>>> from dap.client import open

>>> dataset = open('http://test.pydap.org/coads.nc', verbose=True)

http://test.pydap.org/coads.nc.dds

http://test.pydap.org/coads.nc.das

>>> print dataset.keys()

['UWND', 'WSPD', 'SST', 'VWND', 'SLP', 'AIRT', 'SPEH', 'COADSX', 'COADSY', 'TIME']

Page 10: PyCon 2007

Introspecting the dataset

>>> time = dataset['TIME']

>>> print time.type, time.shape, time.dimensions

Float64 (12,) ('TIME',)

>>> print time.units

>>> print time.units

hour since 0000-01-01 00:00:00

Page 11: PyCon 2007

Retrieving data

>>> print time[:]

http://test.pydap.org/coads.nc.dods?TIME[0:1:11]

[ 366. 1096.485 1826.97 2557.455 3287.94 4018.425 4748.91 5479.395 6209.88 6940.365 7670.85 8401.335]

>>> print time[0]

http://test.pydap.org/coads.nc.dods?TIME[0:1:0]

[ 366.]

>>> print time[-2:]

http://test.pydap.org/coads.nc.dods?TIME[10:1:11]

[ 7670.85 8401.335]

Page 12: PyCon 2007

Working with sequential data

Dataset {

Sequence {

Int32 id;

Float64 lat;

Float64 lon;

} test;

} test%2Ecsv;

http://test.pydap.org/test.csv.dds

Page 13: PyCon 2007

Retrieving data

>>> from dap.client import open

>>> dataset = open('http://test.pydap.org/test.csv', verbose=True)

http://test.pydap.org/test.csv.dds

http://test.pydap.org/test.csv.das

>>> seq = dataset['test']

>>> print seq['lat'][:]

http://test.pydap.org/test.csv.dods?test.lat

[10.1, 10.199999999999999, 10.300000000000001, 10.4, 10.5]

Page 14: PyCon 2007

Iterating over sequential data

>>> for struct in seq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id

http://test.pydap.org/test.csv.dods?test.lat

http://test.pydap.org/test.csv.dods?test.lon

10.1 103.0

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 15: PyCon 2007

Filtering sequences (sure way)

>>> fseq = seq.filter('%s<100' % seq.lon.id)

>>> for struct in fseq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id&test.lon<100

http://test.pydap.org/test.csv.dods?test.lat&test.lon<100

http://test.pydap.org/test.csv.dods?test.lon&test.lon<100

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 16: PyCon 2007

Filtering sequences (fun way!)

>>> fseq = (struct for struct in seq if struct['lon'] < 100)

>>> for struct in fseq:

... print struct['lat'].data, struct['lon'].data

...

http://test.pydap.org/test.csv.dods?test.id&test.lon<100

http://test.pydap.org/test.csv.dods?test.lat&test.lon<100

http://test.pydap.org/test.csv.dods?test.lon&test.lon<100

10.2 93.0

10.3 83.0

10.4 73.0

10.5 63.0

Page 17: PyCon 2007

Server

● pyDAP comes with a WSGI app that works as a DAP server

● Server is just a thin layer between plugins that handle data formats (netCDF, HFD5, SQL, etc.) and responses (DAS, DDS, DODS, HTML, KML, WMS, etc.)

● Can be deployed with Paster Script template:

● paster create -t dap_server myserver● paster server myserver/server.ini

Page 18: PyCon 2007

Plugins and responses

Page 19: PyCon 2007

Plugins and responses

http://localhost:8080/file.nc.das

Page 20: PyCon 2007

Plugins

● Convert data from different formats to pyDAP types

● Plugins for netCDF, CSV, Matlab 4/5, HDF5, GrADS grib, GDAL, DB API 2, grib2

● EasyInstall (entry point dap.plugin):● easy_install dap.plugins.netcdf

Page 21: PyCon 2007

Responses

● Convert from pyDAP types to something else

● “Official” responses: DAS, DDS, DODS● Generate data and metadata from the

dataset created by the plugins● Extra responses can be installed using

EasyInstall (entry point dap.response)

Page 22: PyCon 2007

ASCII response

Dataset { Sequence { Int32 id; Float64 lat; Float64 lon; } test;} test%2Ecsv;---------------------------------------------test.id, test.lat, test.lon1, 10.1, 1032, 10.2, 933, 10.3, 834, 10.4, 735, 10.5, 63

http://test.pydap.org/test.csv.ascii

Page 23: PyCon 2007

HTML response

● Generates an HTML form to download data

● Redirects user to ASCII response● Useful for users without a DAP client

Page 24: PyCon 2007

Example HTML response

Page 25: PyCon 2007

JSON response

{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",

"test": {"attributes": {}, "type": "Sequence", "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}

http://test.pydap.org/test.csv.json

Page 26: PyCon 2007

JSON response with data

{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",

"test": {"attributes": {}, "type": "Sequence", "data": [[1, 10.1, 103.0], [2, 10.2, 93.0], [3, 10.3, 83.0], [4, 10.4, 73.0], [5, 10.5, 63.0]], "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}

http://test.pydap.org/test.csv.json?output_data=1

Page 27: PyCon 2007

WMS response

● Returns maps (images) from requested variables and regions

● Works with geo-referenced grids and sequences

● Layers can be composed together● Data can be constrained:

● /coads.nc.wms?SST // annual mean● /coads.nc.wms?SST[0] // january

Page 28: PyCon 2007

WMS example request

http://localhost:8080/netcdf/coads.nc.wms?LAYERS=SST&WIDTH=512

Page 29: PyCon 2007

KML response

● Generates XML file using the Keyhole Markup Language, pointing to the WMS response

● Nice and simple interface for quick visualizing data

Page 30: PyCon 2007
Page 31: PyCon 2007
Page 32: PyCon 2007

Future

● pyDAP 2.3 almost ready● Dapper compliance● Faster XDR encoding/decoding● Initial support for DDX response and parser

● Build a rich web interface (AJAX) based on JSON + WMS + KML responses

● Not only to pyDAP, but to other OPeNDAP servers using pyDAP as a proxy

Page 33: PyCon 2007

Acknowledgments

● OPeNDAP for all the support● PSF for the financial support to be here● Everybody who submitted bugs (bonus

points for submitting patches!)