pycon 2007
DESCRIPTION
Presentation given at PyCon 2007 in Dallas, TX, 2007, about pyDAP.TRANSCRIPT
Accessing and serving scientific datasets with Python
Dr. Rob De Almeida
The Data Access Protocol
● De facto standard for distributing science data on the internet, used by oceanography, meteorology and climate communities
● Simple HTTP-based protocol with XDR encoding for data transmission
● Supports complex dataset structures● Model output, satellite images, in-situ data,
etc.
Protocol details
● A dataset has different URLs describing it● http://server/dataset● http://server/dataset.dds (structure)● http://server/dataset.das (attributes)● http://server/dataset.dods (data)
● Client (usually) retrieves metadata from DDS/DAS responses and downloads data from DODS response as necessary
A simple example
● Dataset with a list “a” of integers from 0 to 9
● Let's also add a few attributes: author, history
● What is the representation of metadata and data?
Dataset Descriptor Structure
Dataset {
Int32 a[a = 10];
} test;
Dataset Attribute Structure
Attributes {
a {
String author "Rob De Almeida";
String history "Created for PyCon 2007";
}
}
DODS response
Dataset {
Int32 a[a = 10];
} test;
Data:
\x00\x00\x00\x0a\x00\x00\x00\x0a
\x00\x00\x00\x00\x00\x00\x00\x01
\x00\x00\x00\x02\x00\x00\x00\x03
\x00\x00\x00\x04\x00\x00\x00\x05
\x00\x00\x00\x06\x00\x00\x00\x07
\x00\x00\x00\x08\x00\x00\x00\x09
Using pyDAP as a client
● The client retrieves and parses the metadata (DAS/DDS), building a dataset object with all the variables than can be introspected
● Data is downloaded on the fly when required
● Uses httplib2 and a custom-made xdrlib based on numpy or array
Example usage
>>> from dap.client import open
>>> dataset = open('http://test.pydap.org/coads.nc', verbose=True)
http://test.pydap.org/coads.nc.dds
http://test.pydap.org/coads.nc.das
>>> print dataset.keys()
['UWND', 'WSPD', 'SST', 'VWND', 'SLP', 'AIRT', 'SPEH', 'COADSX', 'COADSY', 'TIME']
Introspecting the dataset
>>> time = dataset['TIME']
>>> print time.type, time.shape, time.dimensions
Float64 (12,) ('TIME',)
>>> print time.units
>>> print time.units
hour since 0000-01-01 00:00:00
Retrieving data
>>> print time[:]
http://test.pydap.org/coads.nc.dods?TIME[0:1:11]
[ 366. 1096.485 1826.97 2557.455 3287.94 4018.425 4748.91 5479.395 6209.88 6940.365 7670.85 8401.335]
>>> print time[0]
http://test.pydap.org/coads.nc.dods?TIME[0:1:0]
[ 366.]
>>> print time[-2:]
http://test.pydap.org/coads.nc.dods?TIME[10:1:11]
[ 7670.85 8401.335]
Working with sequential data
Dataset {
Sequence {
Int32 id;
Float64 lat;
Float64 lon;
} test;
} test%2Ecsv;
http://test.pydap.org/test.csv.dds
Retrieving data
>>> from dap.client import open
>>> dataset = open('http://test.pydap.org/test.csv', verbose=True)
http://test.pydap.org/test.csv.dds
http://test.pydap.org/test.csv.das
>>> seq = dataset['test']
>>> print seq['lat'][:]
http://test.pydap.org/test.csv.dods?test.lat
[10.1, 10.199999999999999, 10.300000000000001, 10.4, 10.5]
Iterating over sequential data
>>> for struct in seq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id
http://test.pydap.org/test.csv.dods?test.lat
http://test.pydap.org/test.csv.dods?test.lon
10.1 103.0
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Filtering sequences (sure way)
>>> fseq = seq.filter('%s<100' % seq.lon.id)
>>> for struct in fseq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id&test.lon<100
http://test.pydap.org/test.csv.dods?test.lat&test.lon<100
http://test.pydap.org/test.csv.dods?test.lon&test.lon<100
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Filtering sequences (fun way!)
>>> fseq = (struct for struct in seq if struct['lon'] < 100)
>>> for struct in fseq:
... print struct['lat'].data, struct['lon'].data
...
http://test.pydap.org/test.csv.dods?test.id&test.lon<100
http://test.pydap.org/test.csv.dods?test.lat&test.lon<100
http://test.pydap.org/test.csv.dods?test.lon&test.lon<100
10.2 93.0
10.3 83.0
10.4 73.0
10.5 63.0
Server
● pyDAP comes with a WSGI app that works as a DAP server
● Server is just a thin layer between plugins that handle data formats (netCDF, HFD5, SQL, etc.) and responses (DAS, DDS, DODS, HTML, KML, WMS, etc.)
● Can be deployed with Paster Script template:
● paster create -t dap_server myserver● paster server myserver/server.ini
Plugins and responses
Plugins and responses
http://localhost:8080/file.nc.das
Plugins
● Convert data from different formats to pyDAP types
● Plugins for netCDF, CSV, Matlab 4/5, HDF5, GrADS grib, GDAL, DB API 2, grib2
● EasyInstall (entry point dap.plugin):● easy_install dap.plugins.netcdf
Responses
● Convert from pyDAP types to something else
● “Official” responses: DAS, DDS, DODS● Generate data and metadata from the
dataset created by the plugins● Extra responses can be installed using
EasyInstall (entry point dap.response)
ASCII response
Dataset { Sequence { Int32 id; Float64 lat; Float64 lon; } test;} test%2Ecsv;---------------------------------------------test.id, test.lat, test.lon1, 10.1, 1032, 10.2, 933, 10.3, 834, 10.4, 735, 10.5, 63
http://test.pydap.org/test.csv.ascii
HTML response
● Generates an HTML form to download data
● Redirects user to ASCII response● Useful for users without a DAP client
Example HTML response
JSON response
{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",
"test": {"attributes": {}, "type": "Sequence", "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}
http://test.pydap.org/test.csv.json
JSON response with data
{"test%2Ecsv": {"attributes": {"filename": "test.csv"}, "type": "Dataset",
"test": {"attributes": {}, "type": "Sequence", "data": [[1, 10.1, 103.0], [2, 10.2, 93.0], [3, 10.3, 83.0], [4, 10.4, 73.0], [5, 10.5, 63.0]], "id": {"attributes": {}, "type": "Int32"}, "lat": {"attributes": {}, "type": "Float64"}, "lon": {"attributes": {}, "type": "Float64"}}}}
http://test.pydap.org/test.csv.json?output_data=1
WMS response
● Returns maps (images) from requested variables and regions
● Works with geo-referenced grids and sequences
● Layers can be composed together● Data can be constrained:
● /coads.nc.wms?SST // annual mean● /coads.nc.wms?SST[0] // january
WMS example request
http://localhost:8080/netcdf/coads.nc.wms?LAYERS=SST&WIDTH=512
KML response
● Generates XML file using the Keyhole Markup Language, pointing to the WMS response
● Nice and simple interface for quick visualizing data
Future
● pyDAP 2.3 almost ready● Dapper compliance● Faster XDR encoding/decoding● Initial support for DDX response and parser
● Build a rich web interface (AJAX) based on JSON + WMS + KML responses
● Not only to pyDAP, but to other OPeNDAP servers using pyDAP as a proxy
Acknowledgments
● OPeNDAP for all the support● PSF for the financial support to be here● Everybody who submitted bugs (bonus
points for submitting patches!)