big data at ecmwf
TRANSCRIPT
Slide 1 © ECMWF
Big Data at ECMWF Providing access to multi-petabyte datasets
Past, present and future
Baudouin Raoult Principal Software Strategist
ECMWF
Slide 2 © ECMWF
ECMWF An independent intergovernmental organisation established in 1975 with 20 Member States 14 Co-operating States
Slide 3 © ECMWF
Slide 4 © ECMWF
Slide 5 © ECMWF
Major assimilated datasets
Surface stations
Radiosonde balloons
Polar, infrared
Polar, microwave
Geostationary, IR Aircraft
Slide 6 © ECMWF
The forecast process
Slide 7 © ECMWF
ERA-20C completed: Climate monitoring of the 20th Century
● Using >5% of ECMWF’s computing power
● Assimilating billions of observations
● Producing 2,400 global forecasts per day
● Generating 1 PB of reanalysis data in 200 days
● Currently serving reanalysis products to 20,000 users
Slide 8 © ECMWF
Surface fluxes: greenhouse gases, fires, emissions
Global atmospheric composition
http://atmosphere.copernicus.eu Online catalogue, quick-looks and data
Radiation and ozone layer
European Air Quality
Slide 9 © ECMWF
ECMWF products
● ECMWF currently receives 300 million observation from 130 sources daily.
● ECMWF operational models produce 13 millions fields daily, for a total of around 8 TB.
● 77 million products disseminated ever day, for a total of 6 TB.
Slide 10 © ECMWF
What is Big Data?
● Wikipedia: “Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.”
● Gartner: “Big Data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.” (The 3 “V” of Big Data).
● I would like to add two “S”: Scalability and Sustainability.
Slide 11 © ECMWF
V is for Volume (or coping with exponential growth)
Deletion of 1 PB
Slide 12 © ECMWF
V is for Velocity (or coping with exponential growth, part 2)
● ECMWF’s archive grows exponentially:
– At ECMWF, r is around 0.5, which is a 50% increase per year
– The daily amount of data added to the archive grows exponentially at the same rate!
● In 1995, the size of the archive was increasing at a rate of 14 TB/year.
● In 2014, the size of the archive increases at a rate higher than 60 TB/day (with peaks at 100 TB).
Slide 13 © ECMWF
V is for Variety (or coping with complexity)
3DVar 4DVar 12 Hour 4DVar DCDA
EPS 15 days
Vareps/Monthy EDA
50 Members EPS
T106L16
T106L19
T213L31T319L31
T319L50
T319L60
T511L60 T799L91
T1279L91
FC Pressure levels
FC Model levels
Chernobyl
SSTs
TOGA FC
Errors in FG
Waves
EPS
Clusters
Waves FG
Probabilities
Ensemble means & stdev
Other centers
Sensitivity
NCEP EPS
OI
Errors in AN and FG
4D-Var
Tubes
Wave EPS
Errors if FG, surface
Wave proba.
SCDA Analysis
PT and PV levels
SCDA Forecast
SCDA Forecast
Wave 4V
SCDA Waves
Multi-Analysis
4D-var increments
EFIs
DCDA
DCDA Wave
SCDA 4D-Var
EPS PT levels
Overlap, CalVal
Wave EFIs
Vareps/Monthy
4d-Var Model errors
Ensemble data assimilation
X-MP/4 Y-MP/8 C90/12
C90/16
VPP700-48
VPP700-112
VPP5000 IBM-P4 IBM-P5 IBM-P5+ IBM-P6
10M
100M
1G
10G
100G
1T
10T
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10
Weekend EPS
Weekly Monthly
Extra fields, new gaussian grid
00Z EPS
00Z 10 day FC
00Z Run
00Z Run
End sensitivity
Slide 14 © ECMWF
ECMWF’s Meteorological Archival and Retrieval System
● 28 years in existence
● A managed archive
● MARS is not a file system
– Users are not aware of the location of the data
– Retrievals are expressed in meteorological terms
● An archive, not a database
– Metadata online
– Data offline (automated tape library)
Slide 15 © ECMWF
ECMWF’s Meteorological Archival and Retrieval System
● 1014 directly addressable objects
– Unique hypercube-based indexing
● Data is kept forever:
– For many studies, a dataset becomes useful once enough data has been accumulated
– Deleting old data in an exponentially growing archive is meaningless
● 200 million objects/65 TB added daily
● 7000 registered users
● 650 active users, 100 TB retrieved per day, in 1.5 million requests
Slide 16 © ECMWF
Scalability and sustainability
● Distributed service-oriented architecture, to scale out
● Queues and priorities to ensure quality of service and scaling with the demand
● Indirection is the key to scalability:
– Allow services to be modified/redeployed…
– Allow data to be moved to different media/storage…
– …without any impact on users
● It has allowed us to migrate several times during the past decades:
– MVS to AIX, AIX to Linux, single server to clusters
– CFS to TSM to HPSS
– PL/I and Fortran to C/C++
Slide 17 © ECMWF
Continuous evolution
● Change is driven by new needs and expectations
– Users are not domain experts anymore…
– …users expect information at “Google speed”
● ECMWF is continuously evolving its data delivery methods and services to cater for the new user requirements
● Interoperability is key
– Follow standards and governance to enable interoperability
– OGC, INSPIRE, ISO 19xxx series, NetCDF-CF, WMO Information System, GEOSS
● Provide high-level services on the data
– Data portals for data discovery
– Web services, REST APIs for data retrieval and manipulation
– Close-to-the-data processes
Slide 18 © ECMWF
Data exploring tools
Slide 19 © ECMWF
Interoperability
Slide 20 © ECMWF
The next steps…
● The era of pushing the data to the user is coming to an end
● The volumes involved are too large
● We live in a post-PC era…
● Users want to access data from smart devices…
● …anywhere, anytime, any device…
● … and share their results.
● We need to bring the user processing to the data:
– Cloud Computing is now mature enough to implement operational services
– We need to build a “platform as a service” (PaaS), on which to“software as a service” (SaaS) solutions for environmental data and products
Slide 21 © ECMWF 21
Thank you