analysis optimized data storage in apache science...
TRANSCRIPT
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Analysis Optimized Data Storage in Apache Science Data Analytics Platform
Thomas [email protected]
Group Supervisor - Computer Science for Data-Intensive ApplicationsStrategic Lead - Interactive Data Analytics
Jet Propulsion LaboratoryCalifornia Institute of Technology4800 Oak Grove Drive, Pasadena, CA 91109-8099, U.S.A.
© 2019 California Institute of Technology. Government sponsorship acknowledged.Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsements by the United States Government or the Jet Propulsion Laboratory, California Institute of Technology.
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• NASA has historically focused on systematic capture and stewardship of data for observational Systems• With large amount of observational and modeling data, finding and downloading is becoming inefficient• Reality with large amount of observational and modeling data
• Downloading to local machine is becoming inefficient• Search has gotten a lot faster. Too many matches.• Finding the relevant measurement has becoming a very time consuming process ”Which SST dataset I should use?”• Analyze decades of regional measurement is labor-intensive and costly
• Increasing “big data” era is driving needs to• Scale computational and data infrastructures• Support new methods for deriving scientific inferences• Shift towards integrated data analytics• Apply computational and data science across the lifecycle
• Scalable Data Management• Capture well-architected and curated data repositories based on well-defined data/information architectures• Architecting automated pipelines for data capture
• Scalable Data Analytics• Access and integration of highly distributed, heterogeneous data• Novel statistical approaches for data integration and fusion• Computation applied at the data sources• Algorithms for identifying and extracting interesting features and patterns
Background
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 2
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• Mainly focus on archives and distributions• With additional services
• Better searches – faceted, spatial, keyword, ranking, etc.• Data subsetting – home grown, OPeNDAP, etc.• Visualization – visual discovery, PO.DAAC’s SOTO, NASA Worldview, etc.
• Limitations• Little to no interoperability between tools and services: metadata standard, keyword, spatial coverage (0-360 or -180..180),
temporal representation, etc.• Making sure the most relevant measurements return first• Visualization is nice, but it doesn’t provide enough information about the event/phenomenon captured in the image.• With large amount of observational data, data centers need to do more than just storing bits
• “Is the red blob in the middle of Pacific normal this time of the year?” • “Any relevant news and publications relate to what I am looking at?”• ”What other measurements, phenomena, news, publications relate to the period and location I am looking at?”• “I can see the observation from satellite, are there any relevant in situ data I can look at?”
Data Centers
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 3
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Observation• Traditional methods for data analysis (time-series,
distribution, climatology generation) can’t scale to handle large volume, high-resolution data. They perform poorly
• Performance suffers when involve large files and/or large collection of files
• A high-performance data analysis solution must be free from file I/O bottleneck
Traditional Method for Analyze Satellite Measurements
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 4
Search Download Compute
141 241 341 441
131 231 331 431
121 221 321 421
121 221 321 421
142 242 342 442
132 232 332 432
122 222 322 422
122 222 322 422
143 243 343 443
133 233 333 433
123 223 323 423
123 223 323 423
Y
X
Time 1
Time 2
Time 3
Time
• Depending on the data volume (size and number of files)• It could take many hours of download – (e.g. 10yr of
observational data could yield thousands of files)• It could take many hours of computation• It requires expensive local computing resource (CPU +
RAM + Storage)• After result is produced, purge downloaded files
Temporal spatialArrays
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Enabling Next Generation of Earth Science Tools and Services
THUANG/JPL © 2019. All rights reserved. 52019 CEOS WGISS-48, Hanoi, Vietnam
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• After more than two years of active development, on October 2017 the NASA ESOT/AIST OceanWorks team established Apache Software Foundation and established the Science Data Analytics Platform (SDAP) in the Apache Incubator
• Technology sharing through Free and Open Source Software (FOSS)• Why? Further technology evolution that is restricted by projects /
missions• It is more than GitHub
• Quarterly reporting• Reports are open for community review by over 6000
committers• SDAP has a group of appointed international mentors
• SDAP and many of its affiliated projects are now being developed in the open
• Support local cluster and cloud computing platform support• Fully containerized using Docker and Kubernetes• Infrastructure orchestration using Amazon CloudFormation• Satellite and model data analysis: time series, correlation map, • In situ data analysis and collocation with satellite
measurements• Fast data subsetting• Upload and execute custom parallel analytic algorithms• Data services integration architecture• OpenSearch and dynamic metadata translation• Mining of user interaction and data to enable discovery and
recommendations
Managing Open Source
THUANG/JPL © 2019. All rights reserved. OceanObs’19, Honolulu, HI 6
http://sdap.apache.org
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• An effort funded by the NASA’s Advanced Information Systems Technology (AIST) program• Integrated Science Data Analytics Platform: an analytic center framework to provide an environment for conducting a science
investigation• Enables the confluence of resources for that investigation• Tailored to the individual study area (physical ocean, sea level, etc.)
• Harmonizes data, tools and computational resources to permit the research community to focus on the investigation• Scale computational and data infrastructures• Shift towards integrated data analytics• Algorithms for identifying and extracting interesting features and patterns
Integrated Science Data Analytics Platform SaaS and PaaS for Science Tools and Services
7
Systematic analysis of massive data
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Analyze Hurricane Harvey using GPM and SSTAug 17, 2017 – Sept. 2, 2017
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 8
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Example Interactive Analysis: Visualize and Analyze Sea Level
9THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
Analyze in situ and satellite observations Analyze Sea Level on mobiles
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• SDAP’s analytics engine (a.k.a NEXUS) is a data-intensive analysis solution using a new approach for handling science data to enable large-scale data analysis
• Streaming architecture for horizontal scale data ingestion• Scales horizontally to handle massive amount of data in parallel• Provides high-performance geospatial and indexed search solution• Provides tiled data storage architecture to eliminate file I/O
overhead• A growing collection of science analysis webservices using
Apache Spark: parallel compute, in-memory map-reduce framework
• Pre-Chunk and Summarize Key Variables• Easy statistics instantly (milliseconds)• Harder statistics on-demand using Spark (in seconds)• Visualize original data (layers) on a map quickly (Cassandra
store)• Algorithms – Time Series | Latitude/Time Hovmöller|
Longitude/Time Hovmöller| Latitude/Longitude Time Average | Area Averaged Time Series | Time Averaged Map | Climatological Map | Correlation Map | Daily Difference Average
Scalable Data Analytic Solution
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 10
30-Year Time Series of archival HDF & netCDF files (daily or per orbit)
Fast &Scalable
Display Variables on Map Latitude-Time Hovmoller Plot Aggregate Statistics
Cassandra DB Cluster & Spark In-MemoryParallel Compute!
Chunk Chunk Chunk
Chunk Chunk Chunk
Chunk Chunk Chunk
…
SMAP MODIS GRHSST JASON
MetaData
MetaData
MetaData
MetaData …
Solr DB Cluster
Metadata (JSON): Dataset and granule metadata, Spatial Bounding Box & Summary Statistics
Subset Variables & Chunk Spatially
Slow File I/O
Each file contains many high-resolution geolocated arrays
CustomAnalytics
Two-Database Architecture
NEXUS: The Deep Data Platform
Workflow Automation Horizontal-Scale Data Analysis Environment
Deep Data Processors
Index and Data
Catalog
Data Access
ETL System
Private Cloud
Analytic Platform
ETL System Deep Data Processors
Spring XD
Index and Data Catalog Analytic Platform
Manager Manager Inventory Security SigEvent Search
ZooKeeper ZooKeeperZooKeeper
Manager
Handler Handler Handler
IngestPool
IngestPool
WorkerPool
WorkerPool
File & ProductServices
Job Tracking Services
Business Logics
Applications
HORIZON Data Management and Workflow Framework
Ingest Ingest Ingest
ProductSubscriber
ProductSubscriber
Staging
NoS
QL
Index
Alg Alg Alg
TaskScheduler
Task threads
Block manager
RDD Objects DAG Scheduler Task Scheduler Executor
DAG TaskSet Task
Alg
Alg
Alg
Alg
WebPortal
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
NEXUS Performance: File-based vs. Custom Spark vs. AWS EMR
THUANG/JPL © 2019. All rights reserved. 11
16-WAY 64-WAYCustom Spark 1.7 1.9
AWS EMR 1.7 1.9
1.71.9
1.71.9
0.00.20.40.60.81.01.21.41.61.82.0
Tim
e (s
ec)
Area Averaged Time Series on AWS - BoulderJuly 4, 2002 - July 3, 2016
NEXUS Performance
Custom Spark vs. AWS EMRRef. Speed - Giovanni: 1140.22 sec
16-WAY 64-WAYCustom Spark 3.3 2.9
AWS EMR 3.8 3.1
3.32.9
3.8
3.1
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0Ti
me
(sec
)
Area Averaged Time Series on AWS - ColoradoJuly 4, 2002 - July 3, 2016
NEXUS Performance
Custom Spark vs. AWS EMRRef. Speed - Giovanni: 1150.6 sec
16-WAY 64-WAYCustom Spark 23.1 19.9
AWS EMR 36.9 26.8
23.119.9
36.9
26.8
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
Tim
e (s
ec)
Area Averaged Time Series on AWS - GlobalJuly 4, 2002 - July 3, 2016
NEXUS Performance
Custom Spark vs. AWS EMRRef. Speed - Giovanni: 1366.84 sec
Dataset: MODIS AQUA DailyName: Aerosol Optical Depth 550 nm (Dark Target) (MYD08_D3v6)File Count: 5106Volume: 2.6GBTime Coverage: July 4, 2002 – July 3, 2016
File-based: A web-based application for visualize, analyze, and access vast amounts of Earth science remote sensing data without having to download the data.
• Represents current state of data analysis technology, by processing one file at a time
• Backed by the popular NCO library. Highly optimized C/C++ library
AWS EMR: Amazon’s provisioned MapReduce cluster
2019 CEOS WGISS-48, Hanoi, Vietnam
Algorithm execution time
File-based: 20 minNEXUS: 1.7 sec
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• Running Jupyter from Germany and interacts with analytics services hosted on Amazon and at JPL• Simulated hydrology data in preparation for SWOT hydrology• River data: ~3.6 billion data points. 3-hour sample rate. Consists of measurements from ~600,000 rivers• TRMM data: 17 years, .25deg, 1.5 billion data points• Sub-second retrieval of river measurements• On-the-fly computation of time series and generate coordination plot
Analyze Large Collection of Observational Data Directly … across the ocean
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 12
Retrieval of a single river time series Retrieval of time series from 9 rivers Time series coordination between TRMM and river
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Data Access and the ECCO Ocean and Ice State EstimateNASA ACCESS Program | PI: Patrick Heimbach; Co-Is: Ian Fenty and Thomas Huang
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
• Estimating the Circulation and Climate of the Ocean (ECCO) is a consortium endeavors to produce the best possible estimates of ocean circulation and its role in climate
• Combining state-of-the-art ocean circulation models with global ocean and sea-ice data in a physically and statistically consistent manner
• ECCO products are being used in studies on ocean variability, biological cycles, coastal physics, water cycle, ocean-cryosphere interactions, and geodesy
• Goals• Expand and accelerate in a sustainable and
scalable manner the integration of NASA Earth system data into ECCO through automated preprocessing and transformation
• Automate generation of ECCO reanalysis products into CF-compliant NetCDF products
• Radically streamline the integration of updated ECCO products into NASA's Earth Observing System Data and Information System (EOSDIS)
Models
9ECCO Summer School 2019 State Estimation 1 (I.Fukumori)
General circulation models provide complete descriptions of the ocean, motivating their use as a “curve” to fit the observations.
Atmospheric Reanalyses: Combines observations with weather forecasting models to yield the most complete description of the global atmosphere. e.g., ERA-5 relative vorticity (FZ Juelich)
“Perpetual Ocean”ECCO2 model simulation of
surface current (drifter tracks)
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Interactive Analysis of ECCO Products
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• Committee of Earth Observation Satellites (CEOS) Ocean Variables Enabling Research and Applications for GEO (COVERAGE) Initiative
• Seeks to provide improved access to multi-agency ocean remote sensing data that are better integrated with in-situ and biological observations, in support of oceanographic and decision support applications for societal benefit.
• A community-support open specification with common taxonomies, information model, and API (maybe security)
• Putting value-added services next to the data to eliminate unnecessary data movement
• Avoid data replication. Reduce unnecessary data movement and egress charges
• Public accessible RESTful analytic APIs where computation is next to the data
• Analytic engine infused and managed by the data centers perhaps on the Cloud
• Researchers can perform multi-variable analysis using any web-enabled devices without having to download files
Distributed Analytics Center Architecture
15
CWICAnalytics Platform
Horizontal ScaleData StorageTemporal &
SpatialLookup
Job ManagementSubsetting
Data Packaging
AnalyticalRegistry
AnalyticServices
AlgorithmsFeature DetectionMachine Learning
NLPImage Mining
Data Packaging
Data Viewer
Interactive Science
Workbench
ScienceWorkflow
VisualizationServices
Image GenerationWorkflow
OGCAnimations
Web PortalDiscovery Service
AnalyticEngines
SparkHadoopImpalaDask
TensorflowJob Management
StreamingMessagingAutoscaling
Analysis-ready Storage Services and Workflows Tools
Analytics Platform
Horizontal ScaleData StorageTemporal &
SpatialLookup
Job ManagementSubsetting
Data Packaging
AnalyticalRegistry
AnalyticServices
AlgorithmsFeature DetectionMachine Learning
NLPImage Mining
Data Packaging
Data Viewer
Interactive Science
Workbench
ScienceWorkflow
VisualizationServices
Image GenerationWorkflow
OGCAnimations
Web PortalDiscovery Service
AnalyticEngines
SparkHadoopImpalaDask
TensorflowJob Management
StreamingMessagingAutoscaling
Analysis-ready Storage Services and Workflows Tools
Analytics Platform
Horizontal ScaleData StorageTemporal &
SpatialLookup
Job ManagementSubsetting
Data Packaging
AnalyticalRegistry
AnalyticServices
AlgorithmsFeature DetectionMachine Learning
NLPImage Mining
Data Packaging
Data Viewer
Interactive Science
Workbench
ScienceWorkflow
VisualizationServices
Image GenerationWorkflow
OGCAnimations
Web PortalDiscovery Service
AnalyticEngines
SparkHadoopImpalaDask
TensorflowJob Management
StreamingMessagingAutoscaling
Analysis-ready Storage Services and Workflows Tools
Analytics Platform
Horizontal ScaleData StorageTemporal &
SpatialLookup
Job ManagementSubsetting
Data Packaging
AnalyticalRegistry
AnalyticServices
AlgorithmsFeature DetectionMachine Learning
NLPImage Mining
Data Packaging
Data Viewer
Interactive Science
Workbench
ScienceWorkflow
VisualizationServices
Image GenerationWorkflow
OGCAnimations
Web PortalDiscovery Service
AnalyticEngines
SparkHadoopImpalaDask
TensorflowJob Management
StreamingMessagingAutoscaling
Analysis-ready Storage Services and Workflows Tools
Oceanographic Anomaly Detection
Jupyter Notebook - Interactive Workbench
In Situ Data Analysis Model - Observation Comparison
Mobile Analysis
Model SimulationsTime Series Analysis
Local Visualization and Analysis Tool
Hydrology Research
Data and Information Portal
Tools and Workbenches
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• WEkEO• Copernicus Data and Information
Access Services (DIAS)1. Copernicus Data2. Virtual Environment and
Tools3. User Support
• Harmonized Data Access for Satellite data and Services
• Virtualized infrastructure for personal sandboxes
• Pre-configured tools• COVERAGE Phase B
• Establish US Node on Amazon Cloud
• Establish EU Node on WEkEO at EUMETSAT
• Establish COVERAGE data portal and analysis tool powered by the COVERAGE Nodes at US and EU
COVERAGE Phase B
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 16
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• NASA’s Advanced Information System Technology (AIST) effort for their big ocean science analytics solution using Apache SDAP (a.k.a OceanWorks)
• Enable scientists to use OceanWorks data and analytics within ArcGIS.• Many scientists already use ArcGIS for their day-to-day work. Enabling them to use OceanWorks data and cloud analytics
from within their familiar ArcGIS tools will enable them to perform analyses that cut across several NASA Ocean science datasets, and will expand the reach and impact of the OceanWorks data analytic system within the Ocean Science community
Integration with Esri ArcGIS
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 17
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Instances of Apache SDAP already in production or will be in production within the next 12 months
Connecting to a Federation of SDAPs
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 18
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• Deliver solutions to establish coherent platform solutions
• Embrace open source software • Community validation• Evolve the technology through
community contributions• Share recipes and lessons learned• Technology demonstrations• Host webinars, hands-on cloud
analytics workshops and hackathons
Building Community-Driven Open Data and Open Source Solutions
19
Big Data Analytics and Cloud Computing Workshop, 2017 ESIP Summer Meeting, Bloomington, IN
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
2019 JPL Data Science Showcase
2019 EGU – NASA Hyperwall
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• The gap between visionary to pragmatists is significant. – Geoffrey Moore
• Become an expert in the production environment and devote resources in automations
• Give project engineering team early access to the PaaS
• Deliver all technical documents and work with project system engineering
• Provide project-focused trainings
Partner with NASA and non-NASA Projects - Deliver to Production
20NASA’s Physical Oceanography Distributed Active Archive Center (PO.DAAC)
NASA’s Sea Level Change Team
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam
CEOS SIT Technical Workshop
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
Building Bridges through Open Sourcing
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 21
Presentation | Evaluation | Collaboration
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
• You’ve got to think about big things while you’re doing small things, so that all the small things go in the right direction – Alvin Toffler
• Climate research requires Autonomously Sustainable Solutions
• Open source and a web of analytics centers should be the architecture for climate science• Focus on delivering professional quality open source solutions that enables end-to-end data and computation
architecture, and the total cost of ownership• Open source should not be a destination, it should be in place from the beginning• How a technology is being managed will determine how far it can go
In Summary
THUANG/JPL © 2019. All rights reserved. 2019 CEOS WGISS-48, Hanoi, Vietnam 22
National Aeronautics and Space Administration
Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadena, California
THUANG/JPL © 2019. All rights reserved. OceanObs’19, Honolulu, HI 23
Thomas [email protected]
Jet Propulsion LaboratoryCalifornia Institute of Technology
If you want to go fast, go Alone. If you want to go far, go Together.African Proverb