reverse time migration via resilient distributed datasets: towards in-memory coherence of...
TRANSCRIPT
Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory
Coherence of Seismic-Reflection Wavefields using Apache Spark
Ian Lumb
HPCS 2015 - Montreal
http://hpcs.ca
Outline
● The challenges and opportunities of RTM● Refactoring RTM with Spark/RDDs
o Spark’ing coherence between wavefields● Summary
http://www.acceleware.com/technical-papers
Zhou 2014Fig. 7.25
Motivation
● RTM is performance-challengedo Algorithms research remains topical
GPUs responsible for compelling results● Revisit RTM as a ‘Big Data problem’
o In-memory analytics has the potential to Improve performance of data and wavefield
manipulations in concert with computations Introduce new prospects for imaging conditions
Key Performance Challenges● RTM modeling kernel is compute intensive
o Stable, non-dispersive solution via FDM requires Small time steps and small grid intervals Higher-order approximations of the spatial
derivatives● RTM wavefields exceed memory capacity
o Multiple-TB source volumes must be stored to disk
e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing● Fault-tolerant, parallel data structures
o Cluster-ready● Optionally persistent ● Can be partitioned for optimal placement● Manipulated via operators
Zaharia et al., NSDI 2012
RTM via RDDs: Implementation using Spark● Apache Spark is an implementation of RDDs● Make use of HDFS or alternative FS
o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre● Choose appropriate programming model(s)
o Not limited to MapReduceo Iterative and/or interactive (including streaming)
● Manage Spark workloads o Built-in mode or YARN mode, Mesoso Univa Universal Resource Broker after Lumb, insideBIGDATA
http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
RTM via RDDs: Implementation using Spark (2)
● Deployable on bare metal … cloudso Monitoring/management Bright Cluster Manager
● Introduces analytics possibilities for RTMo Program in Java (C/C++ via JNA), Scala or Python
● Uptake is significant - rapidly growing community● Results are extremely impressive
o Exploit CPUs and/or GPUs after Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/
RTM via RDDs: Opportunities● Apply RDDs to gathers of seismic data
o Partition RDDs optimally for wavefields calculations● Apply RDDs to source wavefields
o Partition RDDs optimally for cross-correlation of forward and reverse time wavefields Significantly reduce/eliminate disk I/O
● Investigate alternate imaging conditionso Machine-learning and/or graph-analytics algorithms
in addition to cross-correlation
SparkWorkers
Spark (YARN) Master
Sparkor YARN
http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-promising-use-cases/a/d-id/1319660
http://ipython.org/notebook.html
Thunder: Initial Impressions● Written in Spark's Python API (Pyspark)
o Makes use of scipy, numpy, and scikit-learn● IPython Notebook serves as interactive GUI
Runs in a Web browser Notebooks can include text and graphics Secure, remote access to an in-cluster IPython
Notebook server ● Includes modular functions for time-series analysis● Can interface with C/C++ from Python
http://thunder-project.org/
Is there a case for migration?● In-memory computing via RDDs is promising
o Application to gathers and wavefields● Spark provides analytics upside
o Imaging conditions other than cross-correlation ● Spark may be applicable to modeling kernels ● Spark can be easily incorporated into pre-existing IT
infrastructureso Compliments existing HPC environments
http://rice2015oghpc.rice.edu/technical-program/
Summary● Is there a case for migration?
o From: RTM via HPC o To: RTM via Big Data or ( Big Data and HPC )
● Does it make sense to refactor other HPC problems as ‘Big Data problems’?
Resilient Distributed Datasets (RDDs)
● Abstraction for in-memory computing● Fault-tolerant, parallel data structures
o Cluster-ready● Optionally persistent ● Can be partitioned for optimal placement● Manipulated via operators
Zaharia et al., NSDI 2012
Refactoring HPC with Spark/RDDs …
● Could Spark/RDDs replace MPI?o Spark has primitives for distributed in-memory
parallel computing … including fault tolerance
Acknowledgements
● M. Zaharia et al. for RDDs● Communities responsible for Spark, Python & Thunder● M. Lamarca, P. Labropoulos, D. Shestakov & L.
Gibbons at Bright Computing
Questions?Ian Lumb
[email protected]@brightcomputing.com
Resources
● RTM's scientific context● Spark support in Bright Cluster Manager for
Apache Hadoop