reverse time migration via resilient distributed datasets: towards in-memory coherence of...

Reverse Time Migration via Resilient Distributed Datasets: Towards In-Memory

Coherence of Seismic-Reflection Wavefields using Apache Spark

Ian Lumb

HPCS 2015 - Montreal

http://hpcs.ca

http://hpcs.ca/

Outline

● The challenges and opportunities of RTM● Refactoring RTM with Spark/RDDs

o Spark’ing coherence between wavefields● Summary

http://www.acceleware.com/technical-papers

Zhou 2014Fig. 7.25

http://www.cambridge.org/ca/academic/subjects/earth-and-environmental-science/solid-earth-geophysics/practical-seismic-data-analysis

http://www.cambridge.org/ca/academic/subjects/earth-and-environmental-science/solid-earth-geophysics/practical-seismic-data-analysis

Motivation

● RTM is performance-challengedo Algorithms research remains topical

GPUs responsible for compelling results● Revisit RTM as a ‘Big Data problem’

o In-memory analytics has the potential to Improve performance of data and wavefield

manipulations in concert with computations Introduce new prospects for imaging conditions

Key Performance Challenges● RTM modeling kernel is compute intensive

o Stable, non-dispersive solution via FDM requires Small time steps and small grid intervals Higher-order approximations of the spatial

derivatives● RTM wavefields exceed memory capacity

o Multiple-TB source volumes must be stored to disk

e.g., Liu et al., Computers & Geosciences 59 (2013) 17–23

Resilient Distributed Datasets (RDDs)

● Abstraction for in-memory computing● Fault-tolerant, parallel data structures

o Cluster-ready● Optionally persistent ● Can be partitioned for optimal placement● Manipulated via operators

Zaharia et al., NSDI 2012

RTM via RDDs: Implementation using Spark● Apache Spark is an implementation of RDDs● Make use of HDFS or alternative FS

o GPFS, AWS S3, OpenStack Swift, Ceph or Lustre● Choose appropriate programming model(s)

o Not limited to MapReduceo Iterative and/or interactive (including streaming)

● Manage Spark workloads o Built-in mode or YARN mode, Mesoso Univa Universal Resource Broker after Lumb, insideBIGDATA

http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/


RTM via RDDs: Implementation using Spark (2)

● Deployable on bare metal … cloudso Monitoring/management Bright Cluster Manager

● Introduces analytics possibilities for RTMo Program in Java (C/C++ via JNA), Scala or Python

● Uptake is significant - rapidly growing community● Results are extremely impressive

o Exploit CPUs and/or GPUs after Lumb, insideBIGDATA http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/


RTM via RDDs: Opportunities● Apply RDDs to gathers of seismic data

o Partition RDDs optimally for wavefields calculations● Apply RDDs to source wavefields

o Partition RDDs optimally for cross-correlation of forward and reverse time wavefields Significantly reduce/eliminate disk I/O

● Investigate alternate imaging conditionso Machine-learning and/or graph-analytics algorithms

in addition to cross-correlation

SparkWorkers

Spark (YARN) Master

Sparkor YARN

http://www.informationweek.com/big-data/big-data-analytics/apache-spark-3-promising-use-cases/a/d-id/1319660

http://ipython.org/notebook.html

Thunder: Initial Impressions● Written in Spark's Python API (Pyspark)

o Makes use of scipy, numpy, and scikit-learn● IPython Notebook serves as interactive GUI

Runs in a Web browser Notebooks can include text and graphics Secure, remote access to an in-cluster IPython

Notebook server ● Includes modular functions for time-series analysis● Can interface with C/C++ from Python

http://thunder-project.org/

Is there a case for migration?● In-memory computing via RDDs is promising

o Application to gathers and wavefields● Spark provides analytics upside

o Imaging conditions other than cross-correlation ● Spark may be applicable to modeling kernels ● Spark can be easily incorporated into pre-existing IT

infrastructureso Compliments existing HPC environments

http://rice2015oghpc.rice.edu/technical-program/

http://rice2015oghpc.rice.edu/technical-program/

Summary● Is there a case for migration?

o From: RTM via HPC o To: RTM via Big Data or ( Big Data and HPC )

● Does it make sense to refactor other HPC problems as ‘Big Data problems’?

Resilient Distributed Datasets (RDDs)

● Abstraction for in-memory computing● Fault-tolerant, parallel data structures

o Cluster-ready● Optionally persistent ● Can be partitioned for optimal placement● Manipulated via operators

Zaharia et al., NSDI 2012

Refactoring HPC with Spark/RDDs …

● Could Spark/RDDs replace MPI?o Spark has primitives for distributed in-memory

parallel computing … including fault tolerance

Acknowledgements

● M. Zaharia et al. for RDDs● Communities responsible for Spark, Python & Thunder● M. Lamarca, P. Labropoulos, D. Shestakov & L.

Gibbons at Bright Computing

Questions?Ian Lumb

[email protected]@brightcomputing.com

mailto:[email protected]

mailto:[email protected]

Resources

● RTM's scientific context● Spark support in Bright Cluster Manager for

Apache Hadoop

https://ianlumb.wordpress.com/2015/04/01/possibilities-for-reverse-time-seismic-migration-rtm-using-apache-spark/

http://www.brightcomputing.com/News-Bright-Computing-Highlights-Fully-Integrated-Support-for-Apache-Spark-at-Spark-Summit-2015

http://www.brightcomputing.com/News-Bright-Computing-Highlights-Fully-Integrated-Support-for-Apache-Spark-at-Spark-Summit-2015

reverse time migration via resilient distributed datasets: towards in-memory coherence of...

Science

rtm o program

spark apache spark

o algorithms research

seismic data o partition

intensive o stable

big data problem o

implementation of rdds

sparkrdds o sparking