analyzing astronomical data with apache spark
TRANSCRIPT
1
Analyzing astronomical data with Apache Spark
Julien PelotonIn collaboration with Christian Arnault & Stephane Plaszczynski
Laboratoire de l’Accelerateur Lineaire
Statistical challenges for large-scale structurein the era of LSST
Julien Peloton Analyzing astronomical data with Apache Spark
2
Motivation
On the one hand...
Future telescopes will collect huge amount of data (O(1)TB/day).This is unprecedented in the field of astronomy.
... on the other hand.
Big data communities deal with such data volumes (and evenmore!) for many years.An efficient framework to tackle Big data problems is ApacheSpark.
Julien Peloton Analyzing astronomical data with Apache Spark
3
Apache Spark
Apache Spark is a cluster-computing framework.Started as a research project at UC Berkeley in 2009.Open Source License (Apache 2.0).Used by +1000 companies over the world.
Julien Peloton Analyzing astronomical data with Apache Spark
4
Apache Spark, Hadoop & HDFS
Credo: ”It is cheaper to move computation rather than data”.
Spark was initially developed to overcome limitations in theMapReduce paradigm (Read → Map → Reduce).To work, Spark needs
a cluster managera distributed storage system (e.g. HDFS).
Julien Peloton Analyzing astronomical data with Apache Spark
5
Data sources
Spark has built-in data sources, but mostly naively structured.
A popular file format to store and manipulate data is FITS.
FITS format has a complex/heterogeneous structure(image/table HDU, header, data block).
Structured data source formats require specificimplementation to distribute the data.
Julien Peloton Analyzing astronomical data with Apache Spark
6
Handling FITS data in Spark: several challenges
How do we read a FITS file in Scala?
How do we access the data of a distributed FITS file acrossmachines?
Julien Peloton Analyzing astronomical data with Apache Spark
7
spark-fits
Seamlessly integrate with Apache Spark: a simple”drag-and-drop” of a FITS file in e.g. HDFS gives you fullaccess to all the framework!
There is support for catalogs (images in preparation).
API for Scala, Python, R, and Java
https://github.com/JulienPeloton/spark-fits
Julien Peloton Analyzing astronomical data with Apache Spark
8
spark-fits API: quick start
In Scala
// Define a DataFrame with the data from the first HDU
val df = spark.read.format("com.sparkfits")
.option("hdu", 1)
.load("hdfs://...")
// The DataFrame schema is inferred from the header
df.show(4)
+----------+---------+----------+-----+-----+
| target| RA| Dec|Index|RunId|
+----------+---------+----------+-----+-----+
|NGC0000000| 3.448297| -0.338748| 0| 1|
|NGC0000001| 4.493667| -1.441499| 1| 1|
|NGC0000002| 3.787274| 1.329837| 2| 1|
|NGC0000003| 3.423602| -0.294571| 3| 1|
+----------+---------+----------+-----+-----+
Julien Peloton Analyzing astronomical data with Apache Spark
9
spark-fits API: quick start
In Python
## Define a DataFrame with the data from the first HDU
df = spark.read.format("com.sparkfits")\
.option("hdu", 1)\
.load("hdfs://...")
## The DataFrame schema is inferred from the header
df.show(4)
+----------+---------+----------+-----+-----+
| target| RA| Dec|Index|RunId|
+----------+---------+----------+-----+-----+
|NGC0000000| 3.448297| -0.338748| 0| 1|
|NGC0000001| 4.493667| -1.441499| 1| 1|
|NGC0000002| 3.787274| 1.329837| 2| 1|
|NGC0000003| 3.423602| -0.294571| 3| 1|
+----------+---------+----------+-----+-----+
Julien Peloton Analyzing astronomical data with Apache Spark
10
I/O performances
VirtualData @ Universite Paris-Sud: 1 driver + 9 executors
9 × 17 Intel Core Processors (Haswell) @ 2GB RAM @ 2.6GHz
Julien Peloton Analyzing astronomical data with Apache Spark
11
Scala, Python, Java, and R API
spark-fits written in Scala, but takes advantage of Spark’sinterfacing mechanisms.
Same kind of performances across different API as far as IO isconcerned.
First iteration
Later iterations
0
25
50
75
100
Itera
tion
time
(s)
84.0
2.4
(Scala API)
First iteration
Later iterations
96.0
2.7
(Python API)
Julien Peloton Analyzing astronomical data with Apache Spark
12
First application: LSST-like catalogs
Problematic: Distribute galaxy catalog data (> 1010
objects), and manipulate the data efficiently usingspark-fits.Data set: LSST-like data from CoLoRe, 110 GB (10 yearstarget).
a Ref: Just count the data (I/O reference).b Shells: Split the data into redshift shells with ∆z = 0.1,
project into Healpix maps, and reduce the data.
Julien Peloton Analyzing astronomical data with Apache Spark
13
First application: LSST-like catalogs (shells)
LSST 1Y - redshift 0.1-0.2
-1 1
LSST 1Y - redshift 0.2-0.3
-1 1LSST 1Y - redshift 0.3-0.4
-1 1
LSST 1Y - redshift 0.4-0.5
-1 1
Julien Peloton Analyzing astronomical data with Apache Spark
14
spark-fits at NERSC
Although not ideal, Spark can be used to process data storedin HPC-style shared file systems.
Process 1.2 TB on Cori (NERSC) with spark-fits, using 40Haswell nodes (1280 cores total).
First iteration
Later iterations
0
20
40
60
Itera
tion
time
(s) 69
.5
2.4
(1 OST)
First iteration
Later iterations
57.1
2.1
(8 OSTs)
Julien Peloton Analyzing astronomical data with Apache Spark
15
Conclusion & perspectives
A set of new tools for Astronomy to enter in the Big Data era
Library to manipulate FITS in Scala.
Spark connector to distribute FITS data across machines, andperform efficient data exploration.
Proof of concept:
Demonstrate robustness against wide range of data set sizes.Distribute up to 1.2 TB of FITS data across machines, andmake data exploration interactively.
Perspectives
Keep extending the tools, e.g. to manipulate image HDU.
Develop scientific cases, and integrate it in the current efforts.
Your collaboration is welcome!
Julien Peloton Analyzing astronomical data with Apache Spark
16
Backup: First application: LSST-like catalogs
Problematic: Distribute galaxy catalog data (> 1010
objects), and manipulate the data efficiently usingspark-fits.
Data set: LSST-like data from CoLoRe, 110 GB (10 yearstarget).
a Ref: Just count the data (I/O reference).b Shells: Split the data into redshift shells with ∆z = 0.1,
project into Healpix maps, and reduce the data.c Neighbours: Find all galaxies in the catalogs contained in a
circle of center (x0, y0) and radius 1 deg, and reduce the data.d Cross-match: Find common objects between two set of
catalogs with size 110 GB and 3.5 GB, respectively, and reducethe data.
Julien Peloton Analyzing astronomical data with Apache Spark
17
Backup: Distribution(s) of FITS
Problematic: How to distribute and manipulate the data of aFITS file?
Julien Peloton Analyzing astronomical data with Apache Spark
18
Backup: Distribution(s) of FITS
Problematic: How to distribute and manipulate the data of aFITS file?
Julien Peloton Analyzing astronomical data with Apache Spark
19
Backup: Spark, Scala, and Python
In the Scala world, there is no real equivalent to the numpy,matplotlib, scipy packages.
Using the Jep package, we can interface Scala to the Python
world.
Installation:
Pip install: pip install jep --user
Build with sbt: unmanagedBase :=
file("/path/to/python3.5/site-packages/jep")
Julien Peloton Analyzing astronomical data with Apache Spark
20
Backup: HDF5
https://github.com/valiantljk/h5spark
https://www.hdfgroup.org/downloads/spark-connector
Julien Peloton Analyzing astronomical data with Apache Spark