Download - Meetup tensorframes
TensorFlow & TensorFrames w/ Apache Spark
Presents...
Marco Saviano
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments
Numerical computing
• Queries and algorithms are computation-heavy
• Numerical algorithms, like ML, uses very simple data types: integers/floating-point operations, vectors, matrixes
• Not necessary a lot of data movement
• Numerical bottlenecks are good targets for optimization
Evolution of computing power
Scale out
Scal
e up
HPC Frameworks
Scale out
Scal
e up
Today’s talk:Spark + TensorFlow = TensorFrames
Open source successesCommits on master branch on GitHub
Apache Spark – 1015 contributors
Google Tensorflow – 582 contributors
Spark enterprise users
Tensorflow enterprise users
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments
Apache Spark
Apache Spark™ is a fast and general engine for large-scale data processing, with built-in
modules for streaming, SQL, machine learning and graph processing
Spark Unified Stack
How does it work? (1/3)
Spark is written in Scala and runs on the Java Virtual Machine.
Every Spark application consists of a driver program:It contains the main function, defines distributed datasets on the cluster and then applies operations to them.
Driver programs access Spark through a SparkContext object.
How does it work? (2/3)
To run the operations defined in the application the driver typically manage a number of nodes called executors.These operations result in tasks that the executors have to perform.
How does it work?(3/3)Managing and manipulating datasets distributed over a cluster writing just a driver program without taking care of the distributed system is possible because of:• Cluster managers (resources management, networking… )• SparkContext (task definition from more abstract operations)• RDD (Spark’s programming main abstraction to represent distributed
datasets)
RDD vs DataFrame
• RDD: Immutable distributed collection of elements of your data,
partitioned across nodes in your cluster that can be operated in parallel with a low-level API, i.e. transformations and actions.
• DataFrame: Immutable distributed collection of data, organized into named
columns. It is like table in a relational database.
DataFrame: pros and cons• Higher level API, which makes Spark available to wider
audience
• Performance gains thanks to the Catalyst query optimizer
• Space efficiency by leveraging the Tungsten subsystem• Off-Heap Memory Management and managing memory explicitly• Cache-aware computation improves the speed of data processing through
more effective use of L1/ L2/L3 CPU caches
• Higher level API may limit expressiveness• Complex transformation are better express using RDD’s API
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments
Google TensorFlow
• Programming system in which you represent computations as graphs
• Google Brain Team (https://research.google.com/teams/brain/)
• Very popular for deep learning and neural networks
Google TensorFlow• Core written in C++
• Interface in C++ and Python
C++ front end Python front end
Core TensorFlow Execution System
CPU GPU Android iOS …
Google TensorFlow adoption
Tensors• Big idea: Express a numeric computation as a graph.• Graph nodes are operations which have any number of inputs and outputs• Graph edges are tensors which flow between nodes
• Tensors can be viewed as a multidimensional array of numbers• A scalar is a tensor,• A vector is a tensor• A matrix is a tensor• And so on…
Programming model
import tensorflow as tfx = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)
session = tf.Session()output_value = session.run(output, {x: 3, y: 5})
x:int32
y:int32
mul
z
3
Tensorflow Demo
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments
Tensorframes
• TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's DataFrames with TensorFlow programs.
• Code written in Python, Scala or directly by passing a protocol buffer description of the operations graph
• Build on the javacpp project
• Officially supported Spark versions: 1.6+
Spark with Tensorflow
Spark worker process Worker python process
C++buffer
Python pickle
Tungsten binary format
Python pickle
Javaobject
TensorFrames: native embedding of
TensorFlow
Spark worker process
C++buffer
Tungsten binary format
Javaobject
Programming model• Integrate Tensorflow API in Spark Dataframes
df=sqlContext.createDataFrame(zip( range(0,10),
range(1,11))).toDF(“x”,”y”)
import tensorflow as tfimport tensorframes as tfsx = tfs.row(df,"x")y = tfs.row(df,"y")output = tf.add(x, 3*y, name=“z”)output_df = tfs.map_rows(output,df)
output_df.collect()
x:int32
y:int32
mul
z
3
df: DataFrame[x: int, y: int]
output_df: DataFrame[x: int, y: int, z: int]
Demo
Tensors• TensorFlow expresses operations on tensors: homogeneous data
structures that consist in an array and a shape• In Tensorframes, tensors are stored into Spark Dataframe
x y
1 [1.1 1.2]2 [2.1 2.2]3 [3.1 3.2]
x y
1 [1.1 1.2]
x y
2 [2.1 2.2]3 [3.1 3.2]Spark Dataframe
Cluster
Node 1
Node 2
Chunk and distribute the table across the cluster
Map operations• TensorFrames provides most operations in two forms
• row-based version• block-based version
• The block transforms are usually more efficient: there is less overhead in calling TensorFlow, and they can manipulate more data at once.• In some cases, it is not possible to consider a sequence of rows as a single tensors
because the data must be homogeneous
process_row: x = 1, y = [1.1 1.2]
process_row: x = 2, y = [2.1 2.2]
process_row: x = 3, y = [3.1 3.2]
row-based
process_block: x = [1], y = [1.1 1.2]process_block: x = [2 3], y = [[2.1 2.2] [3.1 3.2]]
block-based
x y
1 [1]
2 [1 2]
3 [1 2 3]
Row-based vs Block-based
import tensorflow as tfimport tensorframes as tfsfrom pyspark.sql import Rowfrom pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g: x = tfs.row(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_rows(z, df)
df2.show()
import tensorflow as tfimport tensorframes as tfsfrom pyspark.sql import Rowfrom pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g: x = tfs.block(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_blocks(z, df)
df2.show()
Reduction operations• Reduction operations coalesce a pair or a collection of rows and transform them
into a single row, until there is one row left.• The transforms must be algebraic: the order in which they are done should not matter
f(f(a, b), c) == f(a, f(b, c))
import tensorflow as tf
import tensorframes as tfs
from pyspark.sql import Row
from pyspark.sql.functions import *
data = [Row(x=float(x)) for x in range(5)]
df = sqlContext.createDataFrame(data)
with tf.Graph().as_default() as g:
x_input = tfs.block(df, "x", tf_name="x_input")
x = tf.reduce_sum(x_input, name='x')
res = tfs.reduce_blocks(x, df)
print res
Demo
TensorFlow & TensorFrames w/ Apache Spark
1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments
Spark worker process
C++buffer
Tungsten binary format
Javaobject
Direct memory copy
Improving communication
Spark worker process
C++buffer
Direct memory copy
Columnarstorage
Improving communication
Future
• Integration with Tungsten:• Direct memory copy• Columnar storage
• Better integration with MLlib data types
• Improving GPU support
Questions ?