Download - Meetup tensorframes

TensorFlow & TensorFrames w/ Apache Spark

Presents...

Marco Saviano


1. Numerical Computing2. Apache Spark3. Google Tensorflow4. Tensorframes5. Future developments

Numerical computing

• Queries and algorithms are computation-heavy

• Numerical algorithms, like ML, uses very simple data types: integers/floating-point operations, vectors, matrixes

• Not necessary a lot of data movement

• Numerical bottlenecks are good targets for optimization

Evolution of computing power

Scale out

Scal

e up

HPC Frameworks

Scale out

Scal

e up

Today’s talk:Spark + TensorFlow = TensorFrames

Open source successesCommits on master branch on GitHub

Apache Spark – 1015 contributors

Google Tensorflow – 582 contributors

Spark enterprise users

Tensorflow enterprise users

Apache Spark

Apache Spark™ is a fast and general engine for large-scale data processing, with built-in

modules for streaming, SQL, machine learning and graph processing

Spark Unified Stack

How does it work? (1/3)

Spark is written in Scala and runs on the Java Virtual Machine.

Every Spark application consists of a driver program:It contains the main function, defines distributed datasets on the cluster and then applies operations to them.

Driver programs access Spark through a SparkContext object.

How does it work? (2/3)

To run the operations defined in the application the driver typically manage a number of nodes called executors.These operations result in tasks that the executors have to perform.

How does it work?(3/3)Managing and manipulating datasets distributed over a cluster writing just a driver program without taking care of the distributed system is possible because of:• Cluster managers (resources management, networking… )• SparkContext (task definition from more abstract operations)• RDD (Spark’s programming main abstraction to represent distributed

datasets)

RDD vs DataFrame

• RDD: Immutable distributed collection of elements of your data,

partitioned across nodes in your cluster that can be operated in parallel with a low-level API, i.e. transformations and actions.

• DataFrame: Immutable distributed collection of data, organized into named

columns. It is like table in a relational database.

DataFrame: pros and cons• Higher level API, which makes Spark available to wider

audience

• Performance gains thanks to the Catalyst query optimizer

• Space efficiency by leveraging the Tungsten subsystem• Off-Heap Memory Management and managing memory explicitly• Cache-aware computation improves the speed of data processing through

more effective use of L1/ L2/L3 CPU caches

• Higher level API may limit expressiveness• Complex transformation are better express using RDD’s API

Google TensorFlow

• Programming system in which you represent computations as graphs

• Google Brain Team (https://research.google.com/teams/brain/)

• Very popular for deep learning and neural networks

https://research.google.com/teams/brain/

Google TensorFlow• Core written in C++

• Interface in C++ and Python

C++ front end Python front end

Core TensorFlow Execution System

CPU GPU Android iOS …

Google TensorFlow adoption

Tensors• Big idea: Express a numeric computation as a graph.• Graph nodes are operations which have any number of inputs and outputs• Graph edges are tensors which flow between nodes

• Tensors can be viewed as a multidimensional array of numbers• A scalar is a tensor,• A vector is a tensor• A matrix is a tensor• And so on…

Programming model

import tensorflow as tfx = tf.placeholder(tf.int32, name=“x”)y = tf.placeholder(tf.int32, name=“y”)output = tf.add(x, 3 * y, name=“z”)

session = tf.Session()output_value = session.run(output, {x: 3, y: 5})

x:int32

y:int32

mul

z

3

Tensorflow Demo

Tensorframes

• TensorFrames (TensorFlow on Spark Dataframes) lets you manipulate Spark's DataFrames with TensorFlow programs.

• Code written in Python, Scala or directly by passing a protocol buffer description of the operations graph

• Build on the javacpp project

• Officially supported Spark versions: 1.6+

Spark with Tensorflow

Spark worker process Worker python process

C++buffer

Python pickle

Tungsten binary format

Python pickle

Javaobject

TensorFrames: native embedding of

TensorFlow

Spark worker process

C++buffer


Javaobject

Programming model• Integrate Tensorflow API in Spark Dataframes

df=sqlContext.createDataFrame(zip( range(0,10),

range(1,11))).toDF(“x”,”y”)

import tensorflow as tfimport tensorframes as tfsx = tfs.row(df,"x")y = tfs.row(df,"y")output = tf.add(x, 3*y, name=“z”)output_df = tfs.map_rows(output,df)

output_df.collect()

x:int32

y:int32

mul

z

3

df: DataFrame[x: int, y: int]

output_df: DataFrame[x: int, y: int, z: int]

Tensors• TensorFlow expresses operations on tensors: homogeneous data

structures that consist in an array and a shape• In Tensorframes, tensors are stored into Spark Dataframe

x y

1 [1.1 1.2]2 [2.1 2.2]3 [3.1 3.2]

x y

1 [1.1 1.2]

x y

2 [2.1 2.2]3 [3.1 3.2]Spark Dataframe

Cluster

Node 1

Node 2

Chunk and distribute the table across the cluster

Map operations• TensorFrames provides most operations in two forms

• row-based version• block-based version

• The block transforms are usually more efficient: there is less overhead in calling TensorFlow, and they can manipulate more data at once.• In some cases, it is not possible to consider a sequence of rows as a single tensors

because the data must be homogeneous

process_row: x = 1, y = [1.1 1.2]

process_row: x = 2, y = [2.1 2.2]

process_row: x = 3, y = [3.1 3.2]

row-based

process_block: x = [1], y = [1.1 1.2]process_block: x = [2 3], y = [[2.1 2.2] [3.1 3.2]]

block-based

x y

1 [1]

2 [1 2]

3 [1 2 3]

Row-based vs Block-based

import tensorflow as tfimport tensorframes as tfsfrom pyspark.sql import Rowfrom pyspark.sql.functions import *

data = [Row(x=float(x)) for x in range(5)]df = sqlContext.createDataFrame(data)

with tf.Graph().as_default() as g: x = tfs.row(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_rows(z, df)

df2.show()

import tensorflow as tfimport tensorframes as tfsfrom pyspark.sql import Rowfrom pyspark.sql.functions import *

data = [Row(x=float(x)) for x in range(5)]df = sqlContext.createDataFrame(data)

with tf.Graph().as_default() as g: x = tfs.block(df, "x") z = tf.add(x, 3, name='z') df2 = tfs.map_blocks(z, df)

df2.show()

Reduction operations• Reduction operations coalesce a pair or a collection of rows and transform them

into a single row, until there is one row left.• The transforms must be algebraic: the order in which they are done should not matter

f(f(a, b), c) == f(a, f(b, c))

import tensorflow as tf

import tensorframes as tfs

from pyspark.sql import Row

from pyspark.sql.functions import *

data = [Row(x=float(x)) for x in range(5)]

df = sqlContext.createDataFrame(data)

with tf.Graph().as_default() as g:

x_input = tfs.block(df, "x", tf_name="x_input")

x = tf.reduce_sum(x_input, name='x')

res = tfs.reduce_blocks(x, df)

print res


C++buffer


Javaobject

Direct memory copy

Improving communication


C++buffer

Direct memory copy

Columnarstorage

Improving communication

Future

• Integration with Tungsten:• Direct memory copy• Columnar storage

• Better integration with MLlib data types

• Improving GPU support

Questions ?

Download - Meetup tensorframes

Top Related