machine learning with apache spark - prace agenda systems ...€¦ · machine learning with apache...

Machine Learning with Apache Spark

Mathijs KattenbergJeroen Schot

PTC workshop , 2018-02-13

About usMathijs Kattenberg

Technical consultant at SURFsara since 2013

● Working with Big Data technologies (Hadoop, Spark, Kafka)

Before:

● Scientific programmer at VU Amsterdam● MSc Artificial Intelligence at VU Amsterdam

Jeroen Schot

Technical consultant at SURFsara since 2012

● Working with Big Data technologies (Hadoop, Spark, Kafka)

Before:

● MSc Physics at Utrecht University

Program for today

09:00 - 09:15 Welcome & introduction09:15 - 10:30 Apache Spark core and structured API’s10:30 - 10:45 Coffee break10:45 - 12:00 Hands-on Jupyter notebooks12:00 - 13:00 Lunch13:00 - 14:30 Apache Spark MLlib14:30 - 14:45 Coffee break14:45 - 16:15 Hands-on Jupyter notebooks16:15 - 16:30 Coffee break16:30 - 17:00 Practical advice, summary

Apache Spark core and structured API’s

● Differences with traditional HPC approaches

● Distributed data processing

● Resilient Distributed Datasets (RDDs)

● DataFrames (DFs)

“Traditional” (scientific) software applications

Application developed as:

• Stand-alone binary application

• Assumes a specific environment (e.g. Linux OS, CLI)

• Operates on input files and parameters

• Produces output files

• Researcher specifies input files and params via CLI

Scaling “traditional” applications

Now the one running the application needs to:

• Distribute and split data

• Handle faults and errors inherent with scale

• Submit and track applications

An exampleConsider from a tweet we are interested in finding:

• Names of persons

• Names of organisations

• Locations and placesI will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting!

• Store tweets on disk

• Small Python program uses NLTK and Stanford NER to tag

• Write output back to disk

A straightforward implementation

But…

http://bit.ly/1rxKY0n

http://bit.ly/1rxKY0n

Scaling Bottlenecks

• Store tweets on disk: it will eventually fill, many readers

• Small Python program: it can do a tweet every few msecs/secs so need to run separate processes

• Write output back to disk: it will eventually fill, many writers

• Run separate processes: they all need input

Scalability: Design• Data is growing faster than computing power and IO

=> distributed computing necessary

• Most standard applications cannot run in a distributed fashion

=> applications need to be designed with scalability from the start

Machine LimitsCurrent system limits:

• ~256 CPU cores

• 2TB of RAM

• ~500 TB disk space

= expensive (cost does not scale linearly)

Scale out instead of scale up!

Parallel programming is hard

Scalability: DesignIdea: take a step back and consider:

• Work without mutable state

• Restrict the programming interface so that more can be done automatically.

Turns out: we can use ideas from functional programming and declarative languages

Scalable programsConsider: declarative vs. imperative

Functional ProgrammingRestrict the programming interface so that the system can do more automatically. Use ideas from functional programming:

“Here is a function, apply it to all of the data”

• I don't care where it runs (the system should handle that)

• Feel free to run it twice on different nodes (no side effects!)

MapI

like

traffic

lights

1

4

7

6

Map takes as input a function, and a list:

map(len,['I','like','traffic','lights'])which in Python returns [1,4,7,6]

Reduce

47

11

42

13

Reduce takes as input a binary function

and a list

A binary function is a function

with two arguments, like add, subtract, multiply,

etc

100

58

113

reduce(add, [47,11,42,13])returns 113

def add(x,y):return x+y

MapReduce programming model

Input: set of input key/value pairs, Map function and Reduce functionOutput: set of output key/value pairs

Map function is applied to every input pair to produce an intermediate key/value pair

All intermediate pairs are grouped by key

Reduce function is applied to every key and set of values for that key

Hadoop MapReduce

MapReduce strengths

MapReduce framework handles a lot of work for its end-user:

● Splitting work in independent tasks● Task scheduling, retrying on failure● Data grouping/shuffling, in-memory/spilling to disk

MapReduce limitations

● Very low level:decomposing problems in (multiple) MapReduce jobs is hard

● Batch-oriented:unsuited for interactive use or realtime processing

● Disk sync:performance issues when chaining jobs (iterative algorithms)

Higher Level Frameworks

= SQL on Hadoop

= Pig - dataflow DSL

= Dataflow API in Java

= Graph processing

All translated to MR jobs

Apache Spark: a general framework• Spark can be seen as a successor of Hadoop MapReduce and is a simplified framework for writing large

scale data-intensive applications

• Write programs in terms of distributed datasets and operations on them

• Accessible from multiple programming languages:

Scala

Java

Python

R (only via DataFrames)

Spark components

https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Resilient Distributed Dataset (RDD)

• Abstraction for a collection of objects/elements/records

• Spread over many machines

• Built through parallel transformations

• Immutable

Creation of RDD• Transforming an existing RDD• Through SparkContext:

- From internal data structure

- From reading in file (HDFS or otherwise)

text = "This is a sample text."

textRDD = sc.parallelize(text)

lines = sc.textFile('../data/links.tsv')

Operations on RDDsTransformations:

• Create new RDD

• Lazily computed

• Example: ‘map’, ‘filter’

Actions:

• Return some value or side-effect

• Triggers computation

• Example: ‘count’, ‘saveAsTextFile’

Transformations

RDDs can be created from other RDDs using transformations:

• map(f) Apply function f to each element of the RDD

• flatMap(f) Apply function f to each element of the RDD and unpack lists etc.

• filter(pred) Apply predicate pred to each element RDD and return those that pass pred

• distinct() Remove duplicate entries in RDD

Actions

• collect() Returns all elements of the RDD in a list

• count() Returns the number of elements in the RDD

• take(n) Returns the first n elements of the RDD

• reduce(f) Returns the combined result of f on the RDD.

map vs flatmap

• Different from Reduce as in MapReduce

• Aggregates all elements to a single value

Example reduce

Function with two arguments

reduceByKey

x is not the key but the accumulated value!

Pseudo set operations

Pair RDDs

• The elements of a Pair RDD are pairs (k,v)

• k is interpreted as the key, v as the value

• Very much like Hadoop’s MapReduce

• Pair RDD have extra methods

Pair RDD transformations

• groupByKey() Returns a RDD with elements (key, valuelist)

• reduceByKey(f(x,y)) Applies f to all values of each key (similar to Hadoop MapReduce)

• join(RDD) Joins two RDDs on their keys

• mapValues(f) Apply f to the values, not the keys of the RDD

Actions on pair RDDs

Word Count

Input

the cat sat on the matthe aardvark sat on the sofa

aardvark 1cat 1mat 1on 2sat 2sofa 1the 4

Output

lines = sc.textFile(file)words = lines.flatMap(lambda s: s.split())pairs = words.map(lambda w: (w, 1))counts = pairs.reduceByKey(lambda x, y: x + y)

Actions vs. transformations● Try to do as much as possible on executors

● Prefer transformations over actions

● Use collect() only on small data sets

RDD limitations

• Low level: a lot of key-value juggling

• Little room for optimizations by Spark (it cannot assume structure on the data - no schema)

• Good for unstructured data (text), but what if our data has structure (csv, json, table etc.)?

DataFrames

● A DataFrame is a distributed collection of data organized into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python Pandas.

● DataFrames can be constructed from a wide array of sources, such as structured data files, external databases, or existing RDDs.

DataFrames• Collection of Row objects with schema

• Like RDDs, DataFrames are immutable

• Also distributed over machines in cluster

• Transformations and actions

• Lazy but schema is checked eagerly

• Spark makes use of schema information for query optimization

Operations on DataFrames

• Not arbitrary functions but given operations that are understood by Spark and can be optimized

• Like RDDs, transformations and actions

• Transformations like relational operators

• Also an SQL interface

DataFrame API

Relational operators, for example:

selectwherejoinlimitgroupByorderBy

SparkSQL

Can you forget about RDDs? In practice RDDs are used quite often together with

DataFrames.

When we need to tweak schemas.

When we need to clean or wrangle data

When we want more control

When we deal with unstructured data

When we want something just a bit different

Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout

https://prace.jove.surfsara.nl

About the environment● You will be working in a Jupyter notebook environment● The notebooks run on hardware at SURFsara and are accessible via the

browser● Spark is not connected to a cluster, but runs in local mode● Each of you have an environment with:

○ 2 cores○ 6GB memory

● Running multiple notebooks simultaneously you can run in out-of-memory errors, so shutdown the notebook when starting the next

Apache Spark - MLlib and advanced

Spark: What Runs Where?

• At first glance: Spark code and RDD variables look local

• Important to keep track of local variables and references to distributed data (variables of type RDD)

An Executing Application

PySpark & Py4J

https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Spark modes● SparkContext: contains information about the cluster and is the linking

pin between your code and the cluster.● Local mode: single machine, using multiple cores.

For testing and training purposes.● Cluster mode:

○ Stand-alone: dedicated Spark cluster○ Hadoop/YARN: cluster per application○ Mesos: cluster per application, coarse- or fine-grained modes○ Kubernetes: experimental

Spark on ‘classical’ HPC clusters (SGE/Slurm/PBS)● Spark was not designed to run on ‘classical’ HPC cluster● Standard recipe:

○ create a multi-node job submission script○ start Spark master on node 0○ start Spark executors on other nodes

● Filesystem access/assumptions:○ Access to shared file system from all executors (bulk R/W)○ Per executors fast local disk for small file I/O

● Hard to get this secure!● Helper scripts:

○ https://github.com/LLNL/magpie○ https://github.com/glennklockwood/myhadoop

https://github.com/LLNL/magpie

https://github.com/glennklockwood/myhadoop

From High Performance Spark, Holden Karau and Rachel Warren

Machine Learning: MLlibWhy another one?

Spark MLlib

• Scale: many data sets/models become too big for single machine

• Spark is good at training models in a distributed fashion

• Not so good in predicting with very low latency (overhead for startup Spark jobs)

Machine learning1. Data exploration

2. Data preprocessing

3. Model training

4. Model evaluation

5. Model inspection

MLlib: Spark’s Machine Learning library● The Apache Spark core distribution includes a machine learning library since

its inception called ‘MLlib’● MLlib was based on the RDD API● Spark 1.2 introduced a new package called spark.ml● spark.ml is a high-level interface based on DataFrames● Since Spark 2.0 both are called MLlib

○ DataFrames API is the primary API○ RDD API is in maintenance mode○ RDD API expected to be deprecated in 2.3, removed in 3.0

MLlib data types

MLlib (RDD) uses some numerical data types backed by Breeze

● Local vector○ Dense and sparse vectors of doubles

● Labeled point○ Local vector + a label, used by supervised learning algorithms

● Local matrix○ Dense and sparse matrices stored on a single machine

● Distributed matrix○ Row, column indices with double values stored in one or more RDDs

MLlibCommon machine learning algorithms on top of Spark:

• classification: SVM, Naive Bayes, Random Forests

• regression: logistic regression, decision trees, isotonic regression

• clustering: K-means, PIC, LDA

• collaborative filtering: alternating least squares

• dimensionality reduction: SVD, PCA

Pipeline stages

The pipeline concept is the basis for spark.ml and based on the same idea in scikit-learn. There are three main components:

● Transformer: transforms a DataFrame to a new DataFrame● Estimator: needs fitting on data to produce a model (which is a Transformer)● Pipeline: chain of multiple Transformers and Estimators together

Pipeline

tok = Tokenizer().setInputCol(“text”).setOutputCol(“words”)

htf = HashingTF().setInputCol("words") \ .setOutputCol("features") \ .setNumFeatures(200)

lr = LogisticRegression().setMaxIter(10) \ .setRegParam(0.3) \ .setElasticNetParam(0.8) pipeline = Pipeline().setStages([tok, htf, lr])model = pipeline.fit(training_data)

Pipeline

[...]pipeline = Pipeline().setStages([tok, htf, lr)])model = pipeline.fit(training_data)

predictions = model.transform(test_data)

Model selection (hyperparameter tuning)

Model selection can be done using the CrossValidator and TrainValidationSplit tools. They use as input:

● Estimator or Pipeline: the algorithm to optimize● Set of parameter maps: the parameter grid● Evaluator: metric of the performance of a model

CrossValidator and TrainValidationSplit are Estimators themselves!

Model selection

paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()

trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)

model = trainValidationSplit.fit(training_data)


Extending MLlib

● You can write your own Estimators and Transformers● They need to implement the pipeline interfaces● They can be used in Pipelines and mixed with existing ones

Model selection as Pipeline

paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()

trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)

model = trainValidationSplit.fit(training_data)


Additional I/O

● Reading/writing labeled data in LIBSVM format○ Using MLUtils.loadLibSVMFile() and MLUtils.saveAsLibSVMFile()

● Models can be persisted after a job ○ Using model.save() / model.load() methods○ Internal Spark-only format

● Some models can be exported in PMML format○ KMeansModel, LassoModel, LinearRegressionModel, LogisticRegressionModel,

RidgeRegressionModel, SVMModel, StreamingKMeansModel○ Importing PMML is not supported

https://www.csie.ntu.edu.tw/~cjlin/libsvm/https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

https://www.csie.ntu.edu.tw/~cjlin/libsvm/

https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

Using MLlib workflow1. Read the official website documentation:

https://spark.apache.org/docs/2.1.1/ml-guide.html2. Read the Python API docs:

https://spark.apache.org/docs/2.1.1/api/python/index.html3. Read the Scala API docs:

https://spark.apache.org/docs/2.1.1/api/scala/index.html4. Read the Scala source code:

https://github.com/apache/spark/tree/master/mllib/src

Optional: consult Google, Stackoverflow, Spark JIRAMake sure you read the documentation of your Spark version!

https://spark.apache.org/docs/latest/ml-guide.html

https://spark.apache.org/docs/latest/api/python/index.html

https://spark.apache.org/docs/latest/api/scala/index.html

https://github.com/apache/spark/tree/master/mllib/src

Alternatives to Spark MLlibThese libraries can use Spark as a backend and have their own API

● Sparkling Water (H2O) - https://www.h2o.ai/sparkling-water/● DL4J - https://deeplearning4j.org/● Apache Mahout - https://mahout.apache.org/

https://www.h2o.ai/sparkling-water/

https://deeplearning4j.org/

https://mahout.apache.org/

Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout

https://prace.jove.surfsara.nl

Practical advice & summary

Scala (vs Python/Java)To get the most out of Spark you should use Scala (or at least know a little)

● Scala performs better than Python○ dynamic typing, JVM communication

● Scala API is nicer than the Java API○ although this has improved with Java 8

But there are companies running PySpark in production, and with DataFrames the performance gap is smaller than with RDDs

Community packagesspark-packages.org is an index of third-party Spark packages

Examples:

● graphframes: DataFrame-based Graphs● elasticsearch-hadoop: integration with ElasticSearch● thunder: neural data analysis framework● spark-nlp: Natural Language Processing for Spark

Currently ‘only’ 394 packages, quality varies

SerializationSpark stores intermediate data in memory when needed/possible

There are three options:

● In-memory as deserialized Java objects○ Fast, might be inefficient wrt space

● In-memory as serialized data (using Kryo)○ More CPU-intensive, but memory-efficient○ Not needed/possible for Python

● On-disk○ When it doesn’t fit in memory, write to disk○ Slow, but fault-tolerant

Serialization - manual control● When using an RDD multiple times inside the same job, you want to control

where/how this RDD is persisted● Based on five attributes:

○ useDisk○ useMemory○ useOfHeap○ deserialized○ replication

● Controlled by calling rdd.persist(TYPE)● When memory or disk are full, Spark will use a Least Recently Used (LRU)

policy to delete partitions

Serialization - manual control

● Example: rdd.persist(DISK_ONLY_2)○ useDisk = True

○ useMemory = False

○ useOfHeap = False

○ deserialized = False

○ replication = 2

Serialization - manual control

● Example: rdd.persist(MEMORY_ONLY_SER)○ useDisk = False

○ useMemory = True

○ useOfHeap = False

○ deserialized = True

○ replication = 1

Serialization - when reading/writingReading data can be made more performant by writing it in a good format

● Compression codec that favors (de)compression speed over compression ratio○ Because of this BZip2 is usually a bad choice

● Serialization format that stores the structure of the data

General advice:

● For RDDs, use Hadoop SequenceFile or ORCFile with LZO or Snappy compression● For DataFrames, use the Parquet format

HDF5 / netCDF● Official HDF5 Spark Connector (Beta) -

https://www.hdfgroup.org/downloads/spark-connector● Loading netCDF / HDF using SciSpark (NASA JPL) -

https://scispark.jpl.nasa.gov/● H5Spark - https://github.com/valiantljk/h5spark

https://www.hdfgroup.org/downloads/spark-connector

https://scispark.jpl.nasa.gov/index.html

https://github.com/valiantljk/h5spark

MapPartitions● RDD operations seen so far work on

○ single records (map, filter)○ whole RDDs (join, union)

● MapPartitions works on a whole partition● Allows to ‘share state’ between multiple records in the same partition● Use this to share ‘expensive’ operations

○ creating a DB connection, initializing a Tokenizer, …

● Use this to do secondary sorting or custom aggregations (be careful)

MapPartitions● Input: iterator over records in a single partition● Output: iterator over transformed records of this partition

● The full partition might not fit in memory, so avoid creating a list/full buffering

def tokenize(iter):

tokenizer = StringTokenizer() # expensive to startfor line in iter:

yield tokenizer.tokenize(line)

tokenized_rdd = rdd.mapPartitions(tokenize)

MapPartitions

● mapPartitions has an optional argument preservesPartitioning(False by default)

● Set this to True iff the function works on a PairRDD and doesn’t modify the keys

MapPartitionsmapPartitions can be used to implement many other transformations such as map, flatMap and filter

def do_map(iter):for i in iter:

yield f(i)

def do_filter(iter):for i in iter:

if p(i):yield i

def do_flatmap(iter): for sub_iter in iter: for i in sub_iter: yield f(i)

RDD - data sourcesSparkContext methods to read different data formats

● textFile(path)● wholeTextFiles(path)● binaryFiles(path)● binaryRecords(path, recordLength)● newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass)● sequenceFile(path)

Data can be on any Hadoop-supported filesystem (local, HDFS, S3) accessible to all executors

RDD - InputFormat example WARC file format

● ‘Standard’ file format for web archives○ Used by Internet Archive, Library of Congress, CommonCrawl

● WARC file: concatenation of WARC records (separated by two newlines)● WARC record: header and content block● Header contains information such as type, date, length

How to read this into Spark?

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtmlhttp://commoncrawl.org/2014/04/navigating-the-warc-file-format/

https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml

http://commoncrawl.org/2014/04/navigating-the-warc-file-format/

WARC file formatWARC/1.0WARC-Type: responseWARC-Date: 2013-12-04T16:47:32Z Content-Length: 73873Content-Type: application/http; msgtype=responseWARC-IP-Address: 23.0.160.82WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFBWARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OKServer: nginxContent-Type: text/html; charset=UTF-8Vary: Accept-EncodingVary: CookieX-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.Content-Encoding: gzipDate: Wed, 04 Dec 2013 16:47:32 GMTContent-Length: 18953Connection: close

...HTML Content...

Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions


import warcfrom pyspark import SparkContext

sc = SparkContext()warc_files = sc.binaryFiles(“input/*.warc”)warc_records = warc_files.flatMap(lambda f: [r for r in warc.read(f)])


Concerns:

● What if a single WARC file is 1GB in size? 10GB? 100GB?● (Can my WARC library read from a byte array)

Reading WARC - attempt 2

● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)○ Using an existing WARC parsing library

● Read the data using sc.newAPIHadoopFile

from pyspark import SparkContext

sc = SparkContext()warc_records = sc.newAPIHadoopFile(“input/*.warc”, “nl.surfsara.warcutils.WarcInputFormat”, “org.apache.hadoop.io.LongWritable”, “org.apache.hadoop.io.Text”)

public class WarcRecordReader extends RecordReader<LongWritable, WarcRecord> {private DataInputStream in;private long start;private long pos;private long end;private Seekable filePosition;

private CompressionCodecFactory compressionCodecs = null;private CompressionCodec codec;private Decompressor decompressor;

private LongWritable key = null;private WarcRecord value = null;private WarcReader warcReader;

@Overridepublic void initialize(InputSplit inputSplit, TaskAttemptContext context) throws IOException {

FileSplit split = (FileSplit) inputSplit;Configuration conf = context.getConfiguration();final Path file = split.getPath();

start = split.getStart();end = start + split.getLength();compressionCodecs = new CompressionCodecFactory(conf);codec = compressionCodecs.getCodec(file);

FileSystem fs = file.getFileSystem(conf);FSDataInputStream fileIn = fs.open(split.getPath());

if (isCompressedInput()) {in = new DataInputStream(codec.createInputStream(fileIn, decompressor));filePosition = fileIn;

} else {fileIn.seek(start);in = fileIn;filePosition = fileIn;

}

warcReader = WarcReaderFactory.getReaderUncompressed(in);

warcReader.setWarcTargetUriProfile(WarcIOConstants.URIPROFILE);warcReader.setBlockDigestEnabled(WarcIOConstants.BLOCKDIGESTENABLED);warcReader.setPayloadDigestEnabled(WarcIOConstants.PAYLOADDIGESTENABLED);warcReader.setRecordHeaderMaxSize(WarcIOConstants.HEADERMAXSIZE);warcReader.setPayloadHeaderMaxSize(WarcIOConstants.PAYLOADHEADERMAXSIZE);

this.pos = start;}

https://github.com/sara-nl/warcutils

public boolean nextKeyValue() throws IOException {if (key == null) {

key = new LongWritable();}pos = filePosition.getPos();key.set(pos);

value = warcReader.getNextRecord();if (value == null) {

return false;}return true;

}

@Overridepublic LongWritable getCurrentKey() {

return key;}

@Overridepublic WarcRecord getCurrentValue() {

return value;}

@Overridepublic float getProgress() throws IOException {

if (start == end) {return 0.0f;

} else {return Math.min(1.0f, (getFilePosition() - start) / (float) (end - start));

}}

@Overridepublic synchronized void close() throws IOException {

try {if (in != null) {

in.close();}

} finally {if (decompressor != null) {

CodecPool.returnDecompressor(decompressor);}

}}

[...]

Reading WARC - attempt 2● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)

○ Using an existing WARC parsing libary

● Read the data using sc.newHadoopFile

Concerns:

● What if a single WARC record is 1GB in size? 10GB? 100GB?○ Not suitable for any form of distributed computing?

● What if I don’t know Java/Scala?

Real example: unique IDs● Problem: Algorithm expects records to have a unique integer ID for some field

but your dataset has a unique string column (email, username, …)● Solution(?): Use the MonotonicallyIncreasingID function to add a new column

to the DataFrame

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.csv(“input/*”)

df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id())

Real example: unique IDs● Problem: MonotonicallyIncreasingID generates 64-bit numbers, and the

algorithms expects 32-bit numbers…● Solution(?): We have less than 2^32 items, so just cast from Long to Int

from pyspark.sql.functions import monotonically_increasing_id

df = spark.read.csv(“input/*”)

df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id().cast(“int”))

Real example: unique IDs“The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.”

https://spark.apache.org/docs/2.1.1/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id

● Problem: By casting to Int we will have overlapping IDs!● Solution: Use the MLlib StringIndexer instead● Alternative: Convert to RDD and use zipWithIndex

Real example: unique IDs“StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.”

https://spark.apache.org/docs/latest/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer

df = spark.read.csv(“input/*”)string_indexer = StringIndexer(inputCol="id", outputCol="new_id", handleInvalid='error')model = string_indexer.fit(df)df_with_ids = model.transform(df)

Real example: unique IDs“The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.”

https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

df = spark.read.csv(“input/*”)rdd_with_ids = df.rdd.zipWithIndex().map(...)df_with_ids = rdd_with_ids.toDF(schema)

Real example: ALS & RegressionEvaluator● After building an ALS model we try to use the standard RegressionEvaluator to calculate the RMSE

when predicting the validation set● RegressionEvaluator seems to always return ‘nan’● Searching the Internet reveals:

“When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN.

The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too.”

https://issues.apache.org/jira/browse/SPARK-14489

● Reported 08-04-2016● Fixed 28-02-2017 in version 2.2.0● But we are running version 2.1.1 :(

https://issues.apache.org/jira/browse/SPARK-14489

Real example: ALS & RegressionEvaluator● In version 2.2.0 ALS has an extra parameter coldStartStrategy which can be

nan (old behaviour) or drop (drop all rows with NaN predictions)● Workaround for version 2.1.1: drop nan rows manually between transform

and evaluate or subclass the model or estimator

predictions = model.transform(validate).dropna()evaluator = RegressionEvaluator()rmse = evaluator.evaluate(predictions)

def MyEvaluator(RegressionEvaluator): def evaluate(df, params=None): df = df.dropna() return super().evaluate(df, params)

predictions = model.transform(validate)evaluator = MyEvaluator()rmse = evaluator.evaluate(predictions)

Narrow transformationsEach child partition depends on a known subset of parent partitions

Wide transformations● Wide transformations are the most expensive and should be avoided or

optimized● Wide transformations are caused by operations such as groupByKey,

reduceByKey, sort and join● Examples for optimizing:

○ Filter first○ Use reduceByKey instead of groupByKey + map

● Join is one of the most expensive operations in Spark○ Use distinct to prevent data explosion○ Use cogroup instead of join

Joins with DataFrames

● Less control with DataFrames● Execution plan is determined by Catalyst● Cannot change the Partitioner● The advice on preventing joins with non-unique keys still holds!

machine learning with apache spark - prace agenda systems ...€¦ · machine learning with apache...

Documents