machine learning with apache spark - prace agenda systems ...€¦ · machine learning with apache...
TRANSCRIPT
Machine Learning with Apache Spark
Mathijs KattenbergJeroen Schot
PTC workshop , 2018-02-13
About usMathijs Kattenberg
Technical consultant at SURFsara since 2013
● Working with Big Data technologies (Hadoop, Spark, Kafka)
Before:
● Scientific programmer at VU Amsterdam● MSc Artificial Intelligence at VU Amsterdam
Jeroen Schot
Technical consultant at SURFsara since 2012
● Working with Big Data technologies (Hadoop, Spark, Kafka)
Before:
● MSc Physics at Utrecht University
Program for today
09:00 - 09:15 Welcome & introduction09:15 - 10:30 Apache Spark core and structured API’s10:30 - 10:45 Coffee break10:45 - 12:00 Hands-on Jupyter notebooks12:00 - 13:00 Lunch13:00 - 14:30 Apache Spark MLlib14:30 - 14:45 Coffee break14:45 - 16:15 Hands-on Jupyter notebooks16:15 - 16:30 Coffee break16:30 - 17:00 Practical advice, summary
Apache Spark core and structured API’s
● Differences with traditional HPC approaches
● Distributed data processing
● Resilient Distributed Datasets (RDDs)
● DataFrames (DFs)
“Traditional” (scientific) software applications
Application developed as:
• Stand-alone binary application
• Assumes a specific environment (e.g. Linux OS, CLI)
• Operates on input files and parameters
• Produces output files
• Researcher specifies input files and params via CLI
Scaling “traditional” applications
Now the one running the application needs to:
• Distribute and split data
• Handle faults and errors inherent with scale
• Submit and track applications
An exampleConsider from a tweet we are interested in finding:
• Names of persons
• Names of organisations
• Locations and placesI will be watching the election results from Trump Tower in Manhattan with my family and friends. Very exciting!
• Store tweets on disk
• Small Python program uses NLTK and Stanford NER to tag
• Write output back to disk
A straightforward implementation
Scaling Bottlenecks
• Store tweets on disk: it will eventually fill, many readers
• Small Python program: it can do a tweet every few msecs/secs so need to run separate processes
• Write output back to disk: it will eventually fill, many writers
• Run separate processes: they all need input
Scalability: Design• Data is growing faster than computing power and IO
=> distributed computing necessary
• Most standard applications cannot run in a distributed fashion
=> applications need to be designed with scalability from the start
Machine LimitsCurrent system limits:
• ~256 CPU cores
• 2TB of RAM
• ~500 TB disk space
= expensive (cost does not scale linearly)
Scale out instead of scale up!
Parallel programming is hard
Scalability: DesignIdea: take a step back and consider:
• Work without mutable state
• Restrict the programming interface so that more can be done automatically.
Turns out: we can use ideas from functional programming and declarative languages
Scalable programsConsider: declarative vs. imperative
Functional ProgrammingRestrict the programming interface so that the system can do more automatically. Use ideas from functional programming:
“Here is a function, apply it to all of the data”
• I don't care where it runs (the system should handle that)
• Feel free to run it twice on different nodes (no side effects!)
MapI
like
traffic
lights
1
4
7
6
Map takes as input a function, and a list:
map(len,['I','like','traffic','lights'])which in Python returns [1,4,7,6]
Reduce
47
11
42
13
Reduce takes as input a binary function
and a list
A binary function is a function
with two arguments, like add, subtract, multiply,
etc
100
58
113
reduce(add, [47,11,42,13])returns 113
def add(x,y):return x+y
MapReduce programming model
Input: set of input key/value pairs, Map function and Reduce functionOutput: set of output key/value pairs
Map function is applied to every input pair to produce an intermediate key/value pair
All intermediate pairs are grouped by key
Reduce function is applied to every key and set of values for that key
Hadoop MapReduce
MapReduce strengths
MapReduce framework handles a lot of work for its end-user:
● Splitting work in independent tasks● Task scheduling, retrying on failure● Data grouping/shuffling, in-memory/spilling to disk
MapReduce limitations
● Very low level:decomposing problems in (multiple) MapReduce jobs is hard
● Batch-oriented:unsuited for interactive use or realtime processing
● Disk sync:performance issues when chaining jobs (iterative algorithms)
Higher Level Frameworks
= SQL on Hadoop
= Pig - dataflow DSL
= Dataflow API in Java
= Graph processing
All translated to MR jobs
Apache Spark: a general framework• Spark can be seen as a successor of Hadoop MapReduce and is a simplified framework for writing large
scale data-intensive applications
• Write programs in terms of distributed datasets and operations on them
• Accessible from multiple programming languages:
Scala
Java
Python
R (only via DataFrames)
Spark components
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
Resilient Distributed Dataset (RDD)
• Abstraction for a collection of objects/elements/records
• Spread over many machines
• Built through parallel transformations
• Immutable
Creation of RDD• Transforming an existing RDD• Through SparkContext:
- From internal data structure
- From reading in file (HDFS or otherwise)
text = "This is a sample text."
textRDD = sc.parallelize(text)
lines = sc.textFile('../data/links.tsv')
Operations on RDDsTransformations:
• Create new RDD
• Lazily computed
• Example: ‘map’, ‘filter’
Actions:
• Return some value or side-effect
• Triggers computation
• Example: ‘count’, ‘saveAsTextFile’
Transformations
RDDs can be created from other RDDs using transformations:
• map(f) Apply function f to each element of the RDD
• flatMap(f) Apply function f to each element of the RDD and unpack lists etc.
• filter(pred) Apply predicate pred to each element RDD and return those that pass pred
• distinct() Remove duplicate entries in RDD
Actions
• collect() Returns all elements of the RDD in a list
• count() Returns the number of elements in the RDD
• take(n) Returns the first n elements of the RDD
• reduce(f) Returns the combined result of f on the RDD.
map vs flatmap
• Different from Reduce as in MapReduce
• Aggregates all elements to a single value
Example reduce
Function with two arguments
reduceByKey
x is not the key but the accumulated value!
Pseudo set operations
Pair RDDs
• The elements of a Pair RDD are pairs (k,v)
• k is interpreted as the key, v as the value
• Very much like Hadoop’s MapReduce
• Pair RDD have extra methods
Pair RDD transformations
• groupByKey() Returns a RDD with elements (key, valuelist)
• reduceByKey(f(x,y)) Applies f to all values of each key (similar to Hadoop MapReduce)
• join(RDD) Joins two RDDs on their keys
• mapValues(f) Apply f to the values, not the keys of the RDD
Actions on pair RDDs
Word Count
Input
the cat sat on the matthe aardvark sat on the sofa
aardvark 1cat 1mat 1on 2sat 2sofa 1the 4
Output
lines = sc.textFile(file)words = lines.flatMap(lambda s: s.split())pairs = words.map(lambda w: (w, 1))counts = pairs.reduceByKey(lambda x, y: x + y)
Actions vs. transformations● Try to do as much as possible on executors
● Prefer transformations over actions
● Use collect() only on small data sets
RDD limitations
• Low level: a lot of key-value juggling
• Little room for optimizations by Spark (it cannot assume structure on the data - no schema)
• Good for unstructured data (text), but what if our data has structure (csv, json, table etc.)?
DataFrames
● A DataFrame is a distributed collection of data organized into named columns. Conceptually equivalent to a table in a relational database or a dataframe in R/Python Pandas.
● DataFrames can be constructed from a wide array of sources, such as structured data files, external databases, or existing RDDs.
DataFrames• Collection of Row objects with schema
• Like RDDs, DataFrames are immutable
• Also distributed over machines in cluster
• Transformations and actions
• Lazy but schema is checked eagerly
• Spark makes use of schema information for query optimization
Operations on DataFrames
• Not arbitrary functions but given operations that are understood by Spark and can be optimized
• Like RDDs, transformations and actions
• Transformations like relational operators
• Also an SQL interface
DataFrame API
Relational operators, for example:
selectwherejoinlimitgroupByorderBy
SparkSQL
Can you forget about RDDs? In practice RDDs are used quite often together with
DataFrames.
When we need to tweak schemas.
When we need to clean or wrangle data
When we want more control
When we deal with unstructured data
When we want something just a bit different
Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout
About the environment● You will be working in a Jupyter notebook environment● The notebooks run on hardware at SURFsara and are accessible via the
browser● Spark is not connected to a cluster, but runs in local mode● Each of you have an environment with:
○ 2 cores○ 6GB memory
● Running multiple notebooks simultaneously you can run in out-of-memory errors, so shutdown the notebook when starting the next
Apache Spark - MLlib and advanced
Spark: What Runs Where?
• At first glance: Spark code and RDD variables look local
• Important to keep track of local variables and references to distributed data (variables of type RDD)
An Executing Application
An Executing Application
An Executing Application
PySpark & Py4J
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Spark modes● SparkContext: contains information about the cluster and is the linking
pin between your code and the cluster.● Local mode: single machine, using multiple cores.
For testing and training purposes.● Cluster mode:
○ Stand-alone: dedicated Spark cluster○ Hadoop/YARN: cluster per application○ Mesos: cluster per application, coarse- or fine-grained modes○ Kubernetes: experimental
Spark on ‘classical’ HPC clusters (SGE/Slurm/PBS)● Spark was not designed to run on ‘classical’ HPC cluster● Standard recipe:
○ create a multi-node job submission script○ start Spark master on node 0○ start Spark executors on other nodes
● Filesystem access/assumptions:○ Access to shared file system from all executors (bulk R/W)○ Per executors fast local disk for small file I/O
● Hard to get this secure!● Helper scripts:
○ https://github.com/LLNL/magpie○ https://github.com/glennklockwood/myhadoop
From High Performance Spark, Holden Karau and Rachel Warren
Machine Learning: MLlibWhy another one?
Spark MLlib
• Scale: many data sets/models become too big for single machine
• Spark is good at training models in a distributed fashion
• Not so good in predicting with very low latency (overhead for startup Spark jobs)
Machine learning1. Data exploration
2. Data preprocessing
3. Model training
4. Model evaluation
5. Model inspection
MLlib: Spark’s Machine Learning library● The Apache Spark core distribution includes a machine learning library since
its inception called ‘MLlib’● MLlib was based on the RDD API● Spark 1.2 introduced a new package called spark.ml● spark.ml is a high-level interface based on DataFrames● Since Spark 2.0 both are called MLlib
○ DataFrames API is the primary API○ RDD API is in maintenance mode○ RDD API expected to be deprecated in 2.3, removed in 3.0
MLlib data types
MLlib (RDD) uses some numerical data types backed by Breeze
● Local vector○ Dense and sparse vectors of doubles
● Labeled point○ Local vector + a label, used by supervised learning algorithms
● Local matrix○ Dense and sparse matrices stored on a single machine
● Distributed matrix○ Row, column indices with double values stored in one or more RDDs
MLlibCommon machine learning algorithms on top of Spark:
• classification: SVM, Naive Bayes, Random Forests
• regression: logistic regression, decision trees, isotonic regression
• clustering: K-means, PIC, LDA
• collaborative filtering: alternating least squares
• dimensionality reduction: SVD, PCA
Pipeline stages
The pipeline concept is the basis for spark.ml and based on the same idea in scikit-learn. There are three main components:
● Transformer: transforms a DataFrame to a new DataFrame● Estimator: needs fitting on data to produce a model (which is a Transformer)● Pipeline: chain of multiple Transformers and Estimators together
Pipeline
tok = Tokenizer().setInputCol(“text”).setOutputCol(“words”)
htf = HashingTF().setInputCol("words") \ .setOutputCol("features") \ .setNumFeatures(200)
lr = LogisticRegression().setMaxIter(10) \ .setRegParam(0.3) \ .setElasticNetParam(0.8) pipeline = Pipeline().setStages([tok, htf, lr])model = pipeline.fit(training_data)
Pipeline
[...]pipeline = Pipeline().setStages([tok, htf, lr)])model = pipeline.fit(training_data)
predictions = model.transform(test_data)
Model selection (hyperparameter tuning)
Model selection can be done using the CrossValidator and TrainValidationSplit tools. They use as input:
● Estimator or Pipeline: the algorithm to optimize● Set of parameter maps: the parameter grid● Evaluator: metric of the performance of a model
CrossValidator and TrainValidationSplit are Estimators themselves!
Model selection
paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()
trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)
model = trainValidationSplit.fit(training_data)
predictions = model.transform(test_data)
Extending MLlib
● You can write your own Estimators and Transformers● They need to implement the pipeline interfaces● They can be used in Pipelines and mixed with existing ones
Model selection as Pipeline
paramGrid = ParamGridBuilder().addGrid(lr.regParam, Array(0.1, 0.01)).build()
trainValidationSplit = TrainValidationSplit() .setEstimator(pipeline) .setEvaluator(RegressionEvaluator()) .setEstimatorParamMaps(paramGrid) .setTrainRatio(0.8)
model = trainValidationSplit.fit(training_data)
predictions = model.transform(test_data)
Additional I/O
● Reading/writing labeled data in LIBSVM format○ Using MLUtils.loadLibSVMFile() and MLUtils.saveAsLibSVMFile()
● Models can be persisted after a job ○ Using model.save() / model.load() methods○ Internal Spark-only format
● Some models can be exported in PMML format○ KMeansModel, LassoModel, LinearRegressionModel, LogisticRegressionModel,
RidgeRegressionModel, SVMModel, StreamingKMeansModel○ Importing PMML is not supported
https://www.csie.ntu.edu.tw/~cjlin/libsvm/https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
Using MLlib workflow1. Read the official website documentation:
https://spark.apache.org/docs/2.1.1/ml-guide.html2. Read the Python API docs:
https://spark.apache.org/docs/2.1.1/api/python/index.html3. Read the Scala API docs:
https://spark.apache.org/docs/2.1.1/api/scala/index.html4. Read the Scala source code:
https://github.com/apache/spark/tree/master/mllib/src
Optional: consult Google, Stackoverflow, Spark JIRAMake sure you read the documentation of your Spark version!
Alternatives to Spark MLlibThese libraries can use Spark as a backend and have their own API
● Sparkling Water (H2O) - https://www.h2o.ai/sparkling-water/● DL4J - https://deeplearning4j.org/● Apache Mahout - https://mahout.apache.org/
Hands-on with Jupyter notebooks● https://prace.jove.surfsara.nl● Username/password: see handout
Practical advice & summary
Scala (vs Python/Java)To get the most out of Spark you should use Scala (or at least know a little)
● Scala performs better than Python○ dynamic typing, JVM communication
● Scala API is nicer than the Java API○ although this has improved with Java 8
But there are companies running PySpark in production, and with DataFrames the performance gap is smaller than with RDDs
Community packagesspark-packages.org is an index of third-party Spark packages
Examples:
● graphframes: DataFrame-based Graphs● elasticsearch-hadoop: integration with ElasticSearch● thunder: neural data analysis framework● spark-nlp: Natural Language Processing for Spark
Currently ‘only’ 394 packages, quality varies
SerializationSpark stores intermediate data in memory when needed/possible
There are three options:
● In-memory as deserialized Java objects○ Fast, might be inefficient wrt space
● In-memory as serialized data (using Kryo)○ More CPU-intensive, but memory-efficient○ Not needed/possible for Python
● On-disk○ When it doesn’t fit in memory, write to disk○ Slow, but fault-tolerant
Serialization - manual control● When using an RDD multiple times inside the same job, you want to control
where/how this RDD is persisted● Based on five attributes:
○ useDisk○ useMemory○ useOfHeap○ deserialized○ replication
● Controlled by calling rdd.persist(TYPE)● When memory or disk are full, Spark will use a Least Recently Used (LRU)
policy to delete partitions
Serialization - manual control
● Example: rdd.persist(DISK_ONLY_2)○ useDisk = True
○ useMemory = False
○ useOfHeap = False
○ deserialized = False
○ replication = 2
Serialization - manual control
● Example: rdd.persist(MEMORY_ONLY_SER)○ useDisk = False
○ useMemory = True
○ useOfHeap = False
○ deserialized = True
○ replication = 1
Serialization - when reading/writingReading data can be made more performant by writing it in a good format
● Compression codec that favors (de)compression speed over compression ratio○ Because of this BZip2 is usually a bad choice
● Serialization format that stores the structure of the data
General advice:
● For RDDs, use Hadoop SequenceFile or ORCFile with LZO or Snappy compression● For DataFrames, use the Parquet format
HDF5 / netCDF● Official HDF5 Spark Connector (Beta) -
https://www.hdfgroup.org/downloads/spark-connector● Loading netCDF / HDF using SciSpark (NASA JPL) -
https://scispark.jpl.nasa.gov/● H5Spark - https://github.com/valiantljk/h5spark
MapPartitions● RDD operations seen so far work on
○ single records (map, filter)○ whole RDDs (join, union)
● MapPartitions works on a whole partition● Allows to ‘share state’ between multiple records in the same partition● Use this to share ‘expensive’ operations
○ creating a DB connection, initializing a Tokenizer, …
● Use this to do secondary sorting or custom aggregations (be careful)
MapPartitions● Input: iterator over records in a single partition● Output: iterator over transformed records of this partition
● The full partition might not fit in memory, so avoid creating a list/full buffering
def tokenize(iter):
tokenizer = StringTokenizer() # expensive to startfor line in iter:
yield tokenizer.tokenize(line)
tokenized_rdd = rdd.mapPartitions(tokenize)
MapPartitions
● mapPartitions has an optional argument preservesPartitioning(False by default)
● Set this to True iff the function works on a PairRDD and doesn’t modify the keys
MapPartitionsmapPartitions can be used to implement many other transformations such as map, flatMap and filter
def do_map(iter):for i in iter:
yield f(i)
def do_filter(iter):for i in iter:
if p(i):yield i
def do_flatmap(iter): for sub_iter in iter: for i in sub_iter: yield f(i)
RDD - data sourcesSparkContext methods to read different data formats
● textFile(path)● wholeTextFiles(path)● binaryFiles(path)● binaryRecords(path, recordLength)● newAPIHadoopFile(path, inputFormatClass, keyClass, valueClass)● sequenceFile(path)
Data can be on any Hadoop-supported filesystem (local, HDFS, S3) accessible to all executors
RDD - InputFormat example WARC file format
● ‘Standard’ file format for web archives○ Used by Internet Archive, Library of Congress, CommonCrawl
● WARC file: concatenation of WARC records (separated by two newlines)● WARC record: header and content block● Header contains information such as type, date, length
How to read this into Spark?
https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtmlhttp://commoncrawl.org/2014/04/navigating-the-warc-file-format/
WARC file formatWARC/1.0WARC-Type: responseWARC-Date: 2013-12-04T16:47:32Z Content-Length: 73873Content-Type: application/http; msgtype=responseWARC-IP-Address: 23.0.160.82WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFBWARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU
HTTP/1.0 200 OKServer: nginxContent-Type: text/html; charset=UTF-8Vary: Accept-EncodingVary: CookieX-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.Content-Encoding: gzipDate: Wed, 04 Dec 2013 16:47:32 GMTContent-Length: 18953Connection: close
...HTML Content...
Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions
Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions
import warcfrom pyspark import SparkContext
sc = SparkContext()warc_files = sc.binaryFiles(“input/*.warc”)warc_records = warc_files.flatMap(lambda f: [r for r in warc.read(f)])
Reading WARC - attempt 1● Read in whole files with sc.wholeTextfiles or sc.binaryFiles● Use an existing WARC parsing library● Use this library within flatMap or mapPartitions
Concerns:
● What if a single WARC file is 1GB in size? 10GB? 100GB?● (Can my WARC library read from a byte array)
Reading WARC - attempt 2
● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)○ Using an existing WARC parsing library
● Read the data using sc.newAPIHadoopFile
from pyspark import SparkContext
sc = SparkContext()warc_records = sc.newAPIHadoopFile(“input/*.warc”, “nl.surfsara.warcutils.WarcInputFormat”, “org.apache.hadoop.io.LongWritable”, “org.apache.hadoop.io.Text”)
public class WarcRecordReader extends RecordReader<LongWritable, WarcRecord> {private DataInputStream in;private long start;private long pos;private long end;private Seekable filePosition;
private CompressionCodecFactory compressionCodecs = null;private CompressionCodec codec;private Decompressor decompressor;
private LongWritable key = null;private WarcRecord value = null;private WarcReader warcReader;
@Overridepublic void initialize(InputSplit inputSplit, TaskAttemptContext context) throws IOException {
FileSplit split = (FileSplit) inputSplit;Configuration conf = context.getConfiguration();final Path file = split.getPath();
start = split.getStart();end = start + split.getLength();compressionCodecs = new CompressionCodecFactory(conf);codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);FSDataInputStream fileIn = fs.open(split.getPath());
if (isCompressedInput()) {in = new DataInputStream(codec.createInputStream(fileIn, decompressor));filePosition = fileIn;
} else {fileIn.seek(start);in = fileIn;filePosition = fileIn;
}
warcReader = WarcReaderFactory.getReaderUncompressed(in);
warcReader.setWarcTargetUriProfile(WarcIOConstants.URIPROFILE);warcReader.setBlockDigestEnabled(WarcIOConstants.BLOCKDIGESTENABLED);warcReader.setPayloadDigestEnabled(WarcIOConstants.PAYLOADDIGESTENABLED);warcReader.setRecordHeaderMaxSize(WarcIOConstants.HEADERMAXSIZE);warcReader.setPayloadHeaderMaxSize(WarcIOConstants.PAYLOADHEADERMAXSIZE);
this.pos = start;}
https://github.com/sara-nl/warcutils
public boolean nextKeyValue() throws IOException {if (key == null) {
key = new LongWritable();}pos = filePosition.getPos();key.set(pos);
value = warcReader.getNextRecord();if (value == null) {
return false;}return true;
}
@Overridepublic LongWritable getCurrentKey() {
return key;}
@Overridepublic WarcRecord getCurrentValue() {
return value;}
@Overridepublic float getProgress() throws IOException {
if (start == end) {return 0.0f;
} else {return Math.min(1.0f, (getFilePosition() - start) / (float) (end - start));
}}
@Overridepublic synchronized void close() throws IOException {
try {if (in != null) {
in.close();}
} finally {if (decompressor != null) {
CodecPool.returnDecompressor(decompressor);}
}}
[...]
Reading WARC - attempt 2● Find or write an Hadoop InputFormat/RecordReader (in Java/Scala)
○ Using an existing WARC parsing libary
● Read the data using sc.newHadoopFile
Concerns:
● What if a single WARC record is 1GB in size? 10GB? 100GB?○ Not suitable for any form of distributed computing?
● What if I don’t know Java/Scala?
Real example: unique IDs● Problem: Algorithm expects records to have a unique integer ID for some field
but your dataset has a unique string column (email, username, …)● Solution(?): Use the MonotonicallyIncreasingID function to add a new column
to the DataFrame
from pyspark.sql.functions import monotonically_increasing_id
df = spark.read.csv(“input/*”)
df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id())
Real example: unique IDs● Problem: MonotonicallyIncreasingID generates 64-bit numbers, and the
algorithms expects 32-bit numbers…● Solution(?): We have less than 2^32 items, so just cast from Long to Int
from pyspark.sql.functions import monotonically_increasing_id
df = spark.read.csv(“input/*”)
df_with_ids = df.withColumn(“new_id”, monotonically_increasing_id().cast(“int”))
Real example: unique IDs“The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.”
https://spark.apache.org/docs/2.1.1/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id
● Problem: By casting to Int we will have overlapping IDs!● Solution: Use the MLlib StringIndexer instead● Alternative: Convert to RDD and use zipWithIndex
Real example: unique IDs“StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.”
https://spark.apache.org/docs/latest/ml-features.html#stringindexer
from pyspark.ml.feature import StringIndexer
df = spark.read.csv(“input/*”)string_indexer = StringIndexer(inputCol="id", outputCol="new_id", handleInvalid='error')model = string_indexer.fit(df)df_with_ids = model.transform(df)
Real example: unique IDs“The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index.”
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex
df = spark.read.csv(“input/*”)rdd_with_ids = df.rdd.zipWithIndex().map(...)df_with_ids = rdd_with_ids.toDF(schema)
Real example: ALS & RegressionEvaluator● After building an ALS model we try to use the standard RegressionEvaluator to calculate the RMSE
when predicting the validation set● RegressionEvaluator seems to always return ‘nan’● Searching the Internet reveals:
“When building a Spark ML pipeline containing an ALS estimator, the metrics "rmse", "mse", "r2" and "mae" all return NaN.
The reason is in CrossValidator.scala line 109. The K-folds are randomly generated. For large and sparse datasets, there is a significant probability that at least one user of the validation set is missing in the training set, hence generating a few NaN estimation with transform method and NaN RegressionEvaluator's metrics too.”
https://issues.apache.org/jira/browse/SPARK-14489
● Reported 08-04-2016● Fixed 28-02-2017 in version 2.2.0● But we are running version 2.1.1 :(
Real example: ALS & RegressionEvaluator● In version 2.2.0 ALS has an extra parameter coldStartStrategy which can be
nan (old behaviour) or drop (drop all rows with NaN predictions)● Workaround for version 2.1.1: drop nan rows manually between transform
and evaluate or subclass the model or estimator
predictions = model.transform(validate).dropna()evaluator = RegressionEvaluator()rmse = evaluator.evaluate(predictions)
def MyEvaluator(RegressionEvaluator): def evaluate(df, params=None): df = df.dropna() return super().evaluate(df, params)
predictions = model.transform(validate)evaluator = MyEvaluator()rmse = evaluator.evaluate(predictions)
Narrow transformationsEach child partition depends on a known subset of parent partitions
Wide transformations● Wide transformations are the most expensive and should be avoided or
optimized● Wide transformations are caused by operations such as groupByKey,
reduceByKey, sort and join● Examples for optimizing:
○ Filter first○ Use reduceByKey instead of groupByKey + map
● Join is one of the most expensive operations in Spark○ Use distinct to prevent data explosion○ Use cogroup instead of join
Joins with DataFrames
● Less control with DataFrames● Execution plan is determined by Catalyst● Cannot change the Partitioner● The advice on preventing joins with non-unique keys still holds!