fast and expressive big data analytics with python
DESCRIPTION
Fast and Expressive Big Data Analytics with Python. Matei Zaharia. UC Berkeley / MIT. spark-project.org. UC BERKELEY. What is Spark?. Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory computing primitives - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/1.jpg)
Matei Zaharia
Fast and Expressive Big Data Analytics with Python
UC BERKELEY
spark-project.org
UC Berkeley / MIT
![Page 2: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/2.jpg)
What is Spark?Fast and expressive cluster computing system interoperable with Apache Hadoop
Improves efficiency through:»In-memory computing primitives»General computation graphs
Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell
Up to 100× faster(2-10× on disk)
Often 5× less code
![Page 3: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/3.jpg)
Project HistoryStarted in 2009, open sourced 2010
17 companies now contributing code»Yahoo!, Intel, Adobe, Quantifind, Conviva,
Bizo, …
Entered Apache incubator in June
Python API added in February
![Page 4: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/4.jpg)
An Expanding StackSpark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streamin
g(real-time)
GraphX(graph)
…
Shark(SQL)
MLbase(machine learning)
More details: amplab.berkeley.edu
![Page 5: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/5.jpg)
This TalkSpark programming model
Examples
Demo
Implementation
Trying it out
![Page 6: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/6.jpg)
Why a New Programming Model?
MapReduce simplified big data processing, but users quickly found two problems:
Programmability: tangle of map/red functions
Speed: MapReduce inefficient for apps that share data across multiple steps
»Iterative algorithms, interactive queries
![Page 7: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/7.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to data replication and disk I/O
![Page 8: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/8.jpg)
iter. 1 iter. 2 . . .
Input
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and disk
What We’d Like
![Page 9: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/9.jpg)
Spark ModelWrite programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in
memory or disk across a cluster»Built via parallel transformations (map,
filter, …)»Automatically rebuilt on failure
![Page 10: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/10.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()
messages.filter(lambda s: “bar” in s).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in 2 sec (vs 30 s for
on-disk data)
Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk
data)
![Page 11: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/11.jpg)
Fault ToleranceRDDs track the transformations used to build them (their lineage) to recompute lost datamessages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])
HadoopRDDpath = hdfs://…
FilteredRDDfunc = lambda s:
…
MappedRDDfunc = lambda s:
…
![Page 12: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/12.jpg)
Example: Logistic Regression
Goal: find line separating two sets of points
+
–
++
+
+
+
++ +
– ––
–
–
–
––
+
target
–
random initial line
![Page 13: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/13.jpg)
Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda x, y: x + y) w -= gradient
print “Final w: %s” % w
![Page 14: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/14.jpg)
Logistic Regression Performance
1 5 10 20 300
500
1000
1500
2000
2500
3000
3500
4000
HadoopPySpark
Number of Iterations
Ru
nn
ing
Tim
e (
s)
110 s / iteration
first iteration 80 sfurther iterations
5 s
![Page 15: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/15.jpg)
Demo
![Page 16: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/16.jpg)
Supported Operatorsmap
filter
groupBy
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
flatMap
take
first
partitionBy
pipe
distinct
save
...
![Page 17: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/17.jpg)
1000+ meetup members
60+ contributors
17 companies contributing
Spark Community
![Page 18: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/18.jpg)
This TalkSpark programming model
Examples
Demo
Implementation
Trying it out
![Page 19: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/19.jpg)
OverviewSpark core is written in Scala
PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)
No changes to Python
Your app Spark
client
Spark worker
Python child
Python child
PyS
par
k
Spark worker
Python child
Python child
![Page 20: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/20.jpg)
OverviewSpark core is written in Scala
PySpark calls existing scheduler, cache and networking layer (2K-line wrapper)
No changes to Python
Your app Spark
client
Spark worker
Python child
Python childPyS
par
k
Spark worker
Python child
Python child
Main PySpark author:
Josh Rosen
cs.berkeley.edu/~joshrosen
![Page 21: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/21.jpg)
Object MarshalingUses pickle library for both communication and cached data
»Much cheaper than Python objects in RAM
Lambda marshaling library by PiCloud
![Page 22: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/22.jpg)
Job SchedulerSupports general operator graphs
Automatically pipelines functions
Aware of data locality and partitioning
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 23: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/23.jpg)
InteroperabilityRuns in standard CPython, on Linux / Mac
»Works fine with extensions, e.g. NumPy
Input from local file system, NFS, HDFS, S3
»Only text files for now
Works in IPython, including notebook
Works in doctests – see our tests!
![Page 24: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/24.jpg)
Getting StartedVisit spark-project.org for video tutorials, online exercises, docs
Easy to run in local mode (multicore), standalone clusters, or EC2
Training camp at Berkeley in August (free video): ampcamp.berkeley.edu
![Page 25: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/25.jpg)
Getting StartedEasiest way to learn is the shell:
$ ./pyspark
>>> nums = sc.parallelize([1,2,3]) # make RDD from array
>>> nums.count()3
>>> nums.map(lambda x: 2 * x).collect()[2, 4, 6]
![Page 26: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/26.jpg)
ConclusionPySpark provides a fast and simple way to analyze big datasets from Python
Learn more or contribute at spark-project.org
Look for our training camp on August 29-30!
My email: [email protected]
![Page 27: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/27.jpg)
Behavior with Not Enough RAM
Cache disabled
25% 50% 75% Fully cached
0
20
40
60
80
10068.8
58.1
40.729.7
11.5
% of working set in memory
Itera
tion
tim
e (
s)
![Page 28: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/28.jpg)
The Rest of the StackSpark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streamin
g(real-time)
GraphX(graph)
…
Shark(SQL)
MLbase(machine learning)
More details: amplab.berkeley.edu
![Page 29: Fast and Expressive Big Data Analytics with Python](https://reader035.vdocuments.mx/reader035/viewer/2022062314/568130c3550346895d96e2f5/html5/thumbnails/29.jpg)
Performance Comparison
0
5
10
15
20
25
Impala
(d
isk)
Impala
(m
em
)R
edsh
ift
Shark
(d
isk)
Shark
(m
em
)Resp
on
se T
ime (
s)
SQL0
5
10
15
20
25
30
35
Sto
rm
Spark
Th
rou
gh
pu
t (M
B/s
/nod
e)
Streaming0
5
10
15
20
25
30
Hadoop
Gir
aph
Gra
phLa
b
Gra
phX
Resp
on
se T
ime (
min
)Graph