using apache spark for analytics in the...
TRANSCRIPT
![Page 1: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/1.jpg)
USING APACHE SPARK FORUSING APACHE SPARK FORANALYTICS IN THE CLOUDANALYTICS IN THE CLOUD
William C. BentonWilliam C. Benton
Principal Software Engineer
Red Hat Emerging Technology
June 24, 2015
![Page 2: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/2.jpg)
ABOUT MEABOUT MEDistributed systems and data science in Red Hat'sEmerging Technology groupActive open-source and Fedora developerBefore Red Hat: programming language research
![Page 3: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/3.jpg)
FORECASTFORECASTDistributed data processing: history and mythologyData processing in the cloudIntroducing Apache SparkHow we use Spark for data science at Red Hat
![Page 4: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/4.jpg)
Recent history & persistent mythology
DATA PROCESSINGDATA PROCESSING
![Page 5: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/5.jpg)
What makes distributed data processing difficult?
CHALLENGESCHALLENGES
![Page 6: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/6.jpg)
MAPREDUCE (2004)MAPREDUCE (2004)A novel application of some very old functionalprogramming ideas to distributed computingAll data are modeled as key-value pairs Mappers transform pairs; reducers merge severalpairs with the same key into one new pairRuntime system shuffles data to improve locality
![Page 7: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/7.jpg)
WORD COUNTWORD COUNT"a b" "c e" "a b" "d a" "d b"
![Page 8: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/8.jpg)
MAPPED INPUTSMAPPED INPUTS(a, 1)(b, 1)
(c, 1)(e, 1)
(a, 1)(b, 1)
(d, 1)(a, 1)
(d, 1)(b, 1)
![Page 9: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/9.jpg)
SHUFFLED RECORDSSHUFFLED RECORDS(a, 1)(a, 1)(a, 1)
(b, 1)(b, 1)(b, 1)
(c, 1) (d, 1)(d, 1)
(e, 1)
![Page 10: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/10.jpg)
REDUCED RECORDSREDUCED RECORDS
(a, 3) (b, 3) (c, 1) (d, 2) (e, 1)
![Page 11: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/11.jpg)
HADOOP (2005)HADOOP (2005)Open-source implementation of MapReduce, adistributed filesystem, and moreInexpensive way to store and process data withscale-out on commodity hardwareMotivates many of the default assumptions wemake about “big data” today
![Page 12: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/12.jpg)
“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data
![Page 13: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/13.jpg)
“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data
![Page 14: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/14.jpg)
...at least two analytics production clusters (atMicrosoft and Yahoo) have median job input sizesunder 14 GB and 90% of jobs on a Facebookcluster have input sizes under 100 GB.
Appuswamy et al., “Nobody ever got fired forbuying a cluster.” Microsoft Research Tech Report.
![Page 15: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/15.jpg)
Takeaway #1: you may need petascalestorage, but you probably don't evenneed terascale compute.
![Page 16: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/16.jpg)
Takeaway #2: moderately sizedworkloads benefit more fromscale-up than scale out.
![Page 17: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/17.jpg)
“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data
![Page 18: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/18.jpg)
Contrary to our expectations ... CPU (and not I/O)is often the bottleneck [and] improving networkperformance can improve job completion time bya median of at most 2%
Ousterhout et al., “Making Sense of Performance inData Analytics Frameworks.” USENIX NSDI ’15.
![Page 19: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/19.jpg)
Takeaway #3: I/O is not the bottleneck(especially in moderately-sized jobs);focus on CPU performance.
![Page 20: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/20.jpg)
“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data
![Page 21: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/21.jpg)
Takeaway #4: collocated data andcompute was a sensible choice forpetascale jobs in 2005, but shouldn'tnecessarily be the default today.
![Page 22: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/22.jpg)
FACTS (REVISED)FACTS (REVISED)You probably don't need an architecture that willscale out to many nodes to handle real-world dataanalytics (and might be better served by scaling up)Your network and disks probably aren't the problemYou have enormous flexibility to choose the besttechnologies for storage and compute
![Page 23: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/23.jpg)
HADOOP IN 2015HADOOP IN 2015MapReduce is low-level, verbose, and not an obviousfit for many interesting problemsNo unified abstractions: Hive or Pig for query,Giraph for graph, Mahout for machine learning, etc.Fundamental architectural assumptions need to berevisited along with the “facts” motivating them
![Page 24: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/24.jpg)
How our assumptions should change
DATA PROCESSINGDATA PROCESSINGIN THE CLOUDIN THE CLOUD
![Page 25: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/25.jpg)
COLLOCATED DATACOLLOCATED DATAAND COMPUTEAND COMPUTE
![Page 26: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/26.jpg)
ELASTIC RESOURCESELASTIC RESOURCES
![Page 27: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/27.jpg)
DISTINCT STORAGEDISTINCT STORAGEAND COMPUTEAND COMPUTECombine the best storage system for your applicationwith elastic compute resources.
![Page 28: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/28.jpg)
INTRODUCING SPARKINTRODUCING SPARK
![Page 29: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/29.jpg)
Apache Spark is a framework fordistributed computing based on ahigh-level, expressive abstraction.
![Page 30: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/30.jpg)
Query ML Graph StreamingSpark core
![Page 31: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/31.jpg)
ad hoc Mesos YARNQuery ML Graph StreamingSpark core
![Page 32: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/32.jpg)
ad hoc Mesos YARNSpark core
Query ML Graph Streaming
Language bindings for Scala,Java, Python, and R
Access data from JDBC,Gluster, HDFS, S3, and more
![Page 33: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/33.jpg)
ad hoc Mesos YARNA resilient distributed dataset is apartitioned, immutable, lazy collection.
![Page 34: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/34.jpg)
A resilient distributed dataset is apartitioned, immutable, lazy collection.
![Page 35: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/35.jpg)
A resilient distributed dataset is apartitioned, immutable, lazy collection.
The PARTITIONS making upan RDD can be distributedacross multiple machines
![Page 36: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/36.jpg)
A resilient distributed dataset is apartitioned, immutable, lazy collection.
TRANSFORMATIONS create new(lazy) collections; ACTIONS forcecomputations and return results
![Page 37: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/37.jpg)
CREATING AN RDDCREATING AN RDD
spark.parallelize(range(1, 1000))
spark.textFile("hamlet.txt")
spark.hadoopFile("...")spark.sequenceFile("...")spark.objectFile("...")
# from an in-memory collection
# from the lines of a text file
# from a Hadoop-format binary file
![Page 38: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/38.jpg)
TRANSFORMING RDDSTRANSFORMING RDDS
numbers.map(lambda x: x + 1)
lines.flatMap(lambda s: s.split(" "))
vowels = ['a', 'e', 'i', 'o', 'u']words.filter(lambda s: s[0] in vowels)
words.distinct()
# transform each element independently
# turn each element into zero or more elements
# reject elements that don't satisfy a predicate
# keep only one copy of duplicate elements
![Page 39: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/39.jpg)
TRANSFORMING RDDSTRANSFORMING RDDS
pairs.sortByKey()
pairs.reduceByKey(lambda x, y: max(x, y))
pairs.join(other_pairs)
# return an RDD of key-value pairs, sorted by# the keys of each
# combine every two pairs having the same key,# using the given reduce function
# join together two RDDs of pairs so that# [(a, b)] join [(a, c)] == [(a, (b, c))]
![Page 40: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/40.jpg)
CACHING RESULTSCACHING RESULTS
sorted_pairs = pairs.sortByKey()sorted_pairs.cache()
sorted_pairs.persist(MEMORY_AND_DISK)
sorted_pairs.unpersist()
# tell Spark to cache this RDD in cluster# memory after we compute it
# as above, except also store a copy on disk
# uncache and free this result
![Page 41: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/41.jpg)
COMPUTING RESULTSCOMPUTING RESULTS
numbers.count()
counts.collect()
words.saveAsTextFile("...")
# compute this RDD and return a# count of elements
# compute this RDD and materialize it# as a local collection
# compute this RDD and write each# partition to stable storage
![Page 42: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/42.jpg)
WORD COUNT EXAMPLEWORD COUNT EXAMPLE
f = spark.textFile("...")
words = f.flatMap(lambda line: line.split(" "))
occs = words.map(lambda word: (word, 1))
counts = occs.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("...")
# create an RDD backed by the lines of a file
# ...mapping from lines of text to words
# ...mapping from words to occurrences
# ...reducing occurrences to counts
# POP QUIZ: what have we computed so far?
![Page 43: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/43.jpg)
PETASCALE STORAGE,PETASCALE STORAGE,IN-MEMORY COMPUTEIN-MEMORY COMPUTE
storage/compute nodes
compute-only nodes, whichprimarily operate oncached post-ETL data
![Page 44: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/44.jpg)
DATA SCIENCE ATDATA SCIENCE ATRED HATRED HAT
![Page 45: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/45.jpg)
THE EMERGING TECHTHE EMERGING TECHDATA SCIENCE TEAMDATA SCIENCE TEAMEngineers with distributed systems, data science,and scientific computing expertiseGoal: help internal customers solve data problemsand make data-driven decisionsPrinciples: identify best practices, question outdatedassumptions, use best-of-breed technology
![Page 46: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/46.jpg)
DEVELOPMENTDEVELOPMENT
Six compute-only nodes Two nodes for GlusterstorageApache Spark runningunder Apache MesosOpen-source “notebook”interfaces to analyses
![Page 47: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/47.jpg)
DATA SOURCESDATA SOURCES
FTP
S3
SQL
MongoDB
ElasticSearch
![Page 48: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/48.jpg)
INTERACTIVE QUERYINTERACTIVE QUERY
![Page 49: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/49.jpg)
TWO CASE STUDIESTWO CASE STUDIES
![Page 50: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/50.jpg)
ROLE ANALYSISROLE ANALYSISData source: historical configuration and telemetrydata for internal machines from ElasticSearchData size: hundreds of GBAnalysis: identify machine roles based on thepackages each has installed
![Page 51: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/51.jpg)
BUDGET FORECASTINGBUDGET FORECASTINGData sources: operational log data for OpenShift Online(from MariaDB), actual costs incurred by OpenShiftData size: over 120 GBAnalysis: identify operational metrics most stronglycorrelated with operating expenses; model dailyoperating expense as a function of these metrics
Aggregating performance metrics:17 hours in MariaDB, 15 minutes in Spark!
![Page 52: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/52.jpg)
NEXT STEPSNEXT STEPS
![Page 53: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/53.jpg)
DEMO VIDEODEMO VIDEOSee a video demo of Continuum Analytics, PySpark,and Red Hat Storage: or h.264 Ogg Theora
![Page 54: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/54.jpg)
WHERE FROM HEREWHERE FROM HERECheck out the Emerging Technology Data Scienceteam's library to help build your own data-drivenapplications: See my blog for articles about open source datascience: Questions?
https://github.com/willb/silex/
http://chapeau.freevariable.com
![Page 55: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/55.jpg)
THANKSTHANKS
![Page 56: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton](https://reader035.vdocuments.mx/reader035/viewer/2022062317/5a78cbf67f8b9a07028e5db8/html5/thumbnails/56.jpg)