apache spark intro
TRANSCRIPT
![Page 1: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/1.jpg)
Apache Spark Introworkshop
BigData Romania
![Page 2: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/2.jpg)
Apache Spark Intro
★ Apache Spark history★ RDD★ Transformations★ Actions★ Hands-on session
![Page 3: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/3.jpg)
Apache Spark History
https://www.reddit.com/r/IAmA/comments/31bkue/im_matei_zaharia_creator_of_spark_and_cto_at/
![Page 4: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/4.jpg)
From where to learn Spark ?
http://spark.apache.org/
http://shop.oreilly.com/product/0636920028512.do
![Page 5: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/5.jpg)
Spark architecture
![Page 6: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/6.jpg)
Easy ways to run Spark ?★ your IDE (ex. Eclipse or IDEA)★ Standalone Deploy Mode: simplest way to deploy Spark
on a single machine★ Docker & Zeppelin★ EMR★ Hadoop vendors (Cloudera, Hortonworks)
![Page 7: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/7.jpg)
Supported languages
![Page 8: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/8.jpg)
Spark basics
★ RDD★ Operations : Transformations and Actions
![Page 9: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/9.jpg)
RDD
An RDD is simply an immutable distributed collection of objects!
b c d ge f ih kj ml ona qp
![Page 10: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/10.jpg)
Creating RDD (I) Pythonlines = sc.parallelize([“workshop”, “spark”])
Scalaval lines = sc.parallelize(List(“workshop”, “spark”))
Java JavaRDD<String> lines = sc.parallelize(Arrays.asList(“workshop”, “spark”))
![Page 11: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/11.jpg)
Creating RDD (II) Pythonlines = sc.textFile(“/path/to/file.txt”)
Scalaval lines = sc.textFile(“/path/to/file.txt”)
Java JavaRDD<String> lines = sc.textFile(“/path/to/file.txt”)
![Page 12: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/12.jpg)
RDD persistence MEMORY_ONLY
MEMORY_AND_DISKMEMORY_ONLY_SERMEMORY_AND_DISK_SERDISK_ONLYMEMORY_ONLY_2MEMORY_AND_DISK_2OFF_HEAP
![Page 13: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/13.jpg)
Other data structures in Spark
★ Paired RDD★ DataFrame★ DataSet
![Page 14: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/14.jpg)
Paired RDD
Paired RDD = an RDD of key/value pairs
user1 user2 user3 user4 user5
id1/user1 id2/user2 id3/user3 id4/user4 id5/user5
![Page 15: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/15.jpg)
Spark operations RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
Action
Transformation
![Page 16: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/16.jpg)
TransformationsRDD 1
RDD 2Transformations describe how to transform an RDD into another RDD.
RDD 1
RDD 2
![Page 17: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/17.jpg)
Transformations RDD 1
RDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
![Page 18: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/18.jpg)
Popular transformations★ map★ filter★ sample★ union★ distinct★ groupByKey★ reduceByKey★ sortByKey★ join
![Page 19: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/19.jpg)
Actions
Actions compute a result from an RDD !
RDD 1
![Page 20: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/20.jpg)
Actions
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()
![Page 21: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/21.jpg)
Popular actions★ collect★ count★ first★ take★ takeSample★ countByKey★ saveAsTextFile
![Page 22: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/22.jpg)
Transformations and Actions
users
administrators
filter
take(3)
![Page 23: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/23.jpg)
Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()
![Page 24: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/24.jpg)
Transformations and Actions
users
administrators
filter()
take(3) saveAsTextFile()
persist()
![Page 25: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/25.jpg)
Lazy initialization
users
administrators
filter
take(3)
![Page 26: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/26.jpg)
How Spark Executes Your Program
![Page 27: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/27.jpg)
Hands-on session
![Page 28: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/28.jpg)
MovieLens MovieLens data sets were collected by the GroupLens Research Projectat the University of Minnesota. This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies. * Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
Download link : http://grouplens.org/datasets/movielens/
![Page 29: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/29.jpg)
MovieLens dataset
useruser_idagegenderoccupationzipcode
user_ratinguser_idmovie_idratingtimestamp
moviemovie_idtitlerelease_datevideo_releaseimdb_urlgenres...
![Page 30: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/30.jpg)
Exercises already solved !
★ Return only the users with occupation ‘administrator’
★ Increase the age of each user by one★ Join user and rating datasets by user id
![Page 31: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/31.jpg)
Exercises to solve★ How many men/women register to MovieLens★ Distribution of age for male/female registered to
MovieLens★ Which are the movies names with rating x?
★ Average rating by movies★ Sort users by their occupation
![Page 32: Apache spark Intro](https://reader031.vdocuments.mx/reader031/viewer/2022021419/58870c9a1a28abf2228b5421/html5/thumbnails/32.jpg)
Congrats if you reached this slide !