spark basics 2 - github pages. spark basics 2_class.pdf · spark basics 2 processing math: 100%....
TRANSCRIPT
![Page 1: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/1.jpg)
SparkBasics2
Processing math: 100%
![Page 2: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/2.jpg)
Lazyevaluation
Wecancombinemapandreduceoperationstoperformmorecomplexoperations.
Supposewewanttocomputethesumofthesquares
wheretheelementsarestoredinanRDD.
[MathProcessingError]
[MathProcessingError]
![Page 3: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/3.jpg)
CreateanRDD
In [2]: B=sc.parallelize(range(4))B.collect()
Out[2]: [0, 1, 2, 3]
![Page 4: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/4.jpg)
Sequentialsyntax
Performassignmentaftereachcomputation
In [3]: Squares=B.map(lambda x:x*x)Squares.reduce(lambda x,y:x+y)
Out[3]: 14
![Page 5: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/5.jpg)
Cascadedsyntax
Combinecomputationsintoasinglecascadedcommand
In [4]: B.map(lambda x:x*x)\ .reduce(lambda x,y:x+y)
Out[4]: 14
![Page 6: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/6.jpg)
Bothsyntaxesmeanthesamething
Theonlydifference:
InthesequentialsyntaxtheintermediateRDDhasanameSquaresInthecascadedsyntaxtheintermediateRDDisanonymous
Theexecutionisidentical!
![Page 7: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/7.jpg)
Sequentialexecution
Thestandardwaythatthemapandreduceareexecutedis
performthemapstoretheresultingRDDinmemoryperformthereduce
DisadvantagesofSequentialexecution
1. Intermediateresult(Squares)requiresmemoryspace.
2. Twoscansofmemory(ofB,thenofSquares)-doublethe
cache-misses.
![Page 8: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/8.jpg)
Pipelinedexecution
Performthewholecomputationinasinglepass.ForeachelementofB
1. Computethesquare2. Enterthesquareasinputtothereduceoperation.
AdvantagesofPipelinedexecution
1. Lessmemoryrequired-intermediateresultisnotstored.2. Faster-onlyonepassthroughtheInputRDD.
![Page 9: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/9.jpg)
LazyEvaluation
ThistypeofpipelinedevaluationisrelatedtoLazyEvaluation.ThewordLazyisusedbecausethefirstcommand(computingthesquare)isnotexecutedimmediately.Instead,theexecutionisdelayedaslongaspossiblesothatseveralcommandsareexecutedinasinglepass.
ThedelayedcommandsareorganizedinanExecutionplan
![Page 10: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/10.jpg)
Aninstructivemistake
Hereisanotherwaytocomputethesumofthesquaresusingasinglereducecommand.Whatiswrongwithit?
In [5]:
111
C=sc.parallelize([1,1,1])C.reduce(lambda x,y: x*x+y*y)
Out[5]: 5
![Page 11: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/11.jpg)
gettinginformationaboutanRDD
RDD'stypicallyhavehundredsofthousandsofelements.ItusuallymakesnosensetoprintoutthecontentofawholeRDD.HerearesomewaystogetmanageableamountsofinformationaboutanRDD
In [6]: n=1000000B=sc.parallelize([0,0,1,0]*(n/4))
![Page 12: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/12.jpg)
In [7]: #find the number of elements in the RDDB.count()
Out[7]: 1000000
![Page 13: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/13.jpg)
In [8]: # get the first few elements of an RDDprint 'first element=',B.first()print 'first 5 elements = ',B.take(5)
first element= 0first 5 elements = [0, 0, 1, 0, 0]
![Page 14: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/14.jpg)
SamplinganRDD
RDDsareoftenverylarge.Aggregates,suchasaverages,canbeapproximatedefficientlybyusingasample.Samplingisdoneinparallelanditkeepsthedatalocal.
In [9]: # get a sample whose expected size is mm=5.B.sample(False,m/n).collect()
Out[9]: [1, 0, 1, 0, 0, 0]
![Page 15: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/15.jpg)
filteringanRDD
ThemethodRDD.filter(func)Returnanewdatasetformedbyselecting
thoseelementsofthesourceonwhichfuncreturnstrue.
In [10]: # How many positive numbers?B.filter(lambda n: n > 0).count()
Out[10]: 250000
![Page 16: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/16.jpg)
RemovingduplicateelementsfromanRDD
ThemethodRDD.distinct(numPartitions=None)Returnsanewdataset
thatcontainsthedistinctelementsofthesourcedataset
ThenumberofpartitionsisspecifiedthroughthenumPartitionsargument.Eachofthispartitionsispotentiallyondifferentmachine.
In [11]: # Remove duplicate element in DuplicateRDD, we get distinct RDDDuplicateRDD = sc.parallelize([1,1,2,2,3,3])DistinctRDD = DuplicateRDD.distinct()DistinctRDD.collect()
Out[11]: [1, 2, 3]
![Page 17: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/17.jpg)
flatmapanRDD
ThemethodRDD.flatMap(func)issimilartomap,buteachinputitemcan
bemappedto0ormoreoutputitems(sofuncshouldreturnaSeqratherthanasingleitem).
In [12]: text=["you are my sunshine","my only sunshine"]text_file = sc.parallelize(text)# map each line in text to a list of wordsprint 'map:',text_file.map(lambda line: line.split(" ")).collect()# create a single list of words by combining the words from all of the linesprint 'flatmap:',text_file.flatMap(lambda line: line.split(" ")).collect()
map: [['you', 'are', 'my', 'sunshine'], ['my', 'only', 'sunshine']]flatmap: ['you', 'are', 'my', 'sunshine', 'my', 'only', 'sunshine']
![Page 18: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/18.jpg)
Setoperations
Inthispart,weexploresetoperationsincludingunion,intersection,subtract,cartesianinpyspark
In [13]: rdd1 = sc.parallelize([1, 1, 2, 3])rdd2 = sc.parallelize([1, 3, 4, 5])
![Page 19: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/19.jpg)
1. union(other)ReturntheunionofthisRDDandanotherone.
In [14]: rdd1.union(rdd2).collect()
Out[14]: [1, 1, 2, 3, 1, 3, 4, 5]
![Page 20: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/20.jpg)
1. intersection(other)ReturntheintersectionofthisRDDandanotherone.Theoutputwillnotcontainanyduplicateelements,eveniftheinputRDDsdid.Notethatthismethodperformsashuffleinternally.
In [15]: rdd1.intersection(rdd2).collect()
Out[15]: [1, 3]
![Page 21: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/21.jpg)
1. subtract(other,numPartitions=None)Returneachvalueinselfthatisnotcontainedinother.
In [16]: rdd1.subtract(rdd2).collect()
Out[16]: [2]
![Page 22: Spark Basics 2 - GitHub Pages. Spark Basics 2_CLASS.pdf · Spark Basics 2 Processing math: 100%. Lazy evaluation ... filtering an RDD The method RDD.filter(func) Return a new dataset](https://reader034.vdocuments.mx/reader034/viewer/2022042622/5f8f7cead1bfa262190acc20/html5/thumbnails/22.jpg)
1. cartesian(other)ReturntheCartesianproductofthisRDDandanotherone,thatis,theRDDofallpairsofelements(a,b)whereaisinselfandbisinother.
In [17]: print rdd1.cartesian(rdd2).collect()
[(1, 1), (1, 3), (1, 4), (1, 5), (1, 1), (1, 3), (1, 4), (1, 5), (2, 1), (2, 3), (2, 4), (2, 5), (3, 1), (3, 3), (3, 4), (3, 5)]