there are three types of operations on rdds .... more rdd... · types of spark operations there are...
TRANSCRIPT
TypesofsparkoperationsThereareThreetypesofoperationsonRDDs:Transformations,ActionsandShuffles.
Themostexpensiveoperationsarethosetherequirecommunicationbetweennodes.
Transformations:RDD RDD.
Examplesmap,filter,sample,Nocommunicationneeded.
→
More
Actions:RDD Python-objectinheadnode.
Examples:reduce,collect,count,take,Somecommunicationneeded.
→
More
Shuffles:RDD RDD,shuffleneeded
Examples:repartition,sortByKey,reduceByKey,joinALOTofcommunicationneeded.
→
More
Key/valuepairsApythondictionaryisacollectionofkey/valuepairs.Thekeyisusedtofindasetofpairswiththeparticularkey.Thevaluecanbeanything.Sparkhasasetofspecialoperationsfor(key,value)RDDs.
Creating(key,value)RDDS
Method1:parallelizealistofpairs.
In [2]: pair_rdd = sc.parallelize([(1,2), (3,4)])print pair_rdd.collect()
[(1, 2), (3, 4)]
Method2:map()afunctionthatreturnsakey/valuepair.
In [3]: regular_rdd = sc.parallelize([1, 2, 3, 4, 2, 5, 6])pair_rdd = regular_rdd.map( lambda x: (x, x*x) )print pair_rdd.collect()
[(1, 1), (2, 4), (3, 9), (4, 16), (2, 4), (5, 25), (6, 36)]
SomeimportantKey-ValueTransformations
1.reduceByKey(func):Applythereducefunctiononthevalueswiththesamekey.
In [8]:
Notethatalthoughitissimilartothereducefunction,itisimplementedasatransformationandnotasanactionbecausethedatasetcanhaveverylargenumberofkeys.So,itdoesnotreturnvaluestothedriverprogram.Instead,itreturnsanewRDD.
rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.reduceByKey(lambda a,b: a+b).collect()
Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 2), (2, 10)]
2.sortByKey():
SortRDDbykeysinascendingorder.
In [9]:
Note:TheoutputofsortByKey()isanRDD.ThismeansthatRDDsdohaveameaningfulorder,whichextendsbetweenpartitions.
rdd = sc.parallelize([(2,2), (1,4), (3,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.sortByKey().collect()
Original RDD : [(2, 2), (1, 4), (3, 6)]After transformation : [(1, 4), (2, 2), (3, 6)]
3.mapValues(func):
ApplyfunctoeachvalueofRDDwithoutchangingthekey.
In [10]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.mapValues(lambda x: x*2).collect()
Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 4), (2, 8), (2, 12)]
4.groupByKey():
ReturnsanewRDDof(key,<iterator>)pairswheretheiteratoriteratesoverthevaluesassociatedwiththekey.
arepythonobjectsthatgenerateasequenceofvalues.Writingaloopovernelementsas
isinefficientbecauseitfirstallocatesalistofnelementsandtheniteratesoverit.Usingtheiteratorxrange(n)achievesthesameresultwithoutmaterializingthelist.Instead,elementsaregeneratedonthefly.
Tomaterializethelistofvaluesreturnedbyaniteratorwewillusethelistcomprehensioncommand:
In [11]:
Iterators
for i in range(n): ##do something
[a for a in <iterator>]
rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()
5.flatMapValues(func):
funcisafunctionthattakesasinputasinglevalueandreturnsanitratorthatgeneratesasequenceofvalues.TheapplicationofflatMapValuesoperatesonakey/valueRDD.Itappliesfunctoeachvalue,andgetsanlist(generatedbytheiterator)ofvalues.Itthencombineseachofthevalueswiththeoriginalkeytoproducealistofkey-valuepairs.TheselistsareconcatenatedasinflatMap
In [14]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()# the lambda function generates for each number i, an iterator that produces i,i+1print "After transformation : ", rdd.flatMapValues(lambda x: xrange(x,x+2)).collect()
Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 2), (1, 3), (2, 4), (2, 5), (2, 6), (2, 7)]
TransformationsontwoPairRDDs
In [16]: rdd1 = sc.parallelize([(1,2),(2,1),(2,2)])rdd2 = sc.parallelize([(2,5),(3,1)])a = rdd1.collect()b = rdd2.collect()print a,b
[(1, 2), (2, 1), (2, 2)] [(2, 5), (3, 1)]
1.subtractByKey:
RemovefromRDD1allelementswhosekeyispresentinRDD2.
In [17]: print "RDD1:", aprint "RDD2:", bprint "Result:", rdd1.subtractByKey(rdd2).collect()
RDD1: [(1, 2), (2, 1), (2, 2)]RDD2: [(2, 5), (3, 1)]Result: [(1, 2)]
2.join:
Afundamentaloperationinrelationaldatabases.assumestwotableshaveakeycolumnincommon.mergesrowswiththesamekey.
Supposewehavetwo(key,value)datasets
dataset1 .......... dataset2
key=name (gender,occupation,age) key=name haircolor
John (male,cook,21) Jill blond
Jill (female,programmer,19) Grace brown
John (male,kid,2) John black
Kate (female,wrestler,54)
WhenJoiniscalledondatasetsoftype(Key, V)and(Key, W),itreturnsadatasetof(Key, (V, W))pairswithallpairsofelementsforeachkey.Joiningthe2datasetsaboveyields:
key=name (gender,occupation,age),haircolor
John ((male,cook,21),black)
John ((male,kid,2),black)
Jill ((female,programmer,19),blond)
In [18]: print "RDD1:", aprint "RDD2:", bprint "Result:", rdd1.join(rdd2).collect()
RDD1: [(1, 2), (2, 1), (2, 2)]RDD2: [(2, 5), (3, 1)]Result: [(2, (1, 5)), (2, (2, 5))]
ActionsonPairRDDs
In [21]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])a = rdd.collect()
1.countByKey():Countthenumberofelementsforeachkey.Returnsadictionaryforeasyaccesstokeys.
In [22]: print "RDD: ", aresult = rdd.countByKey()print "Result:", result
RDD: [(1, 2), (2, 4), (2, 6)]Result: defaultdict(<type 'int'>, {1: 1, 2: 2})
2.collectAsMap():
Collecttheresultasadictionarytoprovideeasylookup.
In [23]: print "RDD: ", aresult = rdd.collectAsMap()print "Result:", result
RDD: [(1, 2), (2, 4), (2, 6)]Result: {1: 2, 2: 6}
3.lookup(key):
Returnallvaluesassociatedwiththeprovidedkey.
In [24]: print "RDD: ", aresult = rdd.lookup(2)print "Result:", result
RDD: [(1, 2), (2, 4), (2, 6)]Result: [4, 6]