there are three types of operations on rdds .... more rdd... · types of spark operations there are...

TypesofsparkoperationsThereareThreetypesofoperationsonRDDs:Transformations,ActionsandShuffles.

Themostexpensiveoperationsarethosetherequirecommunicationbetweennodes.

Transformations:RDD RDD.

Examplesmap,filter,sample,Nocommunicationneeded.

→

More

http://spark.apache.org/docs/latest/programming-guide.html#transformations

Actions:RDD Python-objectinheadnode.

Examples:reduce,collect,count,take,Somecommunicationneeded.

→

More

http://spark.apache.org/docs/latest/programming-guide.html#actions

Shuffles:RDD RDD,shuffleneeded

Examples:repartition,sortByKey,reduceByKey,joinALOTofcommunicationneeded.

→

More

http://spark.apache.org/docs/latest/programming-guide.html#shuffle-operations

Key/valuepairsApythondictionaryisacollectionofkey/valuepairs.Thekeyisusedtofindasetofpairswiththeparticularkey.Thevaluecanbeanything.Sparkhasasetofspecialoperationsfor(key,value)RDDs.

Creating(key,value)RDDS

Method1:parallelizealistofpairs.

In [2]: pair_rdd = sc.parallelize([(1,2), (3,4)])print pair_rdd.collect()

[(1, 2), (3, 4)]

Method2:map()afunctionthatreturnsakey/valuepair.

In [3]: regular_rdd = sc.parallelize([1, 2, 3, 4, 2, 5, 6])pair_rdd = regular_rdd.map( lambda x: (x, x*x) )print pair_rdd.collect()

[(1, 1), (2, 4), (3, 9), (4, 16), (2, 4), (5, 25), (6, 36)]

SomeimportantKey-ValueTransformations

1.reduceByKey(func):Applythereducefunctiononthevalueswiththesamekey.

In [8]:

Notethatalthoughitissimilartothereducefunction,itisimplementedasatransformationandnotasanactionbecausethedatasetcanhaveverylargenumberofkeys.So,itdoesnotreturnvaluestothedriverprogram.Instead,itreturnsanewRDD.

rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.reduceByKey(lambda a,b: a+b).collect()

Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 2), (2, 10)]

2.sortByKey():

SortRDDbykeysinascendingorder.

In [9]:

Note:TheoutputofsortByKey()isanRDD.ThismeansthatRDDsdohaveameaningfulorder,whichextendsbetweenpartitions.

rdd = sc.parallelize([(2,2), (1,4), (3,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.sortByKey().collect()

Original RDD : [(2, 2), (1, 4), (3, 6)]After transformation : [(1, 4), (2, 2), (3, 6)]

3.mapValues(func):

ApplyfunctoeachvalueofRDDwithoutchangingthekey.

In [10]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()print "After transformation : ", rdd.mapValues(lambda x: x*2).collect()

Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 4), (2, 8), (2, 12)]

4.groupByKey():

ReturnsanewRDDof(key,<iterator>)pairswheretheiteratoriteratesoverthevaluesassociatedwiththekey.

arepythonobjectsthatgenerateasequenceofvalues.Writingaloopovernelementsas

isinefficientbecauseitfirstallocatesalistofnelementsandtheniteratesoverit.Usingtheiteratorxrange(n)achievesthesameresultwithoutmaterializingthelist.Instead,elementsaregeneratedonthefly.

Tomaterializethelistofvaluesreturnedbyaniteratorwewillusethelistcomprehensioncommand:

In [11]:

Iterators

for i in range(n): ##do something

[a for a in <iterator>]

rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()

http://anandology.com/python-practice-book/iterators.html

5.flatMapValues(func):

funcisafunctionthattakesasinputasinglevalueandreturnsanitratorthatgeneratesasequenceofvalues.TheapplicationofflatMapValuesoperatesonakey/valueRDD.Itappliesfunctoeachvalue,andgetsanlist(generatedbytheiterator)ofvalues.Itthencombineseachofthevalueswiththeoriginalkeytoproducealistofkey-valuepairs.TheselistsareconcatenatedasinflatMap

In [14]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])print "Original RDD :", rdd.collect()# the lambda function generates for each number i, an iterator that produces i,i+1print "After transformation : ", rdd.flatMapValues(lambda x: xrange(x,x+2)).collect()

Original RDD : [(1, 2), (2, 4), (2, 6)]After transformation : [(1, 2), (1, 3), (2, 4), (2, 5), (2, 6), (2, 7)]

TransformationsontwoPairRDDs

In [16]: rdd1 = sc.parallelize([(1,2),(2,1),(2,2)])rdd2 = sc.parallelize([(2,5),(3,1)])a = rdd1.collect()b = rdd2.collect()print a,b

[(1, 2), (2, 1), (2, 2)] [(2, 5), (3, 1)]

1.subtractByKey:

RemovefromRDD1allelementswhosekeyispresentinRDD2.

In [17]: print "RDD1:", aprint "RDD2:", bprint "Result:", rdd1.subtractByKey(rdd2).collect()

RDD1: [(1, 2), (2, 1), (2, 2)]RDD2: [(2, 5), (3, 1)]Result: [(1, 2)]

2.join:

Afundamentaloperationinrelationaldatabases.assumestwotableshaveakeycolumnincommon.mergesrowswiththesamekey.

Supposewehavetwo(key,value)datasets

dataset1 .......... dataset2

key=name (gender,occupation,age) key=name haircolor

John (male,cook,21) Jill blond

Jill (female,programmer,19) Grace brown

John (male,kid,2) John black

Kate (female,wrestler,54)

WhenJoiniscalledondatasetsoftype(Key, V)and(Key, W),itreturnsadatasetof(Key, (V, W))pairswithallpairsofelementsforeachkey.Joiningthe2datasetsaboveyields:

key=name (gender,occupation,age),haircolor

John ((male,cook,21),black)

John ((male,kid,2),black)

Jill ((female,programmer,19),blond)

In [18]: print "RDD1:", aprint "RDD2:", bprint "Result:", rdd1.join(rdd2).collect()

RDD1: [(1, 2), (2, 1), (2, 2)]RDD2: [(2, 5), (3, 1)]Result: [(2, (1, 5)), (2, (2, 5))]

ActionsonPairRDDs

In [21]: rdd = sc.parallelize([(1,2), (2,4), (2,6)])a = rdd.collect()

1.countByKey():Countthenumberofelementsforeachkey.Returnsadictionaryforeasyaccesstokeys.

In [22]: print "RDD: ", aresult = rdd.countByKey()print "Result:", result

RDD: [(1, 2), (2, 4), (2, 6)]Result: defaultdict(<type 'int'>, {1: 1, 2: 2})

2.collectAsMap():

Collecttheresultasadictionarytoprovideeasylookup.

In [23]: print "RDD: ", aresult = rdd.collectAsMap()print "Result:", result

RDD: [(1, 2), (2, 4), (2, 6)]Result: {1: 2, 2: 6}

3.lookup(key):

Returnallvaluesassociatedwiththeprovidedkey.

In [24]: print "RDD: ", aresult = rdd.lookup(2)print "Result:", result

RDD: [(1, 2), (2, 4), (2, 6)]Result: [4, 6]

there are three types of operations on rdds .... more rdd... · types of spark operations there are...

Documents