i n tr o d u c ti o n i n to b i g d a ta a n a l y ti c...

19
12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 1/19 Introduction into Big Data analytics MapReduce Janusz Szwabiński Outlook: 1. Motivation 2. Introduction 3. Distributed file systems 4. MapReduce 5. Algorithms using MapReduce Further reading: Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman, Mining of Massive Datasets https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac (https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac) Motivation Let us assume, we are given a list of strings and we are supposed to return the longest string. By making use of the following helper function it is quite easy - we just go over the strings one by one, compute their lengths and keep the longest one: In [1]: def find_longest_string(list_of_strings): longest_string = None longest_string_len = 0 for s in list_of_strings: if len(s) > longest_string_len: longest_string_len = len(s) longest_string = s return longest_string For small lists it works pretty fast: In [2]: list_of_strings = ['abc', 'python', 'dima'] %time max_length = print(find_longest_string(list_of_strings)) For larger lists it seems to be acceptable as well: In [3]: large_list_of_strings = list_of_strings*1000 %time print(find_longest_string(large_list_of_strings)) python CPU times: user 29 µs, sys: 6 µs, total: 35 µs Wall time: 37.7 µs python CPU times: user 415 µs, sys: 0 ns, total: 415 µs Wall time: 330 µs

Upload: others

Post on 15-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 1/19

Introduction into Big Data analytics

MapReduce

Janusz Szwabiński

Outlook:

1. Motivation2. Introduction3. Distributed file systems4. MapReduce5. Algorithms using MapReduce

Further reading:

Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman, Mining of Massive Datasetshttps://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac(https://towardsdatascience.com/a-beginners-introduction-into-mapreduce-2c912bb5e6ac)

MotivationLet us assume, we are given a list of strings and we are supposed to return the longest string. By making use of thefollowing helper function it is quite easy - we just go over the strings one by one, compute their lengths and keep thelongest one:

In [1]: def find_longest_string(list_of_strings):

longest_string = None

longest_string_len = 0

for s in list_of_strings:

if len(s) > longest_string_len:

longest_string_len = len(s)

longest_string = s

return longest_string

For small lists it works pretty fast:

In [2]: list_of_strings = ['abc', 'python', 'dima']

%time max_length = print(find_longest_string(list_of_strings))

For larger lists it seems to be acceptable as well:

In [3]: large_list_of_strings = list_of_strings*1000

%time print(find_longest_string(large_list_of_strings))

python

CPU times: user 29 µs, sys: 6 µs, total: 35 µs

Wall time: 37.7 µs

python

CPU times: user 415 µs, sys: 0 ns, total: 415 µs

Wall time: 330 µs

Page 2: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 2/19

However, for huge lists the approach is slow:

In [4]: large_list_of_strings = list_of_strings*100000000

%time max_length = max(large_list_of_strings, key=len)

8 seconds response time is not acceptable in most applicationssolutions:

vertical scalingimprove the computation time by buying a much better and faster CPUit won't work forever

nowadays, it is not trivial to find a CPU that works 10 times fasterupdating CPU every time the data gets bigger is not feasible

horizontal scalingdesign the code to run in parallelit will get much faster when we will add more CPUs

For the problem at hand, the intuition for horizontal scaling is the following:

1. Break data into many chunks.2. Execute find_longest_string function fro every chunk in parallel.3. Find the longest string among the outputs of all chunks.

However, instead of using our helper function, we will develop a more generic framework to solve the problem. Since theto main tasks we do in our code is computing the length of the string and comparing it to the longest one until now, we willbreak the code into 2 steps:

1. Compute the len of all strings.2. Select the max value.

In [5]: %%time

# step 1:

large_list_of_string_lens = [len(s) for s in large_list_of_strings]

large_list_of_string_lens = zip(large_list_of_strings, large_list_of_string_

lens)

#step 2:

max_len = max(large_list_of_string_lens, key=lambda t: t[1])

print(max_len)

the code runs slower than before, but..."step 2" gets as input not the original list of strings, but some preprocessed datathus, it can be executed using the output of another "step 2"smappers and reducers

"step 1" maps some value into some other value a mapper“step 2” gets a list of values and produces a single (in most cases) value a reducer

Let us introduce two helper functions:

CPU times: user 7.81 s, sys: 25 ms, total: 7.83 s

Wall time: 7.81 s

('python', 6)

CPU times: user 27.3 s, sys: 464 ms, total: 27.8 s

Wall time: 27.8 s

Compiler : 1.09 s

Page 3: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 3/19

In [6]: mapper = len #just the len function

def reducer(p, c):

#returns tuple with bigger length

if p[1] > c[1]:

return p

return c

Now, we can rewrite the code by making use of the map and reduce functions:

In [7]: from functools import reduce

In [8]: %%time

#step 1

mapped = map(mapper, large_list_of_strings)

mapped = zip(large_list_of_strings, mapped)

#step 2:

reduced = reduce(reducer, mapped)

print(reduced)

the code does exactly the same thing as before:"step 1" maps our list of strings into a list of tuples using the mapper function and the zip statement"step 2" uses the reducer function, goes over the tuples from step one and applies it one by onethe result is a tuple with the maximum length of th string

it looks fancierit is more generic and will help us parallelize it

We can now break our input into chunks and run a slightly modified code:

in "step 1"we go over the chunksfor each chunk, we find the longest string using map and reduce

in "step 2"we take the output of "step 1", which is a list of reduced valueswe perform a final reduce to get the longest string

In [9]: def chunkify(lst,n):

return [lst[i::n] for i in range(n)]

('python', 6)

CPU times: user 29.3 s, sys: 117 µs, total: 29.3 s

Wall time: 29.3 s

Page 4: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 4/19

In [10]: data_chunks = chunkify(large_list_of_strings,30)

#step 1:

reduced_all = []

for chunk in data_chunks:

mapped_chunk = map(mapper, chunk)

mapped_chunk = zip(chunk, mapped_chunk)

reduced_chunk = reduce(reducer, mapped_chunk)

reduced_all.append(reduced_chunk)

#step 2:

reduced = reduce(reducer, reduced_all)

print(reduced)

Let us rewrite the mapper function in order to add the first reduce step to it:

In [11]: def chunks_mapper(chunk):

mapped_chunk = map(mapper, chunk)

mapped_chunk = zip(chunk, mapped_chunk)

return reduce(reducer, mapped_chunk)

Now, the code looks as follows:

In [12]: %%time

data_chunks = chunkify(large_list_of_strings, 30)

#step 1:

mapped = map(chunks_mapper, data_chunks)

#step 2:

reduced = reduce(reducer, mapped)

print(reduced)

Now, we can parallelize "step 1" using the multiprocessing module and the pool.map function instead of theregular map one:

In [13]: from multiprocessing import Pool

pool = Pool(8)

In [14]: %%time

data_chunks = chunkify(large_list_of_strings, 8)

#step 1:

mapped = pool.map(chunks_mapper, data_chunks)

#step 2:

reduced = reduce(reducer, mapped)

print(reduced)

('python', 6)

('python', 6)

CPU times: user 42.9 s, sys: 652 ms, total: 43.5 s

Wall time: 43.5 s

('python', 6)

CPU times: user 9.11 s, sys: 1.21 s, total: 10.3 s

Wall time: 14.6 s

Page 5: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 5/19

not a huge improvement, but...the solution is scalable - if we have more data, we just need to add more processing unitsit is generic - it supports a variety of tasksimportant note

very often data is big and staticbreaking it into chunks every time is inefficient and redundantstore data in chunks (shards) from very beginning

Introductionmodern data-mining applications require us to manage immense amounts of data quicklyin many of these applications, the data is extremely regular, and there is ample opportunity to exploit parallelism:

ranking of Web pages by importance, which involves an iterated matrix-vector multiplication where the dimensionis many billionssearches in "friends" networks at social-networking sites, which involve graphs with hundreds of millions of nodesand many billions of edges

to deal with such applications, a new software stack has evolved:heavy use of parallelism offered not by a "supercomputer", but by "computing clusters"large collections of commodity hardware, including conventional processors (“compute nodes”) connected byEthernet cables or inexpensive switchesa new form of file system (a "distributed file system")

much larger units than the disk blocks in a conventional operating systemreplication of data (redundancy) to protect against the frequent media failures that occur when data isdistributed over thousands of low-cost compute nodes

many different higher-level programming systems have been developedMapReduce is one of such systems

many most common calculations on large-scale data distributed over a computing clustertolerant to hardware failuresstill evolvingcommon for MapReduce programs to be created from still higher-level programming systems like SQL

Page 6: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 6/19

Distributed file systemsin the past, applications that called for parallel processing (e.g. large scientific calculations) were done on special-purpose parallel computers with many processors and specialized hardware, i.e. supercomputersnow more and more computing is done on installations with thousands of compute nodes operating more or lessindependently

the compute nodes are commodity hardware reduced costsspecialized file systems required to take advantage of the clusters

Architecture of a cluster

compute nodes are stored on racks (e.g. 8–64 on a rack)nodes on a single rack are connected by a network, typically gigabit Ethernetmany racks of compute nodes are possibleracks are connected by another level of network or a switch with each otherbandwidth of inter-rack communication is somewhat greater than the intrarack Ethernet

given the number of pairs of nodes that might need to communicate between racks, this bandwidth may beessential

Principal failure modes

loss of a single node (e.g., the disk at that node crashes)loss of an entire rack (e.g., the network connecting its nodes to each other and to the outside world fails)if we had to abort and restart the computation every time one component failed, then the computation might nevercomplete successfullysolution to this problem:

files must be stored redundantlywithout duplicates of a file at several compute nodes, if one node failed, all its files would be unavailable untilthe node is fixedif no backup available, some files would be lost forever due to disk crashes

computations must be divided into tasksif any one task fails to execute to completion, it can be restarted without affecting other tasksthis strategy is followed by MapReduce

Page 7: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 7/19

Distributed file system (DFS)

when to use?huge files, even a terabyte in size - there is no point of using DFS with only small filesfiles are rarely updated, but often read

files are divided into chunks, which are typically 64 megabytes in sizechunks are replicated couple of times at different compute nodesnodes holding copies of one chunk should be located on different racksnormally, both the chunk size and the degree of replication can be decided by the userto find the chunks of a file, there is another small file called the master node or name node for that filethe master node is itself replicated, and a directory for the file system as a whole knows where to find its copiesthe directory itself can be replicated, and all participants using the DFS know where the directory copies are

Implementations

Google File System (GFS), the original of the class

Hadoop Distributed File System (HDFS), an open-source DFS used with Hadoop and distributed by the ApacheSoftware FoundationCloudStore, an open-source DFS originally developed by KosmixIBM General Parallel File SystemGlusterFSLustre

Page 8: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 8/19

MapReducea computing style implemented in several systems, including Google’s internal implementation and Hadoophelps to manage many large-scale computations in a fault tolerant waya user needs to write two functions, called Map and Reducethe system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals withthe possibility that one of these tasks will fail to execute

Execution steps

1. Some number of Map tasks each are given one or more chunks from a distributed file system:these Map tasks turn the chunk into a sequence of key-value pairsthe way key-value pairs are produced from the input data is determined by the code written by the user for theMap function

2. The key-value pairs from each Map task are collected by a master controller and sorted by key.3. Keys are divided among all Reduce tasks, so all key-value pairs with the same key wind up at the same Reduce task.4. The Reduce tasks work on one key at a time, and combine all the values associated with that key in some way:

the manner of combination of values is determined by the code written by the user for the Reduce function

Map tasks

input files for a Map task consist of elements of any type, e.g. a tuple or a documenta chunk is a collection of elementsno element is stored across two chunkstechnically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form

normally the keys of input elements are not relevant and we shall tend to ignore theminsisting on this form for inputs and outputs is motivated by the desire to allow composition of severalMapReduce processes

Map function takes an input element as its argument and produces zero or more key-value pairstypes of keys and values are each arbitrarykeys are not "keys" in the usual sense - they do not have to be unique

Page 9: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 9/19

a Map task can produce several key-value pairs with the same key, even from the same element

Example - word count

we want to count the number of occurrences for each word in a collection of documentsthe input file is a repository of documentseach document is an elementMap function uses keys that are of type String (the words) and values that are integersMap task reads a document and breaks it into its sequence of words , , , it emits a sequence of key-value pairs where the value is always 1, i.e.

a single Map task will typically process many documents – all the documents in one or more chunksits output will be more than the sequence for the one document suggested aboveif a word appears times among all documents assigned to that process, then there will be key-value pairs

among its outputan option could be to combine these pairs into a single pair

only possible, because Reduce tasks apply an associative and commutative operation (addition) to the values

w1 w2 … wn

( , 1),   ( , 1), … , ( , 1)w1 w2 wn

w m m

(w, 1)m (w, m)

Grouping by key

output of the Map tasks is grouped by keyvalues associated with each key are formed into a list of valuesgrouping is performed by the system, regardless of what the Map and Reduce tasks domaster controller process knows that there will be Reduce tasksmaster controller picks a hash function that applies to keys and produces a bucket number from to each key from a Map task is hashed and its key-value pair is put in one of local fileseach file is destined for one of the Reduce tasksmaster controller merges the files from each Map task that are destined for a particular Reduce task and feeds themerged file to that process as a sequence of key-list-of-value pairs, i.e. for each key , the input to the Reduce taskthat handles key is a pair of the form

r

0 r − 1r

k

k

(k, [ , , … , ])v1 v2 vn

Reduce tasks

a pair consisting of a key and its list of associated values as inputoutput of the Reduce function is a sequence of zero or more key-value pairsthese key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they arethe same typethe application of the Reduce function to a single key and its associated list of values is often called a reducera Reduce task receives one or more keys and their associated value lists, i.e. it executes one or more reducersoutputs from all Reduce tasks are merged into a single file

Example - word count

Reduce function simply adds up all valuesoutput of a reducer consists of the word and the sumoutput of all the Reduce tasks is a sequence of pairs, where is a word that appears at least once amongall the input documents and is the total number of occurrences of among those documents

(w, m) w

m w

Page 10: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 10/19

Combiners

the Reduce function in the word count example is associative and commutative, i.e. it does not matter how we groupa list of numbers , , , - the sum will be always the samein such a case, we can push some of what the reducers do to the Map tasks

instead of the Map tasks producing many pairs , , , we could apply the Reduce function withinthe Map task, before the output of the Map tasks is subject to grouping and aggregationthese key-value pairs would thus be replaced by one pair with key and value equal to the sum of all the 1’s inall those pairs, it is still necessary to do grouping and aggregation and to pass the result to the Reduce tasks, since there willtypically be one key-value pair with key coming from each of the Map tasks

v1 v2 … vn

(w, 1) (w, 1) …

w

(w, m)

w

Page 11: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 11/19

Details of execution

user program forks a Master controller process and some number of Worker processes at different compute nodesa Worker handles either Map tasks (a Map worker) or Reduce tasks (a Reduce worker), but not boththe Master has many responsibilities

to create some number of Map tasksto create some number of Reduce tasksto assign these tasks to Worker processesto keep track of the status of each task (idle, executing at a particular Worker, or completed)

it is reasonable to create one Map task for every chunk of the input file(s)we may wish to create fewer Reduce tasks

it is necessary for each Map task to create an intermediate file for each Reduce taskif there are too many Reduce tasks the number of intermediate files explodes

a Worker process reports to the Master when it finishes a taska new task is scheduled by the Master for that Worker processeach Map task is assigned one or more chunks of the input file(s) and executes on it the code written by the userthe Map task creates a file for each Reduce task on the local disk of the Worker that executes itthe Master is informed of the location and sizes of each of these files and the Reduce task for which each is destineda Reduce task assigned to a Worker process is given all the files that form its inputthe Reduce task executes code written by the user and writes its output to a file that is part of the surroundingdistributed file system

Page 12: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 12/19

Coping with node failures

the worst thing that can happen is the failure of the compute node with the Master processthe entire MapReduce job must be restarted in this case

other failures will be managed by the Master and the MapReduce job will complete eventuallyif a node with a Map worker fails:

this failure will be detected by the Master, because it periodically pings the Worker processesall the Map tasks that were assigned to this Worker will have to be redone, even if they had completed, becausetheir output is unavailablethe Master sets the status of each of these Map tasks to idle and will schedule them on a Worker when onebecomes availablethe Master must also inform each Reduce task that the location of its input from that Map task has changed

if a node with a Reduce worker fails:the Master simply sets the status of its currently executing Reduce tasks to idlethey will be rescheduled on another reduce worker later

Algorithms using MapReduceoriginal purpose of Google's MapReduce implementation was to execute very large matrix-vector multiplications asare needed in the calculation of PageRankmatrix-vector and matrix-matrix calculations fit nicely into the MapReduce style of computingrelational-algebra operations are another important class of operations that can use MapReduce effectivelyMapReduce is not a solution to every problem, not even every problem that profitably can use many compute nodesoperating in parallel

Page 13: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 13/19

Matrix-vector multiplication by MapReduce

Let be an matrix and a vector of length . Then the matrix-vector product is the vector of length , whose -th element is given by

if , using a DFS or MapReduce for this calculation is not a good ideathis sort of calculation is at the heart of the ranking of Web pages that goes on at search engines, and there is inthe tens of billions (fortunately, the matrix is sparse, with 10 to 15 nonzero elements per row on average)

Vector can fit into memory

Let us first assume that is large, but not so large that vector cannot fit in main memory:

vector will be available to every Map taskmatrix and vector each will be stored in a file of the DFSthe row-column coordinates of each matrix element should be discoverable, either from its position in the file, orbecause it is stored with explicit coordinates, e.g. as a triple the position of element in the vector should be discoverable in the analogous way

Map function:

if is not already read into main memory at the compute node executing a Map task, then load it first into the memory- it will be available to all applications of the Map function performed at this Map taskMap function is written to apply to one element of from each matrix element it produces the key-value pair thus, all terms of the sum that make up the component of the matrix-vector product will get the same key

Reduce function:

it simply sums all the values associated with a given key the result will be a pair

If vector cannot fit into memory

for the above procedure it is not required that fits into main memory at a compute nodeif it does not then there will be a very large number of disk accesses as we move pieces of the vector into mainmemory to multiply components by elements of the matrixalternative approach:

divide the vector in equal-sized subvectors that can fit in memoryaccording to that, divide the matrix into stripes of equal widthstripe and subvector are independent from other stripes/subvectorseach stripe stored in a single filesame for the subvectorsuse the prevoius algorithm for each stripe/subvector pair

M n × n v ⃗  n x ⃗  n

i

=xi ∑j=1

n

mijvj

n = 100n

M

v ⃗ 

n v ⃗ 

v ⃗ M v ⃗ 

(i, j, )mij

vj v ⃗ 

v ⃗ 

M

mij (i, )mijvj

xi i

i

(i, )xi

v ⃗ 

v ⃗ 

i i

Page 14: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 14/19

Relational-algebra operations

in the relational model:a relation is a table with column headers called attributesrows of the relation are called tuplesthe set of attributes of a relation is called its schemarelational algebra defines standard operations on relations, that are used to implement queries:

selection - apply a condition to each tuple in the relation and produce as output only those tuples thatsatisfy (result is usually denoted as )projection - for some subset of the attributes of the relation, produce from each tuple only thecomponents for the attributes in (denoted as )union, intersection and difference - these operations apply to the sets of tuples in two relations that havethe same schemanatural join - a set of all combinations of tuples in relations R and S that are equal on their commonattribute names (denoted as )grouping and aggregation - given a relation , partition its tuples according to their values in one set ofattributes . Then, for each group, aggregate the values in certain other attributes

a relation can be stored as a file in a distributed file systemmany operations on data can be described easily in terms of relational algebra, even though they are not executedwithin a database management system

C

C (R)σC

S

S (R)πS

R ⋈ S

R

G

Computing selections by MapReduce

selections do not need the full power of MapReducethey can be done most conveniently in the map portion alonethey could also be done in the reduce portion alone

Map function:

each tuple in , if condition is satisfied, is outputted as a key-value pair

Reduce function:

identity, it simply passes each key-value pair to the output

t R C (t, t)

Page 15: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 15/19

Computing projections by MapReduce

performed similarly to selectionsince projection may cause the same tuple to appear several times, the reduce function must eliminate duplicates

Map function:

for each tuple in , construct a tuple by eliminating from those components whose attributes are not in thesubset output the key-value pair

Reduce function:

for each key produced by any of the Map tasks, there will be one or more key-value pairs turn into the output

Important note:

duplicate elimination performed by the reducer is associative and commutativea combiner associated with each Map task can eliminate duplicates produced locallyReduce task still required to eliminate two identical tuples coming from different Map tasks

t R t′ t

S

( , )t′ t′

t′ ( , )t′ t′

( , [ , , … , ])t′ t′ t′ t′ ( , )t′ t′

Page 16: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 16/19

Union, intersection, and difference by MapReduce

relations and should have the same schema

Union

Map tasks will be assigned chunks from either or they pass their input tuples as key-value pairs to the outputReduce tasks should only eliminate duplicates

Map function for union:

turn each input tuple into a key-value pair

Reduce function for union:

one or two values will by associated with each keyproduce the output in either case

Intersection

same Map function as beforeReduce function must produce a tuple only if both relations have the tuple

Map function for intersection:

turn each input tuple into a key-value pair

Reduce function for intersection:

if key has value list , then produce otherwise, produce nothing

Difference

the only way a tuple can appear in the output is if it is in but not in the Map function can pass tuples from and through, but must inform the Reduce function about the relation

Map function for difference:

for a tuple in , produce key-value pair for a tuple in , produce key-value pair only the name of the relation (or a bit indicating it) should be passed, not the entire relation

Reduce function for difference:

for each key , if the associated value list is , then produce otherwise produce nothing

R S

R S

t (t, t)

(t, t)

t (t, t)

t [t, t] (t, t)

t R S

R S

t R (t, R)t S (t, S)

t [R] (t, t)

Page 17: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 17/19

Computing natural join by MapReduce

two relations and the goal is to find tuples that agree on their componentsthe -value of tuples from either relation may be used as keythe value will be the other component and the name of the relation

Map function:

for each tuple of produce the key-value pair for each tuple of produce the key-value pair

Reduce function:

each key value will be associated with a list of pairs that are either of the form or construct all pairs consisting of one with first component and the other with first component the output from this key and value list is a sequence of key-value pairsthe key is irrelevanteach value is one of the triples such that and are on the input list of values

If there are more than two attributes:

represents the attributes in the schema of which are not in represents the attributes in both schemas represents attributes in only

key in the map function is the list of values in all attributes that are in the schemas of both R and Sthe value for a tuple from is the name together with the values of all the attributes belonging to but not to similarly for a tuple from reduce function looks at all key-value pairs with a given key and combines those values from R with those values of Sin all possible waysfrom each pairing, the tuple produced has the values from , the key values and the values from

R(A, B) S(B, C)B

B

(a, b) R (b, (R, a))(b, c) S (b, (S, c))

b (R, a) (S, c)R S

(a, b, c) (R, a) (S, c)

A R S

B

C S

R R R S

S

R S

Grouping and aggregation by MapReduce

Minimal example:

relation we want to group-by and aggregate on , i.e. we have an operation where is one of the fiveaggregation operations (i.e. SUM , COUNT , AVG , MAX , MIN )Map will perform groupingReduce does the aggregation

Map function:

for each tuple produce the key-value pair

Reduce function:

each key represents a groupapply the aggregation operator to the list of -values associated with key output the pair , where is the result of applying to the list

if is SUM, then

R(A, B, C)A B (R)γA,θ(B) θ(B)

(a, b, c) (a, b)

a

θ [ , , … , ]b1 b2 bn B a

(a, x) x θ

θ x = + + ⋯ +b1 b2 bn

Page 18: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 18/19

Matrix multiplication

Matrix is a product of two matrices and , if

It is required that the number of columns of equals the number of rows of .

a matrix can be interpreted as a relation with three attributes: the row number, the column number and the value inthat row and column

matrix is a relation with tuples matrix is a relation with tuples we can omit the tuples for matrix elements that are 0 relational representation is a good choice for large andsparce matrices

it is possible that , , and are implicit in the position of a matrix element in the file that represents it, rather thanwritten explicitly with the element itselfin that case the Map function should construct the , , and components of tuples

is almost a natural join followed by grouping and aggregationnatural join of and having only attribute J in common would produce tuples

from each tuple in and tuple in this five-component tuple represents the pair of matrix elements what we want instead is the product of these elements, i.e. the four-component tuple , becausethat represents the product once we have this relation as the result of one MapReduce operation, we can perform grouping and aggregation,with and as the grouping attributes and the sum of as the aggregationwe can implement matrix multiplication as the cascade of two MapReduce operations

First Map function:

for each matrix element produce the key value pair for each matrix element produce the key value pair use names of the matrices (or even better a bit indicating them) for the and values in the above pairs

First Reduce function:

for each key , examine its list of associated valuesfor each value that comes from , and each value that comes from N, , produce a key-value pair with key equal to and value equal to the product of these elements,

Second Map function:

an identity - for every input element with key and value , produce exactly this key-value pair

Second Reduce function:

for each key , produce the sum of the list of values associated with this keythe result is a pair , where is the value of the element in row and column of the matrix

P M N

=pik ∑j

mijnjk

M N

M M(I, J, V ) (i, j, )mij

N N(J, K, W ) (j, k, )njk

i j k

I J K

P = MN

M(I, J, V ) N(J, K, W )(i, j, k, v, w) (i, j, v) M (j, k, w) N

( , )mij njk

(i, j, k, v × w)mijnjk

I K V × W

mij (j, (M, i, ))mij

njk (j, (N, k, ))njk

M N

j

M (M, i, )mij (N, k, )njk

(i, k) mijnjk

(i, k) v

(i, k)((i, k), v) v i k P = MN

Page 19: I n tr o d u c ti o n i n to B i g D a ta a n a l y ti c sprac.im.pwr.edu.pl/~szwabin/assets/bdata/3.pdf · 12.03.2020 4_mapreduce_2020 file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html

12.03.2020 4_mapreduce_2020

file:///home/szwabin/Dropbox/Praca/Zajecia/BigData/Lectures/4_mapreduce/4_mapreduce_2020.html 19/19

Matrix multiplication with one MapReduce step

we may want to perform matrix multiplication in a single MapReduce pass (although two passes are usually better inthis particular task)possible if we put more work into the functionsthe Map function should be used to create the sets of matrix elements that are needed to compute each element ofthe answer since an element of or contributes to many elements of the result, one input element will be turned into manykey-value pairskeys will be pairs , where is a row of and is a column of

Map function:

for each element of , produce all key-value pairs for up to the numberof columns of for each element of , produce all key-value pairs for up to the number ofrows of

Reduce function:

each key will have an associated list with all values and for all possible values of sort by the values that begin with and sort by the values that begin with N , in separate liststhe -th values on each list must have their third components, and extracted and multipliedthese products are summed and the result is paired with in the output

P

M N

(i, k) i M k N

mij M ((i, k), (M, j, ))mij k = 1, 2, … ,N

njk N ((i, k), (N, j, ))njk i = 1, 2, … ,M

(i, k) (M, j, )mij (N, j, )njk j

j M j

j mij njk

(i, k)