cosc 6339 big data analytics python mapreduce and 1...

1

COSC 6339

Big Data Analytics

Python MapReduce and

1st homework assignment

Edgar Gabriel

Spring 2018

pydoop

• Python interface to Hadoop that allows you to write

MapReduce applications in pure Python

• Offers several features interesting features:

– MapReduce API that allows to write pure Python mappers, reducers, record readers, record

writers, partitioners and combiners

• No python method for creating your own InputFormat, but possible to include java

InputFormats

- Support for HDFS API

2

Wordcount in pydoop

#!/usr/bin/env python

import pydoop.mapreduce.api as api

import pydoop.mapreduce.pipes as pp

class Mapper(api.Mapper):

def map(self, context):

words = context.value.split()

for w in words:

context.emit(w, 1)

class Reducer(api.Reducer):

def reduce(self, context):

s = sum(context.values)

context.emit(context.key, s)

def __main__():

pp.run_task(pp.Factory(Mapper, Reducer))

Executing a pydoop job

• pydoop submit --num-reducers 5 --upload-file-to-

cache wordcount_pydoop.py wordcount_pydoop

/gabriel/books/ /gabriel/output

• pydoop submit --num-reducers 5 --input-

format=org.apache.hadoop.mapreduce.lib.input.KeyVa

lueTextInputFormat --upload-file-to-cache

book_per_line.py book_per_line

/gabriel/booklists.txt /gabriel/output8

3

pydoop submit -h

usage: pydoop submit [-h] [--num-reducers INT] [--no-override-home]

[--no-override-env] [--no-override-ld-path]

[--no-override-pypath] [--no-override-path]

[--set-env VAR=VALUE] [-D NAME=VALUE]

[--python-zip ZIP_FILE] [--upload-file-to-cache FILE]

[--upload-archive-to-cache FILE] [--log-level LEVEL]

[--job-name NAME] [--python-program PYTHON] [--pretend]

[--hadoop-conf HADOOP_CONF_FILE]

[--disable-property-name-conversion] [--mrv1]

[--local-fs] [--do-not-use-java-record-reader]

[--do-not-use-java-record-writer] [--input-format CLASS]

[--output-format CLASS]

[--job-conf NAME=VALUE [NAME=VALUE ...]]

[--libjars JAR_FILE] [--cache-file HDFS_FILE]

[--cache-archive HDFS_FILE] [--entry-point ENTRY_POINT]

[--avro-input k|v|kv] [--avro-output k|v|kv]

MODULE INPUT OUTPUT

gabriel@whale:> pydoop submit < omitting list of arguments>

17/02/12 08:38:33 INFO client.RMProxy: Connecting to ResourceManager at

whale/192.168.3.253:10040

17/02/12 08:38:34 WARN mapreduce.JobResourceUploader: No job jar file set.

User classes may not be found. See Job or Job#setJar(String).

17/02/12 08:38:34 INFO input.FileInputFormat: Total input paths to process

: 1

17/02/12 08:38:35 INFO mapreduce.JobSubmitter: number of splits:1

…

17/02/12 08:38:49 INFO mapreduce.Job: map 0% reduce 0%



17/02/12 08:39:13 INFO mapreduce.Job: Job job_1486767070254_0001 completed

successfully

17/02/12 08:39:13 INFO mapreduce.Job: Counters: 51

File System Counters

FILE: Number of bytes read=2008

FILE: Number of bytes written=246163

…

4

import re

import pydoop.hdfs as hdfs

WORD_RE = re.compile(r"[\w']+")

text = hdfs.load(fulltitle)

for w in WORD_RE.findall(text):

print w

Excessing files in HDFS using pydoop

Cluster status webpage

5

Debugging your application

• Major parts of your python application can first be tested using the

interactive python shell

gabriel@whale> python

>>> import re

>>> import pydoop.hdfs as hdfs

>>> WORD_RE = re.compile(r"[\w']+")

>>> text = hdfs.load("/cosc6339_s17/books-

shortlist/9055_bad+medicine.txt")

>>> for w in WORD_RE.findall(text):

... print w

...

a

vlendish

manner

The

How to kill a job if it is hangingbigd65@whale:> yarn application -list

16/02/05 17:01:23 INFO client.RMProxy: Connecting to ResourceManager at whale/192.168.3.253:10040

Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1

Application-Id Application-Name Application-Type User Queue State Final-

State Progress Tracking-URL

application_1391622275771_0007 wordcount MAPREDUCE gabriel default RUNNING

UNDEFINED 5% http://whale-001:58572

bigd65@whale:> yarn application -kill

application_1391622275771_000716/02/05 17:01:38 INFO client.RMProxy: Connecting to ResourceManager at whale/192.168.3.253:10040

Killing application application_1391622275771_0007

14/02/05 17:01:38 INFO impl.YarnClientImpl: Killing application application_1391622275771_0007

http://192.168.1.170:10040/

http://whale-001:58572/

http://192.168.1.170:10040/

6

• Retrieve the ID of your application

bigd65@whale:> yarn application -list

• Retrieve also the log files from across the cluster associated with

your application after it finished or was killed, e.g.

bigd65@whale:> yarn logs -applicationId yourID

– Might want to redirect it into an output file, since logs can be

large and exceed the buffering capability of your terminal, e.g.

yarn logs –applicationId yourid > output.log

Using HDFS

• If you want to run a MapReduce job on the cluster, you

have to have the input data in HDFS and the result also

will be in HDFS

– Input data set is already available with read-only

permission for all students in hdfs:///cosc6339_hw1/

• Similar commands for HDFS as for a local UNIX file

systemhdfs dfs –ls /

hdfs dfs –ls /cosc6339_hw1/

hdfs dfs –mkdir /bigd65/newdir

hdfs dfs –rm /bigd65/file.txt

hdfs dfs –rm -r /bigd65/newdir

7

Using HDFS (II)

• Copying a file into hdfs

hdfs dfs –put <localfilename> /bigd65/<remotefilename>

• Copying a file from hdfs into local directory

hdfs dfs –get /bigd65/output/part-r-00000 .

• Looking at the content of a file in hdfs

hdfs dfs –cat /bigd65/filename.txt

• Merging multiple output file (each reducer produces a separate

output file!)

hdfs dfs –getmerge /bigd65/output/part-* allparts.out

1st Homework• Rules

– Each student should deliver

• Source code (.py files) compressed to a zip or tar.gz

file

• Source code has to be using python 2.7

• Documentation (.pdf, .docx, or .txt file)

– explanations to the code

– answers to questions

– Deliver electronically on blackboard

– Expected by Friday, September 28, 11.59pm

– In case of questions: ask early!

8

1. Given a data set containing all flights which occurred

between 2006 and 2008 in the US

– ~21 Million flights listed in the file

– small file for code development with 286 flights available

in HDFS

– each line is one flight with information as listed on the

next pages

a. Implement a MapReduce job which determines the

percentage of delayed flights per Origin Airport

b. Implement a MapReduce job which determines the

percentage delayed flights per Origin Airport and Month

c. Determine the execution time of code developed in part

a. and b. for the large data set using 1, 2, 4, and 8

reducers. Comment on the results.

Description of the input file

• Comma separated list of data, the elements of which are explained

on the next page

• more information available at

http://stat-computing.org/dataexpo/2009/the-data.html

2008,1,3,4,NA,905,NA,1025,WN,469,,NA,80,NA,NA,NA,LAX,SFO,337,NA,NA,1,A,0,NA,NA,NA,NA,NA

2008,1,3,4,1417,1345,1717,1645,WN,2524,N458WN,120,120,105,32,32,MDW,MHT,838,4,11,0,,0,28,0,0,0,4

2008,1,3,4,852,855,959,1015,WN,3602,N737JW,67,80,57,-16,-3,ONT,SMF,389,4,6,0,,0,NA,NA,NA,NA,NA

2008,1,3,4,1726,1725,1932,1940,WN,563,N285WN,306,315,291,-8,1,RDU,LAS,2027,5,10,0,,0,NA,NA,NA,NA,NA

2008,1,3,4,2014,1935,2129,2045,WN,1662,N461WN,75,70,47,44,39,SLC,BOI,291,3,25,0,,0,0,0,6,0,38

2008,1,4,5,1617,1610,1813,1810,WN,2374,N344SW,56,60,46,3,7,ABQ,MAF,332,3,7,0,,0,NA,NA,NA,NA,NA

2008,1,4,5,839,820,1019,1010,WN,535,N761RR,100,110,82,9,19,BWI,IND,515,5,13,0,,0,NA,NA,NA,NA,NA

2008,1,4,5,814,810,930,930,WN,502,N641SW,76,80,62,0,4,ELP,PHX,347,3,11,0,,0,NA,NA,NA,NA,NA

• Some values can be numeric or NA, some values are missing (i.e.

there are two ,, in a row)

http://stat-computing.org/dataexpo/2009/the-data.html

9

Variable descriptions

Name Description

Year 1987-2008

Month 1-12

DayofMonth 1-31

DayOfWeek 1 (Monday) - 7 (Sunday)

DepTime actual departure time (local, hhmm)

CRSDepTime scheduled departure time (local, hhmm)

ArrTime actual arrival time (local, hhmm)

CRSArrTime scheduled arrival time (local, hhmm)

UniqueCarrier unique carrier code

FlightNum flight number

TailNum plane tail number

ActualElapsedTime in minutes

CRSElapsedTime in minutes

AirTime in minutes

Variable descriptions

ArrDelay arrival delay, in minutes

DepDelay departure delay, in minutes

Origin origin IATA airport code

Dest destination IATA airport code

Distance in miles

TaxiIn taxi in time, in minutes

TaxiOut taxi out time in minutes

Cancelled was the flight cancelled?

CancellationCode reason for cancellation (A = carrier, B = weather,

C= NAS, D = security)

Diverted 1 = yes, 0 = no

CarrierDelay in minutes

WeatherDelay in minutes

NASDelay in minutes

SecurityDelay in minutes

LateAircraftDelay in minutes

10

Input files

• Small input available in hdfs in for development and

testing/cosc6339_hw1/flights-shortlist/sample-

flights.csv

• Large input available in hdfs in /cosc6339_hw1/flights-longlist/allflights.csv

– Only use large input file after you have confirmed that

your code runs correctly with the small input file

• In fact, for the very first steps, you can probably

initially create an even smaller test case with just a

couple of lines/entries

Output

• Every student has one directory in hdfs, please create

subdirectories only in that directory!

hdfs dfs –ls /bigd65/

11

Documentation

• The Documentation should contain

– (Brief) Problem description

– Solution strategy

– Description of how to run your code

– Results section

• Description of resources used

• Description of measurements performed

• Results (graphs/tables + findings)

• The document should not contain

– Replication of the entire source code – that’s why you

have to deliver the sources

– Screen shots of every single measurement you made

• Actually, no screen shots at all.

– The output files

12

Using the cluster

• Access of the cluster only through ssh , e.g.

ssh –l bigd65 whale.cs.uh.edu

• Change the default password given to you upon first login:

every other student knows your password as well!

use the passwd command

• Cluster will block your IP address for one hour after 5

unsuccessfull login attempts

• Copying of data files : use scp or sftp, e.g.

scp thisfile.py [email protected]:

scp [email protected]:thatfile.py .

– Be careful with editing files on windows and then

transferring it to the linux cluster (end of line marker is

different!)

Additional resources

• Python: https://docs.python.org/2.7/tutorial/

• Pydoop: https://crs4.github.io/pydoop/index.html

• Whale cluster: http://pstl.cs.uh.edu/resources/whale

• Hadoop status: https://whale.cs.uh.edu:8088/cluster

– Webpage only available inside of the University network

or when using a VPN connection from the outside of the

university campus!

https://docs.python.org/2.7/tutorial/

https://crs4.github.io/pydoop/index.html

http://pstl.cs.uh.edu/resources/whale

https://whale.cs.uh.edu:8088/cluster

cosc 6339 big data analytics python mapreduce and 1...

Documents