megadata with python and hadoop

Post on 05-Dec-2014

1.808 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

July 2010 Triangle Hadoop Users Group presentation

TRANSCRIPT

Processing MegadataWith Python and Hadoop

July 2010 TriHUGRyan Cox

www.asciiarmor.com

0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF1089919999999999999999990029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF1049919999999999999999990029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF1089919999999999999999990029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF1089919999999999999999990029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF1089919999999999999999990029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF1089919999999999999999990029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF1069919999999999999999990029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF1089919999999999999999990029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF1089919999999999999999990029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF1089919999999999999999990029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF1089919999999999999999990029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF1089919999999999999999990029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF1049919999999999999999990029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF1049919999999999999999990029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF1049919999999999999999990029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF1039919999999999999999990029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF1009919999999999999999990029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF1009919999999999999999990029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF1009919999999999999999990029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF1009919999999999999999990029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF1009919999999999999999990029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF1089919999999999999999990029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF1089919999999999999999990035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW17010029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF1089919999999999999999990029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF1089919999999999999999990029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF1089919999999999999999990029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF1009919999999999999999990029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF1009919999999999999999990029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF1009919999999999999999990029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF1009919999999999999999990029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF1009919999999999999999990029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF1009919999999999999999990029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF1049919999999999999999990029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF1039919999999999999999990029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF1009919999999999999999990029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF1009919999999999999999990029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF1009919999999999999999990029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF1009919999999999999999990029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF1009919999999999999999990029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF1009919999999999999999990029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF1009919999999999999999990029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF1089919999999999999999990029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF1089919999999999999999990035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721

~130 GB NCDC CLIMATE DATASET1901

19041907

19101913

19161919

19221925

19281931

19341937

19401943

19461949

19521955

19581961

19641967

19701973

19761979

19821985

19881991

19941997

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

How can we make this scale?( and do more interesting things )

high_temp = 0 for line in open('1901'): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93])

if (temp != "+9999" and quality in "01459"): high_temp = max(high_temp,float(temp))

print high_temp

JEFFREY DEAN – GOOGLE - 2004

“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”

def mapper(line): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): return float(temp) return None

output = map(mapper,open('1901')) print reduce(max,output)

MAPREDUCE IN PURE PYTHON

HADOOP STREAMING

for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)

(last_key, max_val) = (None, 0)for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val)))

if last_key: print "%s\t%s" % (last_key, max_val)

map

per.p

yre

duer

.py

cat dataFile | mapper.py | sort | reducer.py

DUMBO

DUMBO

def mapper(key,value): line = value.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): yield year, int(temp)

def reducer(key,values): yield key,max(values)

if __name__ == "__main__": import dumbo dumbo.run(mapper,reducer,reducer)

DUMBO

• Ability to pass around Python objects• Job / Iteration Abstraction• Counter / Status Abstraction• Simplified Joining mechanism• Ability to use non-Java combiners• Built-in library of mappers / reducers• Excellent way to model MR algorithms

ELASTIC MAP REDUCE

Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

CLI – API – Web Console

ELASTIC MAP REDUCE

ELASTIC MAP REDUCE

Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’

ELASTIC MAP REDUCECloudWatch metrics

ELASTIC MAP REDUCECloudWatch metrics

QUIZ: HOW WOULD YOU DO THIS?

MAP REDUCE ALGORITHMS ARE DIFFERENT

BFS(G, s) // G is the graph and s is the starting node for each vertex u V [G] - {s}∈ do color[u] ← WHITE // color of vertex u d[u] ← ∞ // distance from source s to vertex u π[u] ← NIL // predecessor of u color[s] ← GRAY d[s] ← 0 π[s] ← NIL Q ← Ø // Q is a FIFO - queue ENQUEUE(Q, s) while Q ≠ Ø // iterates as long as there are gray vertices. do u ← DEQUEUE(Q) for each v Adj[u]∈ do if color[v] = WHITE // discover the undiscovered adjacent vertices then color[v] ← GRAY // enqueued whenever painted gray d[v] ← d[u] + 1 π[v] ← u ENQUEUE(Q, v) color[u] ← BLACK // painted black whenever dequeued

MAP REDUCE ELSEWHERE

> m = function() { emit(this.user_id, 1); } > r = function(k,vals) { return 1; } > res = db.events.mapReduce(m, r, { query : {type:'sale'} }); > db[res.result].find().limit(2) { "_id" : 8321073716060 , "value" : 1 } { "_id" : 7921232311289 , "value" : 1 }

> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]).

Mon

goD

BRi

ak

LEARN MORE

Definitive Guide Hadoophttp://www.hadoopbook.com

Dumbohttp://dumbotics.com/http://github.com/klbostee/dumbo/

Elastic Map Reduce http://aws.amazon.com/

Botohttp://github.com/boto

Getting Started Slideshttp://www.slideshare.net/pacoid/getting-started-on-hadoop

DEMO

top related