big data com python
DESCRIPTION
Python Workshop on-line at Mutirao Python on-line via pycursos.com https://www.youtube.com/watch?feature=player_embedded&v=DFh6l-h6-gwTRANSCRIPT
![Page 1: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/1.jpg)
BigData with PythonA gentle and simple introduction
Marcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com
![Page 2: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/2.jpg)
https://class.coursera.org/datasci-001/
DisclaimerAlguns slides foram retirados das apts
do curso de Bill Howe -Introduction to Data
Science
![Page 3: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/3.jpg)
About me
Co-founder of Crab - Python recsys library
Cientist Chief at Atepassar, e-learning social network
Co-Founder and Instructor of PyCursos, teaching Python on-line
Co-Founder of Pingmind, on-line infrastructure for MOOC’s
Interested at Python, mobile, e-learning and machine learning!
![Page 4: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/4.jpg)
What is BigData ?
![Page 5: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/5.jpg)
Big Data
“Big Data is any data that is expensive to manage and hard to extract value from.”
Michael Franklin Thomas M. Siebel Professor of Computer Science Director of the Algorithms, Machines and People Lab University of Berkeley
![Page 6: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/6.jpg)
Big Data Erik Larson, 1989, Harper’s magazine
“The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”
![Page 7: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/7.jpg)
!"#$%"$#&'(')")(
*$+,(!"#$%"$#&'(')")(
!"#$%&'(%$)*%+%),-+.),#$$)*#/(#*)01.#2%03)
!4'1-%$)*%+%5)6$'70)5)1$-18)0+9#%25):%1.-(#)7#(#9%+#*5);2$)#+1<3)
Big Data, muitos dados.
![Page 8: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/8.jpg)
ChallengesVolume
the size of the data
Velocitythe latency of data processing relative to the growing demand for interactivity
Varietythe diversity of sources, formats, quality, structures
![Page 9: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/9.jpg)
!"#$%&'&$%"()*+",*+$-./0$
.,12()$$
.&3"'4$ .)1,5"'4$
.&12)
$ .&12)$
![Page 10: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/10.jpg)
Where does big data comes from ?
“data exhaust” from customers
new and pervasive sensors
the ability to “keep everything”
![Page 11: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/11.jpg)
!"#$%&'(')*"+$#"''
!"#$%$&'()*+,"-.%(/%"01(
!"#$%&'"()%*+%
2"-&*3&(.4.+*(/"5607-8(
,"-./%('-0"&12%3/&124.(1%
500%6&'216%./7%!.281&%0#.(1%
91:.-/*1(9-.%'7/,(;-(5*5"+'(/"5607-8(
9211/%7.&.%(1/&21%:'#37%6&.&1%723;1%
<#.6)%!1-'2=%
,>!?@%:'(3.#%AB!%
!"#$%*(.-.%'7/,(
<$8(=.&.(
![Page 12: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/12.jpg)
Where does big data comes from ?
4/28/13 Bill Howe, UW 10
photo in public domain
![Page 13: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/13.jpg)
Where does big data comes from ?
4/28/13 Bill Howe, UW 11
![Page 14: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/14.jpg)
!"#$%&'()#*+$$,-.#(''/$$
$011$*2))2'3$-.45#$67#&7$-7$'8$9-&."$:;<:=$$<$23$>$?3@#&3#@$67#&7$"-5#$-$,-.#(''/$$-..'63@$$$$9'&#$@"-3$>;$(2))2'3$A2#.#7$'8$.'3@#3@$BC#($$)23/7=$3#C7$7@'&2#7=$()'D$A'7@7=$3'@#7=$A"'@'$$-)(6*7=$#@.EF$7"-&#G$#-."$*'3@"E$
$$H')G7$>;%I$'8$G-@-$8'&$-3-)J727=$-GG7$<:$!I$'8$$.'*AM#G$G-@-$G-2)J$
![Page 15: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/15.jpg)
!"#$%&'!!
!())!*$++$,-!./&'/0!12)!*$++$,-!34$+5!6#&&6/!!789!:$++$,-!/&4';<!=.&'$&/!4!345!!>!"?!3464!@,'!4-4+5/$/!A&-&'46&3!34$+5!!!
"<&!B',:+&*C!
![Page 16: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/16.jpg)
!"#$%&''&(!!
!)*+!#,-.'&/0!.'$('1!!2++!#,-.'&/0!3#$4'!!5$6-44$/0!3#$4'!0&378!9-(!!*+++!.'$('!!):!;<!0&#&!9-(!&/&78'3'!=$/$(&#$0!0&378!!!
>(&03?-/&7!0&#&!'#-(&=$1!#$6,/3@.$'!A!&/&78'3'!#--7'!B.'#!0-!/-#!C-(D!&#!#,$'$!'6&7$'!E!!
>,$!F(-G7$4H!
![Page 17: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/17.jpg)
Scalability
![Page 18: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/18.jpg)
What does Scalable mean ?
Operationally: In the past: “Works even if data doesn’t fit in main memory” Now: “Can make use of 1000s of cheap computers”
Algorithmically: In the past: If you have N data items, you must do no more than Nm operations -- “polynomial time algorithms” Now: If you have N data items, you must do no more than Nm/k operations, for some large k -- “polynomial time algorithms must be parallelized” Soon: If you have N data items, you should do no more than N * log(N) operations
![Page 19: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/19.jpg)
Example
Given a set of DNA sequences, find all sequences equal to “GATTACGATATTA”.
![Page 20: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/20.jpg)
GATTACGATATTA TACCTGCCGTAA
![Page 21: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/21.jpg)
5/15/13 Bill Howe, eScience Institute 4
TACCTGCCGTAA = GATTACGATATTA?
GATTACGATATTA
time = 0
No.
Time = 0
![Page 22: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/22.jpg)
Time = 1
5/15/13 Bill Howe, eScience Institute 5
CCCCCAATGAC = GATTACGATATTA?
GATTACGATATTA
time = 1
No.
![Page 23: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/23.jpg)
Time = 17
5/15/13 Bill Howe, eScience Institute 6
GATTACGATATTA contains GATTACGATATTA?
GATTACGATATTA
time = 17
Yes!
Send it to the output.
![Page 24: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/24.jpg)
5/15/13 Bill Howe, eScience Institute 7
GATTACGATATTA
40 records, 40 comparions
N records, N comparisons
The algorithmic complexity is order N: O(N)
![Page 25: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/25.jpg)
GATTACGATATTA
AAAATCCTGCA AAACGCCTGCA
TTTACGTCAA
TTTTCGTAATT
What if we sort the sequences? What if we sort the sequences ?
![Page 26: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/26.jpg)
CTGTACACAACCT < GATTACGATATTA
GATTACGATATTA
No match.
Skip to 75% mark
Start at the 50% mark
time = 0
0% 100%
CTGTACACAACCT
Time = 0
![Page 27: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/27.jpg)
GGATACACATTTA > GATTACGATATTA
GATTACGATATTA
No match.
Go back to 62.5% mark time = 1
0% 100%
GGATACACATTTA
![Page 28: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/28.jpg)
GATATTTTAAGC < GATTACGATATTA
GATTACGATATTA
No match.
Skip back to 68.75% mark
0% 100%
GATATTTTAAGC
![Page 29: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/29.jpg)
GATTACGATATTA = GATTACGATATTA
GATTACGATATTA
0% 100%
Match!
Walk through the records until we fail to match.
![Page 30: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/30.jpg)
GATTACGATATTA
0% 100%
How many comparisons did we do?
40 records, only 4 comparisons
N records, log(N) comparisons
This algorithm is O(log(N)) Far better scalability
![Page 31: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/31.jpg)
Relational Databases
5/15/13 Bill Howe, eScience Institute 14
Relational Databases • Databases are good at Needle in Haystack problems:
– Extracting small results from big datasets – Transparently provide old style scalability – Your query will always* finish, regardless of dataset size.
– Indexes are easily built and automatically used when appropriate
CREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = GATTACGATATTA ;
*almost
![Page 32: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/32.jpg)
Example
New task: Read Trimming
Given a set of DNA sequences,trim the final n bps of each sequence
Generate a new dataset
![Page 33: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/33.jpg)
GATTACGATATTA TACCTGCCGTAA
![Page 34: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/34.jpg)
TACCTGCCGTAA becomes TACCT
time = 0
Time = 0
![Page 35: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/35.jpg)
Time = 1
CCCCCAATGAC becomes CCCCC
time = 1
![Page 36: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/36.jpg)
Time = 17
GATTACGATATTA becomes GATTA
time = 17
![Page 37: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/37.jpg)
Can we use an index?
No. We have to touch every record no matter what.
The task is fundamentally O(N)
Can we do any better?
![Page 38: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/38.jpg)
![Page 39: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/39.jpg)
time = 0
Time = 0
![Page 40: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/40.jpg)
Time = 1
time = 1
![Page 41: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/41.jpg)
Time = 2
time = 2
![Page 42: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/42.jpg)
Time = 3
time = 3
![Page 43: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/43.jpg)
Time = 7
time = 7
How much time did this take?
7 cycles
40 records, 6 workers
O(N/k)
time = 7
How much time did this take?
7 cycles
40 records, 6 workers
O(N/k)
![Page 44: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/44.jpg)
Schematic of a Parallel “Read Trimming” Task
You are given short “reads”: genomic sequences about 35-75 characters each
Distribute the reads among k computers
f f f f f f f is a function to trim a read; apply it to every item
Now we have a big distributed set of trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short “reads”: genomic sequences about 35-75 characters each
Distribute the reads among k computers
f f f f f f f is a function to trim a read; apply it to every item
Now we have a big distributed set of trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short “reads”: genomic sequences about 35-75 characters each
Distribute the reads among k computers
f f f f f f f is a function to trim a read; apply it to every item
Now we have a big distributed set of trimmed reads
Schematic of a Parallel “Read Trimming” Task
You are given short “reads”: genomic sequences about 35-75 characters each
Distribute the reads among k computers
f f f f f f f is a function to trim a read; apply it to every item
Now we have a big distributed set of trimmed reads
![Page 45: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/45.jpg)
You are given TIFF images
Distribute the images among k computers
f f f f f f
f is a function to convert TIFF to PNG; apply it to every item
Now we have a big distributed set of converted images
New Task: Convert 405k TIFF images to PNG
http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/
Convert 405k TIFF images to PNG
![Page 46: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/46.jpg)
You have sets of parameters for thousands of small simulations
Divide the parameter sets among k computers
f f f f f f
f is runs the simulation and produces some output; apply it to every item
Now we have a big distributed set of simulation results
New Task: Run thousands of simulations
http://escience.washington.edu/get-help-now/dave-williams-simulating-muscle-dynamics-cloud
Run thousands of simulations
![Page 47: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/47.jpg)
You have millions of documents
Distribute the documents among k computers
f f f f f f
f finds the most common word in a single document
Now we have a big distributed list of (doc_id, word) pairs
Find the most common word in each document Find the most common word in each document
![Page 48: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/48.jpg)
Consider a slightly more general programto compute the word frequency of everyword in a singe documentConsider a slightly more general program to compute the
word frequency of every word in a single document
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood.
(people, 2) (government, 6) (assume, 1) (history, 2) …
![Page 49: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/49.jpg)
You have millions of documents
Distribute the documents among k computers
f f f f f f For each document f returns a set of (word, freq) pairs
Compute the word frequency of 5M documents
Now we have a big distributed list of sets of word freqs.
![Page 50: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/50.jpg)
There’s a pattern...There’s a pattern here….
• A function that maps a read to a trimmed read • A function that maps a TIFF image to a PNG image • A function that maps a set of parameters to a simulation result • A function that maps a document to its most common word • A function that maps a document to a histogram of word
frequencies
![Page 51: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/51.jpg)
What if we want to compute the word frequency across all documents?
US Constitution Declaration of Independence
Articles of Confederation
(people, 78) (government, 123)
(assume, 23) (history, 38)
…
![Page 52: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/52.jpg)
You have millions of documents
Distribute the documents among k computers
map map map map map map For each document, return a set of (word, freq) pairs
Compute the word frequency across 5M documents
Now what?
• But we don’t want a bunch of little histograms – we want one big histogram.
• How can we make sure that a single computer has access to every occurrence of a given word regardless of which document it appeared in?
Compute the word frequency across 5M docs
![Page 53: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/53.jpg)
Compute the word frequency across 5M docs
Distribute the documents among k computers
map map map map map map For each document, return a set of (word, freq) pairs
Now we have a big distributed list of sets of word freqs.
reduce reduce reduce reduce Now just count the occurrences of each word
4 4 3 We have our distributed histogram
Compute the word frequency across 5M documents
![Page 54: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/54.jpg)
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 11
Some distributed algorithm…
Map
(Shuffle)
Reduce
![Page 55: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/55.jpg)
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 12
MapReduce Programming Model
• Input & Output: each a set of key/value pairs • Programmer specifies two functions:
– Processes input key/value pair – Produces set of intermediate pairs
– Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
![Page 56: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/56.jpg)
Compute the word frequency across 5M docs
5/15/13 Bill Howe, eScience Institute 13
Example: What does this do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result);
slide source: Google, Inc.
![Page 57: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/57.jpg)
Example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
![Page 58: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/58.jpg)
Word length histogram example
3/12/09 Bill Howe, eScience Institute 14
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
3/12/09 Bill Howe, eScience Institute 15
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many big , medium , and small words are used?
![Page 59: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/59.jpg)
Word length histogram example
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter
Example: Word length histogram
![Page 60: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/60.jpg)
Word length histogram exampleAbridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Split the document into chunks and process each chunk on a different computer
Chunk 1
Chunk 2
Example: Word length histogram
![Page 61: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/61.jpg)
Word length histogram example
(yellow, 20) (red, 71) (blue, 93) (pink, 6 )
Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1 (204 words)
Map Task 2 (190 words)
(key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3)
Example: Word length histogram
![Page 62: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/62.jpg)
Word length histogram example
3/12/09 Bill Howe, eScience Institute 19
(yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 )
Reduce tasks
(yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
Shuffle step
(yellow, 37) (red, 148) (blue, 200) (pink, 9)
![Page 63: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/63.jpg)
MR Phases
!"#$%&'(")$*+&,&
MR Phases
• Each Map and Reduce task has multiple phases:
![Page 64: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/64.jpg)
MR PhasesMAP REDUCE
(w1,1)
(w2,1)
(w1,1)
…
(w1,1)
(w2,1)
…
(did1,v1)
(did2,v2)
(did3,v3)
. . . .
(w1, (1,1,1,…,1))
(w2, (1,1,…))
(w3,(1…))
…
…
…
…
(w1, 25)
(w2, 77)
(w3, 12)
…
…
…
…
Shuffle
!"#$%&'()%"**$"(+%,&-.$/%%012%3',%45+,%+$3)%6&78%9:;%
![Page 65: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/65.jpg)
MR Phases
5/15/13 Bill Howe, eScience Institute 21
Hadoop in One Slide
src: Huy Vo, NYU Poly
![Page 66: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/66.jpg)
Taxonomy ofParallel Architectures
5/15/13 Bill Howe, eScience Institute 22
Taxonomy of Parallel Architectures
Easiest to program,
but $$$$
Scales to 1000s of computers
![Page 67: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/67.jpg)
5/15/13 Bill Howe, UW 32
![Page 68: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/68.jpg)
5/15/13 Bill Howe, UW 25
![Page 69: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/69.jpg)
5/15/13 Bill Howe, eScience Institute 26
Large-Scale Data Processing
• Many tasks process big data, produce big data • Want to use hundreds or thousands of CPUs
– ... but this needs to be easy – Parallel databases exist, but they are expensive, difficult to set
up, and do not necessarily scale to hundreds of nodes.
• MapReduce is a lightweight framework, providing: – Automatic parallelization and distribution – Fault-tolerance – I/O scheduling – Status and monitoring
![Page 70: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/70.jpg)
5/15/13 Bill Howe, eScience Institute 35
MapReduce Contemporaries
• Dryad (Microsoft) – Relational Algebra
• Pig (Yahoo) – Near Relational Algebra over MapReduce
• HIVE (Facebook) – SQL over MapReduce
• Cascading – Relational Algebra
• Clustera – U of Wisconsin
• Hbase – Indexing on HDFS
MapReduce Contemporaries
![Page 71: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/71.jpg)
Talk is cheap,Show me the code
![Page 72: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/72.jpg)
ExamplesMore Examples: Build an Inverted Index
5/15/13 Bill Howe, UW 14
Input:
tweet1, (“I love pancakes for breakfast”) tweet2, (“I dislike pancakes”) tweet3, (“What should I eat for breakfast?”) tweet4, (“I love to eat”)
Desired output:
“pancakes”, (tweet1, tweet2) “breakfast”, (tweet1, tweet3) “eat”, (tweet3, tweet4) “love”, (tweet1, tweet4) …
![Page 73: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/73.jpg)
ExamplesMore Examples: Relational Join
5/15/13 Bill Howe, UW 15
Name SSN Sue 999999999 Tony 777777777
EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing
Emplyee ⨝ Assigned Departments
Employee Assigned Departments
Name SSN EmpSSN DepName Sue 999999999 999999999 Accounts Tony 777777777 777777777 Sales Tony 777777777 777777777 Marketing
![Page 74: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/74.jpg)
ExamplesRelational Join in MapReduce: Before Map Phase
5/15/13 Bill Howe, UW 16
Name SSN Sue 999999999 Tony 777777777
EmpSSN DepName 999999999 Accounts 777777777 Sales 777777777 Marketing
Employee
Assigned Departments
Key idea: Lump all the tuples together into one dataset
Employee, Sue, 999999999 Employee, Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing
What is this for?
![Page 75: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/75.jpg)
ExamplesRelational Join in MapReduce: Map Phase
5/15/13 Bill Howe, UW 17
Employee, Sue, 999999999 Employee, Tony, 777777777 Department, 999999999, Accounts Department, 777777777, Sales Department, 777777777, Marketing
key=999999999, value=(Employee, Sue, 999999999) key=777777777, value=(Employee, Tony, 777777777) key=999999999, value=(Department, 999999999, Accounts) key=777777777, value=(Department, 777777777, Sales) key=777777777, value=(Department, 777777777, Marketing)
why do we use this as the key?
![Page 76: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/76.jpg)
ExamplesRelational Join in MapReduce: Reduce Phase
5/15/13 Bill Howe, UW 18
key=999999999, values=[(Employee, Sue, 999999999), (Department, 999999999, Accounts)]
key=777777777, values=[(Employee, Tony, 777777777), (Department, 777777777, Sales), (Department, 777777777, Marketing)]
Sue, 999999999, 999999999, Accounts
Tony, 777777777, 777777777, Sales Tony, 777777777, 777777777, Marketing
![Page 77: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/77.jpg)
ExamplesRelational Join in MapReduce, again
5/15/13 Bill Howe, Data Science, Autumn 2012 19
Order(orderid, account, date) 1, aaa, d1 2, aaa, d2 3, bbb, d3
LineItem(orderid, itemid, qty) 1, 10, 1 1, 20, 3 2, 10, 5 2, 50, 100 3, 20, 1 tagged with
relation name Order 1, aaa, d1 ! 1 : “Order”, (1,aaa,d1) 2, aaa, d2 ! 2 : “Order”, (2,aaa,d2) 3, bbb, d3 ! 3 : “Order”, (3,bbb,d3) Line 1, 10, 1 ! 1 : “Line”, (1, 10, 1) 1, 20, 3 ! 1 : “Line”, (1, 20, 3) 2, 10, 5 ! 2 : “Line”, (2, 10, 5) 2, 50, 100 ! 2 : “Line”, (2, 50, 100) 3, 20, 1 ! 3 : “Line”, (3, 20, 1)
Map Reducer for key 1
“Order”, (1,aaa,d1) “Line”, (1, 10, 1) “Line”, (1, 20, 3)
(1, aaa, d1, 1, 10, 1) (1, aaa, d1, 1, 20, 3)
![Page 78: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/78.jpg)
ExamplesSimple Social Network Analysis: Count Friends
5/15/13 Bill Howe, UW 20
Jim, Sue Sue, Jim Lin, Joe Joe, Lin Jim, Kai Kai, Jim Jim, Lin Lin, Jim
Input Desired Output
Jim, 3 Lin, 2 Sue, 1 Kai, 1 Joe, 1
Jim, 1 Sue, 1 Lin, 1 Joe, 1 Jim, 1 Kai, 1 Jim, 1 Lin, 1
MAP REDUCE
Jim, (1, 1, 1) Lin, (1, 1) Sue, (1) Kai, (1) Joe, (1)
SHUFFLE
![Page 79: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/79.jpg)
ExamplesMatrix Multiplication
5/11/13 Bill Howe, UW 1
!" #" $" %&"
'" &" %#" !"
!" %&"
$" #"
%#" %&"
(" $"
X = !" %)"
&#" $"
![Page 80: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/80.jpg)
ExamplesMatrix Multiply in MapReduce
C = A X B A has dimensions L,M B has dimensions M,N • In the map phase:
– for each element (i,j) of A, emit ((i,k), A[i,j]) for k in 1..N – for each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1..L
• In the reduce phase, emit – key = (i,k) – value = Sumj (A[i,j] * B[j,k])
5/15/13 Bill Howe, Data Science, Autumn 2012 21
![Page 81: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/81.jpg)
Examples
5/15/13 Bill Howe, Data Science, Autumn 2012 22
AB
X
A B
=
• One reducer per output cell • Each reducer computes Sumj (A[i,j] * B[j,k])
![Page 82: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/82.jpg)
Meeting Hadoop
![Page 83: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/83.jpg)
Cluster Computing
• Large number of commodity servers, connected by high speed, commodity network
• Rack: holds a small number of servers • Data center: holds many racks
CSE 344 – Winter 2012 23
!"#$%&'(#))%'&*"&+!"#$%&'()*+(",-(+./0"123145"67$%8('"1",-(+./09":;1;<"/0=>4""/?"#@0@0A"/?"#$99@B("C$8$9(89;"D>"E$F$'$G$0"$0)"H==G$0""),,(-.,(/0123,(456,7852(9,:3;-,(
![Page 84: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/84.jpg)
Cluster Computing
• Massive parallelism: – 100s, or 1000s, or 10000s servers – Many hours
• Failure: – If medium-time-between-failure is 1 year – Then 10000 servers have one failure / hour
24 slide src: Dan Suciu and Magda Balazinska
![Page 85: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/85.jpg)
Distributed File System (DFS) • For very large files: TBs, PBs • Each file is partitioned into chunks,
typically 64MB • Each chunk is replicated several times
(!3), on different racks, for fault tolerance
• Implementations: – Google’s DFS: GFS, proprietary – Hadoop’s DFS: HDFS, open source
25 slide src: Dan Suciu and Magda Balazinska
![Page 86: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/86.jpg)
Hadoop
![Page 87: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/87.jpg)
!"#$%&'%(#)**+%,%
!"#$%&"#'()*'(+(%"(&"#'(,-.%/#-/0,#'12,'"(,3#'4-("#'*%4/,%&0/#*'-#$."'5,2-#44%)3'2)'(')#/62,7'21'-2882*%/9'.(,*6(,#:'
![Page 88: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/88.jpg)
!"#$%&'()"'*&+&*'",)-&$('!"#$%%!&'((#)&#&*!+)(,-%.
.//'$)0(,123(),4'
5('%#4')0&')6'(%&'4(,)07&4('&$)'484(&94':1(%'*#,7&'0)')6'432'",)-&$(4'
;#%))'%#4')0&')6'(%&'2177&4('104(#**#<)0'=#>))"'?300107'@///4')6'4&,+&,4')0'=#>))"'
![Page 89: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/89.jpg)
!"#$%&'()*+),)
!"##$%&'"()'*'+,-'.&/01&'*'23$'4,5%&6''
'78193:&1:08&'5&93;/'"##$%&<=''''''
' ')&,819'>;$3;&'
-&'./0&)01)2.(00$)$&03'4/)
![Page 90: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/90.jpg)
!"#$%&"#"$'$()&*$+"$,&-../$0"#-$1.2$
3+456.%+&7$-&*&$&8&79"+"$
:#;*$<+8+84=$/&>#28"$"#&2%)$?&%)+8#$7.4$&8&79"+"$
@#.A"/&%+B&7$&8&79"+"$
@#8.<#$C8&79"+"$D204$D+"%.E#29$F2&0-$&8-$%.</7+&8%#$<&8&4#<#8*$G+-#.$&8-$+<&4#$&8&79"+"$
:2#8-$C8&79"+"$
![Page 91: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/91.jpg)
!"#$%&'&$()*##+$,$-#./$-0&1$
2$34)5#.637$$2$8)9':##;$$2$<##/-'$$2$=>?$$2$@0&.'A$2$B)&1CD4$$2$E'F$G#H;$I04'&$$2$G)"##J$$2$IF0K'H$2$B0.;'*$0.$$$$$
![Page 92: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/92.jpg)
!"#$$%&'($)*)+',&-&
!"#$%& '#()%*+,%-!./0&
12+34#&
!56%& 758&
96:;&
<;;=%%(%:&
0>;;(&
/?+@%& A;B5%& C25::&
![Page 93: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/93.jpg)
!"#$$%&#'()*'+,-$.&/&
![Page 94: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/94.jpg)
!"#$%&'(&
!""#
$%&'()*+&,#-&.&/&010#
![Page 95: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/95.jpg)
!"#$%&'()&&
!""#
!"#$$%&#'()*'+,)-#&./-&(0()-1&
![Page 96: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/96.jpg)
!"#$$%&'$(%$)*)+,&-&
.$/&+0"12*0&
3",2&30"12*0&
4"(*&4$#*&
5"%&6*#71*&!89:&&&&
![Page 97: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/97.jpg)
MapReduce
http://labs.google.com/papers/mapreduce.html
![Page 98: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/98.jpg)
Exemplo de MapReducefrom mrjob.job import MRJobimport re
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1)
def reducer(self, word, counts): yield (word, sum(counts))
if __name__ == '__main__': MRWordFreqCount().run()
![Page 99: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/99.jpg)
Projeto MrJob
Criado pela Equipe de Engenharia do Yelp
Totalmente Open-Source
Todo em Python
Utiliza Map-Reduce para Processamento
Permite rodar tanto no Amazon EMR como no Hadoop
![Page 100: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/100.jpg)
Objetivos do MrJobsSe você quer aprender MapReduce, ele é para você
Se você tem um problema cavalar e precisa de muito processamento e não está afim de mexer em Hadoop
Se você já tem um cluster Hadoop e quer rodar scripts Python
Se você quer migrar seu código Python do Hadoop para o EMR
Se você não quer escrever Python (Impossível!), não é para você!
![Page 101: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/101.jpg)
Passos importantes
sudo easy_install mrjob
![Page 102: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/102.jpg)
Vamos a uma demo...
Texto da posse de Obama em 2009.
![Page 103: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/103.jpg)
Desempenho MapReduce
![Page 104: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/104.jpg)
Mais informações
http://packages.python.org/mrjob/
https://github.com/Yelp/mrjob
![Page 105: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/105.jpg)
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
![Page 106: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/106.jpg)
Distributed Computing with mrJobhttps://github.com/Yelp/mrjob
Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce
![Page 107: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/107.jpg)
Atepassar Recommendationshttp://atepassar.com
Problema: Como recomendar novos amigos ?
![Page 108: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/108.jpg)
Atepassar Recommendationshttp://atepassar.com
![Page 109: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/109.jpg)
Atepassar Recommendationshttp://atepassar.com
marcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla
"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]
$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
![Page 110: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/110.jpg)
Atepassar Recommendationshttp://atepassar.com
marcel;jonas,maria,jose,amandamaria;carol,fabiola,amanda,marcelamanda;paula,patricia,maria,marcelcarol;maria,jose,patriciafabiola;mariapaula;fabio,amandapatricia;amanda,caroljose;marcel,caroljonas;marcel,fabiofabio;jonas,paulacarla
"marcel" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["patricia", 1], ["paula", 1]]"maria" [["jose", 2], ["patricia", 2], ["jonas", 1], ["paula", 1]]"patricia" [["maria", 2], ["jose", 1], ["marcel", 1], ["paula", 1]]"paula" [["jonas", 1], ["marcel", 1], ["maria", 1], ["patricia", 1]]"amanda" [["carol", 2], ["fabio", 1], ["fabiola", 1], ["jonas", 1], ["jose", 1]]"carol" [["amanda", 2], ["marcel", 2], ["fabiola", 1]]"fabio" [["amanda", 1], ["marcel", 1]]"fabiola" [["amanda", 1], ["carol", 1], ["marcel", 1]]"jonas" [["amanda", 1], ["jose", 1], ["maria", 1], ["paula", 1]]"jose" [["maria", 2], ["amanda", 1], ["jonas", 1], ["patricia", 1]]
$python friends_recommender.py - r emr --num-ec2-instances 5 facebook_data.csv > output.dat
marcel;jonas,maria,jose,amandacarol;maria,jose,patriciafabio;jonas,paula...
![Page 111: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/111.jpg)
Atepassar Recommendationshttp://atepassar.com
fsim(u,i) = w1 f1(u,i) + w2 f2(u,i) + b,
https://gist.github.com/3970945#file_atepassar_recommender.py
![Page 112: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/112.jpg)
Atepassar Recommendationshttp://atepassar.com
Celery - para agendamento dos jobs coletores e executores.
mrJob - para mapreduce e acesso ao Hadoop
MongoDb - para armazenamento das recomendações
Boto - acesso aos files do S3.
![Page 113: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/113.jpg)
A melhor parte!
![Page 114: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/114.jpg)
Projetos interessantes
Estrutura de dados para manipulação rápida - slicing - indexing - subseting
Handling missing dataAgregações, Séries Temporais
Pandas: a data analysis library for Python, poised to give R a run for its money… http://pandas.pydata.org/
![Page 115: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/115.jpg)
Projetos interessantes
Outro framework para computação distribuída com Python com MapReduce.
Criado pelo Instituto Nokia.
Backend dele é escrito em Erlang (funcional, concorrente e bem escalável!)
Não utiliza o FileSystem mais usado HDFS e sim um novo padrão por eles (DDFS).
http://discoproject.org/
![Page 116: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/116.jpg)
Projetos interessantes
http://api.mongodb.org/python/2.0/examples/map_reduce.html
Python & MongoDb
MongoDb - Banco de Dados Não relacional (NoSQL)
Possui suporte nativo built-in para fazer MapReduce.
Escrever o código em JS e não é muito legível e fica preso ao Mongo ...
>>> reduce = Code("function (key, values) {"... " var total = 0;"... " for (var i = 0; i < values.length; i++) {"... " total += values[i];"... " }"... " return total;"... "}")
![Page 117: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/117.jpg)
Projetos interessantes
https://github.com/klbostee/dumbo/wiki/Short-tutorial
Dumbo
Uma das primeiras bibliotecas em cima do MapReduce e Python.
Complicado para começar e está desatualizada :(
![Page 118: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/118.jpg)
Projetos interessantes
Um wrapper em Python em cima do Hadoop para computação distribuída.
http://pydoop.sourceforge.net/docs/index.html
Legal, mas dá um trabalho para configurar.
![Page 119: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/119.jpg)
Projetos interessantes
Mapreduce com Python na Google AppEngine
https://developers.google.com/appengine/docs/python/dataprocessing/
Ainda experimental e fica “preso” à plataformaAppEngine.
![Page 120: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/120.jpg)
Projetos interessantes
Algoritmos de aprendizagem de máquina
Supervisionados & Não supervisionados
Pré-processamento, extração de dados
Avaliação de classificadores, Pipeline,seleção de atributos.
http://scikit-learn.org/stable/
![Page 121: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/121.jpg)
Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
![Page 122: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/122.jpg)
Projetos interessantes
Processamento de linguagem natural
Várias ferramentas para tokenização,pos tagging, named entity recognition,classificadores, etc.
Vários corpus disponíveis!
http://nltk.org/
Fiquem de olho... Pipeline for distributed Natural Language Processing, made in Python
https://github.com/NAMD/pypln
![Page 123: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/123.jpg)
Projetos interessantes
Our Project’s Home Page
http://muricoca.github.com/crab
![Page 124: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/124.jpg)
Future Releases
Planned Release 0.13New home for python-recsys:
https://github.com/python-recsys/crab
Planned Release 0.14
Support to Item-Based Recommenders using MapReduce with MrJob
New commiters: vinnigracindo, ocelma, fcurella
![Page 125: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/125.jpg)
Join us!
1. Read our Wiki Pagehttps://github.com/muricoca/crab/wiki/Developer-Resources
2. Check out our current sprints and open issueshttps://github.com/muricoca/crab/issues
3. Forks, Pull Requests mandatory
4. Join us at irc.freenode.net #muricoca or at our discussion list
http://groups.google.com/group/scikit-crab
![Page 126: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/126.jpg)
Vários outros ...
Matplotlib
StatsModelsScipy
Numpy
Matplotlib
NetworkX
PyBrain
milk
Orange
![Page 127: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/127.jpg)
Livros recomendados
![Page 128: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/128.jpg)
Livros recomendados
http://shop.oreilly.com/product/0636920022640.do?cmp=il-radar-ebooks-big-data-now-radar
For free...
![Page 129: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/129.jpg)
Livros recomendados
For free...
http://infolab.stanford.edu/~ullman/mmds/book.pdf
![Page 130: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/130.jpg)
Artigos recomendados
For free...
http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.html
http://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html
![Page 131: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/131.jpg)
Artigos recomendados
For free...
http://aimotion.blogspot.com.br/2012/10/atepassar-recommendations-recommending.html
http://aimotion.blogspot.com.br/2012/08/introduction-to-recommendations-with.html
https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob
![Page 132: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/132.jpg)
![Page 133: Big Data com Python](https://reader034.vdocuments.mx/reader034/viewer/2022052410/554a06a3b4c9055b7a8b55b5/html5/thumbnails/133.jpg)
BigData with PythonA gentle and simple introduction
Marcel Caraciolo@marcelcaracioloDeveloper, Cientist, contributor to the Crab recsys project,works with Python for 6 years, interested at mobile,education, machine learning and dataaaaa!Recife, Brazil - http://aimotion.blogspot.com