intro to hadoop...intro to hadoop bill graham - @billgraham data systems engineer, analytics...
TRANSCRIPT
![Page 1: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/1.jpg)
Intro To HadoopBill Graham - @billgraham
Data Systems Engineer, Analytics InfrastructureInfo 290 - Analyzing Big Data With Twitter
UC Berkeley Information SchoolSeptember 2012
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
![Page 2: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/2.jpg)
Outline
• What is Big Data?• Hadoop
• HDFS• MapReduce
• Twitter Analytics and Hadoop
2
![Page 3: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/3.jpg)
What is big data?
• A bunch of data?• An industry?• An expertise?• A trend?• A cliche?
3
![Page 4: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/4.jpg)
Wikipedia big data
In information technology, big data is a loosely-defined term used to describe data sets so large and complex that they become awkward to work with using on-hand database management tools.
4Source: http://en.wikipedia.org/wiki/Big_data
![Page 5: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/5.jpg)
How big is big?
• 2008: Google processes 20 PB a day• 2009: Facebook has 2.5 PB user data + 15 TB/day • 2009: eBay has 6.5 PB user data + 50 TB/day• 2011: Yahoo! has 180-200 PB of data• 2012: Facebook ingests 500 TB/day
5
![Page 6: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/6.jpg)
That’s a lot of data
6Credit: http://www.flickr.com/photos/19779889@N00/1367404058/
![Page 7: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/7.jpg)
So what?
s/data/knowledge/g
7
![Page 8: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/8.jpg)
No really, what do you do with it?
• User behavior analysis• AB test analysis• Ad targeting• Trending topics• User and topic modeling• Recommendations• And more...
8
![Page 9: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/9.jpg)
How to scale data?
9
![Page 10: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/10.jpg)
Divide and Conquer
10
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
![Page 11: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/11.jpg)
Parallel processing is complicated
• How do we assign tasks to workers?• What if we have more tasks than slots?• What happens when tasks fail?• How do you handle distributed synchronization?
11Credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/
![Page 12: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/12.jpg)
Data storage is not trivial• Data volumes are massive• Reliably storing PBs of data is challenging• Disk/hardware/network failures• Probability of failure event increases with number of machines
For example: 1000 hosts, each with 10 disks a disk lasts 3 year how many failures per day?
12
![Page 13: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/13.jpg)
Hadoop cluster
13
Cluster of machine running Hadoop at Yahoo! (credit: Yahoo!)
![Page 14: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/14.jpg)
Hadoop
14
![Page 15: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/15.jpg)
Hadoop provides
• Redundant, fault-tolerant data storage• Parallel computation framework• Job coordination
15http://hapdoop.apache.org
![Page 16: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/16.jpg)
Joy
16Credit: http://www.flickr.com/photos/spyndle/3480602438/
![Page 17: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/17.jpg)
Hadoop origins
• Hadoop is an open-source implementation based on GFS and MapReduce from Google• Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. (2003) The Google File System• Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
17
![Page 18: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/18.jpg)
Hadoop Stack
18
MapReduce(Distributed Programming Framework)
Pig(Data Flow)
Hive(SQL)
HDFS(Hadoop Distributed File System)
Cascading(Java)
HBa
se(C
olum
nar D
atab
ase)
![Page 19: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/19.jpg)
HDFS
19
![Page 20: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/20.jpg)
HDFS is...
• A distributed file system• Redundant storage• Designed to reliably store data using commodity hardware• Designed to expect hardware failures• Intended for large files• Designed for batch inserts• The Hadoop Distributed File System
20
![Page 21: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/21.jpg)
HDFS - files and blocks
• Files are stored as a collection of blocks• Blocks are 64 MB chunks of a file (configurable)• Blocks are replicated on 3 nodes (configurable)• The NameNode (NN) manages metadata about files and blocks• The SecondaryNameNode (SNN) holds a backup of the NN data• DataNodes (DN) store and serve blocks
21
![Page 22: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/22.jpg)
Replication
• Multiple copies of a block are stored• Replication strategy:
• Copy #1 on another node on same rack• Copy #2 on another node on different rack
22
![Page 23: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/23.jpg)
HDFS - writes
23
DataNode
Block
Slave node
NameNode
Master
DataNode
Block
Slave node
DataNode
Block
Slave node
File
Client
Rack #1 Rack #2
Note: Write path for a single block shown. Client writes multiple blocks in parallel.
block
![Page 24: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/24.jpg)
HDFS - reads
24
DataNode
Block
Slave node
NameNode
Master
DataNode
Block
Slave node
DataNode
Block
Slave node
File
ClientClient reads multiple blocks in parallel and re-assembles into a file.
block 1 block 2block N
![Page 25: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/25.jpg)
What about DataNode failures?
• DNs check in with the NN to report health• Upon failure NN orders DNs to replicate under-replicated blocks
25Credit: http://www.flickr.com/photos/18536761@N00/367661087/
![Page 26: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/26.jpg)
MapReduce
26
![Page 27: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/27.jpg)
MapReduce is...
• A programming model for expressing distributed computations at a massive scale• An execution framework for organizing and performing such computations• An open-source implementation called Hadoop
27
![Page 28: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/28.jpg)
Typical large-data problem
• Iterate over a large number of records• Extract something of interest from each• Shuffle and sort intermediate results• Aggregate intermediate results• Generate final output
28
Map
Reduce
(Dean and Ghemawat, OSDI 2004)
![Page 29: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/29.jpg)
MapReduce paradigm
• Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3)
• Framework handles everything else*• Value with same key go to same reducer
29
![Page 30: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/30.jpg)
MapReduce Flow
30
map map map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
![Page 31: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/31.jpg)
MapReduce - word count example
function map(String name, String document):
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator partialCounts):
totalCount = 0
for each count in partialCounts:
totalCount += count
emit(word, totalCount)
31
![Page 32: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/32.jpg)
MapReduce paradigm - part 2• There’s more!• Partioners decide what key goes to what reducer
• partition(k’, numPartitions) -> partNumber
• Divides key space into parallel reducers chunks• Default is hash-based
• Combiners can combine Mapper output before sending to reducer
• Reduce(k2, list(v2)) -> list(v3)
32
![Page 33: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/33.jpg)
MapReduce flow
33
combine combine combine combine
b a 1 2 c 9 a c 5 2 b c 7 8
partition partition partition partition
map map map map
k1 k2 k3 k4 k5 k6 v1 v2 v3 v4 v5 v6
b a 1 2 c c 3 6 a c 5 2 b c 7 8
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
a 1 5 b 2 7 c 2 9 8
r1 s1 r2 s2 r3 s3
c 2 3 6 8
![Page 34: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/34.jpg)
MapReduce additional details
• Reduce starts after all mappers complete• Mapper output gets written to disk• Intermediate data can be copied sooner• Reducer gets keys in sorted order• Keys not sorted across reducers• Global sort requires 1 reducer or smart partitioning
34
![Page 35: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/35.jpg)
MapReduce - jobs and tasks• Job: a user-submitted map and reduce implementation to apply to a data set• Task: a single mapper or reducer task
• Failed tasks get retried automatically• Tasks run local to their data, ideally
• JobTracker (JT) manages job submission and task delegation• TaskTrackers (TT) ask for work and execute tasks
35
![Page 36: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/36.jpg)
MapReduce architecture
36
TaskTracker
Task
Slave node
JobTracker
Master
TaskTracker
Task
Slave node
TaskTracker
Task
Slave node
Job
Client
![Page 37: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/37.jpg)
What about failed tasks?
• Tasks will fail• JT will retry failed tasks up to N attempts• After N failed attempts for a task, job fails• Some tasks are slower than other• Speculative execution is JT starting up multiple of the same task• First one to complete wins, other is killed
37Credit: http://www.flickr.com/photos/phobia/2308371224/
![Page 38: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/38.jpg)
MapReduce data locality
• Move computation to the data• Moving data between nodes has a cost• MapReduce tries to schedule tasks on nodes with the data• When not possible TT has to fetch data from DN
38
![Page 39: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/39.jpg)
MapReduce - Java API• Mapper: void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter)
• Reducer: void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter)
39
![Page 40: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/40.jpg)
MapReduce - Java API• Writable
• Hadoop wrapper interface• Text, IntWritable, LongWritable, etc
• WritableComparable• Writable classes implement WritableComparable
• OutputCollector• Class that collects keys and values
• Reporter• Reports progress, updates counters
• InputFormat• Reads data and provide InputSplits• Examples: TextInputFormat, KeyValueTextInputFormat
• OutputFormat• Writes data• Examples: TextOutputFormat, SequenceFileOutputFormat
40
![Page 41: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/41.jpg)
MapReduce - Counters are...• A distributed count of events during a job• A way to indicate job metrics without logging• Your friend
• Bad: System.out.println(“Couldn’t parse value”);• Good: reporter.incrCounter(BadParseEnum, 1L);
41
![Page 42: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/42.jpg)
MapReduce - word count mapperpublic static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } }}
42
![Page 43: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/43.jpg)
MapReduce - word count reducerpublic static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}
43
![Page 44: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/44.jpg)
MapReduce - word count mainpublic static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);}
44
![Page 45: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/45.jpg)
MapReduce - running a job
• To run word count, add files to HDFS and do:
$ bin/hadoop jar wordcount.jar org.myorg.WordCount input_dir output_dir
45
![Page 46: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/46.jpg)
MapReduce is good for...
• Embarrassingly parallel algorithms• Summing, grouping, filtering, joining• Off-line batch jobs on massive data sets• Analyzing an entire large dataset
46
![Page 47: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/47.jpg)
MapReduce is ok for...
• Iterative jobs (i.e., graph algorithms)• Each iteration must read/write data to disk• IO and latency cost of an iteration is high
47
![Page 48: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/48.jpg)
MapReduce is not good for...
• Jobs that need shared state/coordination• Tasks are shared-nothing• Shared-state requires scalable state store
• Low-latency jobs• Jobs on small datasets• Finding individual records
48
![Page 49: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/49.jpg)
Hadoop combined architecture
49
TaskTracker
DataNode
Slave node
JobTracker
Master
TaskTracker
DataNode
Slave node
TaskTracker
DataNode
Slave node
SecondaryNameNode
Backup
NameNode
![Page 50: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/50.jpg)
NameNode UI• Tool for browsing HDFS
50
![Page 51: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/51.jpg)
JobTracker UI• Tool to see running/completed/failed jobs
51
![Page 52: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/52.jpg)
Running Hadoop
• Multiple options• On your local machine (standalone or pseudo-distributed)• Local with a virtual machine• On the cloud (i.e. Amazon EC2)• In your own datacenter
52
![Page 53: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/53.jpg)
Cloudera VM• Virtual machine with Hadoop and related technologies pre-loaded• Great tool for learning Hadoop• Eases the pain of downloading/installing• Pre-loaded with sample data and jobs• Documented tutorials• VM: https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM• Tutorial: https://ccp.cloudera.com/display/SUPPORT/Hadoop+Tutorial
53
![Page 54: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/54.jpg)
Twitter Analyticsand Hadoop
54
![Page 55: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/55.jpg)
Multiple teams use Hadoop
• Analytics• Revenue• Personalization & Recommendations• Growth
55
![Page 56: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/56.jpg)
ApplicationData
Third Party Imports MySQL/
Gizzard
Analytics Web Tools
Main Hadoop DW
MySQL
Vertica
Production Hosts
Social graphTweetsUser profiles
HDFS
Staging Hadoop Cluster
ScribeAggregators
Logevents
HBase
Twitter Analytics data flow
56
Rasvelg
AnalystsEngineersPMsSales
HCatalog
DistributedCrawler
LogMover
Crane Crane
Crane
CraneCrane
Crane
Oink Oink
Oink
![Page 57: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/57.jpg)
Example: active users
57
Production Hosts
MySQL/Gizzard
Main Hadoop DW
Vertica MySQLAnalytics
Dashboards
web_events
sms_events
Log mover(via staging cluster)
Scribe
Scribe
Job DAG
Log mover
Oink/PigCleanseFilterTransformGeo lookupUnionDistinct
Oink
Oinkuser_sessions
Oink
user_profilesCrane
Crane
Crane
...
Crane
active_by_*
Rasvelg
JoinAggregations:- active_by_geo- active_by_device- active_by_client...
Join, Group, CountAggregations:- active_by_geo- active_by_device- active_by_client...
Rasvelg
![Page 58: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/58.jpg)
Credits
• Data-Intensive Information Processing Applications ― Session #1, Jimmy Lin, University of Maryland
58
![Page 59: Intro To Hadoop...Intro To Hadoop Bill Graham - @billgraham Data Systems Engineer, Analytics Infrastructure Info 290 - Analyzing Big Data With Twitter UC Berkeley Information School](https://reader036.vdocuments.mx/reader036/viewer/2022071112/5fe7c5a65ed06410360deb66/html5/thumbnails/59.jpg)
Questions?
59
Bill Graham - @billgraham