samstag, 15. oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...date node disk disk disk date...

35
3x Friso van Vollenhoven @fzk Samstag, 15. Oktober 11

Upload: others

Post on 30-Jul-2020

54 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

3xFr

iso va

n Voll

enho

ven

@fzk

Samstag, 15. Oktober 11

Page 2: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 3: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 4: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

86.88.37.142 - - [26/Jul/2011:00:01:46 +0200] "GET /nl/index.html?Referrer=ADVNLGOO22901030000bsl HTTP/1.1" 200 15551 "http://www.google.nl/search?sourceid=navclient&aq=0h&oq=b&hl=nl&ie=UTF-8&q=bol.com.nl" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" "DYN_USER_ID=12660142780; DYN_USER_CONFIRM=8bc25ea623423bae5c4ce970faf1b13f4; BOL_RFID=ADVNLGOO1322090000bsl; BUI=86.55.31.109.1278181451852406" 0 "Ti3nysCoEI4AAGMfqZAAAAPD" "-" "325886" "ps316"

Millions of these, each day

Samstag, 15. Oktober 11

Page 5: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Egypt @ Jan 27, 2011

Samstag, 15. Oktober 11

Page 6: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

BGP4MP|980099497|A|193.148.15.68|3333|192.37.0.0/16|3333 5378 286 1836|IGP|193.148.15.140|0|0||NAG||

Hundreds of millions of these, each day

the internet works because of these (and cables and routers and money and people and stuff)

Samstag, 15. Oktober 11

Page 7: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 8: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 9: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 10: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Date Node

DISK

DISK

DISK

Name Node

/some/file /foo/bar

HDFS client create file

write data

read data

replicate

Node localHDFS client

read data

Samstag, 15. Oktober 11

Page 11: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Why ?scalable

open sourcecost-efficient

storage and processing

in one

good for analytics: schema-less, unstructured

Samstag, 15. Oktober 11

Page 12: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Not for me...

I don’t have a lot of data.

I surely don’t have a cluster of machines to spare.

I just read the paper.

It’d be cool if I could try this stuff sometime, though...

Samstag, 15. Oktober 11

Page 13: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Free data...

Samstag, 15. Oktober 11

Page 14: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Getting it...

curl -u fzk:secret \https://stream.twitter.com/1/statuses/sample.json \> tweets.json

8 weeks == ~1/4 TB

Samstag, 15. Oktober 11

Page 15: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Tens of millions of these

Samstag, 15. Oktober 11

Page 16: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Good, now the cluster...

http://whirr.apache.org/

Samstag, 15. Oktober 11

Page 17: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Step 1: Configure

Step 2: Launch

Step 3: ?

Step 4: Pay

Samstag, 15. Oktober 11

Page 18: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

whirr.service-name=hadoopwhirr.cluster-name=my-clusterwhirr.instance-templates=\1 hadoop-jobtracker+hadoop-namenode, \19 hadoop-datanode+hadoop-tasktracker

whirr.provider=aws-ec2whirr.identity=SECRETwhirr.credential=EVEN-MORE-SECRETwhirr.private-key-file=${sys:user.home}/.ssh/id_rsawhirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

whirr.hadoop-install-function=install_cdh_hadoopwhirr.hadoop-configure-function=configure_cdh_hadoop

whirr.hardware-id=c1.xlarge

Step 1: Configure

Samstag, 15. Oktober 11

Page 19: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

whirr launch-cluster --config cluster.properties

Step 2: Launch

bash .whirr/my-cluster/hadoop-proxy.sh

wait about 20 minutes...

Samstag, 15. Oktober 11

Page 20: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 21: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Twitter mentions

What’s up with Microsoft?

Step 3:

Samstag, 15. Oktober 11

Page 22: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

“Hello, Oracle”

“Google vs. Microsoft vs. Apple”

“Apache rocks! Oracle not so much...”

“Apple == iAwesome”

Oracle, 1Google, 1Microsoft, 1Apple, 1Apache, 1Oracle, 1Apple, 1

input: text

split words

emit:$WORD, 1for ‘interesting’ words

MAP

Samstag, 15. Oktober 11

Page 23: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

MAGIC!

Samstag, 15. Oktober 11

Page 24: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

map(input record) => (key, value)

ORDER BY key GROUP BY key

reduce(key, values) => (key, value)

Samstag, 15. Oktober 11

Page 25: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Apache: [1]

Apple: [1,1]

Google: [1]

Microsoft: [1]

Oracle: [1,1]

REDUCE

Apache: 1Apple: 2Google: 1Microsoft: 1Oracle: 2

input: text, count

sum values

emit:$KEY, $SUM for all keys

Samstag, 15. Oktober 11

Page 27: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

hadoop jar bigdata-twitter-0.1-SNAPSHOT-job.jar \-Dxebia.twitter.terms=oracle,google,microsoft,apache \s3://training-hdfs/twitter-sample/* /job-output

wait another 20 minutes...

mvn clean install

export HADOOP_CONF_DIR=$HOME/.whirr/my-cluster

Samstag, 15. Oktober 11

Page 28: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 29: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 30: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 31: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

Samstag, 15. Oktober 11

Page 32: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

hadoop fs -get /job-output/part-r-00000 .

whirr destroy-cluster --config cluster.properties

Samstag, 15. Oktober 11

Page 33: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

20110807 apache 220110807 google 42220110807 microsoft 4420110807 oracle 1120110808 apache 2520110808 google 134120110808 microsoft 16020110808 oracle 3720110809 apache 1720110809 google 143120110809 microsoft 18420110809 oracle 4020110810 apache 1220110810 google 168820110810 microsoft 17920110810 oracle 51

Samstag, 15. Oktober 11

Page 34: Samstag, 15. Oktober 11gotocon.com/dl/goto-amsterdam-2011/slides/...Date Node DISK DISK DISK Date Node DISK DISK DISK Date Node DISK DISK DISK Name Node /some/file /foo/bar HDFS client

From: [email protected]: AWS Billing Statement Available

Greetings from Amazon Web Services,

This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:

Total: $218.02

Thank you for using Amazon Web Services.

Sincerely,Amazon Web Services

Step 4: Pay

Samstag, 15. Oktober 11