bigdata 大資料分析實務 (進階上機課程)

Bigdata 大資料簡介與分析應用 ( 上機課程 )

莊家雋

大綱

• 另一種作業系統： Linux• 啟動 Hadoop• 使用分散式儲存系統： HDFS• 使用分散式運算系統： MapReduce• 使用現成的工具做分類與推荐： Mahout• 源源不絕的接收資料： Flume

3

動手玩 LINUX

Linux 使用簡介

• 使用終端機– Ctrl+alt+T

• 今天會用到的指令– 基本檔案操作– VIM 文字編輯器

基本 Linux 指令介紹 : ls 、 cp

http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php#

• 複制檔案： cp• 查看檔案： ls

基本 Linux 指令介紹 : mv 、 rm

http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php

• 移動檔案、改檔： mv• 刪除檔案： rm

基本 Linux 指令介紹 : cat 、 mkdir

• 建立目錄： mkdir• 查看檔案內容： cat

http://linux.vbird.org/linux_basic/0220filemanager/0220filemanager-fc4.php

Vim 文字編輯器介紹

• 使用『 vi filename 』進入一般指令模式• 按下 i 進入編輯模式，開始編輯文字• 按下 [ESC] 按鈕回到一般指令模式• 按 : 進入指令列模式，檔案儲存 (w) 並離開 (q) vi

環境

http://linux.vbird.org/linux_basic/0310vi.php#vi

9

啟動 HADOOP

10

Ｈ adoop 系統架構

• Master /slave architecture – Ｎ ameNode ， DataNode– Resource Manager ， NodeManager

master slave1

NN DN

RM NM

slave2

DN

NM

11

窮人版Ｈ adoop 系統架構

• 所有ｄａｅｍｏｎ都在同一台主機上

master

NN DN

RM NM

啟動 HDFS

• start-dfs.sh• http://master:50070

啟動 Mapreduce

• start-yarn.sh• http://master:50030/cluster

14

動手玩 HDFS

分散式檔案系統： HDFS

• 在分散式的儲存環境裏，提供單一的目錄系統• 每個檔案被分割成許多區塊並進行異地備份

15

HDFS檔案 1 檔案 2

16

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.htmlhttp://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.html


17

http://www.ewdna.com/2013/04/Hadoop-HDFS-Comics.htmlhttp://bradhedlund.com/2011/09/10/understanding-hadoop-clusters-and-the-network/



18

HDFS 命令列操作

• 基本指令– hadoop fs –ls <file_in_hdfs>– hadoop fs –lsr <dir_in_hdfs>– hadoop fs –rm <file_in_hdfs>– hadoop fs –rmr <dir_in_hdfs>– hadoop fs -mkdir <dir_in_hdfs>– hadoop fs –cat <file_in_hdfs>

– hadoop fs –get <file_in_hdfs> <file_in_local>– hadoop fs –put <file_in_local> <file_in_hdfs>

19

動手玩 MAPREDUCE

分散式運算系統： MapReduce

• 一個問題被分割之後而成的小問題。解決一個問題，其實就是要解決其所有子問題。

• 分而治之，各個擊破– 傳統方法

• 分而治之，”同時”各個擊破– MapReduce

• Ｍ ap ：解決每個子問題• Reduce ：將子問題的解答做匯總

• 針對 key/value 的資料類型做分析20

MapReduce 如何做字數統計

This is a bookThis is a penThis is a deskThat is my bookThat is my pen

<This,3> <That,2> This is a desk

That is my book

map1

map2

map3

<This,1>, <is, 1>, <a, 1>, <book,1>

<This,1>, <is, 1>, <a, 1>, <pen,1>

<This,1>, <is, 1>, <a, 1>, <desk,1>

<That,1>, <is, 1>, <my, 1>, <book,1>

<That,1>, <is, 1>, <my, 1>, <pen,1>

reduce

<This,3>, <That,2>, <is, 5>, <my, 2>, <a,3><book,2>, <desk,1>, <pen,2>

<This, [1,1,1]><That,[1,1]><is,[1,1,1,1,1]><my,[1,1]><a,[1,1,1]><book,[1,1]><pen,[1,1]><desk,[1]>

<is,5> <my,2><a,3>

map2

<book,2> <desk,1><pen,2>

That is my pen

map3

This is a bookThis is a pen

map1

22

1. 由 RM 做全局的資源分配2. NM 定時回報目前的資源使用量3. 每個 JOB 會有一個負責的 AppMaster 控制 Job4. 將資源管理與工作控制分開5. YARN 為一通用的資源管理系統可達成在 YARN 上運行多種框架

23

MapReduce 程式長成這樣…

Step by Step

#vim wordcount.dataaaa bbb ccc dddbbb ccc ddd eee

# hadoop fs -mkdir mr.wordcount# hadoop fs -put wordcount.data mr.wordcount# hadoop fs -ls mr.wordcount

# hadoop jar MR-sample.jar org.nchc.train.mr.wordcount.WordCount mr.wordcount/wordcount.data output...omit...File Input Format Counters Bytes Read=32 File Output Format Counters Bytes Written=30

# hadoop fs -cat output/part-r-00000aaa 1bbb 2ccc 2ddd 2eee 1

25

動手玩分群

動手對資料做分類國文數學

ID 1 0 10

ID 2 10 0

ID 3 10 10

ID 4 20 10

ID 5 10 20

ID 6 20 20

ID 7 50 60

ID 8 60 50

ID 9 60 60

ID 10 90 90

國文數學ID 1 0 10

ID 2 10 0

ID 3 10 10

ID 4 20 10

ID 5 10 20

ID 6 20 20

ID 7 50 60

ID 8 60 50

ID 9 60 60

ID 10 90 90

動手分看看…

Step by Step

#vi clustering.data0 1010 010 1020 1010 2020 2050 6060 5060 6090 90

# hadoop fs -mkdir testdata# hadoop fs -put clustering.data testdata# hadoop fs -ls -R testdata-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/clustering.data

# mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job -t1 3 -t2 2 -i testdata -o output...omit...14/09/08 01:31:07 INFO clustering.ClusterDumper: Wrote 3 clusters14/09/08 01:31:07 INFO driver.MahoutDriver: Program took 104405 ms (Minutes: 1.7400833333333334)

#mahout clusterdump --input output/clusters-0-final --pointsDir output/clusteredPointsC-0{n=1 c=[9.000, 9.000] r=[]} Weight : [props - optional]: Point: 1.0: [9.000, 9.000]C-1{n=2 c=[5.833, 5.583] r=[0.167, 0.083]} Weight : [props - optional]: Point: 1.0: [5.000, 6.000] 1.0: [6.000, 5.000] 1.0: [6.000, 6.000]C-2{n=4 c=[1.313, 1.333] r=[0.345, 0.527]} Weight : [props - optional]: Point: 1.0: [1:1.000] 1.0: [0:1.000] 1.0: [1.000, 1.000] 1.0: [2.000, 1.000] 1.0: [1.000, 2.000] 1.0: [2.000, 2.000]

讓我們想一想

• 資料前處理– 轉成 Mahout 能處理的資料欄位

• 領域專門知識– 為什麼是二群而不是三群呢 ?

30

動手玩推薦系統

推薦系統就在你身邊

• YouTube• 博客來

book-a book-b book-c

User 1 5 4 5

User 2 4 5 4

User 3 5 4 4~5

User 4 1 2 1~2

User 5 2 1 1

推薦系統原理


User 1 5 4 5

User 2 4 5 4

User 3 5 4

User 4 1 2

User 5 2 1 1

Step by Step

#vi recom.data1,1,51,2,41,3,52,1,42,2,52,3,43,1,53,2,44,1,14,2,25,1,25,2,15,3,1

# hadoop fs -mkdir testdata# hadoop fs -put recom.data testdata# hadoop fs -ls -R testdata-rw-r--r-- 3 root hdfs 288374 2014-02-05 21:53 testdata/recom.data

# mahout recommenditembased -s SIMILARITY_EUCLIDEAN_DISTANCE -i testdata -o output...omit… File Input Format Counters Bytes Read=287 File Output Format Counters Bytes Written=3214/09/04 05:46:56 INFO driver.MahoutDriver: Program took 434965 ms (Minutes: 7.249416666666667)

# hadoop fs -cat output/part-r-000003 [3:4.4787264]4 [3:1.5212735]


User 1 5 4 5

User 2 4 5 4

User 3 5 4 4~5

User 4 1 2 1~2

User 5 2 1 1

分析結果

# hadoop fs -ca3 [3:4.4787264 [3:1.521273

1. 我們預測 User4 不太喜歡 book-c ，所以我不會推薦 book-c 給User42. 我們預測 User3 喜歡 book-c ，所以我會推薦 book-c 給 User3

35

Try It!

book1 book2 book3 book4 book5 book6 book7 book8 Book9

User1 3 2 1 5 5 1 3 1

User2 2 3 1 3 5 4 3

User3 1 2 3 3 2 1

User4 2 1 2 1 1 2

User5 3 3 1 3 2 2 3 3 2

User6 1 3 2 2 1

user7 4 4 1 5 1 3 3 4

user 對 book 的評價表

36

動手玩 FLUME

當資料源源不決的產生時

• 手動將資料放到 HDFS 上• 使用 Flume 做資料收集

資料目錄 HDFS sink

HDFS

Memory Channel

檔案

Flume

不用寫程式，也能自動執行

• 僅定義 config 檔即可#vim exampleagent.sources = source1agent.channels = channel1agent.sinks = sink1

agent.sources.source1.type = spooldiragent.sources.source1.channels = channel1agent.sources.source1.spoolDir = /home/hadoop/flumedataagent.sources.source1.fileHeader = false

agent.sinks.sink1.type=hdfsagent.sinks.sink1.channel=channel1agent.sinks.sink1.hdfs.path=hdfs://master:9000/user/hadoopagent.sinks.sink1.hdfs.fileType=DataStreamagent.sinks.sink1.hdfs.writeFormat=TEXTagent.sinks.sink1.hdfs.rollSize = 0agent.sinks.sink1.hdfs.rollCount = 0agent.sinks.sink1.hdfs.idleTimeout = 0

agent.channels.channel1.type = memoryagent.channels.channel1.capacity = 100

#cd ~/flume/conf#flume-ng agent -n agent -c . -f ./example…

總結

• 使用虛擬機器技能 + 1• 使用 Linux 技能 + 1• 使用 HDFS 技能 + 1• 使用 Flume 技能 + 1• 使用 MapReduce 技能 + 1 • 使用 Mahout 做分群技能 + 1• 使用 Mahout 做推荐技能 + 1

backup

41

動手玩虛擬機器

開啟虛擬機器

…canopy.Job -t1 3 -t2 2 -i testdata

43

找出 3 群

…canopy.Job -t1 6 -t2 5 -i testdata

44

找出 2 群

bigdata 大資料分析實務 (進階上機課程)

Engineering