20130325 mldm monday spide r
TRANSCRIPT
![Page 1: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/1.jpg)
20130325 MLDM Monday
R 上的 spideR 寫作軍火庫
by c3h3
![Page 2: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/2.jpg)
TW useR Group & MLDM Monday
● http://www.meetup.com/Taiwan-useR-Group/● http://www.facebook.com/TaiwanUseRGroup/● http://www.youtube.com/user/TWuseRGroup/● http://tw.use-r.net/
![Page 3: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/3.jpg)
關於講者
● Chia-Chi Chang (c3h3)● Chief of Data Scientist of InnovoTECH● TW useR Group / MLDM Monday 創辦人之一
● R 、Python 和 Maple 的愛用者● 平時喜歡分析各種類型的資料、買賣金融商品;另外,也喜歡閱讀各種數學理論、模型、以
及它們的應用......
![Page 4: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/4.jpg)
講題大綱
● spideR 寫作的預備知識
● spideR 的一些小範例
● spideR 的架構
● spideR 的寫作流程
● spideR 的一些小技巧
![Page 5: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/5.jpg)
本次演講適合初學者請各位高手們忍耐一下囉!
![Page 6: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/6.jpg)
預備知識
![Page 7: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/7.jpg)
spideR 寫作的預備知識
● 什麼是網站?● 網站的結構?● 網址的祕密?● 網站資料的種類?● 分析的工具
![Page 8: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/8.jpg)
什麼是網站?
![Page 9: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/9.jpg)
一般人眼中的網站
![Page 10: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/10.jpg)
設計師眼中的網站
![Page 12: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/12.jpg)
那... spideR 眼中的網站呢?
![Page 13: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/13.jpg)
網站的結構?
![Page 14: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/14.jpg)
網站的結構 (分類)
● 前端 V.S. 後端
● Model + View + Controler (MVC)
● Static V.S. Dynamic (Ajax)
![Page 15: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/15.jpg)
MVC結構
![Page 16: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/16.jpg)
Static V.S. Dynamic (Ajax)
● 範例:
● [Ajax] http://shop.myer.com.au/shop/mystore/973607510
● [Static] http://tw.stock.yahoo.com/d/s/major_2451.html
![Page 17: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/17.jpg)
網址的祕密?
![Page 18: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/18.jpg)
網址的祕密
● URL?var_1=val_1&var_2=val_2... ○ 其實,就像呼叫函數一樣
○ 相關訊息可在 form 中或 JS code 中找到○ http://finance.yahoo.com/q/hp?s=%
5ETWII&a=06&b=2&c=1997&d=02&e=24&f=2013&g=d○ http://www.taifex.com.tw/eng/eng3/eng3_2dl.asp?
COMMODITY_ID=all&DATA_DATE=2012/11/01&DATA_DATE1=2012/11/15
![Page 19: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/19.jpg)
網址的祕密
● URL 中帶有規則
○ 有些網址會把訊息藏在 URL 中○ 然後,在由後端的 URL Dispatcher 解析
● URL 中帶有規則的範例:
○ http://tw.stock.yahoo.com/d/s/major_2451.html
○ URL規則: major_StockID.html
![Page 20: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/20.jpg)
網站資料的種類?
![Page 21: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/21.jpg)
網站資料的種類?
● Page (HTML)● Data (JSON/XML...) ● File
![Page 24: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/24.jpg)
常用的工具
![Page 25: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/25.jpg)
常用的工具
● Google Chrome○ Developer Tools
● Firefox○ Firebug○ Hackbar○ Cookie Manager+
● cURL● Wireshark
![Page 26: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/26.jpg)
一些小範例
![Page 27: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/27.jpg)
[Example1] 抓股票代碼:
![Page 28: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/28.jpg)
使用技術
● Example1_Extract_TWSE_Stock_IDs.R● R○ XML::htmlParse○ XML::readHTMLTable○ charToRaw○ gsub
● Reference:○ [共筆Blog] 去除 " " 的方法
○ R 的 regular expresssion 講義
![Page 30: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/30.jpg)
使用技術
● Example2_Extract_Stock_Major_Data_Fom_Kimo.R
● R○ XML::htmlParse○ XML::readHTMLTable
![Page 31: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/31.jpg)
回家作業:
● 綜合前兩個範例:
○ 抓取全部代碼的 ID○ 抓取 OTC 的資料
■ Hint: OTC_IDs○ 將所不同 ID 的 Data Table 用不同名稱命名
■ Hint1: 可以讓函數 output Data Table■ Hint2: 也可以用 assign 函數
○ 在 Data Table 中使用一個新欄位來存 ID ===> 建立總表
○ 在 Data Table 中使用一個新欄位來存日期
![Page 32: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/32.jpg)
[Example3] 抓取0050代碼:
![Page 33: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/33.jpg)
使用技術
● Example3_Extract_0050_IDs.R● R○ XML::htmlParse○ XPath Parser in XML
● Reference:○ http://www.w3.org/TR/xpath/
![Page 34: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/34.jpg)
[Example4] 利用 ID 搭配 quantmod:
![Page 35: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/35.jpg)
使用技術● Example4_Get_Stock_Data_From_Yahoo
Finance.R● R○ quantmod::getSymbols○ quantmod::chartSeries○ get○ assign
● Reference:○ Quantmod Web○ Quantmod Slide
![Page 36: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/36.jpg)
[Example5] 找到後台的JSON時?
![Page 37: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/37.jpg)
回家作業:
● 可以利用 R 中的 rjson 套件,練習處理看看賞面的網頁?
● Reference:○ rjson: http://cran.r-project.
org/web/packages/rjson/rjson.pdf
![Page 38: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/38.jpg)
[Example6] 當遇到下載檔案時
![Page 39: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/39.jpg)
使用技術● Example6_Download_CSV_File_From_T
WSE.R● R○ RCurl::getURL○ file■ writeLines■ readLines
○ textConnection○ read.table
![Page 40: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/40.jpg)
回家作業:
● 接續上方範例......○ 運用 apply 對每一行都 parse 開○ 利用長度去掉不要的資料
○ 把留下的資料運用 do.call(rbind, data_list) 合成
○ 然後,製作成Data frame格式並存入 RData 檔案之中
![Page 41: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/41.jpg)
[Example7] 看code學寫code
![Page 43: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/43.jpg)
使用技術
● Example7_Download_ZIP_File_From_Taifex.R
● R○ download.file○ unzip
![Page 44: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/44.jpg)
[Example8] 當遇需要 Cookie 時
![Page 45: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/45.jpg)
使用技術
● Example8_Download_CSV_File_From_Taifex_With_Cookie.R
● R○ RCurl::getCurlHandle○ RCurl::getURL(url,curl=curlHandle)○ XML::htmlParse○ XML::xmlAttrs
![Page 46: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/46.jpg)
回家作業:
● 接續上方範例......○ 練習用 readline 讀入 unzip 出來的 rpt 檔○ 並將 rpt 檔轉換成 quantmod 可以分析用
的 xts 格式
![Page 47: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/47.jpg)
spideR 的架構
![Page 48: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/48.jpg)
spideR 的架構
● Web Connector○ RCurl
● Data Parser (Cleaner)○ XML
● Data Center○ RData File○ DB (SQLite, MySQL, PostgreSQL,
MongoDB, Redis, ........)
![Page 49: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/49.jpg)
spideR 的寫作流程
![Page 50: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/50.jpg)
spideR 的寫作流程
● 確立目標?● 觀察網頁
● 頁面分類
● 分類頁面的 Connector 實作
● 分類頁面的 Parser 實作
● 資料庫比對與存取
![Page 51: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/51.jpg)
一些小技巧
![Page 53: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/53.jpg)
尋找「後台」的小技巧2 -- 找form
![Page 54: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/54.jpg)
尋找「資料」的小技巧1 打開 hidden
![Page 55: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/55.jpg)
尋找「資料」的小技巧2利用JQuery
![Page 56: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/56.jpg)
尋找「資料」的小技巧3利用 JS debugger;
![Page 57: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/57.jpg)
尋找「資料」的小技巧4停用 JS (停用前)
![Page 58: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/58.jpg)
尋找「資料」的小技巧4停用 JS (停用後:推薦商品消失)
![Page 59: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/59.jpg)
Q & A
![Page 60: 20130325 mldm monday spide r](https://reader031.vdocuments.mx/reader031/viewer/2022020218/55a441651a28ab5d538b47e1/html5/thumbnails/60.jpg)
感謝大家