web scrain wit r - 统计之都 · web scrain wit r xiao nan @road2stat ... (than python &...

64
. . Web ScrainwitR Xiao Nan @road2stat 6th China R Beijing Xiao Nan @road2stat Web Scraping with R 1/43 . . . 1 / 43

Upload: leque

Post on 19-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

.

..Web Scra�in� wit� R

Xiao Nan @road2stat

6th China R Beijing

Xiao Nan @road2stat Web Scraping with R 1/43...

1/43

.

Outline

• Overview• Toolkit• Exception Handling• Parallelization• Outro

Xiao Nan @road2stat Web Scraping with R 2/43...

2/43

..

Part 1Overview

Two Types ofScrapers / Crawlers

The REAL ones and the ...

...

Fake ones?

...

I’m not a fake!

.

I can crawl the web, too!

.

Two Types of Scrapers

• General-purpose Crawlers• Focused Crawlers (our focus today)

Xiao Nan @road2stat Web Scraping with R 8/43...

8/43

.

Browser Revisited

..

Xiao Nan @road2stat Web Scraping with R 9/43...

9/43

.

Browser Revisited

...

Analysis and Retrieve

.

Xiao Nan @road2stat Web Scraping with R 10/43...

10/43

.

Browser Revisited

...

Analysis and Retrieve

.

Receive and Parse

..

Xiao Nan @road2stat Web Scraping with R 11/43...

11/43

.

Comparing to Other LanguagesPros & Cons

.Pros..

.

• Lightweight• Easy to implement• Easy to debug• Seamless modeling integration: less I/O

Xiao Nan @road2stat Web Scraping with R 12/43...

12/43

.

Comparing to Other LanguagesPros & Cons

.Pros..

.

• Lightweight• Easy to implement• Easy to debug• Seamless modeling integration: less I/O

.Cons..

.

• Fewer libraries (than Python & Ruby)• Multi-Process Parallelization: forking is de cient ...

Xiao Nan @road2stat Web Scraping with R 13/43...

13/43

.

Choosing A Better Platform

..

Xiao Nan @road2stat Web Scraping with R 14/43...

14/43

.

Choosing A Better Platform

Why Linux?

• Network performance & mem. management → Faster• Better parallelization support → Faster• Uni ed encoding & locale → Faster (for coders)• More recent third party libs → Faster (less bugs)

Xiao Nan @road2stat Web Scraping with R 15/43...

15/43

.

Choosing A Better Platform

Why Linux?

• Network performance & mem. management → Faster

• Better parallelization support → Faster• Uni ed encoding & locale → Faster (for coders)• More recent third party libs → Faster (less bugs)

Xiao Nan @road2stat Web Scraping with R 15/43...

15/43

.

Choosing A Better Platform

Why Linux?

• Network performance & mem. management → Faster• Better parallelization support → Faster

• Uni ed encoding & locale → Faster (for coders)• More recent third party libs → Faster (less bugs)

Xiao Nan @road2stat Web Scraping with R 15/43...

15/43

.

Choosing A Better Platform

Why Linux?

• Network performance & mem. management → Faster• Better parallelization support → Faster• Uni ed encoding & locale → Faster (for coders)

• More recent third party libs → Faster (less bugs)

Xiao Nan @road2stat Web Scraping with R 15/43...

15/43

.

Choosing A Better Platform

Why Linux?

• Network performance & mem. management → Faster• Better parallelization support → Faster• Uni ed encoding & locale → Faster (for coders)• More recent third party libs → Faster (less bugs)

Xiao Nan @road2stat Web Scraping with R 15/43...

15/43

..

Part 2Toolkit

Retrieve & Parse

.

ToolkitAvailable R Packages

Pkg. Name Retrieve? Parse? Must-Know?RCurl Yes No YesXML Limited Yes Yesrjson No Yes YesRJSONIO No Yes Optionalhttr Yes Yes Optionalselectr No Yes OptionalROAuth No No Optional

Xiao Nan @road2stat Web Scraping with R 18/43...

18/43

.

ToolkitAvailable R Packages

..RCurl.

[XML]

.

.

XML.

.

selectr

.

.

ROAuth

.

.

httr

.

.

[JSON]

.

.

rjson

.

.

RJSONIO .

.

Parsing

.

Retrieving

. Heavyweight.Lightweight

Xiao Nan @road2stat Web Scraping with R 19/43...

19/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.• httr → Simpli cation version RCurl + XML + rjson

Not Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.• httr → Simpli cation version RCurl + XML + rjson

Not Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.

• httr → Simpli cation version RCurl + XML + rjsonNot Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.• httr → Simpli cation version RCurl + XML + rjson

Not Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.• httr → Simpli cation version RCurl + XML + rjson

Not Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001

• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitAvailable R Packages

• RCurl → Header Con gurationR不务正业之 RCurl: http://cos.name/cn/topic/17816

• XML → XPath, 3 Critical Functionshttp://www.road2stat.com/cn/r/rxml.html

• rjson → Of cially Listed / RJSONIO → by D.T.L.• httr → Simpli cation version RCurl + XML + rjson

Not Recommended for not discreet enough:http://randyzwitch.com/r-error-message-fun/

• ROAuth → Useful for APIs. see RWeibo of @lijian001• selectr → Translate CSS Selectors to XPath Expressions

Xiao Nan @road2stat Web Scraping with R 20/43...

20/43

.

ToolkitFront-End and Miscellaneous

• Chrome Developer Tools / FireBug →Analyzing AJAX Requests: http://cos.name/cn/topic/107729

• JSONView → Output Formatted JSON• Visual Event → Bounded event on DOM elements• tcpdump + Wireshark → Packet Capture & Protocol Analysis

Xiao Nan @road2stat Web Scraping with R 21/43...

21/43

.

ToolkitFront-End and Miscellaneous

• Chrome Developer Tools / FireBug →Analyzing AJAX Requests: http://cos.name/cn/topic/107729

• JSONView → Output Formatted JSON

• Visual Event → Bounded event on DOM elements• tcpdump + Wireshark → Packet Capture & Protocol Analysis

Xiao Nan @road2stat Web Scraping with R 21/43...

21/43

.

ToolkitFront-End and Miscellaneous

• Chrome Developer Tools / FireBug →Analyzing AJAX Requests: http://cos.name/cn/topic/107729

• JSONView → Output Formatted JSON• Visual Event → Bounded event on DOM elements

• tcpdump + Wireshark → Packet Capture & Protocol Analysis

Xiao Nan @road2stat Web Scraping with R 21/43...

21/43

.

ToolkitFront-End and Miscellaneous

• Chrome Developer Tools / FireBug →Analyzing AJAX Requests: http://cos.name/cn/topic/107729

• JSONView → Output Formatted JSON• Visual Event → Bounded event on DOM elements• tcpdump + Wireshark → Packet Capture & Protocol Analysis

Xiao Nan @road2stat Web Scraping with R 21/43...

21/43

..

Part 3Exception Handling

..

More than 70%

.

Exception HandlingCoding Strategy

• Dirty HTML & XML: Preprocess with htmltidy

• Build-in Condition/Error Handler Function: XML::xmlStructuredStop

• Coding Strategy: Interative until fault-tolerant.

Xiao Nan @road2stat Web Scraping with R 25/43...

25/43

.

Exception HandlingCoding Strategy

• Dirty HTML & XML: Preprocess with htmltidy• Build-in Condition/Error Handler Function: XML::xmlStructuredStop

• Coding Strategy: Interative until fault-tolerant.

Xiao Nan @road2stat Web Scraping with R 25/43...

25/43

.

Exception HandlingCoding Strategy

• Dirty HTML & XML: Preprocess with htmltidy• Build-in Condition/Error Handler Function: XML::xmlStructuredStop

• Coding Strategy: Interative until fault-tolerant.

Xiao Nan @road2stat Web Scraping with R 25/43...

25/43

.

Exception HandlingFAQ on COS BBS

• Cookie Operation → http://cos.name/cn/topic/108806• Referer Validation → http://cos.name/cn/topic/109407• Session Validation → http://cos.name/cn/topic/107802• Encoding Errors → Identify the Problem Source

Google: 编码 site:cos.name/cn/

Xiao Nan @road2stat Web Scraping with R 26/43...

26/43

.

Exception HandlingThe Various Data Source

• Choose Of cial API rst: NCBI with rOpenSci

• Restricted API usage: Private API key

Xiao Nan @road2stat Web Scraping with R 27/43...

27/43

.

Exception HandlingThe Various Data Source

• Choose Of cial API rst: NCBI with rOpenSci• Restricted API usage: Private API key

Xiao Nan @road2stat Web Scraping with R 27/43...

27/43

..

.

Exception HandlingThe Various Data Source

• Choose Of cial API rst (NCBI with rOpenSci)• Restricted API usage: Private API key• SSL and SSL Decryption: Trusted MITM

http://www.webos-internals.org/wiki/Decrypt_SSL_(trusted_

man-in-the-middle_technique)

Xiao Nan @road2stat Web Scraping with R 29/43...

29/43

..

Part 4Parallelization

..

.

ParallelizationThe best solution is?

A Conventional Way: RCurl::getURIAsynchronous()

• Native. Extremely easy to use.• Pitfalls: Have to control the process number by hand.

This seems weird!

Xiao Nan @road2stat Web Scraping with R 32/43...

32/43

.

ParallelizationThe best solution is?

A Conventional Way: RCurl::getURIAsynchronous()

• Native. Extremely easy to use.

• Pitfalls: Have to control the process number by hand.This seems weird!

Xiao Nan @road2stat Web Scraping with R 32/43...

32/43

.

ParallelizationThe best solution is?

A Conventional Way: RCurl::getURIAsynchronous()

• Native. Extremely easy to use.• Pitfalls: Have to control the process number by hand.

This seems weird!

Xiao Nan @road2stat Web Scraping with R 32/43...

32/43

.

ParallelizationThe best solution is?

A Better Way: doMC + foreach

• Full control / Easy to migrate / Natural to code:

require(doMC)registerDoMC(20)x <- foreach(i = 1:1e+5, ...) %dopar% {

xxx <- getURL(urls[i])}

Xiao Nan @road2stat Web Scraping with R 33/43...

33/43

.

ParallelizationThe best solution is?

A Better Way: doMC + foreach

• Full control / Easy to migrate / Natural to code:

require(doMC)registerDoMC(20)x <- foreach(i = 1:1e+5, ...) %dopar% {

xxx <- getURL(urls[i])}

Xiao Nan @road2stat Web Scraping with R 34/43...

34/43

.

ParallelizationThe best solution is?

A Better Way: doMC + foreach

• Full control / Easy to migrate / Natural to code:

require(doMC)registerDoMC(20)x <- foreach(i = 1:1e+5, ...) %dopar% {

xxx <- getURL(urls[i])}

• Single machine, registerDoMC(20), 10 min, 1e+5 pages.

Xiao Nan @road2stat Web Scraping with R 35/43...

35/43

.

ParallelizationThe best solution is?

A Better Way: doMC + foreach

• Full control / Easy to migrate / Natural to code:

require(doMC)registerDoMC(20)x <- foreach(i = 1:1e+5, ...) %dopar% {

xxx <- getURL(urls[i])}

• Single machine, registerDoMC(20), 10 min, 1e+5 pages.• Pitfalls: (Almost) Linux only.

Xiao Nan @road2stat Web Scraping with R 36/43...

36/43

.

ParallelizationMore Pitfalls

• Requires high-perf storage →Redis (rredis) or MongoDB (RMongo)?

• Memory leak (RCurl & XML) → Avoid long exec. time• Intensive testing before run → Minimize errors

Xiao Nan @road2stat Web Scraping with R 37/43...

37/43

..

Part 5Outro

.

RemarksWeb Crawler Ethics

• Web Crawler Ethics

• Honor robots.txt• A Balanced Crawling Rate• Spammer Shame• With great power comes great responsibility.

Xiao Nan @road2stat Web Scraping with R 39/43...

39/43

.

RemarksWeb Crawler Ethics

• Web Crawler Ethics• Honor robots.txt

• A Balanced Crawling Rate• Spammer Shame• With great power comes great responsibility.

Xiao Nan @road2stat Web Scraping with R 39/43...

39/43

.

RemarksWeb Crawler Ethics

• Web Crawler Ethics• Honor robots.txt• A Balanced Crawling Rate

• Spammer Shame• With great power comes great responsibility.

Xiao Nan @road2stat Web Scraping with R 39/43...

39/43

.

RemarksWeb Crawler Ethics

• Web Crawler Ethics• Honor robots.txt• A Balanced Crawling Rate• Spammer Shame

• With great power comes great responsibility.

Xiao Nan @road2stat Web Scraping with R 39/43...

39/43

.

RemarksWeb Crawler Ethics

• Web Crawler Ethics• Honor robots.txt• A Balanced Crawling Rate• Spammer Shame• With great power comes great responsibility.

Xiao Nan @road2stat Web Scraping with R 39/43...

39/43

..

.

Further Reading

1. XML & JSON Speci cation (esp. XPath)

2. RCurl & XML Documentation

3. Web Data Mining (Chapter 8) by Bing Liu

4. XML and Web Technologies for Data Sciences with Rby Duncan Temple Lang, et al. (Due Sep. 2013)

5. Curl.jl https://github.com/forio/Curl.jl

Xiao Nan @road2stat Web Scraping with R 41/43...

41/43

http://cos.name/cn/

Q & A