scrape the web: strategies for programming …scrape the web: strategies for programming websites...

403
Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia ([email protected], +1-585-506-8865) February 18, 2010

Upload: others

Post on 28-Jun-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scrape the Web: Strategies for programmingwebsites that don’t expect it

Presenter: Asheesh Laroia, @asheeshlaroia([email protected], +1-585-506-8865)

February 18, 2010

Page 2: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 3: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Intro

Page 4: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Meta

Page 5: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 6: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 7: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 8: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 9: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 10: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Hello

I You will learn neat tricks

I DO NOT BECOME AN EVIL COMMENT SPAMMER

I Theory, practice, and iterative development

I Brittle? Sometimes.

I The comics aren’t mine; ask me for references.

Page 11: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Page 12: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Page 13: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Format introduction

I I’ll stand up here and talk about things.

I You’ll ask me questions.

Page 14: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

Page 15: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

Page 16: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

You know what sucks?

I It sucks when everyone’s thinking something and nobody’ssaying it.

I If I am incoherent, stop me.

Page 17: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 18: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 19: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 20: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 21: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Only” three hours

I Slow me down,

I or speed me up.

I Do this with your voice or by raising your hand.

I Don’t try to do it via Twitter.

Page 22: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What is screen scraping?

Page 23: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Photo

Page 24: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Photo

Page 25: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Brittle?

Page 26: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Page 27: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Page 28: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Page 29: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Remote procedure call

I Every time you press a key, you cause the remote computer toexecute code.

I Every keypress causes a remote procedure call.

I If you understand this, you can document it as an API.

Page 30: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Page 31: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Page 32: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Page 33: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power

I We get to interact with the raw data.

I We could write our own interface.

I We get to programmatically interact with a system that onlyexpect humans at the door.

Page 34: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Independence

I Design choices and restrictions fall away.

Page 35: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Independence

I Design choices and restrictions fall away.

Page 36: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 37: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 38: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Power, too much

I WE CAN SEND SPAM!

I Don’t do that.

Page 39: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 40: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Programming the web

Page 41: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Say

Page 42: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

Page 43: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

Page 44: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web

I It’s the twenty-first century.

I The Web is a massive, mostly-unrestricted remote procedurecall system.

Page 45: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Page 46: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Page 47: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mac OS “say”

I I’m not hip enough to have “say”

I but I do have the Web

Page 48: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cepstral demo

Page 49: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Curry

Page 50: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Delicious

Page 51: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Curry on the web

http://mehfilindian.com/LunchMenuTakeOut.htm

Page 52: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Page 53: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Page 54: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Page 55: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Beneath the covers...

I FrontPage 6.0 is from 2003

I Some really ugly HTML...

I I like to call this 1998-style HTML

Page 56: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

Page 57: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

Page 58: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

Page 59: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The easy way

examples/curry/trivial.py

I urllib2.urlopen() gives you a file descriptor

I Now you can read() it... (and you get a big ol’ byte string)

I Test its contents for squash, and you’re done.

Page 60: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

Page 61: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

Page 62: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

Page 63: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The Web and standards

I We don’t have to resort to visual screen scraping.

I The web has a standard data format for marking up pagecontent.

I What is it called?

Page 64: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

Page 65: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

Page 66: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

XHTML and HTML

I It’s 2010.

I Surely XHTML has won by now.

Page 67: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 68: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 69: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 70: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 71: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Extract some information”

I HTML

I vs. XHTML (2000)

I Both are trees of tags; both can be visualized in FireBug.

I ...did XHTML win?

Page 72: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 73: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 74: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?

I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 75: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 76: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?

I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 77: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 78: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:

I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 79: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 80: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?

I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 81: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 82: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?

I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 83: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 84: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?

I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 85: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 86: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 87: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Stats pop quiz(Stats from the MAMA survey published by Opera<http://dev.opera.com/articles/view/mama-key-findings/>.)

I Average page size?I 16.5K

I HTML to XHTML ratio?I 2:1

I Transitional vs. Strict/Frameset:I 10:1

I How many in ”Quirks” mode?I 85%

I What’s more popular? TITLE or BODY?I TITLE

I What percent validate in general?I ca. 4.13%

I What percent of web pages that have validation badgesvalidate?

I ca. 12

Page 88: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 89: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The web: Round one

Page 90: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Parsing considerations

Page 91: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your options

I An example of valid HTML (written by hand)(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 92: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)

I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 93: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 94: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 95: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 96: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 97: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidom

I Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 98: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 99: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 100: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in Firefox

I In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 101: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidom

I in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 102: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 103: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A showcase of some of your optionsI An example of valid HTML (written by hand)

(examples/parsing/)I Parsed with HTMLParser

I An example of invalid HTML (cooked by hand)(examples/parsing/invalid-html/)

I Parsed with HTMLParser

I An example of valid XHTML (written by hand)(examples/parsing/valid-xhtml/)

I Parsed with xml.dom.minidomI Parsed with HTMLParser

I An example of invalid XHTML<http://www.washington.edu/accessit/webdesign/student/unit5/invalidHTML.htm>(examples/parsing/invalid-xhtml/)

I in FirefoxI In xml.dom.minidomI in HTMLParser

I If web HTML is not always parseable, we need a differentapproach.

Page 104: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Page 105: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Page 106: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Other ways to get information out of web pages?

I “squash” in page contents.lower()

I re.search(“squash”, page contents, re.IGNORECASE)

Page 107: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Inspirational quote: JWZ

Some people, when confronted with a problem, think“Iknow, I’ll use regular expressions.” Now they have twoproblems.– Jamie Zawinski

Page 108: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 109: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 110: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 111: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 112: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 113: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What’s wrong with regular expressions for scraping

I <a href=”/whatever/”>

I <a href=’whatever’>

I <a href=‘whatever”>

I Okay for “Reviews 1-10 of 430”

I Kodos: Regular expression GUI (since redemo.py seemsunmaintained)

Page 114: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Inspirational quote: Jon Postel

Robustness principle: “Be conservative in what you do, be liberal inwhat you accept from others.”– Jon Postel, Transmission Control Protocol, RFC 793

Page 115: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Inspirational quote: Leonard Richardson

“You didn’t write that awful page. You’re just trying to get somedata out of it. Right now, you don’t really care what HTML issupposed to look like.“– Leonard Richardson, author of BeautifulSoup

Page 116: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Back to curry

Page 117: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Page 118: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Page 119: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

New goal for curry: Objectify

Map the menu to Python objects

I play with the source in BeautifulSoup

I ...this is a text processing problem, not tag processing.

Page 120: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 121: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 122: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 123: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 124: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 125: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Model the data

examples/curry/menu.pyclass Entree:

I index

I name

I description

I long winded description

I price

Page 126: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mini-lesson

I hand-written pages vs.

I machine-written pages

Page 127: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mini-lesson

I hand-written pages vs.

I machine-written pages

Page 128: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mini-lesson

I hand-written pages vs.

I machine-written pages

Page 129: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

Page 130: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

New goal: Scrape Yahoo! finance

I examples/tree-builders/beautifulsoup yfinance.py

Page 131: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

We’re done!

Right?

Page 132: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Trees of tags

Page 133: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What defines how HTML gets parsed?

Web browsers

Page 134: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Page 135: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Page 136: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Surfing tag trees in FireBug

I Or Opera Dragonfly

I Or Chrome’s Inspector

Page 137: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Parsing trees and finding elements

Page 138: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 139: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 140: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 141: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 142: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 143: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 144: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 145: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 146: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...

I titleI span.title

Page 147: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I title

I span.title

Page 148: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Early history

I 1998: HTML::TokeParser for Perl

I $p->get tag(“title”)

I 1999: W3C XPath standard

I xmlDoc.selectNodes(“//title”)

I 2004: BeautifulSoup for Python, Release 1.0, “So rich andgreen”

I soup(“title”)

I 2006: scrAPI for Ruby

I CSS Selectors...I titleI span.title

Page 149: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 150: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 151: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 152: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 153: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 154: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 155: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recent history

I 2007: lxml.html improved, publicized by Ian Bicking

I CSS selectors for Pythonistas

I 2007: html5lib: Parse web pages like a browser

I 2008: BeautifulSoup 3.1.0, the end of an era

I 2010: html5lib deprecates BeautifulSoup

I “cannot correctly represent any HTML 5 tree (for lack ofnamespace support), and cannot represent at all anycontaining MathML or SVG”

Page 156: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 157: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 158: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 159: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 160: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 161: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Searching tag trees

I BeautifulSoup API(examples/tree-builders/beautifulsoup/search.py)

I html5lib creates BeautifulSoup objects (or others)(examples/tree-builders/html5lib/search.py)

I lxml provides XPath(examples/tree-builders/lxml/search xpath.py)

I “minimal stable XPath”

I lxml provides CSSSelect(examples/tree-builders/lxml/search css.py)

Page 162: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Interacting with the web

Page 163: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Yahoo! search (hard-coded)

examples/search/yahoo.py

Page 164: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Page 165: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Google! search (hard-coded)

examples/search/google.py

I Great code, but broken due to ?

Page 166: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Something’s wrong...

Page 167: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 168: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The web: HTTP and you

Page 169: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A network trace of an HTTP conversation

Page 170: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

User-Agent, and other headers the client sends

Page 171: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 172: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 173: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 174: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 175: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 176: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 177: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 178: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Status codes

I 2xx: Success

I 3xx: Redirection

I 4xx: Error

I 402: Payment Required

I 404 Not Found

I 410 Gone

I 418 I’m a teapot

Page 179: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP methods

I GET

I POST

I PUT

I BREW

Page 180: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP methods

I GET

I POST

I PUT

I BREW

Page 181: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP methods

I GET

I POST

I PUT

I BREW

Page 182: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP methods

I GET

I POST

I PUT

I BREW

Page 183: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP methods

I GET

I POST

I PUT

I BREW

Page 184: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 185: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 186: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 187: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 188: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 189: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Once we set User-Agent, are we just like Firefox?

I JavaScript behavior

I Image download behavior

I Cookie behavior

I Invalid HTML handling behavior (?)

I Accept: headers

Page 190: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

What if we settle for approximate emulation?

Page 191: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Re-do of Google search with a cooked user-agent

examples/search/urllib2-user-agent/google as ie.py

Page 192: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Page 193: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Page 194: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Page 195: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Favorite User-Agent headers

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;)

I Mozilla/4.0 (compatible; MSIE 5.0; Windows 98;(compatible;))

I I can’t believe it’s not Googlebot/2.1

Page 196: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP: State via cookies

I HTTP implements state on top of TCP

Page 197: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

HTTP: State via cookies

I HTTP implements state on top of TCP

Page 198: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 199: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 200: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 201: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 202: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt

I User-agent: *

I Disallow: /

I Allow: /crawlme.html

I http://www.robotstxt.org/

Page 203: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Page 204: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Page 205: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt and detectability

I “How does the server know you’re a robot?”

I Well, if you GET /robots.txt...

Page 206: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Filling out more forms: POST and GET

(Be sure to pay attention to the clock; minute 90 is when snackbreak starts.)

Page 207: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

POST: Cepstral Weather demo (by hand)

http://cepstral.com/cgi-bin/demos/weather

Page 208: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Note the URL we POST to

I from FireBug

Page 209: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Note the URL we POST to

I from FireBug

Page 210: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Note the data we POST

I from FireBug

Page 211: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Note the data we POST

I from FireBug

Page 212: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Write simple Python that also POSTs

examples/cepstral/just post.py

Page 213: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Pull out the .wav file and play it with mplayer

examples/cepstral/play wav.py

Page 214: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

POST: Cepstral weather demo (via mechanize)

examples/cepstral/just post via mechanize.py

Page 215: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Page 216: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Yahoo! search (via mechanize)

examples/search/yahoo mechanize.py

I Great code, but broken due to robots.txt

Page 217: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Yahoo! search (via mechanize, handle robots=False)

examples/search/yahoo mechanize norobots.py

Page 218: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Basic Google! search (via mechanize,handle robots=False, changeuser-agent)

examples/search/google mechanize.py

Page 219: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cookies

Page 220: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

emusic: Log in and verify that we logged in successfully(with cookielib)(optional)

examples/cookies/emusic login byhand.py

Page 221: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

emusic: Log in and verify that we logged in successfully(with mechanize)

examples/cookies/emusic login mechanize.py

Page 222: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

emusic: Check how many downloads we have left (withmechanize)

examples/cookies/emusic check downloads.py

Page 223: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Now we’re done, right?

Whew.

Page 224: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 225: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap and philosophy

Page 226: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 227: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 228: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 229: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 230: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 231: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 232: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 233: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Recap

We’ve seen:

I Loading web pages from the network with urllib2

I Parsing web pages (even broken ones)

I Scraping that page into a set of structured Python objects

I HTTP status codes

I Faking the user agent header

I Submitting forms

I Keeping a session with cookies

Page 234: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

Page 235: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

Page 236: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

Page 237: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Play nice” on the web

I Ignore Terms of Service at your own peril

I robots.txt

I DO NOT BECOME AN EVIL COMMENT SPAMMER

Page 238: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Page 239: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Page 240: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Page 241: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why scrape the web?

I Anger

I Interoperation with unmaintained systems

I “Rogue interoperability”

Page 242: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Web APIs

Page 243: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Page 244: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Page 245: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contacts

I status messagesI large profile imagesI notifications

I What’s the point?

Page 246: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messages

I large profile imagesI notifications

I What’s the point?

Page 247: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile images

I notifications

I What’s the point?

Page 248: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Page 249: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Facebook uses standards!

I XMPP chat doesn’t support:

I support grouping contactsI status messagesI large profile imagesI notifications

I What’s the point?

Page 250: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 251: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 252: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 253: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 254: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 255: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Sorry”

I Ohloh: “Sorry, it is not currently possible to get the list ofcommits through the API.”

I Flickr: No way to get a user avatar via the API.

I API keys are evidence of submission.

I Where is the love?

I Why even play this game?

Page 256: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 257: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Parser redux

Page 258: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 259: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 260: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 261: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 262: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTML

I HTML: 1998-style, or 2003-style?

Page 263: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Choosing a parser

I Performance

I Ease-of-use

I Quality

I Especially as relates to cleaning broken HTMLI HTML: 1998-style, or 2003-style?

Page 264: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Benchmarks by Ian Bicking

I Benchmarks run by me this morning

I same results as Ian

Page 265: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Benchmarks by Ian BickingI Benchmarks run by me this morning

I same results as Ian

Page 266: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Benchmarks by Ian BickingI Benchmarks run by me this morning

I same results as Ian

Page 267: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Ease of use

Page 268: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Page 269: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Page 270: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Page 271: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Tree fixups

I lxml ≈ BeautifulSoup

I lxml ≈ html5lib

I BeautifulSoup 3.0.7 > BeautifulSoup 3.1.0

Page 272: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A winner

I lxml!

I ...?

Page 273: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A winner

I lxml!

I ...?

Page 274: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

A winner

I lxml!

I ...?

Page 275: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 276: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 277: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 278: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 279: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

More about CSS selectors

I FireQuark

I http://www.imdb.com/title/tt0111161/

I h5:contains(“Release”)

I CSS...

Page 280: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 281: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Countermeasures

Page 282: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Easy

Page 283: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Imagine a really stupid bot

Page 284: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Check Referer header

I mechanize solves this

Page 285: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Check Referer header

I mechanize solves this

Page 286: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Extra hidden form fields

I mechanize solves this

Page 287: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Extra hidden form fields

I mechanize solves this

Page 288: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Requiring cookies

I mechanize solves this

Page 289: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Requiring cookies

I mechanize solves this

Page 290: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Countermeasures: hard

Page 291: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Page 292: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Page 293: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, or

I your own machines

I Use SOCKS (plus SSH) to make this easy

Page 294: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Page 295: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Per-IP address query limits

Example: Yahoo web search API

I Use more IPs

I Tor, orI your own machines

I Use SOCKS (plus SSH) to make this easy

Page 296: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

Page 297: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

CAPTCHAs

Example: Google web search (when you exceed undeclared querylimits).

I uh-oh

Page 298: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Page 299: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

JavaScript

Example: “Hash cash” system for avoiding comment spam.

I uh-oh

Page 300: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Invisible countermeasures

Page 301: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Behavior profiling

I Time-based?

Page 302: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Behavior profiling

I Time-based?

Page 303: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Inserting false link visible only to bots

I “Tarpits”

Page 304: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Inserting false link visible only to bots

I “Tarpits”

Page 305: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt access

I As soon as you access it, you lose.

Page 306: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

robots.txt access

I As soon as you access it, you lose.

Page 307: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Getting around IP address limits

Page 308: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Page 309: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Understand

I We still have to stay within the limits. We can just takeadvantage of IPs we do have.

Page 310: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

Page 311: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

Page 312: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

ssh -D

I Borrow the IP of any machine you can log in to

I ssh -D 1080 asheesh.org

Page 313: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

Page 314: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

Page 315: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

socks monkey

I SOCKSify Python from within Python

I examples/ip-limits/socks monkey.py

Page 316: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

Page 317: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

Page 318: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tsocks

I SOCKSify Python via LD PRELOAD

I examples/ip-limits/tsocks/

Page 319: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Page 320: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Page 321: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

tor

“The onion router”

I SOCKSify but borrow someone else’s IP

I (play nice...)

Page 322: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 323: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 324: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 325: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 326: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Cycling strategies

I Drain it dry

I easy to implement first

I Round-robin

I generally preferable

Page 327: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Return to JavaScript: breaking Hash Cash

Page 328: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Page 329: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Page 330: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Page 331: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Detecting its presence

I Attempt to submit a comment with JS disabled

I Attempt to submit a comment with JS enabled

I Trace the second in FireBug

Page 332: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Page 333: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Rewriting the JavaScript as Python

I You may think I’m joking, but this is a common strategy.

Page 334: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

Page 335: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

Page 336: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

DOMForm

I Good news

“DOMForm is a Python module for web scraping and web testing.It knows how to evaluate embedded JavaScript code in response toappropriate events.”– John J. Lee of mechanize

I Bad news

“This module is unmaintained. Maybe someday...”Also, it does not execute page-global JavaScript, which is whereHashCash is implemented.

Page 337: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 338: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 339: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 340: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 341: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 342: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

python-spidermonkey

I Good news

I “Python/JavaScript bridge module, making use of Mozilla’sspidermonkey JavaScript implementation.”

I Bad news

I ...do you really want to parse the web page for JavaScript andexecute it?

I examples/javascript/hashcash.py

Page 343: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Ick

I None of this is as clean and automated as mechanize.

Page 344: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Ick

I None of this is as clean and automated as mechanize.

Page 345: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

“Breaking” CAPTCHAs

Page 346: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Page 347: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Fallback: yourself

I Can always just prompt the operator to figure it out and enterit

Page 348: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 349: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 350: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 351: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 352: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 353: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Mailinator: “Enter these words to delete the email”

I Only so many different images

I So build a look-up table

I ...indexed by URL?

I ...indexed by image contents?

I ...indexed by fuzzy image contents?

(I don’t have a good tool for the last one.)

Page 354: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Page 355: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Audio captchas: “Simple” signal analysis

I Should be doable in pylab/matplotlib with fast Fouriertransforms

Page 356: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

Page 357: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

Page 358: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

JavaScript CAPTCHAs (like reCAPTCHA)

I re-implement CAPTCHA-downloading logic in Python

I ...or execute the JavaScript with spidermonkey

Page 359: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

Page 360: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

...JDownloader

I “Again, our captcha team did a great job and implementedmany new captcha methods.”

Page 361: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The website from Hell: US PTO Public PAIR

http://portal.uspto.gov/external/portal/pair

Page 362: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Start with a CAPTCHA

Page 363: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Solve it and move on to...

I document.write()

Page 364: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Solve it and move on to...

I document.write()

Page 365: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

The page is invisible.

Page 366: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 367: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Automating the web browser

Page 368: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Selenium Remote Control

examples/seleniumrc/start.py

Page 369: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Selenium IDE

I Our friend, XPath

I FireBug

Page 370: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Selenium IDE

I Our friend, XPath

I FireBug

Page 371: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Selenium IDE

I Our friend, XPath

I FireBug

Page 372: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Page 373: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Page 374: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Why don’t we just do this all the time?

I Firefox memory footprint

I Flexibility

Page 375: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 376: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Other tricks

Page 377: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Your parser may fail

Page 378: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Page 379: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Page 380: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Page 381: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Text encoding

I Look in the HTTP header!

I Try UTF-8!

I ...chardet, if you must

Page 382: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Page 383: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Page 384: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Automatically reverse-engineer templates

I templatemaker by Adrian Holovaty

I everyblock templatemaker

Page 385: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

table2dict

I Python bug tracker

Page 386: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

table2dict

I Python bug tracker

Page 387: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Outline

Intro

Programming the web

Stats pop quiz

The web: Round one

The web: HTTP and you

Recap and philosophy

Parser redux

Countermeasures

Automating the web browser

Other tricks

Conclusions

Page 388: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Conclusions

Page 389: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 390: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 391: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 392: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 393: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Scaling and stability

I Choosing reliable queries from web pages

I Expanding to more IP addresses when necessary using SSH(and Python 2.6 multiprocessing for a plausible model of howto rotate SOCKS proxies)

I Tor (and other proxy considerations)

I registrar.py: was seven years stable...

Page 394: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Page 395: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Page 396: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Summary

I If it’s on a web page, you can scrape it out.

I “Now you have an API for everything.”

Page 397: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Page 398: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Page 399: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Future directions

I More automation

I Using cssselect everywhere, geez it’s cool

Page 400: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Page 401: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Page 402: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions

Page 403: Scrape the Web: Strategies for programming …Scrape the Web: Strategies for programming websites that don’t expect it Presenter: Asheesh Laroia, @asheeshlaroia (scrape-pycon@asheesh.org,

Bonus time

If we have time:

I Greasemonkey demo: scraping in the browser

I Audience-suggested scraping lab

I Workshopping on queries or regular expressions