web scraping for code-ophobes

46
Web Scraping @AnnieCushing For Code-ophobes

Upload: annie-cushing

Post on 13-May-2015

14.499 views

Category:

Technology


1 download

DESCRIPTION

Learn to scrape data in Google Docs using ImportFeed, ImportHTML, and ImportXML. Annie Cushing, Senior SEO at SEER Interactive (@AnnieCushing on Twitter) isn't a developer, so she breaks this process down into easy-to-understand steps - and provides a link to a Google Doc where you can follow along and learn from!

TRANSCRIPT

Page 1: Web Scraping for Code-ophobes

Web Scraping

@AnnieCushing

For Code-ophobes

Page 2: Web Scraping for Code-ophobes

What I’m not

@AnnieCushing

Page 3: Web Scraping for Code-ophobes

What I am

Page 4: Web Scraping for Code-ophobes

THE WIND BENEATH MY WEB-SCRAPING WINGS

@djchrisle

@ethanlyon

@AnnieCushing

Page 5: Web Scraping for Code-ophobes

3 WAYS TO SCRAPE IN GOOGLE DOCS

• ImportFeed• ImportHTML• ImportXML

@AnnieCushing

Page 6: Web Scraping for Code-ophobes

=ImportFeed

Page 7: Web Scraping for Code-ophobes

ImportFeed

=ImportFeed(URL, query, headers, numItems)

http://bit.ly/importfeed@AnnieCushing

=ImportFeed("http://feeds.searchengineland.com/searchengineland")

OR

=ImportFeed(C4) My preference

Page 8: Web Scraping for Code-ophobes

@AnnieCushing

Page 9: Web Scraping for Code-ophobes

@AnnieCushing

http://slidesha.re/stalker-wil

STALKING FOR LINKS

BY @WILREYNOLDS

Page 10: Web Scraping for Code-ophobes

=ImportHTML

Page 11: Web Scraping for Code-ophobes

ImportHTML

• Table• List

TWO OPTIONS

@AnnieCushing

Page 12: Web Scraping for Code-ophobes

=ImportHtml(URL, query, index)

URL: “www.domain.com/whatever” OR cell reference query: “table” or “list” OR cell referenceindex: If multiple lists or tables, which one (3 = 3rd table)

@AnnieCushing

Page 13: Web Scraping for Code-ophobes

Table Example of ImportHTML

@AnnieCushing

Page 14: Web Scraping for Code-ophobes

List Example of ImportHTML

@AnnieCushing

Page 15: Web Scraping for Code-ophobes

=ImportXML

Page 16: Web Scraping for Code-ophobes

ImportXML

http://bit.ly/xpath-tutorial

=ImportXML(URL, query)

@AnnieCushing

Page 17: Web Scraping for Code-ophobes

Simple Explanation of XPath

XPath uses path expressions to select nodes or node-sets in an XML document.

@AnnieCushing

Page 18: Web Scraping for Code-ophobes

@AnnieCushing

Page 19: Web Scraping for Code-ophobes

7 Types of Nodes

@AnnieCushing

Page 20: Web Scraping for Code-ophobes

Simple Explanation of XPath

<div><p><blockquote><price><ul>

ELEMENTS

@AnnieCushing

Page 21: Web Scraping for Code-ophobes

• As you drill down, you separate nodes with /

• Ex: /html/div/ul/li/a

PARENT-CHILD NODES

@AnnieCushing

Page 22: Web Scraping for Code-ophobes

classidsize

Look for the = sign

ATTRIBUTES

@AnnieCushing

Page 23: Web Scraping for Code-ophobes

Simple Explanation of XPath

/: Starts at the root//: Starts wherever @: Selects attributes []: Answers the question “Which one?”[*]: All

KEY CHARACTERS

@AnnieCushing

Page 24: Web Scraping for Code-ophobes

Let’s Start Simple

@AnnieCushing

Page 25: Web Scraping for Code-ophobes

Magic!

@AnnieCushing

Page 26: Web Scraping for Code-ophobes

Grab the URLs

@AnnieCushing

Page 27: Web Scraping for Code-ophobes

Because it’s an @tribute!

Page 28: Web Scraping for Code-ophobes

Let’s dial it up

@AnnieCushing

http://bit.ly/distilled-xml

Page 29: Web Scraping for Code-ophobes

@AnnieCushing

Page 30: Web Scraping for Code-ophobes

@AnnieCushing

What if your child nodes look like this?

Page 31: Web Scraping for Code-ophobes

Let’s dial it up

@AnnieCushing

Page 32: Web Scraping for Code-ophobes

Could do it this way

@AnnieCushing

Page 33: Web Scraping for Code-ophobes

At your own risk

@AnnieCushing

Page 34: Web Scraping for Code-ophobes

Better plan

@AnnieCushing

Page 35: Web Scraping for Code-ophobes

The world according to Annie

// = blah blah yada yada

@AnnieCushing

Page 36: Web Scraping for Code-ophobes

Can even be in the middle of the XPath

//div[@class=‘main’]//blockquote[2]

@AnnieCushing

Page 37: Web Scraping for Code-ophobes

Other ways to tell “which one” in XPath

STARTS-WITH

@AnnieCushing

Page 38: Web Scraping for Code-ophobes

Other ways to tell “which one” in XPath

@AnnieCushing

CONTAINS

Page 39: Web Scraping for Code-ophobes

Other ways to tell “which one” in XPath

@AnnieCushing

Page 40: Web Scraping for Code-ophobes

Other ways to tell “which one” in XPath

INDEX VALUE

@AnnieCushing

Page 41: Web Scraping for Code-ophobes

Other ways to tell “which one” in XPath

LAST()

@AnnieCushing

Page 42: Web Scraping for Code-ophobes

Become a scraping FOOL

@NicoMiceli

@AnnieCushing

• Pull queries from Topsy• Pull product feeds• Pull specific elements from a sitemap• Scrape Twitter followers• Pull GA metrics• Scrape HTML tables (e.g., list of countries from Wikipedia)• Scrape lists (e.g., scraped lists of consumer review sites to create

a custom search engine, top sports blogs, etc.)• Scrape rankings• Scrape GA codes / Adsense IDs / IPs / IP Country Codes• Find de-indexed sites• Scrape directories• Scrape Yahoo / Google for relevant pages from directory listings• Scraping title / h1 / meta descriptions• Scrape page URLs to find if someone is linking to you• Scrape Google to find snippets of text on a list of domains (for link

networks)• Scrape Quora

Page 43: Web Scraping for Code-ophobes

43

SEE IMPORT FUNCTIONS IN THEIR NATURAL HABITAT!http://bit.ly/annies-gdoc@AnnieCushin

g

Page 44: Web Scraping for Code-ophobes

AWWW YEAHHH!

Page 45: Web Scraping for Code-ophobes

TO PLAY …

1. Log in2. File > Make a copy…3. Poke around and test

@AnnieCushing

Page 46: Web Scraping for Code-ophobes

RESOURCES

XPath Tutorial: http://bit.ly/xpath-tutorial Annie’s Gdoc: http://bit.ly/annies-gdocDistilled Guide: http://bit.ly/distilled-guideSEER Cookbook: http://bit.ly/seer-cookbook

@AnnieCushing