the seo's guide to scraping everything

26
SCRAPING EVERYTHING the SEO’s guide to: @eppievojt digital marketing consultant, JPL

Upload: eppievojt

Post on 30-Nov-2014

23.547 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: The SEO's Guide to Scraping Everything

SCRAPING!EVERYTHING!

the SEO’s guide to: !

@eppievojt!digital marketing consultant, JPL!

Page 2: The SEO's Guide to Scraping Everything

NEXT LEVEL!XPATH-ING!

Use Case 1:

Does site x link to any page on eppie.net?

Page 3: The SEO's Guide to Scraping Everything

NEXT LEVEL!XPATH-ING!

Scrape partial matches using XPath’s “contains” function to find inexact data.

What we know:"

1)  Link will contain"http://www.eppie.net in the "href attribute"

2)  Some people like to hurt the internet by capitalizing URLs, so we’ll need to account for that"

3)  People who link to you don’t care about your desire for canonicalization

Page 4: The SEO's Guide to Scraping Everything

DO YOU LINK!TO ME?!

//a[contains(@href,'http://www.eppie.net’)]

PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY

Page 5: The SEO's Guide to Scraping Everything

DO YOU LINK!TO ME?!

//a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'http://www.eppie.net’)]

Add translate() to normalize case

Page 6: The SEO's Guide to Scraping Everything

DO YOU LINK!TO ME?!

Get notified when a link is removed + Make contact to potentially save dropping link (friendly

reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process + Get notification when link goes live

How you can use this:

Page 7: The SEO's Guide to Scraping Everything

NEXT LEVEL!XPATH-ING!

Use Case 2:

Find every external link from cnn.com

Page 8: The SEO's Guide to Scraping Everything

NEXT LEVEL!XPATH-ING!

Combine attribute selectors to more accurately target useful information

What we know:"

1)  External links all contain http://"

2)  Internal links can also use http://"

3)  So we need to exclude http:// links to the current domain

Page 9: The SEO's Guide to Scraping Everything

SCRAPE ALL!EXTERNAL LINKS!

//a[contains(@href,'http://') and not(contains(@href,'cnn.com'))]

Page 10: The SEO's Guide to Scraping Everything

SCRAPE ALL!EXTERNAL LINKS!

Identify if a page is too spammed out to bother with by pulling external link counts

Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites.

Broken link building automation

How you can use this:

Page 11: The SEO's Guide to Scraping Everything

LINK TYPE!IDENTIFICATION!

Use Case 3:

How are they ranking? What kind of links do they have?

Page 12: The SEO's Guide to Scraping Everything

LINK TYPE!IDENTIFICATION!

XPath’s ancestor axis lets us leverage semantic markup to ID link types.

What we know:"

A link inside a containing element with an id or class name including the word “comment,” “footer,” or “blogroll” is highly suggestive of type

Page 13: The SEO's Guide to Scraping Everything

LINK TYPE!IDENTIFICATION!

"//a[@href='h,p://randfishkin.com/blog']/ancestor::*[contains(@id|@class,'comment')]"

Was Rand comment-

spamming his way to

the top? This + 0S

E

tells the story...

Page 14: The SEO's Guide to Scraping Everything

SCRAPE ALL!EXTERNAL LINKS!

Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor text

Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis

Why you might use this:

Page 15: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Use Case 4:

I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)

Page 16: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Use regular expressions to pattern match structured text

Example:

Extract all @mentions of a specific user from a tweet or page

Page 17: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Page 18: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Page 19: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Page 20: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Page 21: The SEO's Guide to Scraping Everything

EXTRACT!@ MENTIONS!

/(?:^|\s)@([A-z0-9_]+)/gi

Page 22: The SEO's Guide to Scraping Everything

REGEX TO!THE RESCUE!

Pull contact information from a web site (Twitter username, email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research

Why you might use this:

Page 23: The SEO's Guide to Scraping Everything

BEYOND THE !SPREADSHEET!

Use Case 5:

I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.

Page 24: The SEO's Guide to Scraping Everything

BEYOND THE !SPREADSHEET!

Scraping outside the spreadsheet allows for more complex systems to be built.

PHP Scraping Overview:

1)  CURL target page 2)  Convert to DOM Object 3)  Run Xpath Queries 4)  Store Data or Hit API

Page 25: The SEO's Guide to Scraping Everything

BEYOND THE !SPREADSHEET!

Simple PHP Scraper Class:

http://www.scrapeeverything.com

Page 26: The SEO's Guide to Scraping Everything

SHOW!SOME LOVE!

I’m @eppievojt and I work for @jplcreative "

eppie.net linkdetective.com jplcreative.com