the seo's guide to scraping everything

SCRAPING!EVERYTHING!

the SEO’s guide to: !

@eppievojt!digital marketing consultant, JPL!

NEXT LEVEL!XPATH-ING!

Use Case 1:

Does site x link to any page on eppie.net?


Scrape partial matches using XPath’s “contains” function to find inexact data.

What we know:"

1)  Link will contain"http://www.eppie.net in the "href attribute"

2)  Some people like to hurt the internet by capitalizing URLs, so we’ll need to account for that"

3)  People who link to you don’t care about your desire for canonicalization

DO YOU LINK!TO ME?!

//a[contains(@href,'http://www.eppie.net’)]

PROBLEM: FAILS TO ACCOUNT FOR CASE SENSITIVITY

DO YOU LINK!TO ME?!

//a[contains(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'http://www.eppie.net’)]

Add translate() to normalize case

DO YOU LINK!TO ME?!

Get notified when a link is removed + Make contact to potentially save dropping link (friendly

reminder, buy expiring domain, recreate dead resource)

Integrate into link outreach process + Get notification when link goes live

How you can use this:


Use Case 2:

Find every external link from cnn.com


Combine attribute selectors to more accurately target useful information

What we know:"

1)  External links all contain http://"

2)  Internal links can also use http://"

3)  So we need to exclude http:// links to the current domain

SCRAPE ALL!EXTERNAL LINKS!

//a[contains(@href,'http://') and not(contains(@href,'cnn.com'))]


Identify if a page is too spammed out to bother with by pulling external link counts

Find expired or expiring domains being linked to from authority sites. Purchase and rebuild or redirect those sites.

Broken link building automation

How you can use this:

LINK TYPE!IDENTIFICATION!

Use Case 3:

How are they ranking? What kind of links do they have?


XPath’s ancestor axis lets us leverage semantic markup to ID link types.

What we know:"

A link inside a containing element with an id or class name including the word “comment,” “footer,” or “blogroll” is highly suggestive of type


"//a[@href='h,p://randfishkin.com/blog']/ancestor::*[contains(@id|@class,'comment')]"

Was Rand comment-

spamming his way to

the top? This + 0S

E

tells the story...


Analyze competitors’ strategies for acquiring links

Find what types of links are being used to get good anchor text

Improve workflow: Ignore placed links (comments, directory submissions, article submissions, blog networks, etc) and work on a smaller subset of EARNED links for manual analysis

Why you might use this:

REGEX TO!THE RESCUE!

Use Case 4:

I’ve scraped some data, now I need to extract some small portion of it that XPath can’t do on its own (easily)


Use regular expressions to pattern match structured text

Example:

Extract all @mentions of a specific user from a tweet or page

EXTRACT!@ MENTIONS!

/(?:^|\s)@([A-z0-9_]+)/gi


Pull contact information from a web site (Twitter username, email address) to improve outreach efforts

Extract code fragments (like Analytics IDs and AdSense IDs) for improved competitive research

Why you might use this:

BEYOND THE !SPREADSHEET!

Use Case 5:

I want to chain processes together, process lots of data, or allow multiple users to leverage what I build.


Scraping outside the spreadsheet allows for more complex systems to be built.

PHP Scraping Overview:

1)  CURL target page 2)  Convert to DOM Object 3)  Run Xpath Queries 4)  Store Data or Hit API


Simple PHP Scraper Class:

http://www.scrapeeverything.com

SHOW!SOME LOVE!

I’m @eppievojt and I work for @jplcreative "

eppie.net linkdetective.com jplcreative.com

the seo's guide to scraping everything

Technology