keith enlow mobile heritrix mobile. introduction heritrix 3.1 mobile finder web service 2 options...

22
Keith Enlow Heritrix Mobile

Upload: julianna-jackson

Post on 14-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

Keith Enlow

Heritrix Mobile

Page 2: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

IntroductionHeritrix 3.1Mobile Finder Web Service2 Options

Crawl desktop web pages (default)Crawl mobile web pages using Mobile finder

and exclude mobile web pages that use media queries.

Page 3: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

ExperimentDecision Making HeritrixWeb Service (Mobile Finder) Heritrix

Modified Heritrix 3.1 to include two options for crawlingOption 0: Crawl with desktop user agentOption 1: Crawl with mobile user agent using Mobile

FinderAdded built in mobile user agent adapted from

Google BotCrawled a small set of URLsUsed Mobile Finder to find if the given URL

has mobile versionWrote a small script to discover differences

between the mobile and desktop versions

Page 4: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

<property name="userAgentTemplate"value="Mozilla/5.0 (compatible; heritrix/@VERISON@+ @OPERATOR_CONTACT_URL@)"/>

<property name="userAgentTemplateMobile"value="Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_1 like Mac OS X; en-us)

AppleWebKit/532.9 (KHTML, like Gecko) Version/4.0.5 Mobile/8B117

Safari/6531.22.7 (compatible; heritrix/@VERSION@+ @OPERATOR_CONTACT_URL@"/>

<!-- Option # = Description 0 [Default] Crawl using desktop user agent 1 Crawl using mobile user agent + Mobile Finder Web Service

--><property name="CrawlOption" value="0" />

Page 5: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

URLs Crawled

Desktop URL Mobile URLwww.huffingtonpost.

comwww.foxnews.comwww.nbcnews.comwww.whitehouse.go

vwww.nasa.govwww.ssa.govwww.cornell.eduwww.stanford.eduwww.mit.edu

m.huffpost.com foxnews.mobiwww.nbcnews.comm.whitehouse.govmobile.nasa.govwww.ssa.gov/mobilem.cornell.edu/#homem.stanford.edum.mit.edu /

mobile.mit.edu

Page 6: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 7: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 8: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 9: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 10: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 11: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 12: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 13: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 14: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages
Page 15: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

Redirection/Delivery200 Response (server side redirect)302 “Temporary” relocation301 “Permanent” relocationJavaScript Redirection (client side redirect)Media QueriesStyle Sheets

Page 16: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

Tiny LimitsNo JavaScript Engine

Heritrix is unable to perform and execute JavaScript code

Unable to catch client side redirection and will instead continue to crawl the desktop version of the web page.

Note: The Mobile Finder Web Service will find the mobile page and therefore Heritrix will continue the crawl.

www.nasa.govwww.ssa.govwww.cornell.edu

Page 17: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

Desktop vs MobileTotal Link Count

Hufington Fox News NBC News NASA SSA White House Stanford Cornell MIT56774 12703 8894 4960 2380 8121 2351 2901 120

2134 110 3545 63 53 570 116 94 124

Page 18: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

HTML Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT11550 2681 2302 851 20 3251 385 596 12

493 35 488 18 0 76 16 31 26

Page 19: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

JavaScript Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT245 107 46 589 12 83 104 525 2

33 4 14 8 0 13 4 8 0

Page 20: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

CSS Distribution

Huffington Fox News NBC News NASA SSA White House Stanford Cornell MIT587 301 72 304 1 154 214 86 3

36 3 17 1 0 19 8 4 3

Page 21: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

Image Distribution

Huffington Fox News NBC NASA SSA

White House Stanford Cornell MIT

38671 8893 5852 2908 17 4187 1460 1484 871227 59 2769 28 0 436 74 4 89

Page 22: Keith Enlow Mobile Heritrix Mobile. Introduction Heritrix 3.1 Mobile Finder Web Service 2 Options Crawl desktop web pages (default) Crawl mobile web pages

FIN