factors affecting website reconstruction from the web infrastructure frank mccown, norou diawara,...

31
Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer Science Department Norfolk, Virginia, USA JCDL 2007 Vancouver, BC June 20, 2007

Upload: meagan-bennett

Post on 18-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

Factors Affecting Website Reconstruction from the Web Infrastructure

Frank McCown, Norou Diawara, and Michael L. Nelson

Old Dominion UniversityComputer Science Department

Norfolk, Virginia, USA

JCDL 2007Vancouver, BCJune 20, 2007

Page 2: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

2

Outline

• Web-repository crawling with Warrick• How successful is a reconstruction?• Reconstruction experiment • Significant findings

Page 3: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

3Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg

Page 4: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

4

Crawling the Crawlers

World Wide Web

Repo1

Repo2

Repon

...

Web crawling

Repo

Web-repository crawling

Page 5: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

5

Page 6: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

6

Page 7: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

7

Cached Image

Page 8: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

Cached PDF

http://www.fda.gov/cder/about/whatwedo/testtube.pdf

MSN version Yahoo version Google version

canonical

Page 9: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

10

• McCown, et al., Brass: A Queueing Manager for Warrick, IWAW 2007.

• McCown, et al., Factors Affecting Website Reconstruction from the Web Infrastructure, ACM IEEE JCDL 2007.

• McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006.

• McCown, et al., Lazy Preservation: Reconstructing Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http://warrick.cs.odu.edu/

Page 10: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

11

Page 11: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

13

Measuring the Difference

(rc, rm, ra)

changed missing added

Apply Recovery Vector for each resource

Compute Difference Vector for website

Page 12: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

14

Some Difference Vectors

D = (changed, missing, added)

(0,0,0) – Perfect recovery

(1,0,0) – All resources are recovered but changed

(0,1,0) – All resources are lost

(0,0,1) – All recovered resources are at new URIs

Page 13: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

15

How Much Change is a Bad Thing?

Lost Recovered

Page 14: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

16

How Much Change is a Bad Thing?

Lost Recovered

Page 15: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

17

Assigning Penalties

Apply to each resource

(Pc, Pm, Pa)Penalty Adjustment

Or Difference vector

Page 16: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

18

Defining Success

success = 1 – dm

Equivalent to percent of recovered resources

0 1

Less successful

More successful

Page 17: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

19

Reconstruction Experiment

• 300 websites chosen randomly from Open Directory Project (dmoz.org)

• Crawled and reconstructed each website every week for 14 weeks

• Examined change rates, age, decay, growth, recoverability

Page 18: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

20

Success of website recovery each week

*On average, we recovered 61% of a website on any given week.

Page 19: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

21

Recovery of Textual Resources

Page 20: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

22

Recovery by TLD

Page 21: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

23

Birth and Decay

Page 22: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

24

Recovery of HTML Resources

Page 23: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

25

Recovery by Age

Page 24: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

26

Statistics for Repositories

Page 25: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

27

Which Factors Are Significant?

• External backlinks• Internal backlinks• Google’s PageRank• Hops from root page• Path depth• MIME type

• Query string params• Age• Resource birth rate• TLD• Website size• Size of resources

Page 26: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

28

Mild Correlations

• Hops and – website size (0.428)– path depth (0.388)

• Age and # of query params (-0.318)

• External links and – PageRank (0.339)– Website size (0.301)– Hops (0.320)

Page 27: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

29

Regression Analysis

• No surprises: all variables are significant, but overall model only explains about half of the observations

• Three most significant variables: PageRank, hops and age (R-squared = 0.1496)

Page 28: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

31

Conclusions

• Most of the sampled websites were relatively stable– One third of the websites never lost a single resource– Half of the websites never added any new resources

• The typical website can expect to get back 61% of its resources if it were lost today (77% textual, 42% images and 32% other)

• How to improve recovery from WI? Improve PageRank, decrease number of hops to resources, create stable URLs

Page 29: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

32

Thank You

Frank McCown

[email protected]://www.cs.odu.edu/~fmccown/

Sorry, Dad… You lost me in the first

two minutes.

Page 30: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

33

Injecting Server Components into Crawlable Pages

Erasure codesHTML pages Recover at least

m blocks

Page 31: Factors Affecting Website Reconstruction from the Web Infrastructure Frank McCown, Norou Diawara, and Michael L. Nelson Old Dominion University Computer

34

Database

Perlscript

config

Static files (html files, PDFs,

images, style sheets, Javascript, etc.)

Web Infrastructure

Web Infrastructure

Web Server

Dynamicpage

Recoverable

Not Recoverable