thumbnail summarization techniques for web archives ahmed alsum * stanford university libraries...

28
Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA [email protected] 1 Michael L. Nelson Old Dominion University Norfolk VA, USA [email protected] The 36th European Conference on Information Retrieval . ECIR 2014, Amsterdam, Netherlands, 2014 * Ahmed AlSum did this work while he was PhD student at Old Dominion University ECIR 2014 Amsterdam, Netherlands

Upload: emily-mitchell

Post on 17-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Thumbnail Summarization Techniques For Web Archives

Ahmed AlSum*

Stanford University Libraries

Stanford CA, [email protected]

1

Michael L. Nelson

Old Dominion University

Norfolk VA, [email protected]

The 36th European Conference on Information Retrieval. ECIR 2014, Amsterdam, Netherlands, 2014

*Ahmed AlSum did this work while he was PhD student at Old Dominion University

ECIR 2014 Amsterdam, Netherlands

Page 2: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

ECIR 2014 Amsterdam, Netherlands 2

What is a Web Archive?

http://www.cs.odu.edu

Page 3: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Thumbnails in Web Archive

Internet Archive UK Web Archive

3ECIR 2014 Amsterdam, Netherlands

Page 4: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

4

Memento Terminology

URI-R, R

URI-M, M

URI-T, TM

http://www.amazon.com

http://web.archive.org/web/20110411070244/http://amazon.com

Original Resource

Memento

TimeMap

ECIR 2014 Amsterdam, Netherlands

Page 5: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Thumbnails Creation Challenges• Scalability in Time

• IA may need 361 years to create thumbnail for each memento using one hundred machines.

• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.

• Page quality

5ECIR 2014 Amsterdam, Netherlands

Page 6: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Thumbnails Usage Challenges

6

• This is partial view of the first 700 thumbnails out of 10,500 available mementos for www.apple.com

ECIR 2014 Amsterdam, Netherlands

Page 7: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

From 10,500 Mementos to 69 Thumbnails.

7ECIR 2014 Amsterdam, Netherlands

Page 8: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

How many thumbnails do we need?

www.unfi.com on the live Web

8ECIR 2014 Amsterdam, Netherlands

Page 9: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

How many thumbnails do we need?

www.unfi.com on the live Web

9ECIR 2014 Amsterdam, Netherlands

Page 10: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

40 Thumbnails are good.

10ECIR 2014 Amsterdam, Netherlands

Page 11: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

METHODOLOGY

11ECIR 2014 Amsterdam, Netherlands

Page 12: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Visual Similarity and Text Similarity

Sim

ilar

Dif

fere

nt

HTML Text

12ECIR 2014 Amsterdam, Netherlands

Page 13: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Correlation between Visual Similarity and Text Similarity • Text Similarity

• SimHash• DOM Tree• Embedded resources• Memento Datetime (Capture time)

• Visual Similarity

13ECIR 2014 Amsterdam, Netherlands

Page 14: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Text Similarity

SimHash• Computes 64-bit SimHash fingerprints with k = 4 for two

pages• Full HTML text ✔• The main content from the web page• All the text • Templates including the text• The template excluding the text

• Calculate the differences using Hamming Distance

14ECIR 2014 Amsterdam, Netherlands

Page 15: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Text Similarity

DOM Tree• Transfer each webpage to DOM tree• Calculate the difference using Levenshtein Distance

• Levenshtein distance: is the number of operations to insert, update, and delete.

15ECIR 2014 Amsterdam, Netherlands

Page 16: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Text Similarity

Embedded resources• Extract the embedded resources for each page • Calculate the total number of new resources that have

been added and the resources that have been removed.• For example, the difference between M1 and M2:

• Addition of 5 resources (2 javascript files and 3 images) • Removal of 2 resources (1 javascript file and 1 image).

16ECIR 2014 Amsterdam, Netherlands

Page 17: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Text Similarity

Memento datetime• Calculate the difference between the record capture time

for both pages in seconds.

17ECIR 2014 Amsterdam, Netherlands

Page 18: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Visual Similarity• Measurement: the number of different pixels between two

thumbnails• To compare two thumbnails,

• Resize them into different dimensions: 64x64, 128x128, 256x256, and 600x600.

• Calculate the Manhattan distance and Zero distance between each pair

18ECIR 2014 Amsterdam, Netherlands

Page 19: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Correlation between Visual Similarity and Text Similarity

SimHash DOM tree

Embedded resources Memento Datetime

19

SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]

ECIR 2014 Amsterdam, Netherlands

Page 20: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

SELECTION ALGORITHMS

20ECIR 2014 Amsterdam, Netherlands

Page 21: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Threshold Grouping

21ECIR 2014 Amsterdam, Netherlands

Page 22: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Threshold Grouping

22ECIR 2014 Amsterdam, Netherlands

Page 23: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Clustering technique• Input:

• TimeMap with n mementos• A set of features.

• For example, F = {SimHash, Memento-Datetime}

• Task:• Cluster n mementos in K clusters.

23ECIR 2014 Amsterdam, Netherlands

Page 24: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Clustering technique

SimHash Feature SimHash and Datetime Features

24

Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.

ECIR 2014 Amsterdam, Netherlands

Page 25: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Time Normalization

25ECIR 2014 Amsterdam, Netherlands

Page 26: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Selection Algorithms Comparison

  Threshold Grouping K clustering Time Normalization

TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109

# Features 1 feature 1 or more 1 feature

Preprocessing required Yes Yes No

Efficient processing Medium Extensive Light

Incremental Yes No Yes

Online/offline Both Both Both

26ECIR 2014 Amsterdam, Netherlands

Page 27: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Generalization outside the Web Archive

• Get k thumbnails from website that has n pages

27ECIR 2014 Amsterdam, Netherlands

Page 28: Thumbnail Summarization Techniques For Web Archives Ahmed AlSum * Stanford University Libraries Stanford CA, USA aalsum@stanford.edu 1 Michael L. Nelson

Conclusions• We explored the similarity between the text and visual

appearance of the web page.• We found that SimHash and Levenshtein distance have the highest

correlation

• We presented three algorithms to select k thumbnails from n mementos per TimeMap.

28

[email protected]@aalsum

ECIR 2014 Amsterdam, Netherlands