![Page 1: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/1.jpg)
Characterizing Search Behavior
in Web Archives
Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon
Foundation for National Scientific Computing
TWAW2011, Hyderabad, India
![Page 2: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/2.jpg)
2
Ephemeral Web
• The web contains unique and valuable information
– news, interviews, opinions, feelings
• 80% of the web documents are
unavailable after 1 year.
• Knowledge gap for future generations
![Page 3: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/3.jpg)
3
Web Archiving Initiatives
• 42 web archiving initiatives in 26 countries.
• +180 billion documents archived since 1996.
![Page 4: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/4.jpg)
4
Web Archiving Workflow
Acquisition Storage Searching
Preservation
Presentation Searching
• Search technology based on web search engines
– ignores the temporal dimension
– don’t understand the end users
![Page 5: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/5.jpg)
5
1st : Understanding Users
• Why do users search? (information needs)
• What do users search for? (topics)
• How do users search? (search behavior) – this study: 1st characterization
![Page 6: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/6.jpg)
6
Predicting users’ behavior can improve
• Response time – e.g. cache, special indexes
• Quality of results – e.g. better ranking, suggest queries
• Web design – e.g. make most used functionalities stand out
![Page 7: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/7.jpg)
7
Portuguese Web Archive
• Archives the Portuguese Web ≈ .PT domain
• ≈ 182M documents:
• searchable by full-text and URL.
• range between 1996 and 2009.
• Search available since 2010.
http://archive.pt
![Page 8: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/8.jpg)
8
Interface: full-text search
Results Page
![Page 9: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/9.jpg)
9
Interface: URL search
Versions Page
same text box
![Page 10: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/10.jpg)
10
Methodology
![Page 11: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/11.jpg)
11
Search Log Analysis
• Pros
• Large and varied
• Less bias
• Cheaper
• Non-intrusive
• Real information needs
• Cons
• Lack of context
• Lack of control
![Page 12: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/12.jpg)
12
Dataset of Search Logs
• ≈ 10K sessions - 7 months of 2010
• Procedure
• cleansing
• normalized and excluded invalid sessions & queries
• session delimitation
• used IP, user session and a 30 minute gap
• Users
• 72% of IP addresses → Portugal
• 89% of interactions → PT language interface
![Page 13: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/13.jpg)
13
How do users search?
![Page 14: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/14.jpg)
14
General Statistics
• Full-text sessions + URL sessions ≈ 90%
• Full-text sessions / URL sessions ≈ 2:1
• A typical full-text session:
• 1 or 2 queries
• 1 to 3 terms per query
• 1 or 2 result pages seen per query
• 1 click per query
• A typical URL session:
• 1 or 2 queries
• 1 or 2 clicks per query
![Page 15: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/15.jpg)
15
Query Distribution
0%
10%
20%
30%
40%
50%
60%
70%
1 2 3 4 5 6 7 8 9 ≥10
% s
essio
ns
# queries
# full-text queries per session
85%
![Page 16: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/16.jpg)
0%
5%
10%
15%
20%
25%
30%
35%
≤-5 -4 -3 -2 -1 0 1 2 3 4 ≥5
% q
ue
rie
s
# terms changed
# full-text terms changed
71%
16
Query Refinement
terms
added in
42% of
queries
terms
removed
in 25% of
queries
![Page 17: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/17.jpg)
17
Exploring Popularity
Many
Few
Popular
Rare
• Queries, terms, clicks and archived pages seen
• follow a power law distribution
• 27% top queries → 50% query volume
• 6% top terms → 50% query volume
• 10% top pages seen → 26% all pages seen
• 66% clicks → 1st result page
![Page 18: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/18.jpg)
18
How do users search?
• Spend little time and effort on individual searches
• Search and explore following power law distributions
• Search in web archives as in web search engines
• Excite (U.S.), Fast (Europe), Tumba! (Portugal)
• A little less queries, but a little longer
![Page 19: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/19.jpg)
19
But, what about time?
![Page 20: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/20.jpg)
20
1/3 Queries are Restricted by Date
0%
5%
10%
15%
20%
25%
30%
35%
start date end date start & enddate
% q
ue
rie
s
restriction
% queries restricted by date
full-text
URL
![Page 21: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/21.jpg)
21
Oldest Versions are more Searched
30
40
50
60
70
80
90
100
199
6
199
7
199
8
199
9
200
0
200
1
200
2
2003
200
4
200
5
200
6
200
7
200
8
200
9
% q
ue
rie
s r
estr
icte
d b
y d
ate
years
full-text queries
URL queries
![Page 22: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/22.jpg)
22
Oldest Versions are more Clicked
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 # c
licks/#
tim
es d
isp
laye
d
![Page 23: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/23.jpg)
23
Conclusions
![Page 24: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/24.jpg)
24
Conclusions
• Web archive users:
– search as in web search engines
– prefer full-text search over URL search
– prefer the oldest documents over the newest
![Page 25: Characterizing Search Behavior in Web Archives · Characterizing Search Behavior in Web Archives Miguel Costa, Mário J. Silva LaSIGE @ Faculty of Sciences, University of Lisbon Foundation](https://reader031.vdocuments.mx/reader031/viewer/2022021909/5be497b709d3f20a668d0b2e/html5/thumbnails/25.jpg)
25
Future Work
• Validate results
– larger datasets, other sources, throughout time
• Use results to improve:
– ranking considering time
– throughput and response speed
– user interface