when search becomes research and research becomes search
DESCRIPTION
SIGIR'13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH).TRANSCRIPT
When Search becomes Research and Research becomes Search
SIGIR’13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH)
August 1, 2013, Dublin, Ireland
Jaap KampsUniversity of Amsterdam
(Re)search(Re)searchers
• My current main interest is search related to/supporting research (amongst a few dozen other things)
• So what’s different if your searchers are researchers, and their search is (part of) their research?
• This talk is rather speculative -- no iron-clad formal results -- but I hope to convince you that this is (at least) an interesting use case
• And an area with great opportunities to work in...
Outline
• DATA: The Web and Online Heritage
• Issues: Archival Silence
• USERS: Digital Heritage -- Digital Humanities
• Challenges: Digital Methods
• TOOLS: Supporting Complex Search Tasks
• (Re)search: Digital Methods <-> Complex Search
Lot’s of CH online
CH is digitized on a massive scale
Europeana: millions of objects from 1000s of providers
The UK Web Archive
8
Permission-based selective archiving since 2004 30% success rate 131,164 websites, 54,604
instances, ~14TB WARCs
Domain crawl from 12 April 2013 to implement non-print legal deposit Expected to crawl
between 4-5 million UK websites
Access in reading rooms only
http://www.webarchive.org.uk
Terabytes of Archived Web Data
(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)
What’s the problem?
Not really that much traffic...
Europeana Web Traffic Report – Q4 2012 - 5 -
Month by Month Overview
Visits Unique Visitors Page Views Time on site/visit (mm:ss)
Bounce rate
October 2012 534,830 441,096
2,017,751 00:02:17 50.27%
November 2012 612,902 505,177
2,299,244 00:02:16 49.79%
December 2012 530,747 439,919 2,079,335 00:02:19 48.80%
Europeana Web Traffic Report – Q4 2012 - 7 -
2. Portal Search
338,574 Visits with Search 36.10% Increase from Q3 2012 52.89% Increase from Q4 2011
Visits with Search is the number of visits during which at least one portal search occurred
743,292 Total Unique Searches 37.82% Increase from Q3 2012 31.67% Increase from Q4 2011
Total Unique Searches is the number of times a search is performed on Europeana (duplicate searches within a single visit are excluded)
3. Object Views, Social Actions & Click-throughs 2,361,589 Object Views 8.10% Increase from Q3 2012 45.55% Increase from Q4 2011
The number of times Europeana object pages have been viewed. Repeated views of a single page are counted
778,046 Search Result Views 10.06% Increase from Q3 2012 8.32% Decrease from Q4 2011
The number of time Europeana search results pages have been viewed. Repeated views of a single page are counted
2,975 Social Actions 46.19% Increase from Q3 2012 22.27% Increase from Q4 2011
The number of times a user has clicked on a social share icon within the portal
KPI 27: 30,000 object shares in 2012 Jan – Dec 2012 – 9,609 shares (from portal)
Let’s say: less traffic than we hoped for...
How often are web archives used?
6
Archiving institutions’ focus on data collection, not usage
19 of 29 IIPC members’ archives (listed on website) have full or partial online access, often permission-based
Large scale national web archives have restricted access – dark archives eg Danish National Web Archive, over 280TB
online access for researchers with PhD or higher level 20 users since 2005
“Document-centric” access methods
No agreed way of calculating / benchmarking access statistics
Little evidence of scholarly use of web archives, making it difficult to understand requirements
(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)
Archival Silence
• Many online collections suffer from low traffic...
• After years of hard work, the data is there
• But the users aren’t queuing up to come and explore the data
• Why is that happening?
Digital Heritage online are incunabula
Our infrastructure changed in a revolutionary way
Our technology changed in a revolutionary way
How radical did information access methods change?
Think outside the box?
• Are we too “framed” by the type of systems that had before?
• And by those that emerged on the Web?
• (cmp. Diane Kelly’s, Contours and Convergence, KSJ lecture at ECIR’13.)
Wrap Up (1)
• We have made wonderful progress: CH data is out there in huge volume
• More, better, richer, ... every day
• Use of the data is often lagging behind
• We should learn from “the Web”
• But also do really different things!
• (This takes time -- at least a generation)
Right, something really different -- but what?
CH as Web search?
• Should we really try to “copy” the Web?
• Web search optimizes fast, shallow search
• on highly dynamic data with massive #s of user signals
• Could we be *ahead* of the Web (rather than following them)?
Let’s do the obvious :)
• Look seriously at the scholarly use of the CH information we have accumulated?
• Get in touch with researchers and find out how they (want to) use the data and why they are *not* using our tools
• (In fact, heritage institutions traditionally focused on scholars, emphasis on the general public is quite recent...)
Digital Heritage Digital Humanities
e-Humanities
The Times They are a-Changin’ ?
Something exciting is happening!
• Digital Humanities emerging fast in response to massive volume of data
• Digitization of historic sources
• Heritage of the future is digital
• User-generated content in new media
• In short: for many research questions a lot of relevant data is available!
Change in Character1.0 2.0
Collection-centered User-centered
Supply-driven Demand-driven
Professionals Amateurs
Individual scholar Team or lab
Small scale Large scale
Qualitative Quantitative
Change = Radical!
• Change in research paradigm?
• Traditional humanities based on interpretative paradigm
• Empirical sciences based on a truth-finding paradigm
• Did the “success criterion” change?
• Use tools of the exact science for the benefit of traditional paradigm?
(Actual empirical science is also less rigorous)
DH requires new data-driven research methods
"Google and the politics of tabs" by Govcom.org, Amsterdam, 2008.
Website historiography
Innovat ion and Evaluat ion of Informat ion
A CHI98 Workshop Gene Golovchinsky and Nicholas J. Belkin
Abstract
This report summarizes a workshop held at CHI 98 that focused on several aspects of information exploration, including user interfaces, theory, and evaluation. Information exploration is a common activity that spans a variety of media and is an integral component of many information seeking behav- iors that people engage in. The com- plexity of this activity, and the need to support it appropriately, led us to pro- pose this workshop. Over the course of two days, we examined several aspects of this problem, struggled with a few definitions, and came away with a bet- ter understanding of the design space. Here we summarize those efforts.
Introduction
Traditional Information Retrieval is concerned with improving effective- ness of indexing and retrieval mecha- nisms, and with supporting one information seeking behavior: speci- fied searching through query formula- tion. This has been predicated on support for one kind of user popula- tion, with one kind of information need. But the networked information environment has resulted in a shift in the user population of information retrieval systems. This change has introduced new classes of users, in the sense of levels o f expertise, and has also made clear that there are different kinds of information needs and differ- ent kinds of information seeking behaviors than those supported by tra- ditional IR systems and techniques. This workshop focused on developing understanding of one such information seeking behavior, Information Explo- ration, on interface design for support- ing this behavior, and on evaluation
methods and measures for assessing such interfaces.
Information Exploration addresses the goal of refining a vague concept into a more thorough understanding of the problem which led to the information interaction. We believe that informa- tion exploration research falls squarely in the domain of human-computer interaction with some emphasis on information retrieval, rather than vice versa. Thus one of the thrusts o f this workshop was to attempt to character- ize the activities users engage in, to design for those activities, and to iden- tify evaluation techniques and mea- sures that provide appropriate insights into users ' behavior and performance.
Organization
About 20 people participated in the workshop. They were chosen on the basis of initial brief submitted position papers, and represented a broad spec- trum of industry and academia. Partic- ipants came from France, Canada, Germany, and the U.S. After accep- tance, participants were asked to sub- mit longer (4-5 page) position statements that described relevant research and perspectives a few weeks prior to the workshop. These papers were made available through the workshop web site, and participants were encouraged to review and com- ment on them.
Submissions were organized into three categories: Interface, Evaluation and Theory. Each category was further subdivided into themes that suggested themselves. Thus a number of inter- face submissions concerned informa- tion visualization; three of five evaluation-related submissions focused on expertise, and the theory section split evenly between frame-
works and representation of informa- tion.
On the morning of the first day, work- shop activities were organized based on the three topics we had initially defined. After the morning introduc- tory session, we split the workshop into three new working groups, based on the results of that discussion.
J. - ©
Figure 1. Information exploration (gray box) situated in the broader task. The black
"method" box may involve a recursive information exploration step to identify
information sources.
Discussion Highlights
It seems obligatory for a workshop to debate the definition of the concept that brought people together; we embraced this orthodoxy with a ven- geance. One of the recurring themes of
SIGCHI Bulletin Volume 31, Number 1 January 1999 22
Essentially these are complex search strategies!
Wrap Up (II)
• Digital Humanities is emerging fast and leads to new data driven research methods
• Motivated by hum. research questions
• Essentially they are crawling, cleaning, tokenizing, ranking, exploring, visualizing
• Basically the stuff *we* are experts in
• Can we build tools that support their research task from begin to end?
(Re)search?
• Interactively construct complex strategy
• data sources, selections, processing, back-and-forth, ...
• Explore all results using facets/aspects
• explore whole data set -- no 10 links
• Store, share, and refine search strategies
• “Session” may take minutes, hours, days, ...
How to get there?
(1) Intensive collaborations with CH institutions
(2) Include researchers: Co-creation, Living Lab, ...
(3) Build not a tool, but the toolmaker’s tools
Team up with Arjen de Vries and Spinque :)
Search strategy from building blocks
Strategy Builder Each block = data or manipulations
Build dedicated search engine “on the fly”
Research methods become search strategies
Store, refine, reuse, share strategies
(Re)search!
Web Archive (New Media scholars)
Thaer SamarPhD/programmer
Hugo HuurdemanPhD researcher
Anat Ben-DavidPostdoc
Arjen de Vries Jaap Kamps Richard Rogers
Paul DoorenboschRené Voorburg
Victor-Jan Vos
WebART Goals
• Evaluating current curation and selection procedures of Web archives
• Getting insights into current use of Web archives
•Developing new methods and tools for research using Web archives
Flickr: koninklijkebibliotheek
KB: Web archive since 2007
Statistics:•4,000+ websites
•17,000+ harvests
•7+ TerabyteSelective approach
KB: Web archive since 2007
Statistics:•4,000+ websites
•17,000+ harvests
•7+ TerabyteSelective approach
”Wayback Machine” interface
• WebARTist (pilot - beta 1)
• Initial dataset (corpus)• 432 crawls, 16 months (13.64 GB)
Full-text search engine
KB CommonCrawl+nu.nl
(Dutch news aggregator)
WebARTist: Use case
• Digital Methods Winter School (Jan. ’13)
• Co-design workshop (“Living Lab”)
• researchers & developers
• first use WebARTist
Word frequency analysis
0
100
200
300
400
500
600
700
800
17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013
Co-Word Analysis
1
abcnews.go.com1
brucespringsteen.net1
theverge.com1
sportamerika.nl
1
reuters.com1
ebird.org
1
googleblog.blogspot.co.uk
1
presscentre.sony.eu
1
project.wnyc.org
1
bbc.com
1
poynter.org
1
abclocal.go.com
1
en.wikipedia.org
1
nhc.noaa.gov
1
nypost.com
2
earthcam.com
2
maps.google.com
3
hp.com
4
google.org
4
edition.cnn.com
Syria
Sandy
7wired.com
7allthingsd.com
7abcnews.go.com
7thesun.co.uk
7allesoversterrenkunde.nl
8volkskrant.nl
9fd.nl
9nos.nl
9mobiel.nuvideo.nl
9guardian.co.uk
10bit.ly
10billboard.biz
10cbsnews.com
11
usmagazine.com
11
variety.com
12
theverge.com
12
people.com
13
Rutte en Verhagen leggen schuld bij PVV
13
telegraaf.nl
14
washingtonpost.com
18
edition.cnn.com
19
bbc.co.uk
20
youtube.com
20
nytimes.com
21
styletoday.nl
21
bloomberg.com
24
thesistools.com
26
hollywoodreporter.com
30
online.wsj.com
30
deadline.com
33
poll.nupubliek.nl34
spaarrente.nl
39
gamer.nl
48
reuters.com
52
tmz.com
57
open.spotify.com
78
peil.nl
93
gezondheidsnet.nl
US Election
4
1blogs.aljazeera.net
1youtube.com
1worldpressphoto.org
1wikileaks.org1
washingtonpost.com
1eubusiness.com
1vesti.bg
1trouw.nl
1#NAME
1en.wikipedia.org
1l
1sana.sy
1hosted.ap.org
1shariah4belgium.com
1nrc.nl
1guardian.co.uk
1geopolicity.com
1nctb.nl
1rt.com
1kaspersky.com
2
todayszaman.com
2
volkskrant.nl
2
spaarrente.nl
2
reuters.com
2
peil.nl
2
hrw.org
2
uk.reuters.com
2
cbsnews.com
3
telegraph.co.uk
3
maps.google.nl
4
bbc.co.uk
5
edition.cnn.com
5
aljazeera.com
english.alarabiya.net
7
maps.google.com
Outlink Analysis
Geomapping location Wire service
Temporal Image Analyses
Pilot Tools: Scalable Full Text Search++
User interface!
Zoekmachine!
Inverted Index!
Hadoop Distributed Filesystem!
Some Lessons (pilot)
• Fun, creative (but hard for control freaks)
• unexpected really new ideas!
• It is really co-design -- a dialog:
• researchers keep talking in “solutions”
• unaware of the full potential?
• Search engine used to explore
• Then want to use their own tools
• Emphasis on aggregates, visualizations
Ongoing• Started to designing the whole task support
• Want folks to stay in the system!
• Connect source data to later “information graphics”
• For the research prototype: no polished graphics
• Volume/Hadoop slow things down
• 1. Port “search by strategy” to Hadoop (slow, asynchronous)
• 2. After (complex) selection on Hadoop, instantiate a dedicated environment (fast, interactive, bounded size)
Projects with museums, archives, libraries, archaeology
Wrap Up (III)
• How far can we push this to support research in a generic way?
• Working on many sources, processing components and way to combine them into search strategies
• Working on richer data (also from research use)
• Working on scale
• Data is still a crucial issue/factor
• Researchers always want what isn’t there
• Data quality/noise/completeness issues
Work on (Re)search?
• (Re)search leads to radically different modes of information access!
• (NB: Recall the panel!)
• Digital humanities is happening right now
• No shortage of data, dedicated users, ...
• Still lot’s of low hanging fruit
• Great opportunities for young researchers!
Questions?
• We’re hiring!
• 2 PhD (4y), 2 Postdocs (6m/1y).
• WebART: http://webarchiving.nl/
• ExPoSe: http://staff.science.uva.nl/~kamps/expose/
• Thank you to all collaborators: Arjen de Vries, Richard Rogers, Hugo Huurdeman, Thaer Samar, Anat Ben David, Maarten Marx, Wouter Alink, ...