when search becomes research and research becomes search

When Search becomes Research and Research becomes Search

SIGIR’13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH)

August 1, 2013, Dublin, Ireland

Jaap KampsUniversity of Amsterdam

(Re)search(Re)searchers

• My current main interest is search related to/supporting research (amongst a few dozen other things)

• So what’s different if your searchers are researchers, and their search is (part of) their research?

• This talk is rather speculative -- no iron-clad formal results -- but I hope to convince you that this is (at least) an interesting use case

• And an area with great opportunities to work in...

Outline

• DATA: The Web and Online Heritage

• Issues: Archival Silence

• USERS: Digital Heritage -- Digital Humanities

• Challenges: Digital Methods

• TOOLS: Supporting Complex Search Tasks

• (Re)search: Digital Methods <-> Complex Search

Lot’s of CH online

CH is digitized on a massive scale

Europeana: millions of objects from 1000s of providers

The UK Web Archive

8

  Permission-based selective archiving since 2004   30% success rate   131,164 websites, 54,604

instances, ~14TB WARCs

  Domain crawl from 12 April 2013 to implement non-print legal deposit   Expected to crawl

between 4-5 million UK websites

  Access in reading rooms only

http://www.webarchive.org.uk

Terabytes of Archived Web Data

(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)

What’s the problem?

Not really that much traffic...

Europeana Web Traffic Report – Q4 2012 - 5 -

Month by Month Overview

Visits Unique Visitors Page Views Time on site/visit (mm:ss)

Bounce rate

October 2012 534,830 441,096

2,017,751 00:02:17 50.27%

November 2012 612,902 505,177

2,299,244 00:02:16 49.79%

December 2012 530,747 439,919 2,079,335 00:02:19 48.80%

Europeana Web Traffic Report – Q4 2012 - 7 -

2. Portal Search

338,574 Visits with Search 36.10% Increase from Q3 2012 52.89% Increase from Q4 2011

Visits with Search is the number of visits during which at least one portal search occurred

743,292 Total Unique Searches 37.82% Increase from Q3 2012 31.67% Increase from Q4 2011

Total Unique Searches is the number of times a search is performed on Europeana (duplicate searches within a single visit are excluded)

3. Object Views, Social Actions & Click-throughs 2,361,589 Object Views 8.10% Increase from Q3 2012 45.55% Increase from Q4 2011

The number of times Europeana object pages have been viewed. Repeated views of a single page are counted

778,046 Search Result Views 10.06% Increase from Q3 2012 8.32% Decrease from Q4 2011

The number of time Europeana search results pages have been viewed. Repeated views of a single page are counted

2,975 Social Actions 46.19% Increase from Q3 2012 22.27% Increase from Q4 2011

The number of times a user has clicked on a social share icon within the portal

KPI 27: 30,000 object shares in 2012 Jan – Dec 2012 – 9,609 shares (from portal)

Let’s say: less traffic than we hoped for...

How often are web archives used?

6

  Archiving institutions’ focus on data collection, not usage

  19 of 29 IIPC members’ archives (listed on website) have full or partial online access, often permission-based

  Large scale national web archives have restricted access – dark archives   eg Danish National Web Archive, over 280TB

  online access for researchers with PhD or higher level   20 users since 2005

  “Document-centric” access methods

  No agreed way of calculating / benchmarking access statistics

  Little evidence of scholarly use of web archives, making it difficult to understand requirements

(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)

Archival Silence

• Many online collections suffer from low traffic...

• After years of hard work, the data is there

• But the users aren’t queuing up to come and explore the data

• Why is that happening?

Digital Heritage online are incunabula

Our infrastructure changed in a revolutionary way

Our technology changed in a revolutionary way

How radical did information access methods change?

Think outside the box?

• Are we too “framed” by the type of systems that had before?

• And by those that emerged on the Web?

• (cmp. Diane Kelly’s, Contours and Convergence, KSJ lecture at ECIR’13.)

Wrap Up (1)

• We have made wonderful progress: CH data is out there in huge volume

• More, better, richer, ... every day

• Use of the data is often lagging behind

• We should learn from “the Web”

• But also do really different things!

• (This takes time -- at least a generation)

Right, something really different -- but what?

CH as Web search?

• Should we really try to “copy” the Web?

• Web search optimizes fast, shallow search

• on highly dynamic data with massive #s of user signals

• Could we be *ahead* of the Web (rather than following them)?

Let’s do the obvious :)

• Look seriously at the scholarly use of the CH information we have accumulated?

• Get in touch with researchers and find out how they (want to) use the data and why they are *not* using our tools

• (In fact, heritage institutions traditionally focused on scholars, emphasis on the general public is quite recent...)

Digital Heritage Digital Humanities

e-Humanities

The Times They are a-Changin’ ?

Something exciting is happening!

• Digital Humanities emerging fast in response to massive volume of data

• Digitization of historic sources

• Heritage of the future is digital

• User-generated content in new media

• In short: for many research questions a lot of relevant data is available!

Change in Character1.0 2.0

Collection-centered User-centered

Supply-driven Demand-driven

Professionals Amateurs

Individual scholar Team or lab

Small scale Large scale

Qualitative Quantitative

Change = Radical!

• Change in research paradigm?

• Traditional humanities based on interpretative paradigm

• Empirical sciences based on a truth-finding paradigm

• Did the “success criterion” change?

• Use tools of the exact science for the benefit of traditional paradigm?

(Actual empirical science is also less rigorous)

DH requires new data-driven research methods

"Google and the politics of tabs" by Govcom.org, Amsterdam, 2008.

Website historiography

Innovat ion and Evaluat ion of Informat ion

A CHI98 Workshop Gene Golovchinsky and Nicholas J. Belkin

Abstract

This report summarizes a workshop held at CHI 98 that focused on several aspects of information exploration, including user interfaces, theory, and evaluation. Information exploration is a common activity that spans a variety of media and is an integral component of many information seeking behaviors that people engage in. The com- plexity of this activity, and the need to support it appropriately, led us to pro- pose this workshop. Over the course of two days, we examined several aspects of this problem, struggled with a few definitions, and came away with a better understanding of the design space. Here we summarize those efforts.

Introduction

Traditional Information Retrieval is concerned with improving effective- ness of indexing and retrieval mecha- nisms, and with supporting one information seeking behavior: speci- fied searching through query formula- tion. This has been predicated on support for one kind of user population, with one kind of information need. But the networked information environment has resulted in a shift in the user population of information retrieval systems. This change has introduced new classes of users, in the sense of levels o f expertise, and has also made clear that there are different kinds of information needs and different kinds of information seeking behaviors than those supported by traditional IR systems and techniques. This workshop focused on developing understanding of one such information seeking behavior, Information Explo- ration, on interface design for supporting this behavior, and on evaluation

methods and measures for assessing such interfaces.

Information Exploration addresses the goal of refining a vague concept into a more thorough understanding of the problem which led to the information interaction. We believe that information exploration research falls squarely in the domain of human-computer interaction with some emphasis on information retrieval, rather than vice versa. Thus one of the thrusts o f this workshop was to attempt to character- ize the activities users engage in, to design for those activities, and to identify evaluation techniques and measures that provide appropriate insights into users ' behavior and performance.

Organization

About 20 people participated in the workshop. They were chosen on the basis of initial brief submitted position papers, and represented a broad spec- trum of industry and academia. Partic- ipants came from France, Canada, Germany, and the U.S. After accep- tance, participants were asked to sub- mit longer (4-5 page) position statements that described relevant research and perspectives a few weeks prior to the workshop. These papers were made available through the workshop web site, and participants were encouraged to review and com- ment on them.

Submissions were organized into three categories: Interface, Evaluation and Theory. Each category was further subdivided into themes that suggested themselves. Thus a number of interface submissions concerned information visualization; three of five evaluation-related submissions focused on expertise, and the theory section split evenly between frame-

works and representation of information.

On the morning of the first day, workshop activities were organized based on the three topics we had initially defined. After the morning introduc- tory session, we split the workshop into three new working groups, based on the results of that discussion.

J. - ©

Figure 1. Information exploration (gray box) situated in the broader task. The black

"method" box may involve a recursive information exploration step to identify

information sources.

Discussion Highlights

It seems obligatory for a workshop to debate the definition of the concept that brought people together; we embraced this orthodoxy with a ven- geance. One of the recurring themes of

SIGCHI Bulletin Volume 31, Number 1 January 1999 22

Essentially these are complex search strategies!

Wrap Up (II)

• Digital Humanities is emerging fast and leads to new data driven research methods

• Motivated by hum. research questions

• Essentially they are crawling, cleaning, tokenizing, ranking, exploring, visualizing

• Basically the stuff *we* are experts in

• Can we build tools that support their research task from begin to end?

(Re)search?

• Interactively construct complex strategy

• data sources, selections, processing, back-and-forth, ...

• Explore all results using facets/aspects

• explore whole data set -- no 10 links

• Store, share, and refine search strategies

• “Session” may take minutes, hours, days, ...

How to get there?

(1) Intensive collaborations with CH institutions

(2) Include researchers: Co-creation, Living Lab, ...

(3) Build not a tool, but the toolmaker’s tools

Team up with Arjen de Vries and Spinque :)

Search strategy from building blocks

Strategy Builder Each block = data or manipulations

Build dedicated search engine “on the fly”

Research methods become search strategies

Store, refine, reuse, share strategies

(Re)search!

Web Archive (New Media scholars)

Thaer SamarPhD/programmer

Hugo HuurdemanPhD researcher

Anat Ben-DavidPostdoc

Arjen de Vries Jaap Kamps Richard Rogers

Paul DoorenboschRené Voorburg

Victor-Jan Vos

WebART Goals

• Evaluating current curation and selection procedures of Web archives

• Getting insights into current use of Web archives

•Developing new methods and tools for research using Web archives

Flickr: koninklijkebibliotheek

KB: Web archive since 2007

Statistics:•4,000+ websites

•17,000+ harvests

•7+ TerabyteSelective approach

KB: Web archive since 2007

Statistics:•4,000+ websites

•17,000+ harvests

•7+ TerabyteSelective approach

”Wayback Machine” interface

• WebARTist (pilot - beta 1)

• Initial dataset (corpus)• 432 crawls, 16 months (13.64 GB)

Full-text search engine

KB CommonCrawl+nu.nl

(Dutch news aggregator)

WebARTist: Use case

• Digital Methods Winter School (Jan. ’13)

• Co-design workshop (“Living Lab”)

• researchers & developers

• first use WebARTist

Word frequency analysis

0

100

200

300

400

500

600

700

800

17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013

Co-Word Analysis

1

abcnews.go.com1

brucespringsteen.net1

theverge.com1

sportamerika.nl

1

reuters.com1

ebird.org

1

googleblog.blogspot.co.uk

1

presscentre.sony.eu

1

project.wnyc.org

1

bbc.com

1

poynter.org

1

abclocal.go.com

1

en.wikipedia.org

1

nhc.noaa.gov

1

nypost.com

2

earthcam.com

2

maps.google.com

3

hp.com

4

google.org

4

edition.cnn.com

Syria

Sandy

7wired.com

7allthingsd.com

7abcnews.go.com

7thesun.co.uk

7allesoversterrenkunde.nl

8volkskrant.nl

9fd.nl

9nos.nl

9mobiel.nuvideo.nl

9guardian.co.uk

10bit.ly

10billboard.biz

10cbsnews.com

11

usmagazine.com

11

variety.com

12

theverge.com

12

people.com

13

Rutte en Verhagen leggen schuld bij PVV

13

telegraaf.nl

14

washingtonpost.com

18

edition.cnn.com

19

bbc.co.uk

20

youtube.com

20

nytimes.com

21

styletoday.nl

21

bloomberg.com

24

thesistools.com

26

hollywoodreporter.com

30

online.wsj.com

30

deadline.com

33

poll.nupubliek.nl34

spaarrente.nl

39

gamer.nl

48

reuters.com

52

tmz.com

57

open.spotify.com

78

peil.nl

93

gezondheidsnet.nl

US Election

4

1blogs.aljazeera.net

1youtube.com

1worldpressphoto.org

1wikileaks.org1

washingtonpost.com

1eubusiness.com

1vesti.bg

1trouw.nl

1#NAME

1en.wikipedia.org

1l

1sana.sy

1hosted.ap.org

1shariah4belgium.com

1nrc.nl

1guardian.co.uk

1geopolicity.com

1nctb.nl

1rt.com

1kaspersky.com

2

todayszaman.com

2

volkskrant.nl

2

spaarrente.nl

2

reuters.com

2

peil.nl

2

hrw.org

2

uk.reuters.com

2

cbsnews.com

3

telegraph.co.uk

3

maps.google.nl

4

bbc.co.uk

5

edition.cnn.com

5

aljazeera.com

english.alarabiya.net

7

maps.google.com

Outlink Analysis

Geomapping location Wire service

http://www.webarchiving.nl/news/webartprojectreportsearchingthenewsarchives

























Temporal Image Analyses

Timeline

http://labs.timelessfuture.com/timeline/

http://labs.timelessfuture.com/timeline/

Pilot Tools: Scalable Full Text Search++

User interface!

Zoekmachine!

Inverted Index!

Hadoop Distributed Filesystem!

Some Lessons (pilot)

• Fun, creative (but hard for control freaks)

• unexpected really new ideas!

• It is really co-design -- a dialog:

• researchers keep talking in “solutions”

• unaware of the full potential?

• Search engine used to explore

• Then want to use their own tools

• Emphasis on aggregates, visualizations

Ongoing• Started to designing the whole task support

• Want folks to stay in the system!

• Connect source data to later “information graphics”

• For the research prototype: no polished graphics

• Volume/Hadoop slow things down

• 1. Port “search by strategy” to Hadoop (slow, asynchronous)

• 2. After (complex) selection on Hadoop, instantiate a dedicated environment (fast, interactive, bounded size)

Projects with museums, archives, libraries, archaeology

Wrap Up (III)

• How far can we push this to support research in a generic way?

• Working on many sources, processing components and way to combine them into search strategies

• Working on richer data (also from research use)

• Working on scale

• Data is still a crucial issue/factor

• Researchers always want what isn’t there

• Data quality/noise/completeness issues

Work on (Re)search?

• (Re)search leads to radically different modes of information access!

• (NB: Recall the panel!)

• Digital humanities is happening right now

• No shortage of data, dedicated users, ...

• Still lot’s of low hanging fruit

• Great opportunities for young researchers!

Questions?

• We’re hiring!

• 2 PhD (4y), 2 Postdocs (6m/1y).

• WebART: http://webarchiving.nl/

• ExPoSe: http://staff.science.uva.nl/~kamps/expose/

• Thank you to all collaborators: Arjen de Vries, Richard Rogers, Hugo Huurdeman, Thaer Samar, Anat Ben David, Maarten Marx, Wouter Alink, ...

http://webarchiving.nl

http://webarchiving.nl

http://staff.science.uva.nl/~kamps/expose/




when search becomes research and research becomes search

Technology

web search

web archiving

portal search

ch data

terabytes of archived

shallow search

uk web archive

digital methods complex