when search becomes research and research becomes search

75
When Search becomes Research and Research becomes Search SIGIR’13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH) August 1, 2013, Dublin, Ireland Jaap Kamps University of Amsterdam

Upload: jaap-kamps

Post on 30-Nov-2014

548 views

Category:

Technology


8 download

DESCRIPTION

SIGIR'13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH).

TRANSCRIPT

Page 1: When Search becomes Research and Research becomes Search

When Search becomes Research and Research becomes Search

SIGIR’13 Workshop on Exploration, Navigation and Retrieval of Information in Cultural Heritage (ENRICH)

August 1, 2013, Dublin, Ireland

Jaap KampsUniversity of Amsterdam

Page 2: When Search becomes Research and Research becomes Search

(Re)search(Re)searchers

• My current main interest is search related to/supporting research (amongst a few dozen other things)

• So what’s different if your searchers are researchers, and their search is (part of) their research?

• This talk is rather speculative -- no iron-clad formal results -- but I hope to convince you that this is (at least) an interesting use case

• And an area with great opportunities to work in...

Page 3: When Search becomes Research and Research becomes Search

Outline

• DATA: The Web and Online Heritage

• Issues: Archival Silence

• USERS: Digital Heritage -- Digital Humanities

• Challenges: Digital Methods

• TOOLS: Supporting Complex Search Tasks

• (Re)search: Digital Methods <-> Complex Search

Page 4: When Search becomes Research and Research becomes Search

Lot’s of CH online

Page 5: When Search becomes Research and Research becomes Search

CH is digitized on a massive scale

Page 6: When Search becomes Research and Research becomes Search

Europeana: millions of objects from 1000s of providers

Page 7: When Search becomes Research and Research becomes Search

The UK Web Archive

8

  Permission-based selective archiving since 2004   30% success rate   131,164 websites, 54,604

instances, ~14TB WARCs

  Domain crawl from 12 April 2013 to implement non-print legal deposit   Expected to crawl

between 4-5 million UK websites

  Access in reading rooms only

http://www.webarchive.org.uk

Terabytes of Archived Web Data

(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)

Page 8: When Search becomes Research and Research becomes Search

What’s the problem?

Page 9: When Search becomes Research and Research becomes Search

Not really that much traffic...

Page 10: When Search becomes Research and Research becomes Search

Europeana Web Traffic Report – Q4 2012 - 5 -

Month by Month Overview

Visits Unique Visitors Page Views Time on site/visit (mm:ss)

Bounce rate

October 2012 534,830 441,096

2,017,751 00:02:17 50.27%

November 2012 612,902 505,177

2,299,244 00:02:16 49.79%

December 2012 530,747 439,919 2,079,335 00:02:19 48.80%

Europeana Web Traffic Report – Q4 2012 - 7 -

2. Portal Search

338,574 Visits with Search 36.10% Increase from Q3 2012 52.89% Increase from Q4 2011

Visits with Search is the number of visits during which at least one portal search occurred

743,292 Total Unique Searches 37.82% Increase from Q3 2012 31.67% Increase from Q4 2011

Total Unique Searches is the number of times a search is performed on Europeana (duplicate searches within a single visit are excluded)

3. Object Views, Social Actions & Click-throughs 2,361,589 Object Views 8.10% Increase from Q3 2012 45.55% Increase from Q4 2011

The number of times Europeana object pages have been viewed. Repeated views of a single page are counted

778,046 Search Result Views 10.06% Increase from Q3 2012 8.32% Decrease from Q4 2011

The number of time Europeana search results pages have been viewed. Repeated views of a single page are counted

2,975 Social Actions 46.19% Increase from Q3 2012 22.27% Increase from Q4 2011

The number of times a user has clicked on a social share icon within the portal

KPI 27: 30,000 object shares in 2012 Jan – Dec 2012 – 9,609 shares (from portal)

Let’s say: less traffic than we hoped for...

Page 11: When Search becomes Research and Research becomes Search

How often are web archives used?

6

  Archiving institutions’ focus on data collection, not usage

  19 of 29 IIPC members’ archives (listed on website) have full or partial online access, often permission-based

  Large scale national web archives have restricted access – dark archives   eg Danish National Web Archive, over 280TB

  online access for researchers with PhD or higher level   20 users since 2005

  “Document-centric” access methods

  No agreed way of calculating / benchmarking access statistics

  Little evidence of scholarly use of web archives, making it difficult to understand requirements

(From: Hockx-Yu, Web Archiving and Scholarly Use of Web Archives, 2013)

Page 12: When Search becomes Research and Research becomes Search

Archival Silence

• Many online collections suffer from low traffic...

• After years of hard work, the data is there

• But the users aren’t queuing up to come and explore the data

• Why is that happening?

Page 13: When Search becomes Research and Research becomes Search

Digital Heritage online are incunabula

Page 14: When Search becomes Research and Research becomes Search

Our infrastructure changed in a revolutionary way

Page 15: When Search becomes Research and Research becomes Search

Our technology changed in a revolutionary way

Page 16: When Search becomes Research and Research becomes Search

How radical did information access methods change?

Page 17: When Search becomes Research and Research becomes Search
Page 18: When Search becomes Research and Research becomes Search

Think outside the box?

• Are we too “framed” by the type of systems that had before?

• And by those that emerged on the Web?

• (cmp. Diane Kelly’s, Contours and Convergence, KSJ lecture at ECIR’13.)

Page 19: When Search becomes Research and Research becomes Search

Wrap Up (1)

• We have made wonderful progress: CH data is out there in huge volume

• More, better, richer, ... every day

• Use of the data is often lagging behind

• We should learn from “the Web”

• But also do really different things!

• (This takes time -- at least a generation)

Page 20: When Search becomes Research and Research becomes Search

Right, something really different -- but what?

Page 21: When Search becomes Research and Research becomes Search

CH as Web search?

• Should we really try to “copy” the Web?

• Web search optimizes fast, shallow search

• on highly dynamic data with massive #s of user signals

• Could we be *ahead* of the Web (rather than following them)?

Page 22: When Search becomes Research and Research becomes Search

Let’s do the obvious :)

• Look seriously at the scholarly use of the CH information we have accumulated?

• Get in touch with researchers and find out how they (want to) use the data and why they are *not* using our tools

• (In fact, heritage institutions traditionally focused on scholars, emphasis on the general public is quite recent...)

Page 23: When Search becomes Research and Research becomes Search

Digital Heritage Digital Humanities

e-Humanities

Page 24: When Search becomes Research and Research becomes Search

The Times They are a-Changin’ ?

Page 25: When Search becomes Research and Research becomes Search

Something exciting is happening!

• Digital Humanities emerging fast in response to massive volume of data

• Digitization of historic sources

• Heritage of the future is digital

• User-generated content in new media

• In short: for many research questions a lot of relevant data is available!

Page 26: When Search becomes Research and Research becomes Search

Change in Character1.0 2.0

Collection-centered User-centered

Supply-driven Demand-driven

Professionals Amateurs

Individual scholar Team or lab

Small scale Large scale

Qualitative Quantitative

Page 27: When Search becomes Research and Research becomes Search

Change = Radical!

• Change in research paradigm?

• Traditional humanities based on interpretative paradigm

• Empirical sciences based on a truth-finding paradigm

• Did the “success criterion” change?

• Use tools of the exact science for the benefit of traditional paradigm?

Page 28: When Search becomes Research and Research becomes Search

(Actual empirical science is also less rigorous)

Page 29: When Search becomes Research and Research becomes Search

DH requires new data-driven research methods

Page 30: When Search becomes Research and Research becomes Search

"Google and the politics of tabs" by Govcom.org, Amsterdam, 2008.

Website historiography

Page 31: When Search becomes Research and Research becomes Search

Innovat ion and Evaluat ion of Informat ion

A CHI98 Workshop Gene Golovchinsky and Nicholas J. Belkin

Abstract

This report summarizes a workshop held at CHI 98 that focused on several aspects of information exploration, including user interfaces, theory, and evaluation. Information exploration is a common activity that spans a variety of media and is an integral component of many information seeking behav- iors that people engage in. The com- plexity of this activity, and the need to support it appropriately, led us to pro- pose this workshop. Over the course of two days, we examined several aspects of this problem, struggled with a few definitions, and came away with a bet- ter understanding of the design space. Here we summarize those efforts.

Introduction

Traditional Information Retrieval is concerned with improving effective- ness of indexing and retrieval mecha- nisms, and with supporting one information seeking behavior: speci- fied searching through query formula- tion. This has been predicated on support for one kind of user popula- tion, with one kind of information need. But the networked information environment has resulted in a shift in the user population of information retrieval systems. This change has introduced new classes of users, in the sense of levels o f expertise, and has also made clear that there are different kinds of information needs and differ- ent kinds of information seeking behaviors than those supported by tra- ditional IR systems and techniques. This workshop focused on developing understanding of one such information seeking behavior, Information Explo- ration, on interface design for support- ing this behavior, and on evaluation

methods and measures for assessing such interfaces.

Information Exploration addresses the goal of refining a vague concept into a more thorough understanding of the problem which led to the information interaction. We believe that informa- tion exploration research falls squarely in the domain of human-computer interaction with some emphasis on information retrieval, rather than vice versa. Thus one of the thrusts o f this workshop was to attempt to character- ize the activities users engage in, to design for those activities, and to iden- tify evaluation techniques and mea- sures that provide appropriate insights into users ' behavior and performance.

Organization

About 20 people participated in the workshop. They were chosen on the basis of initial brief submitted position papers, and represented a broad spec- trum of industry and academia. Partic- ipants came from France, Canada, Germany, and the U.S. After accep- tance, participants were asked to sub- mit longer (4-5 page) position statements that described relevant research and perspectives a few weeks prior to the workshop. These papers were made available through the workshop web site, and participants were encouraged to review and com- ment on them.

Submissions were organized into three categories: Interface, Evaluation and Theory. Each category was further subdivided into themes that suggested themselves. Thus a number of inter- face submissions concerned informa- tion visualization; three of five evaluation-related submissions focused on expertise, and the theory section split evenly between frame-

works and representation of informa- tion.

On the morning of the first day, work- shop activities were organized based on the three topics we had initially defined. After the morning introduc- tory session, we split the workshop into three new working groups, based on the results of that discussion.

J. - ©

Figure 1. Information exploration (gray box) situated in the broader task. The black

"method" box may involve a recursive information exploration step to identify

information sources.

Discussion Highlights

It seems obligatory for a workshop to debate the definition of the concept that brought people together; we embraced this orthodoxy with a ven- geance. One of the recurring themes of

SIGCHI Bulletin Volume 31, Number 1 January 1999 22

Essentially these are complex search strategies!

Page 32: When Search becomes Research and Research becomes Search

Wrap Up (II)

• Digital Humanities is emerging fast and leads to new data driven research methods

• Motivated by hum. research questions

• Essentially they are crawling, cleaning, tokenizing, ranking, exploring, visualizing

• Basically the stuff *we* are experts in

• Can we build tools that support their research task from begin to end?

Page 33: When Search becomes Research and Research becomes Search

(Re)search?

• Interactively construct complex strategy

• data sources, selections, processing, back-and-forth, ...

• Explore all results using facets/aspects

• explore whole data set -- no 10 links

• Store, share, and refine search strategies

• “Session” may take minutes, hours, days, ...

Page 34: When Search becomes Research and Research becomes Search

How to get there?

Page 35: When Search becomes Research and Research becomes Search

(1) Intensive collaborations with CH institutions

Page 36: When Search becomes Research and Research becomes Search

(2) Include researchers: Co-creation, Living Lab, ...

Page 37: When Search becomes Research and Research becomes Search

(3) Build not a tool, but the toolmaker’s tools

Page 38: When Search becomes Research and Research becomes Search

Team up with Arjen de Vries and Spinque :)

Page 39: When Search becomes Research and Research becomes Search

Search strategy from building blocks

Page 40: When Search becomes Research and Research becomes Search

Strategy Builder Each block = data or manipulations

Build dedicated search engine “on the fly”

Page 41: When Search becomes Research and Research becomes Search

Research methods become search strategies

Store, refine, reuse, share strategies

(Re)search!

Page 42: When Search becomes Research and Research becomes Search

Web Archive (New Media scholars)

Page 43: When Search becomes Research and Research becomes Search

Thaer SamarPhD/programmer

Hugo HuurdemanPhD researcher

Anat Ben-DavidPostdoc

Arjen de Vries Jaap Kamps Richard Rogers

Paul DoorenboschRené Voorburg

Victor-Jan Vos

Page 44: When Search becomes Research and Research becomes Search

WebART Goals

• Evaluating current curation and selection procedures of Web archives

• Getting insights into current use of Web archives

•Developing new methods and tools for research using Web archives

Page 45: When Search becomes Research and Research becomes Search

Flickr: koninklijkebibliotheek

KB: Web archive since 2007

Statistics:•4,000+ websites

•17,000+ harvests

•7+ TerabyteSelective approach

Page 46: When Search becomes Research and Research becomes Search

KB: Web archive since 2007

Statistics:•4,000+ websites

•17,000+ harvests

•7+ TerabyteSelective approach

Page 47: When Search becomes Research and Research becomes Search
Page 48: When Search becomes Research and Research becomes Search
Page 49: When Search becomes Research and Research becomes Search

”Wayback Machine” interface

Page 50: When Search becomes Research and Research becomes Search

• WebARTist (pilot - beta 1)

• Initial dataset (corpus)• 432 crawls, 16 months (13.64 GB)

Full-text search engine

KB CommonCrawl+nu.nl

(Dutch news aggregator)

Page 51: When Search becomes Research and Research becomes Search
Page 52: When Search becomes Research and Research becomes Search
Page 53: When Search becomes Research and Research becomes Search
Page 54: When Search becomes Research and Research becomes Search
Page 55: When Search becomes Research and Research becomes Search
Page 56: When Search becomes Research and Research becomes Search
Page 57: When Search becomes Research and Research becomes Search
Page 58: When Search becomes Research and Research becomes Search
Page 59: When Search becomes Research and Research becomes Search

WebARTist: Use case

• Digital Methods Winter School (Jan. ’13)

• Co-design workshop (“Living Lab”)

• researchers & developers

• first use WebARTist

Page 60: When Search becomes Research and Research becomes Search

Word frequency analysis

0

100

200

300

400

500

600

700

800

17/05/2011 25/08/2011 03/12/2011 12/03/2012 20/06/2012 28/09/2012 06/01/2013

Page 61: When Search becomes Research and Research becomes Search

Co-Word Analysis

Page 62: When Search becomes Research and Research becomes Search

1

abcnews.go.com1

brucespringsteen.net1

theverge.com1

sportamerika.nl

1

reuters.com1

ebird.org

1

googleblog.blogspot.co.uk

1

presscentre.sony.eu

1

project.wnyc.org

1

bbc.com

1

poynter.org

1

abclocal.go.com

1

en.wikipedia.org

1

nhc.noaa.gov

1

nypost.com

2

earthcam.com

2

maps.google.com

3

hp.com

4

google.org

4

edition.cnn.com

Syria

Sandy

7wired.com

7allthingsd.com

7abcnews.go.com

7thesun.co.uk

7allesoversterrenkunde.nl

8volkskrant.nl

9fd.nl

9nos.nl

9mobiel.nuvideo.nl

9guardian.co.uk

10bit.ly

10billboard.biz

10cbsnews.com

11

usmagazine.com

11

variety.com

12

theverge.com

12

people.com

13

Rutte en Verhagen leggen schuld bij PVV

13

telegraaf.nl

14

washingtonpost.com

18

edition.cnn.com

19

bbc.co.uk

20

youtube.com

20

nytimes.com

21

styletoday.nl

21

bloomberg.com

24

thesistools.com

26

hollywoodreporter.com

30

online.wsj.com

30

deadline.com

33

poll.nupubliek.nl34

spaarrente.nl

39

gamer.nl

48

reuters.com

52

tmz.com

57

open.spotify.com

78

peil.nl

93

gezondheidsnet.nl

US Election

4

1blogs.aljazeera.net

1youtube.com

1worldpressphoto.org

1wikileaks.org1

washingtonpost.com

1eubusiness.com

1vesti.bg

1trouw.nl

1#NAME

1en.wikipedia.org

1l

1sana.sy

1hosted.ap.org

1shariah4belgium.com

1nrc.nl

1guardian.co.uk

1geopolicity.com

1nctb.nl

1rt.com

1kaspersky.com

2

todayszaman.com

2

volkskrant.nl

2

spaarrente.nl

2

reuters.com

2

peil.nl

2

hrw.org

2

uk.reuters.com

2

cbsnews.com

3

telegraph.co.uk

3

maps.google.nl

4

bbc.co.uk

5

edition.cnn.com

5

aljazeera.com

english.alarabiya.net

7

maps.google.com

Outlink Analysis

Page 63: When Search becomes Research and Research becomes Search
Page 64: When Search becomes Research and Research becomes Search
Page 65: When Search becomes Research and Research becomes Search
Page 66: When Search becomes Research and Research becomes Search

Geomapping location Wire service

Page 67: When Search becomes Research and Research becomes Search

Temporal Image Analyses

Page 69: When Search becomes Research and Research becomes Search

Pilot Tools: Scalable Full Text Search++

User interface!

Zoekmachine!

Inverted Index!

Hadoop Distributed Filesystem!

Page 70: When Search becomes Research and Research becomes Search

Some Lessons (pilot)

• Fun, creative (but hard for control freaks)

• unexpected really new ideas!

• It is really co-design -- a dialog:

• researchers keep talking in “solutions”

• unaware of the full potential?

• Search engine used to explore

• Then want to use their own tools

• Emphasis on aggregates, visualizations

Page 71: When Search becomes Research and Research becomes Search

Ongoing• Started to designing the whole task support

• Want folks to stay in the system!

• Connect source data to later “information graphics”

• For the research prototype: no polished graphics

• Volume/Hadoop slow things down

• 1. Port “search by strategy” to Hadoop (slow, asynchronous)

• 2. After (complex) selection on Hadoop, instantiate a dedicated environment (fast, interactive, bounded size)

Page 72: When Search becomes Research and Research becomes Search

Projects with museums, archives, libraries, archaeology

Page 73: When Search becomes Research and Research becomes Search

Wrap Up (III)

• How far can we push this to support research in a generic way?

• Working on many sources, processing components and way to combine them into search strategies

• Working on richer data (also from research use)

• Working on scale

• Data is still a crucial issue/factor

• Researchers always want what isn’t there

• Data quality/noise/completeness issues

Page 74: When Search becomes Research and Research becomes Search

Work on (Re)search?

• (Re)search leads to radically different modes of information access!

• (NB: Recall the panel!)

• Digital humanities is happening right now

• No shortage of data, dedicated users, ...

• Still lot’s of low hanging fruit

• Great opportunities for young researchers!

Page 75: When Search becomes Research and Research becomes Search

Questions?

• We’re hiring!

• 2 PhD (4y), 2 Postdocs (6m/1y).

• WebART: http://webarchiving.nl/

• ExPoSe: http://staff.science.uva.nl/~kamps/expose/

• Thank you to all collaborators: Arjen de Vries, Richard Rogers, Hugo Huurdeman, Thaer Samar, Anat Ben David, Maarten Marx, Wouter Alink, ...