preserving websites of research & development...

29
Preserving Websites Of Research & Development Projects Page 1 Daniel Bicho [email protected] Daniel Gomes [email protected]

Upload: vuongcong

Post on 27-Jul-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Preserving Websites Of Research & Development Projects

Page 1

Daniel [email protected]

Daniel [email protected]

Page 2: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Project arcomem: ARchive COmmunities MEMories

Page 2http://www.arcomem.eu/

Page 3: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 3

Information about scientific events

Page 4: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 4

Dissemination and training materials

Page 5: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 5

Demonstrations about the project

Page 6: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 6

Content is not available anymore

http://www.arcomem.eu/

Page 7: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

European Union FP7 Work Programme invested 59 million euros in R&D projects.

Part of this funding was spent developing R&D projects websites.

Waste of significant investments

Page 7

Page 8: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Loss of Knowledge

Page 8

R&D project sites publish important scientific outputs (e.g. data sets, tools).

Work as aggregators of project outputs (ex. news and events).

Page 9: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

How to preserve R&D websites?

Web Archives need to identify project URLs to preserve R&D project websites.

Where to get project URLs?

Page 9

Page 10: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 10https://open-data.europa.eu/

Page 11: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

European Open Data Portal Datasets

Page 11

CORDIS EU research projects funded under work programmes (FP4, FP5, FP6, FP7).

Include information like project URL, acronym, title, dates, funding, objectives etc.

Page 12: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 12

Web-archived project URLs since FP4

Page 13: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 13

Project URLs of EU-funded research are mostly being preserved outside Europe

Page 14: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 14

Valid project URLs in 2016

Page 15: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

CORDIS datasets provide incomplete information

Page 15

25 000 projects were funded by the FP7 work programme.

Only 8% have an associated project URL.

Page 16: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 16

Missing information regarding project URL

Page 17: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

R&D websites provide important information but quickly disappear.

Funded projects datasets are incomplete.

How to automatically identify project URLs to be preserved (with limited resources)?

Problem

Page 17

Page 18: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Search Engines already index the Web

Open Data Portal provides meta-data about R&D projects

Combine open data sets with search engines to identify R&D project URLS

Proposed approach

Page 18

Page 19: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Automatic workflow to identify R&D project URLs

Page 19

ACRONYM TITLE PROJECT URL

DIP3The 3Ps of Distributed Information delivery ...

missing

ACRONYM TITLE PROJECT URL

DIP3The 3Ps of Distributed Information delivery ...

www.dip-3.eu

Identify project URL through Bing Web Search API

Page 20: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 20

Web search by project Acronym +Title

Page 21: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Evaluate heuristics to maximize the performance of the automatic identification of R&D project URLs.

+Acronym +Title+Acronym +Title -Cordis+Acronym +Title -Cordis -EC+Acronym +Title -Cordis -EC +CommonTerms

Experiment

Page 21

Page 22: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Evaluating the heuristics

A test collection was built based on the FP7 dataset by manually validating each R&D project URL.

The project URLs returned by each heuristic were compared to the test collection to measure if they matched.

Page 22

Page 23: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Heuristics Performance (F-measure)

Page 23

F-measure is a combination of recall and precision.

Heuristics Top 1 Results Top 10 Results

+Acronym +Title 44% 12%

+Acronym +Title -Cordis 45% 11%

+Acronym +Title -Cordis -EC 47% 11%

+Acronym +Title -Cordis -EC +project 48% 11%

Page 24: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Applied heuristic with best performance to the 23 588 projects that were missing the project URL.

Identified 20 429 new project URLs.

Preserving R&D projects websites

Page 24

Page 25: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Page 25

Before: 23 588 missing URLs

After: 20 429 new URLs

Page 26: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Crawled the identified R&D project URLs to be preserved.

Page 26

Nr. Project URL Seeds 20 429

Stored Content (compressed) 1.4 TB

Identify R&D websites

Preserve R&D websites

Incomplete Dataset

Completed Dataset

Page 27: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

R&D websites are important but are quickly disappearing.

European Datasets about funded R&D projects are incomplete. Only 8% have a project URL associated.

54% of the EU-fundend R&D project URLs are being web-archived. Mostly outside the European Union.

Conclusions

Page 27

Page 28: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Automatic heuristics to identify R&D project URLs.

All outputs of this study are available in open access

○ Test collection○ Completed CORDIS data sets with new project URLs○ Extensive technical report

https://github.com/arquivo/Research-Websites-Preservation

Conclusions

Page 29

Page 29: Preserving Websites Of Research & Development Projectssobre.arquivo.pt/.../preserving-websites-of-research-development... · Preserving Websites Of Research & Development Projects

Thank You

Page 30