webarchivering in het audiovisuele domein web archiving in the audiovisual domain julia vytopil-...

34
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands Institute for Sound and Vision

Upload: ginger-norman

Post on 04-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Webarchivering in het Audiovisuele DomeinWeb archiving in the audiovisual Domain

Julia Vytopil- Nederlands Instituut voor Beeld en GeluidNetherlands Institute for Sound and Vision

Our history of web archiving2008-2010

2011-2012

2008-2010

Purposes of web archiving

What Web archiving is not

Web archiving as a context collection

Web archiving as a context collection

Current project: selection of sites: broadcaster

Current project: selection of sites

Issues and challenges

Issues and challenges

Current status

Front end & back end

Front end & back end

Web Archiving in audiovisual field

Studiedag webarchivering in Nederland, Hilversum, October 30, 2014

Chloé [email protected]

http://archivethe.net

Web archiving

What? & Why?What is a Web archive?A copy of website

Recorded by a crawlerAt a specific date and time

Look and feel like a real website

For Whom?Any institution whose aim is

to collect & preserve web/media material

for historical, cultural, heritage or legal (compliance) purpose

PervasiveDynamic Valuable

Web content

Variety of format

Ephemeral

Why?

How?

• Collection policy

• Management tools

• Quality control

• Access

Web Archiving Team• Put in place a cross-disciplinary team‣ Curator / Librarian / Archivist‣ Information system technician

• Train a team‣ Web archivist / Project Manager ‣ Engineer(s) to design & monitor the whole process (for

in house solution)

• Web archiving requires critical skills and experience, especially concerning engineers in the case of an in-house solution

Collection policy

Collection policyExtensive Collection vs Intensive Collection

How to iimprove Selection Policy

IMR value propositions:

• [Topic crawls] Percolable, a tool to discover relevant sources

• [Crawl of actives sources] Automated refreshment rate

• [Large Crawls] Smart discovery crawl based on topic or language

How?

• Collection policy

• Management tools

• Quality control

• Access

Archivethe.net

User Interface

Challenges: Video

B&G Screenshot

OurTube / Our Tweet screenshot

Challenges: Social Media

Quality Assurance

Access

Access & Search• Browsing in the archive

• URL

• Full Text with Elastic Search

+

• Branding (search, web archive)

• Automatic redirection

• Automated categorization

• Semantic expansion

Extract valuable informationFrom your large corpus for Users /

Researchers

•Cleaned text

•Keywords to add Cloud

•Outlinks to analyze Graphs

•Structure unstructured data (forums,...)

•Named entities

•More are coming soon...

About IMRInternet Memory Research

✓Spin-off of the Internet Memory Foundation, French start-up, founded in 2011

✓20+ engineers actively engaged in the Web Archiving and Information Mining field

✓EU Projects: DOPA, Annomarket, TrendMiner, Rethink Big, ASAP

✓Large Scale Crawler with high performances

✓Scalable platform based on a distributed architecture and Big Data components (Hadoop, Hbase, HDFS,…)

✓Innovative infrastructure with low consumption

About IMR

Any Question?http://archivethe.net [email protected]

Twitter ArchiveTheNet