scape webinar: tools for uncovering preservation risks in large repositories
DESCRIPTION
This presentation origins from a webinar presented by Luís Faria. The webinar presents the SCAPE developed tools Scout and C3PO and demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities. Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems. The webinar was held 26 June 2014.TRANSCRIPT
![Page 1: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/1.jpg)
Luis Faria [email protected] KEEP SOLUTIONS www.keep-‐solu=ons.com
SCAPE webminar July 26, 2014
Tools for uncovering preserva=on risks in your large repositories
![Page 2: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/2.jpg)
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
2
Why do we need monitoring?
![Page 3: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/3.jpg)
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation policies
3
Why do we need monitoring?
RisksOpportunities
![Page 4: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/4.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 4
5.41%&0.77%&1.54%&1.93%&2.32%&2.70%&2.70%&
5.02%&7.34%&
9.27%&15.83%&
26.64%&28.57%&
0.00%& 5.00%& 10.00%& 15.00%& 20.00%& 25.00%& 30.00%&
Other&Data&intensive&industry&
Non&affiliated&Big&data&science&
Digital&preservaDon&vendor&Research&funder&Large&enterprise&
Publisher&or&content&producer&Small&or&medium&enterprise&Local&government&insDtuDon&
NaDonal&government&insDtuDon&Memory&insDtuDon&or&content&holder&
University&
What%descrip-ons%fit%your%organiza-on?%
Preserva'on monitoring survey
181 valid par=cipants
![Page 5: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/5.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Preserva'on monitoring survey
5
92%$
89%$
78%$
77%$
76%$
76%$
75%$
74%$
69%$
68%$
64%$
41%$
51%$
41%$
40%$
44%$
23%$
27%$
17%$
28%$
25%$
30%$
18%$
9%$
18%$
13%$
12%$
24%$
22%$
25%$
25%$
19%$
23%$
41%$
40%$
41%$
46%$
44%$
53%$
51%$
58%$
47%$
55%$
46%$
0.00%$ 10.00%$ 20.00%$ 30.00%$ 40.00%$ 50.00%$ 60.00%$ 70.00%$ 80.00%$ 90.00%$ 100.00%$
File$corrup7on$
Backup$failure$
Staff$not$enough$or$adequate$
SoDware$plaForm$obsolescence$
Hardware$plaForm$obsolescence$
Lack$of$context$informa7on$
Incorrect$ac7on$results$
Consumers$misalignment$
Outdated$preserva7on$plans$
Producers$misalignment$
Content$not$aligned$with$policies$
Importance$(normalized$mean)$ Monitoring$ Not$monitoring$ Uncertain$or$No$answer$
![Page 6: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/6.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 6
Tools for uncovering preserva'on risks
Content FITS C3PO Scout
FITS output (XML)
</>
File characteris=cs distribu=on (graphs and drill-‐down analysis)
File and world proper=es throughout =me and no=fica=ons
![Page 7: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/7.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
• h\p://fitstool.org • Characteriza=on
• Iden=fica=on • Feature extrac=on • Valida=on
• Support for: • DROID
• JHove
• Apache Tika
• ADL Tool
• Exidool
• FFIdent
• File U=lity (windows port)
• NLNZ Metadata Extractor
• OIS Audio, File and XML Informa=on
FITS -‐ File Informa'on Tool Set• h\ps://github.com/keeps/fits/tree/keeps
• Developed by KEEPS • Added support for:
• FIDO
• Microsod Office
• Adobe Illustrator
• Corel Draw
• Email (EML)
• Autocad (DWG)
• Shapefile
• RTF, TXT
• Databases (DBML)
7
![Page 8: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/8.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
FITS -‐ File Informa'on Tool Set
• Demonstra=on • Download from h\p://fitstool.org !
• Execute for a file !
!• Execute for a directory
8
./fits.sh -‐i file.png
./fits.sh -‐r -‐i source_directory/ -‐o output_directory/
![Page 9: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/9.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
FITS performance
• h\ps://github.com/keeps/fits-‐tes=ng • 3 to 6 seconds per file • 12 TB -‐ A year
• h\p://www.openplanetsfounda=on.org/blogs/2013-‐01-‐09-‐year-‐fits
• Other op=ons for scalability: • Fido • Apache Tika • Nanite
9
![Page 10: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/10.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO -‐ Clever, Cra?y Content Profile of Objects
• h\p://ifs.tuwien.ac.at/imp/c3po • Web applica=on • Content characteris=cs aggrega=on • Drill-‐down analysis
10
![Page 11: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/11.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO install
• Download binaries at: • h\p://dl.bintray.com/peshkira/c3po/
• Install mongodb: • h\p://www.mongodb.org/
• Install Apache Tomcat • h\p://tomcat.apache.org/
• Put C3PO web app in Apache Tomcat • Remove ROOT dir for webapps and rename C3PO web app to ROOT.war
• Start Apache Tomcat and connect to: • h\p://localhost:8080/
• Usage guide: • h\ps://github.com/peshkira/c3po/wiki/Usage-‐Guide
11
![Page 12: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/12.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
C3PO performance
Dataset: Statsbiblioteket (Denmark) • Size: 440M files (12 TB) • Process =me: 388h (16 days) / 50h for XML report • Average =me: 2.5s per 1000 files • Web applica=on has 2.5 million FITS files limit
12
![Page 13: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/13.jpg)
Scout: a preserva'on watch system
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Monitors aspects of the world to detect preserva=on risks and opportuni=es
13
Content
Policies Web
Scout
Risk notification
Humanknowledge
Registries
![Page 14: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/14.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 14
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
![Page 15: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/15.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137). 15
Current information sources
• Repository content and events
• SCAPE Policy model
• PRONOM
• Web semantic extraction
• Web page renderability experiments
![Page 16: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/16.jpg)
16
Define triggers
• Notify me when there are tools that can render the format X.
![Page 17: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/17.jpg)
17
Define triggers Simple query with templates
![Page 18: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/18.jpg)
18
Receive notifications
HTTP Push API
There are tools that can render format X.
![Page 19: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/19.jpg)
19
Interfaces
Web page
REST API
![Page 20: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/20.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
How to be a part of Scout
• Checkout • Site: http://openplanets.github.io/scout/
• Report: http://www.scape-project.eu/deliverable/d12-2-final-version-of-the-preservation-watch-component
• Demo: http://scout.scape.keep.pt
• Integrate your content
• Contribute with information (soon) • Use Scout form for manual input of knowledge
20
![Page 21: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/21.jpg)
This work was par,ally supported by the SCAPE Project. The SCAPE project is co-‐funded by the European Union under FP7 ICT-‐2009.4.1 (Grant Agreement number 270137).
Roadmap
• User support • More trigger templates • More adaptors
• KrakeN / Propminer • Sodware catalogues • Other format registries • Other experiments informa=on sources • Manual input (human knowledge) • Simula=on
21
![Page 22: SCAPE Webinar: Tools for uncovering preservation risks in large repositories](https://reader034.vdocuments.mx/reader034/viewer/2022051817/54833cdcb4af9f690d8b49ab/html5/thumbnails/22.jpg)
Luis Faria [email protected] KEEP SOLUTIONS www.keep-‐solu=ons.com
SCAPE webminar July 26, 2014
Tools for uncovering preserva=on risks in large repositories