veřejné služby pro dark archives
TRANSCRIPT
![Page 1: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/1.jpg)
Webarchiv.czDovětek k přednášce o běhu památníku českého webu.
![Page 2: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/2.jpg)
2266 domén
![Page 3: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/3.jpg)
Docker?
![Page 4: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/4.jpg)
Monitrix
https://github.com/ukwa/monitrix
Prototyp 1
Monitoring / Front-end pro Heritrix 3
Analytika probíhající sklizně / pravděpodobně agreguje jen jeden stroj
Prototyp 2
ELK: ElasticSearch / Logstash / Kibana
25 miliónů řádek logů / 26 GB na disku / 4vCPU / 20 GB RAM – otázka jak škálovat na celoplošné sklizně
![Page 5: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/5.jpg)
QA
proces na analýzu reportu na nesklizené weby a jejich znovu sklizení
proces pro analýzu objevených ale nesklizených URL
na kontrolu sklizní speciální webů jako Youtube, Facebook, Twitter
![Page 6: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/6.jpg)
Webarchiv.czKam směřovat?
![Page 7: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/7.jpg)
Služby
![Page 8: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/8.jpg)
CDX SERVER API
![Page 9: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/9.jpg)
CDX SERVER API
http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200 will return 2 capture results with non-200 status codes.
http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV will return 10 capture results with non-200 status codes and mime types that are not text/html but which match a specific content digest
https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp
![Page 10: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/10.jpg)
WAT
>>data['Envelope']['WARC-Header-Metadata']['WARC-Type']"response">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['Headers']['Server']"Apache">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']"BBCNEWS|Africa|NamibiabracesforNujomaexit">>len(data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'])42>>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'][28]{"path":"A@/href","title":"HomeofBBCSportontheinternet","url":"http://news.bbc.co.uk/sport1/hi/default.stm"}
![Page 11: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/11.jpg)
WAT
Použití https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification
WAT specifikace https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview
Workshop na vytvoření grafu pomocí WAT https://home.archive.org/~vinay/archive-web-graphs-workshop/
![Page 12: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/12.jpg)
Common Crawl
Je možné použít Amazon infrastructure na analytiku nad daty Common Crawl
více jak ~100 TB přírůstek měsíčně
Common Crawhttps://commoncrawl.org/the-data/get-started/
Příklady využití dat Common Crawlhttp://commoncrawl.org/the-data/examples/
CDX Server API s GUI pro procházení CDX souborůhttp://index.commoncrawl.org
![Page 13: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/13.jpg)
Fulltext
![Page 14: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/14.jpg)
Portugalský prototyp fulltextu
http://www.arquivo.pt/resawdev
The login is: resaw/resaw.eu
https://sobre.arquivo.pt/news/a-first-attempt-to-archive-the-.eu-domain?set_language=en
https://netpreserveblog.wordpress.com/2015/06/03/a-first-attempt-to-archive-the-eu-domain/
Thesis http://sobre.arquivo.pt/sobre/publicacoes-1/Documentos-acerca-do-Arquivo.pt/information-search-in-web-archives
Slides from IIPC GA 2015 http://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_11_Gomes.pptx
kolegovy poznámky: https://www.evernote.com/shard/s43/sh/e6e12603-ecb2-42ae-8532-67d2779b4a86/3b2162e0bcc710d847b6fa5e86cc70b2
![Page 15: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/15.jpg)
UK WA prototyp fulltextu Shine
Prototyphttps://www.webarchive.org.uk/shine/search/advanced
Wikihttps://github.com/ukwa/shine/wiki/Specification
Codehttps://github.com/ukwa/shine
Prezentace Helen Hockx-Yuhttp://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_08_Hockx.ppt
Videohttps://www.youtube.com/watch?v=o4iIdZP4rg8
![Page 16: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/16.jpg)
Další příklady
![Page 17: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/17.jpg)
Website Classification Dataset
http://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/
![Page 18: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/18.jpg)
HTTP Archive
In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.
http://httparchive.org/trends.php?s=All&minlabel=Nov+15+2010&maxlabel=Sep+15+2015
http://httparchive.org/interesting.php
![Page 19: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/19.jpg)
Přednášky o současném myšlení o webových
archivech ze Stanfordu
![Page 20: Veřejné služby pro Dark archives](https://reader034.vdocuments.mx/reader034/viewer/2022042906/58a1f60a1a28abac528b4ee1/html5/thumbnails/20.jpg)
IIPC GA 2015
https://www.youtube.com/channel/UCkUsw2Lo1ahekgy_xEb11BA/videos