crate: a simple model for self- describing web resources international web archiving workshop 2007...
TRANSCRIPT
![Page 1: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/1.jpg)
CRATE: A Simple Model for Self-Describing Web Resources
International Web Archiving Workshop 2007
Joan A. Smith & Michael L. NelsonOld Dominion University
Department of Computer ScienceNorfolk, VA 23529
{jsmit, mln}@cs.odu.edu
![Page 2: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/2.jpg)
Slide # 2IWAW ‘07 {jsmit,mln}@cs.odu.edu
WWW and Digital Libraries: Vastly Different Worlds
World Wide Web – A disorganized free-for-all– Near-zero metadata– Unpredictable additions,
deletions, modifications– No preservation policy
Crawlapalooza
Digital Library – Organized– Groomed content– Lots of metadata– Structured changes– Active preservation policies
Harvester Home Companion
![Page 3: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/3.jpg)
Slide # 3IWAW ‘07 {jsmit,mln}@cs.odu.edu
Web Sites: Metadata Challenged
% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.
GET /jackJill.jpg HTTP/1.1 Host: foo.edu
HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg
ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ
HTML metadata
JPEG metadata
![Page 4: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/4.jpg)
Slide # 4IWAW ‘07 {jsmit,mln}@cs.odu.edu
Archives: Metadata-Rich
![Page 5: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/5.jpg)
Slide # 5IWAW ‘07 {jsmit,mln}@cs.odu.edu
YAMM?! (Yet Another Metadata Model?)
![Page 6: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/6.jpg)
Slide # 6IWAW ‘07 {jsmit,mln}@cs.odu.edu
The MPEG-21 DIDL Model
![Page 7: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/7.jpg)
Slide # 7IWAW ‘07 {jsmit,mln}@cs.odu.edu
Preservation & Metadata
Resource Metadata Available
Less More
Pro
bab
ilit
y o
f P
res
erv
atio
n
Low
Hig
h
HTTP/HTML
Automatic metadata utilities/CRATE
Archival Information Package (AIP)
![Page 8: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/8.jpg)
Slide # 8IWAW ‘07 {jsmit,mln}@cs.odu.edu
# Webs >> # Archivists
Archivist
Web Sites
Typical ingest scenario
![Page 9: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/9.jpg)
Slide # 9IWAW ‘07 {jsmit,mln}@cs.odu.edu
Harnessing the Web Server
Archivist: mod_oai GetRecord request and response
User: standard GET request and response
Self-describing resource
![Page 10: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/10.jpg)
Slide # 10IWAW ‘07 {jsmit,mln}@cs.odu.edu
What is a “Self-Describing” Resource?
EXIF TOOL:File Name 103_0315.JPGCamera Model Name Canon EOS DIGITAL REBELDate/Time Original 2003:09:30 13:37:51Shooting Mode SportsShutter Speed 1/2000Aperture 7.1Metering Mode EvaluativeExposure Compensation 0ISO 400Lens 75.0 - 300.0mmFocal Length 300.0mmImage Size 3072x2048Quality NormalFlash OffWhite Balance AutoFocus Mode AI Servo AFContrast +1Sharpness +1Saturation +1Color Tone NormalFile Size 1606 kBFile Number 103-0315
Standard HTTP Headers --Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg
PLUS: Output from built-in utilities:
JHOVE TOOL:Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0File/Magic:
JPEG image dataJFIF standard 1.00resolution (DPI)"LEAD Technologies Inc. V1.01“33 x 26
MD5 Hash:58a54e8638db432f4515eedf89f44505
…CRATE: Wrapped together with the resource in simple XML
![Page 11: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/11.jpg)
Slide # 11IWAW ‘07 {jsmit,mln}@cs.odu.edu
Metadata Generation Utility Examples
Name Description
Jhove Analysis by type (img, audio, text)
Kea Key phrase extraction
OTS Open Text Summarizer
ExifTool Image/video metadata extractor
PDFlib-pCOS Extract PDF metadata
MP3-Tag Extract audio file tags
Essence Customized information extraction
GDFR MIME++
MD5 Message Digest
File Magic Uses content-identification bits of the file
![Page 12: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/12.jpg)
Slide # 12IWAW ‘07 {jsmit,mln}@cs.odu.edu
Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000
### Section 2: 'Main' server configuration
# Port 80
<IfDefine SSL> Listen 80 Listen 443 </IfDefine>
User www Group www ServerAdmin [email protected] ServerName www.openna.com DocumentRoot "/home/httpd/ona"
<Directory /> Options None AllowOverride None Order deny,allow Deny from all </Directory>
<Directory "/home/httpd/ona"> Options None AllowOverride None Order allow,deny Allow from all </Directory>
<Files .pl> Options None AllowOverride None Order deny,allow Deny from all </Files>
<IfModule mod_dir.c> DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi </IfModule>
#<IfModule mod_include.c> #Include conf/mmap.conf #</IfModule>
UseCanonicalName On
<IfModule mod_mime.c> TypesConfig /etc/httpd/conf/mime.types </IfModule>
DefaultType text/plain HostnameLookups Off
• Operational Rules• Modules (mod_perl, etc.)• Security• Virtual Hosts
![Page 13: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/13.jpg)
Slide # 13IWAW ‘07 {jsmit,mln}@cs.odu.edu
Apache: mod_oai Location Directive<Location /modoai>SetHandler modoai-handlermodoai_oai_active ON<modoai_plugin>
label “md5sum”exec “/usr/bin/md5sum %s”version “/usr/bin/md5sum --version”mime “*/*”
</modoai_plugin><modoai_plugin>
label “file”exec “/usr/bin/file -kz %s”version “/usr/bin/file -v”mime “*/*”
</modoai_plugin><modoai_plugin>
label “jhove”exec “/opt/jhove/jhove -m pdf-hul %s”version “/opt/jhove/jhove -v”mime “application/pdf”
</modoai_plugin><modoai_plugin>
label “pronom”exec “java -jar DROID.jar -L %s”version “java -jar DROID.jar -V”mime “*/*”
</modoai_plugin></Location /modoai>
• Scripts• Pipes• Executables• MIME-based selective
processing
![Page 14: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/14.jpg)
Slide # 14IWAW ‘07 {jsmit,mln}@cs.odu.edu
Building a CRATE
• URI, UUID
• Standard HTTP Headers
• Plug-In Metadata
• Base64-Encoded Resource
CRATE
CRATE ID
METADATA
RESOURCE
![Page 15: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/15.jpg)
Slide # 15IWAW ‘07 {jsmit,mln}@cs.odu.edu
CRATE example from mod_oaihttp://foo.edu/modoai/?verb=GetRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg
metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>
<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>
<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>
</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>
<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc </data></crateContent><crateMetadata>
<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>
</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>
<version>Jhove (Rel. 1.1, 2006-06-05)</version><data><![CDATA[ Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]></data>
</description></crateMetadata>
</record></GetRecord> </OAI-PMH>
![Page 16: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/16.jpg)
Slide # 16IWAW ‘07 {jsmit,mln}@cs.odu.edu
Automatic, Best-Effort Metadata
• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server
• Unverified– Utility results are not cross-checked– Output of analyses go directly into XML response
• Undifferentiated– No categorization of output– Resource and metadata form complex-object response
A simple, easy-to-implement option for improving
available preservation metadata for web resources
![Page 17: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/17.jpg)
Slide # 17IWAW ‘07 {jsmit,mln}@cs.odu.edu
Issues - Or Not?
• Web Server Performance– Academic vs dot-com expectations– Solution options
• Utility Efficiency– Java-based vs C-based– Market pressures
• Security– Metadata vs risk– Access controls
![Page 18: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/18.jpg)
Slide # 18IWAW ‘07 {jsmit,mln}@cs.odu.edu
Next Up…
• mod_oai Open Source release• Formalize/release CRATE schema definition (XSD)• Metrics Collection & Evaluation
– Academic sites– Dot-Com sites– Examine utility compatibility and issues– Address security concerns
![Page 19: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/19.jpg)
Slide # 19IWAW ‘07 {jsmit,mln}@cs.odu.edu
Demo
TODAY:
• http://beatitude.cs.odu.edu:8080/modoaitest/diag.jpg
• http://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord&metadataPrefix=crate&identifier=http://localhost/modoaitest/diag.jpg
AT MODOAI.ORG:• http://www.modoai.org/demos.html
![Page 20: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University](https://reader030.vdocuments.mx/reader030/viewer/2022032709/56649ec55503460f94bd0dde/html5/thumbnails/20.jpg)
Slide # 20IWAW ‘07 {jsmit,mln}@cs.odu.edu
Further Information
• The mod_oai project home page:
http://www.modoai.org/• JCDL 2007:
Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination
• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/
• http://www.joanasmith.com/pubs.html