crate: a simple model for self- describing web resources international web archiving workshop 2007...

20
CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University Department of Computer Science Norfolk, VA 23529 {jsmit, mln}@cs.odu.edu

Upload: joshua-chase

Post on 01-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

CRATE: A Simple Model for Self-Describing Web Resources

International Web Archiving Workshop 2007

Joan A. Smith & Michael L. NelsonOld Dominion University

Department of Computer ScienceNorfolk, VA 23529

{jsmit, mln}@cs.odu.edu

Page 2: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 2IWAW ‘07 {jsmit,mln}@cs.odu.edu

WWW and Digital Libraries: Vastly Different Worlds

World Wide Web – A disorganized free-for-all– Near-zero metadata– Unpredictable additions,

deletions, modifications– No preservation policy

Crawlapalooza

Digital Library – Organized– Groomed content– Lots of metadata– Structured changes– Active preservation policies

Harvester Home Companion

Page 3: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 3IWAW ‘07 {jsmit,mln}@cs.odu.edu

Web Sites: Metadata Challenged

% telnet foo.edu 80 Trying 82.165.199.160... Connected to foo.edu. Escape character is '^]'.

GET /jackJill.jpg HTTP/1.1 Host: foo.edu

HTTP/1.1 200 OK Date: Mon, 11 Jun 2007 16:49:25 GMT Server: Apache/1.3.33 (Unix) Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Accept-Ranges: bytes Content-Length: 15986 Content-Type: image/jpeg

ÿØÿà"#2s¡35Rq‘±³ÁÂ$%Ccruƒ“¢ÃÒÿÄ

HTML metadata

JPEG metadata

Page 4: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 4IWAW ‘07 {jsmit,mln}@cs.odu.edu

Archives: Metadata-Rich

Page 5: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 5IWAW ‘07 {jsmit,mln}@cs.odu.edu

YAMM?! (Yet Another Metadata Model?)

Page 6: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 6IWAW ‘07 {jsmit,mln}@cs.odu.edu

The MPEG-21 DIDL Model

Page 7: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 7IWAW ‘07 {jsmit,mln}@cs.odu.edu

Preservation & Metadata

Resource Metadata Available

Less More

Pro

bab

ilit

y o

f P

res

erv

atio

n

Low

Hig

h

HTTP/HTML

Automatic metadata utilities/CRATE

Archival Information Package (AIP)

Page 8: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 8IWAW ‘07 {jsmit,mln}@cs.odu.edu

# Webs >> # Archivists

Archivist

Web Sites

Typical ingest scenario

Page 9: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 9IWAW ‘07 {jsmit,mln}@cs.odu.edu

Harnessing the Web Server

Archivist: mod_oai GetRecord request and response

User: standard GET request and response

Self-describing resource

Page 10: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 10IWAW ‘07 {jsmit,mln}@cs.odu.edu

What is a “Self-Describing” Resource?

EXIF TOOL:File Name 103_0315.JPGCamera Model Name Canon EOS DIGITAL REBELDate/Time Original 2003:09:30 13:37:51Shooting Mode SportsShutter Speed 1/2000Aperture 7.1Metering Mode EvaluativeExposure Compensation 0ISO 400Lens 75.0 - 300.0mmFocal Length 300.0mmImage Size 3072x2048Quality NormalFlash OffWhite Balance AutoFocus Mode AI Servo AFContrast +1Sharpness +1Saturation +1Color Tone NormalFile Size 1606 kBFile Number 103-0315

Standard HTTP Headers --Last-Modified: Mon, 29 Aug 2005 12:01:40 GMT ETag: "5800535-3e72-4312f924" Content-Length: 15986 Content-Type: image/jpeg

PLUS: Output from built-in utilities:

JHOVE TOOL:Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0File/Magic:

JPEG image dataJFIF standard 1.00resolution (DPI)"LEAD Technologies Inc. V1.01“33 x 26

MD5 Hash:58a54e8638db432f4515eedf89f44505

…CRATE: Wrapped together with the resource in simple XML

Page 11: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 11IWAW ‘07 {jsmit,mln}@cs.odu.edu

Metadata Generation Utility Examples

Name Description

Jhove Analysis by type (img, audio, text)

Kea Key phrase extraction

OTS Open Text Summarizer

ExifTool Image/video metadata extractor

PDFlib-pCOS Extract PDF metadata

MP3-Tag Extract audio file tags

Essence Customized information extraction

GDFR MIME++

MD5 Message Digest

File Magic Uses content-identification bits of the file

Page 12: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 12IWAW ‘07 {jsmit,mln}@cs.odu.edu

Web Server Configuration: “conf” file ### Section 1: Global Environment # ServerType standalone ServerRoot "/etc/httpd" PidFile /var/run/httpd.pid ResourceConfig /dev/null AccessConfig /dev/null Timeout 300 KeepAlive On MaxKeepAliveRequests 0 KeepAliveTimeout 15 MinSpareServers 16 MaxSpareServers 64 StartServers 16 MaxClients 512 MaxRequestsPerChild 100000

### Section 2: 'Main' server configuration

# Port 80

<IfDefine SSL> Listen 80 Listen 443 </IfDefine>

User www Group www ServerAdmin [email protected] ServerName www.openna.com DocumentRoot "/home/httpd/ona"

<Directory /> Options None AllowOverride None Order deny,allow Deny from all </Directory>

<Directory "/home/httpd/ona"> Options None AllowOverride None Order allow,deny Allow from all </Directory>

<Files .pl> Options None AllowOverride None Order deny,allow Deny from all </Files>

<IfModule mod_dir.c> DirectoryIndex index.htm index.html index.php index.php3 default.html index.cgi </IfModule>

#<IfModule mod_include.c> #Include conf/mmap.conf #</IfModule>

UseCanonicalName On

<IfModule mod_mime.c> TypesConfig /etc/httpd/conf/mime.types </IfModule>

DefaultType text/plain HostnameLookups Off

• Operational Rules• Modules (mod_perl, etc.)• Security• Virtual Hosts

Page 13: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 13IWAW ‘07 {jsmit,mln}@cs.odu.edu

Apache: mod_oai Location Directive<Location /modoai>SetHandler modoai-handlermodoai_oai_active ON<modoai_plugin>

label “md5sum”exec “/usr/bin/md5sum %s”version “/usr/bin/md5sum --version”mime “*/*”

</modoai_plugin><modoai_plugin>

label “file”exec “/usr/bin/file -kz %s”version “/usr/bin/file -v”mime “*/*”

</modoai_plugin><modoai_plugin>

label “jhove”exec “/opt/jhove/jhove -m pdf-hul %s”version “/opt/jhove/jhove -v”mime “application/pdf”

</modoai_plugin><modoai_plugin>

label “pronom”exec “java -jar DROID.jar -L %s”version “java -jar DROID.jar -V”mime “*/*”

</modoai_plugin></Location /modoai>

• Scripts• Pipes• Executables• MIME-based selective

processing

Page 14: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 14IWAW ‘07 {jsmit,mln}@cs.odu.edu

Building a CRATE

• URI, UUID

• Standard HTTP Headers

• Plug-In Metadata

• Base64-Encoded Resource

CRATE

CRATE ID

METADATA

RESOURCE

Page 15: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 15IWAW ‘07 {jsmit,mln}@cs.odu.edu

CRATE example from mod_oaihttp://foo.edu/modoai/?verb=GetRecord&identifier= http://foo.edu/jackJill.jpg&metadataPrefix=crate

<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-06-18T18:21:46Z</responseDate> <request verb="GetRecord" identifier=http://foo.edu/jackJill.jpg

metadataPrefix=“crate">http://foo.edu/crate/</request> <GetRecord>

<record> <header> <identifier>http://foo.edu/jackJill.jpg</identifier>

<datestamp>2007-01-17T04:09:07Z</datestamp><setSpec>mime:image:jpeg</setSpec>

</header><crateContent> <mimeType>image/jpeg encoding=“base64”</mimeType>

<data>JVBERi0xLjQKMyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpI+hlzHdxHZ56diZdOiXjHNfEq9jOuDTzEc </data></crateContent><crateMetadata>

<description><label>“file magic”</label> <exec>/usr/bin/file jackJill.jpg</exec><version>file-4.16</version><data>JPEG image data, JFIF standard 1.00, resolution (DPI), "LEAD Technologies Inc. V1.01", 33 x 26</data>

</description><description><label>“jhove”</label> <exec>/opt/jhove/jhove –m jpeg-hul</exec>

<version>Jhove (Rel. 1.1, 2006-06-05)</version><data><![CDATA[ Date: 2007-06-18 14:35:50 EDT RepresentationInformation: /home/crate/apache/htdocs/jackJill.jpgReportingModule: JPEG-hul, Rel. 1.2 (2005-08-22) LastModified: 2007-01-16 23:09:07 EST Size: 27750Format: JPEG Version: 1.00 Status: Well-Formed and valid SignatureMatches: JPEG-hulMIMEtype: image/jpeg Profile: JFIF JPEGMetadata: CompressionType: Huffman coding, Baseline DCTImages: Number: 1 Image: NisoImageMetadata: MIMEType: image/jpeg ByteOrder: big-endianCompressionScheme: JPEG ColorSpace: YCbCr SamplingFrequencyUnit: inch XSamplingFrequency: 33YSamplingFrequency: 26 ImageWidth: 172 ImageLength: 146 BitsPerSample: 8, 8, 8 SamplesPerPixel: 3Scans: 1 QuantizationTables: QuantizationTable: Precision: 8-bit DestinationIdentifier: 0 Comments: LEAD Technologies Inc. V1.01 ApplicationSegments: APP0 ]]></data>

</description></crateMetadata>

</record></GetRecord> </OAI-PMH>

Page 16: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 16IWAW ‘07 {jsmit,mln}@cs.odu.edu

Automatic, Best-Effort Metadata

• Automatic– Generated at time of dissemination– Integrates preservation functions with the web server

• Unverified– Utility results are not cross-checked– Output of analyses go directly into XML response

• Undifferentiated– No categorization of output– Resource and metadata form complex-object response

A simple, easy-to-implement option for improving

available preservation metadata for web resources

Page 17: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 17IWAW ‘07 {jsmit,mln}@cs.odu.edu

Issues - Or Not?

• Web Server Performance– Academic vs dot-com expectations– Solution options

• Utility Efficiency– Java-based vs C-based– Market pressures

• Security– Metadata vs risk– Access controls

Page 18: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 18IWAW ‘07 {jsmit,mln}@cs.odu.edu

Next Up…

• mod_oai Open Source release• Formalize/release CRATE schema definition (XSD)• Metrics Collection & Evaluation

– Academic sites– Dot-Com sites– Examine utility compatibility and issues– Address security concerns

Page 19: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 19IWAW ‘07 {jsmit,mln}@cs.odu.edu

Demo

TODAY:

• http://beatitude.cs.odu.edu:8080/modoaitest/diag.jpg

• http://beatitude.cs.odu.edu:8080/modoai/?verb=GetRecord&metadataPrefix=crate&identifier=http://localhost/modoaitest/diag.jpg

AT MODOAI.ORG:• http://www.modoai.org/demos.html

Page 20: CRATE: A Simple Model for Self- Describing Web Resources International Web Archiving Workshop 2007 Joan A. Smith & Michael L. Nelson Old Dominion University

Slide # 20IWAW ‘07 {jsmit,mln}@cs.odu.edu

Further Information

• The mod_oai project home page:

http://www.modoai.org/• JCDL 2007:

Generating Best Effort Preservation Metadata For Web Resources At Time Of Dissemination

• Authors’ webs:• http://www.cs.odu.edu/~mln/pubs/

• http://www.joanasmith.com/pubs.html