steps towards better digital archives hiroyuki kawano department of information and...
TRANSCRIPT
Steps Towards Better Digital Archives
Hiroyuki Kawano
Department of Information and Telecommunication,Faculty of Mathematical Sciences and Information Engineering, Nanzan university, Japan
Adjunct researcher inDigital Library Section at National Diet Library (Japan)
Outline : Today’s Talk
Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL)
Problems of building Digital Archive Technology / Legislation / Organization
Towards Better Digital Archive Technical problems
Distributed crawling programs Huge storage systems using hierarchical architecture
Social problems Intellectual properties (copyright law, creative commons)
Self-introduction
Background of My Research: Mondou
Search results by Mondou
Related keywords providedby association rule mining
Text/Web Mining Mondou ( 1996 )
Relevant keywords provided by text mining association rules
Document clustering Information visualization Discover web communities Distributed and cooperative web robots
Differences between Search Engine and Web ArchiveWeb Search Engine Web Archive
Crawling ・ Freshness by time stamps and informative file types:html, text, pdf, doc and others
・ Accurate crawling of entire web pages stored in target web sites, as rapid as possible
Quality ・ Focusing on special attributes
and descriptions:
- title, meta, hyperlink tags
・ Quality control is strongly required
- Original/Master copies
- Archiving shots management
Search ・ Recall and Precision
・ Results are sometimes influenced by commerciality.
・ Simple and easy query input
・ Difficulties of document searches
- Historical change and heterogeneous keywords
- Evolution of hyperlink structures
Preservation ・ Short time: several months
- Almost of users prefer popular and fresh web pages.
・ Long time: several centuries
as paper, micro film etc.
- migration, transformation
Adjunct researcher (2002-) inDigital Library Section at National Diet Library (Japan)
Roles of NDL KANSAI-kan Collaborative Service between East and West
PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%)
Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7)
97,000 Titles, about 143,000 Books WARP (Web Archiving Project)
1,499Titles 46 Government Organizations, 1,907 Cooperative Organizations
Dnavi 9,900 directories
Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs
Roles of NDL KANSAI-kan Collaborative Service between East and West
PhD. Thesis (45%), Journal & Magazine (29%), Reports of Grant-in-Aid for Scientific Research of MEXT(15%), Scientific Reports (7%), Asian Library (4%)
Digital Portal in Japan Digital Library (in Meiji-era, during 1868-1911) (2007/7)
97,000 Titles, about 143,000 Books WARP (Web Archiving Project)
1,499Titles (E-book, journal, article, white report etc.) 46 Government Organizations, 1,907 Cooperative Organizations
Governmental contents are also edited, modified and deleted… Dnavi
9,900 directories Add 1,100 URL/yr Investigate 2,300 URLs deleted among 5,600 URLs
WARP (Web Archiving Project) The House of Councilors
Consolidation of cities, organizations, universities etc.
Outline : Today’s Talk
Fragile Intelligence on Internet Disappearing Scientific/Artistic/Cultural Contents Statistics of web contents in Japan (by NDL)
Problems of building Digital Archive Technology / Legislation / Organization
Towards Better Digital Archive Technical problems
Distributed crawling programs Huge storage systems using hierarchical architecture
Social problems Intellectual properties (copyright law, creative commons)
Science 9,000( 1900 )
Science 90,000( 1950 )
Science 0.9million ( 2000 )
2001B.C300
% of Archive30 ~ 50%
Alexander Library0.5 million32TB
Surface Web:14TB (1 billion pages)Deep Web:7.5PB (550 billion pages)
Web (Japan) 0.45billion pages ( 18.4TB )Web (go.jp) 20 million pages ( 1.6TB )
1PB
10PB
100PB
Surface Web:167TBDeep Web:67 ~ 92PB
Web Pages
20052003
Book, reports, others782million
( 50PB=50000TB )
Books ( Public )4.8million ( 308TB )
Scan This Book!http://www.nytimes.com/
Books ( Current )3.20million ( 205TB )
Book ( Unknown )24million ( 1540TB )
Statistics of Web Sites 2001
1 billion pages (Surface Web), 550 billion pages (Deep Web; 7.5PB) http://www.brightplanet.com/technology/deepweb.asp
2002 2 billion pages:
( English:56.4%, Germany:7.7%, French:5.6%, Japanese:4.9% ) http://www.netz-tipp.de/languages.html
2003 167TB (Surface Web), 92PB (Deep Web)
http://www2.sims.berkeley.edu/research/projects/how-much-info-2003
2005, January Searchable Web pages : 11.5 billion Pages in 75 Languages
http://www.cs.uiowa.edu/~asignori/web-size/
Survey Report of Japanese Web Sites (by NDL, 2005)
Web Data HTML Files ・・・ about 44 million files Picture Files ・・・・・ about 55 million files Estimated Total # of Files ・・・・・・・・ 450 million
files Estimated Total Volume of Data ・・・・・・・・・
18.4TB jp domain:182,093 hosts
go.jp hosts (2,336 hosts, 1.28%) Files 4.4% Volume 8.5%
http://www.ndl.go.jp/jp/aboutus/bulkresearch2005summary.html
Digital ArchivesOrganization From Characteristics
Internet Archive 1996 Wayback Machine (Fair Use)
Austria, National Lib. 1996/6 Legislation
Sweden, Royal Lib. 1996/9 Legislation
Denmark, Royal Lib. 1997/6 Legislation
Australia, National Lib. 1997/6 Discussion
France, National Lib. 1999 Discussion
USA, Lib. of Congress 2000 NDIIPP
Finland, National Lib. 2000/8 Proposal of Legislation
Britain, Lib. 2001/5 2003, Legislation (non-print material)
China, Lib. 2003/1 WICP “Discussing Legislation for Networked Electric Publishing” ( 2003/5 )
Korea, National Lib. 2006/2 OASIS, Discussing Legislation
National Digital Library is under construction to open in 2008
Germany, National Lib. 2006/6 Legislation
Japan, National Diet Lib. 2002/6 Middle term planning (2004)
Towards Better Digital Archives Preserve Fragile Born-digital Contents
Academic/Scientific/Artistic/Cultural Resources Archive of Digital Information
Technologies of Long Term Preserving Legislation of Long Term Preserving Organization of Long Term Preserving
Organization National libraries for digital preservation projects IIPC (International Internet Preservation
Consortium)
National Archive Libraryfor preserving Digital Information
National Archive Libraryfor preserving Digital Information
Organization( Mandator
y )Belief
Organization( Mandator
y )Belief
●National Diet Library●National Archives of Japan
●Public/Private Libraries
●NII
●Government
●National Diet Library●National Archives of Japan
●Public/Private Libraries
●NII
●Government
Legislation( Law,
Consensus )Commons
Legislation( Law,
Consensus )Commons●Law of National Diet
Library●Law of Libraries●Law of National Archive●Law of Museums etc.
●Intellectual Properties●Copyright Law●Copyleft/Creative Commons
●Law of National Diet Library
●Law of Libraries●Law of National Archive●Law of Museums etc.
●Intellectual Properties●Copyright Law●Copyleft/Creative Commons
Technologies( Architectur
e )Mission-driven
Technologies( Architectur
e )Mission-driven●Internet Technologies
●Natural Language ( CJK )
including Vietnamese
●Information Retrieval●Database Technologies●Archive Technologies
●Internet Technologies
●Natural Language ( CJK )
including Vietnamese
●Information Retrieval●Database Technologies●Archive Technologies
Various Technical Problems Programs of crawling contents from surface and deep
webs provided by dynamic web services emulation and migration of dynamic content
Heritrix
Collaboration and optimization of distributed systems preserve monotonously increasing digital contents crawling, storages, information retrieval with time-line
Wera (Web ARchive Access), OAIS, DSpace etc.
Metadata formats URI, RDF, MODS (Metadata Object Description Schema)
Various Technical Issues: before Constructing Web Archives Problems of web crawling
Discovery of starting URL Frequency of retrieving Target file extensions Domain, directory, depth Contents in cross-domains Scripting URL
javascript, java, flash etc. Quality control required
No missing pages No imperfect capturing
Imperfection caused by timeout
Cost performance of advanced storage systems Properties of various storage media Archiving units in web sites Compression techniques Differentiating archives Duplication prepared for troubles and
disasters Conservation of originality
Certification of master copy Hyperlink, Coding, Layout, Script etc.
Hidden webs and archiving
Advanced techniques KQML Mediator Wrapper Association rules
Web mining Knowledge and rules
derived from Metadata Repository Web summaries
Web Servers Web Servers Web Servers……
Agents Agents AgentsKQML
Search
Archiving Robots
Web Archiving Systems(Metadata, Site Summaries,
Frequent navigational patterns,Representative web contents)
How do we archive contents stored in hidden webs?
Growth of Storage Market
Trend of Storage Volume : 10 times in 2010 2010: volume of storage 1370PB (10 times of volume in 2005) Growth rate 56.9%/year ( IDC Japan )
Storage Market in JapanUnit:\100M, TB ( JEITA )
\13.07M/TB\8.42M/TB
\5.02M/TB
\2.73M/TB
Next DVD:25-30GB
2010Holographic Disc:200GB-1TB
Dell
Others
Hitachi
Architecture of hierarchical storage
First Level Storage (plain files,
full text search)
Second Level Storage(compressed files, partial indexing)
Third Level Storage(archiving multiple-files with compression, low cost devices)
Cache Storage
prefetch
Operational Database of Archiving System(log files of web robots, search queries, navigational patterns)
Guidelines of Metadata http://www.loc.gov/standards/
Various Formats and Standards Resource Description Formats MARC 21 formats - Representation and communication of descriptive metadata
about information items MARCXML - MARC 21 data in an XML structure MODS (Metadata Object Description Standard) - XML markup for selected
metadata from existing MARC 21 records as well as original resource description
MADS (Metadata Authority Description Standard) - XML markup for selected authority data from MARC21 records as well as original authority data
EAD (Encoded Archival Description) - XML markup designed for encoding finding aids
Digital Library Standards METS (Metadata Encoding & Transmission Standard) - Structure for encoding
descriptive, administrative, and structural metadata (www.loc.gov/mets) MIX (NISO Metadata for Images in XML) - XML schema for encoding technical
data elements required to manage digital image collections PREMIS (Preservation Metadata) - A data dictionary and supporting XML
schemas for core preservation metadata needed to support the long-term preservation of digital materials.
Options (metadata) of OPAC and WARPOPAC WARP
Title Title
Authors/Editors Authors
Editors
Location Start URL
Year Duration
Category
Category No. ( NDC, NDLC, LCC, DDC,UDC, GPO ) NDC
Standard No. (ISBN, ISSN, CODEN, UTM, ISRN, ISMN etc. )
ISSN+ISBN
Book ID ( JAPANMARC, USMARC, UKMARC, OCLC etc.)
Management No. Meta ID
Codes (Language, Original Language, Gov., Univ. etc. )Japanese/Western Books, Digital Contents, Music/Video, Ashihara Collection etc.
Collections
NDL Resource
Guideline of NDL Meta Data
Guideline of NDL Meta Data
NDL-DA (NDL-Digital Archive) System is based on OAIS reference model
Information Package consists of Content Information Metadata
Organizing Unit Bibliography, Volume, Number, Article Web Site, Web pages
http://www.ndl.go.jp/jp/standards/da/index.html
OAIS ( Open Archival Information System ) http://www.rlg.org/en/pdfs/rlgnews/news56.pdf
Submission Information Package
Archival Information PackageDissemination Information Package
Descriptive MetadataIs stored separately
NDL: Meta Data Information Package Metadata
Preserving contents and associated metadata Description Metadata
Bibliography: Title, Publisher, Volume, Number etc. Technical Metadata
CPU, Hardware, Operating System, Software etc. Preservation Metadata
Long-term preservation: Ingest/Migration history etc. Rights Metadata
Permission, Creator, Authority, Audience etc. Control Metadata
Other Data for Preservation/Utilization/Management
NDL: Meta Data
Information Package Metadata – METS1.6 METS ( Metadata Encoding and Transmission Standard )
Description Metadata– MODS3.2 and NDL-DA Metadata Scheme
MODS3.2 (Metadata Object Description Schema) MODS is a derivative of MARC21, and it is not so complex
Technical Metadata – PREMIS based Scheme Preservation Metadata – PREMIS based Scheme Rights Metadata – PREMIS based Scheme
PREMIS (PREservation Metadata: Implementation Strategies) View Path is “Preservation Layer Model” in DIAS (Digital
Information Archiving System, Netherland) Control Metadata – NDL-DA Metadata Scheme
Sample: Attribute Values typeOfResource
based on MARC21
Text Cartographic notated music sound recording sound recording-musical sound recording-nonmusical still image moving image three dimensional object software, multimedia mixed material
digitalOrigin based on MODS
born digital reformatted digital digitized microfilm digitized other analog
Japanese Kana: script or transliteration<titleInfo>
<title> 国立国会図書館 </title></titleinfo><titleInfo script=”Kana”>
<title> コクリツ コッカイ トショカン</title>
</titleInfo><titleInfo script=”latn”>
<title>kokuritsu kokkai toshokan</title></titleInfo>
Information Package <mets>
Contents <fileSec>/<fileGrp>
Control MD<amdSec>
Structure Map<structMap>
PDF File<file id=“001” amdid=“201 401”>
Technical ・ Preservation MD<techMDid=“201”>
Ritghts MD<rightMDid=“301”>
Preservation ・ Management MD<digiprovMDid=“401”>
Description MD <dmdSecid=“101”>
Bibliographic Unit<div dmdid=“101” amdid=“301”>
<fptr fileid=“001”/></div>
Conclusion Web archive is one of dominant information infrastructure in
digital information society. Technical problems
Distributed crawling, long-term huge storage, advanced IR Social problems
Intellectual properties (copyright law, creative commons)
Huge volume and long-term preserving Distributed crawling programs
Surface and hidden webs, complex web services Huge storage systems using hierarchical architecture
Storage media, archiving formats, compression methods and rates Retrieving mechanism: navigational pattern mining in web archive
Preserving strategies by importance and access frequencies Effective emulation and migration of dynamic contents
Discussion
Digital Archives Infrastructure of Digital Contents
Problems of Digital Archives Technology : Collaboration of Standardization Legislation : Consensus among Stake Holders Organization : Store/Preservation/Utilization
Towards Better Digital Archives Collaboration for Integrated Digital Archives
Library, National Archive, Museum, University, Laboratory, Company etc.