digital library projects at bibliotheca alexandrina noha adly 16 january 2006
TRANSCRIPT
Digital Library Projects at
Bibliotheca Alexandrina
Noha Adly16 January 2006
Noha Adly 06 Bibliotheca Alexandrina 2
Network– Fiber Optical backbone
• The 11 floors of the library• The BA Conference Center (BACC)• The Science Museum and Planetarium
– FTP used for horizontal cabling (2200+ outlets)– Gigabit Ethernet technology is deployed– Leased lines used to connect remote branches
• CULTNAT• Shallalat• Swedish Institute (Anna-Lindh foundation)
Internet Connectivity– Bandwidth from 10 Mbps to 155 Mbps (STM1)– Plans for wireless Internet access using Wi-Fi Hotspots– Full Internet access through Internet Cafe– BA Conference Center for journalists and press agents
Infrastructure and Connectivity
Noha Adly 06 Bibliotheca Alexandrina 3
External
Firewall
Network Backbone
Servers
PCs
DMZ
System Architecture
Noha Adly 06 Bibliotheca Alexandrina 4
Infrastructure OverviewPublic PCs
Reading Tables 145
Study Rooms 85
OPAC 26
Print Servers 12
Young People 13
Children Library 7
Taha Hussein Lib 10
Internet Cafe 7
Information Literacy Lab 9
Museums 16
Total 330
Staff PCs
FAP 158
LIS 155
ICT 230
CUL 233
EXT 51
Others 71
Total 898
74 servers
2 firewalls
corporate Antivirus
Noha Adly 06 Bibliotheca Alexandrina 5
Server Room
Noha Adly 06 Bibliotheca Alexandrina 6
Services
Intranet
ERP
Backup
Office
MMV
Anti-Virus
Streaming
….etcSecurity
Web
VTLS
Video Conf.
Webcasting
Noha Adly 06 Bibliotheca Alexandrina 7
Video Conferencing
Noha Adly 06 Bibliotheca Alexandrina 8
Access Control System – Staff
Noha Adly 06 Bibliotheca Alexandrina 9
Ticketing Control System
Noha Adly 06 Bibliotheca Alexandrina 10
ILS – Integrated Library System
Noha Adly 06 Bibliotheca Alexandrina 11
Library Information System Web based Support Arabisation Trilingual interface (Arabic, French, English) Integrated with Multimedia system Available 24x7 In-house development tools
– Payment Card System– Automated Circulation Overdue Notices– Membership system– Cataloging Performance Tracking– Circulation Reports and Statistics– Customized Reports– etc …
Noha Adly 06 Bibliotheca Alexandrina 12
BA Website
Noha Adly 06 Bibliotheca Alexandrina 13
Statistics – www.bibalex.org
Noha Adly 06 Bibliotheca Alexandrina 14
Statistics
Noha Adly 06 Bibliotheca Alexandrina 15
Noha Adly 06 Bibliotheca Alexandrina 16
ISIS - Mission & Goals
Mission– Initiate, carry-out and promote research and
development of activities and projects related to building a universal knowledge center
– Acting as an incubator for digital and technological projects, promoting and nurturing innovations in accordance with BA goals
Goals– Preserving the heritage for future generations, and
– Universal Access to Human Knowledge
Noha Adly 06 Bibliotheca Alexandrina 17
Noha Adly 06 Bibliotheca Alexandrina 18
Overview
Internet Archive Million Book Project UDBE: Universal Digital Book Encoder DAR: Digital Asset Repository The Digital Modern History of Egypt
– Gamal Abdel Nasser– Description de l’Egypte
OACIS
Noha Adly 06 Bibliotheca Alexandrina 19
Internet Archive
Noha Adly 06 Bibliotheca Alexandrina 20
Overview
Web: 10 billion pages from 1996-2001 Television: 2000 hours of Egyptian and US TV Movies: 1000 archival films 100 Terabytes of data Storage on 200 computers
The second copy world wide, after the original copy in San Francisco
Noha Adly 06 Bibliotheca Alexandrina 21
Noha Adly 06 Bibliotheca Alexandrina 22
Noha Adly 06 Bibliotheca Alexandrina 23
Noha Adly 06 Bibliotheca Alexandrina 24
Noha Adly 06 Bibliotheca Alexandrina 25
Noha Adly 06 Bibliotheca Alexandrina 26
Access Statistics
Noha Adly 06 Bibliotheca Alexandrina 27
Second Generation Machines: Petabox
Designed to store and process one petabyte (million gigabytes).
Features:– Low power: 6 kW per rack, and 60 kW for
the whole system– High density: 64TB Terabytes per 40U rack– Local computing to process the data – 800
low-end PCs– Multi operating systems– Software to automate mirroring– Easy Maintenance: one system
administrator per petabyte– Software to automate mirroring itself– Inexpensive design– Inexpensive storage
Noha Adly 06 Bibliotheca Alexandrina 28
Single Rack Configuration
43U48 x 100Mbps, 2 x 1Gbps
1.2TB, 1GHz, 100Mbps
1.2TB, 1GHz, 2 x 100Mbps
2 x 3GHz, 2 GB, 4 x 1Gbps
Router/Firewall (1)
Switch (2)
Admin Node (2)
Data Node (80)
All boxes 1U, except Router/Firewall 2U
Noha Adly 06 Bibliotheca Alexandrina 29
Noha Adly 06 Bibliotheca Alexandrina 30
Progress
An agreement with the Internet Archive for building the Petabox has been signed
Hard disks for 2 Petabytes have been purchased 3700 hard disks to reach IA by February1st 2006 IA will build the machines and load them with the data of
the web collection of 2002, 2003, 2004 and 2005 1300 hard disks will be delivered at BA to be assembled
locally New machines for the 2006 collection will be designed and
manufactured locally.
Noha Adly 06 Bibliotheca Alexandrina 31
Million Book ProjectMillion Book
Noha Adly 06 Bibliotheca Alexandrina 32
Goals
Long-term: Capture all books in digital format; Short-term: Digitize 1 million books by 2007; Provide a test bed to support research areas, such as
– Scanning techniques;
– Optical character recognition;
– Intelligent indexing;
– Machine translation;
– Information retrieval.
Noha Adly 06 Bibliotheca Alexandrina 33
Partners USA
– Carnegie Mellon University
– Internet Archive
China– Beijing University
– Chinese Academy of Science
– Fudan University
– Chinese Ministry of Education
– Nanjing University
– State Planning Commission of China
– Tsinghua University
– Zhejiang University
Noha Adly 06 Bibliotheca Alexandrina 34
Partners India
– Indian Institute of Science– International Institute of Information Technology, Hyderabad– Arulmigu Kalasalingam College of Engineering– Goa University– Indian Institute of Information Technology, Allahabad– Shanmugha Arts, Science, Technology & Research Academy– Tirumala Tirupati Devasthanams– Maharashtra Industrial Development Corporation– University of Pune– Anna University
……. Now increased to 22 centers
Noha Adly 06 Bibliotheca Alexandrina 35
Noha Adly 06 Bibliotheca Alexandrina 36
Digital Lab Workflow
Noha Adly 06 Bibliotheca Alexandrina 37
Noha Adly 06 Bibliotheca Alexandrina 38
Image Processing Enhances the quality of the scanned images
– Removes noise
– Reduces file size
Functions performed– Despeckle – removes isolated black pixels
– Deskew – detects and removes skew
– Crop – removes the extra white spaces
– Curvature correction
– Removal of margins
Noha Adly 06 Bibliotheca Alexandrina 39
Image Processing Procedure
ScanFixNoise
PhotoshopBlack edgeCentering
ScanFixSkew
OTIFF PTIFF.X PTIFF
PhotoshopResize
ACDSeeCompress
Recover Recover
Noha Adly 06 Bibliotheca Alexandrina 40
OCR - Arabic Poses unique challenges
– Written cursively, with blocks of connected characters
– a ‘block of characters’ can have more than one base line.
– Uses external objects such as dots, 'Hamza' and 'Madda'.
– Diacritization
– Characters can have more than one shape according to their position
– Overlapping makes it difficult to determine the spacing
Sakhr Automatic reader is used Tricky with old books Requires learning
Noha Adly 06 Bibliotheca Alexandrina 41
Noha Adly 06 Bibliotheca Alexandrina 42
Pre-OCR Text Enhancement
Condition of Arabic printings varies– Old/new– Light/heavy– Solid/dot-matrix
ScanFix’s smoothing and completion features improve recognition accuracy
Separate from actual processing phase– Must be tested under OCR right away– OCR specialists have a better feel for “good text”
Noha Adly 06 Bibliotheca Alexandrina 43
Font Libraries
Improvement of Arabic OCR results through
– Tweaking of OCR engine settings
– Learning Libraries for different fonts have been built to achieve
higher recognition rates Databases of character glyphs that describe a particular type
of script and improve OCR accuracy Built on a carefully selected and classified high-variety set
of scanned images belonging to a batch of about 1000 books that boiled down to 15 font groups
Noha Adly 06 Bibliotheca Alexandrina 44
Font Classification
Classification criteria:
– Script type
• TA: Traditional Arabic
• AR: Arabic Transparent
• DT: Deco type Naskh and Deco type Naskh extension
– Printing quality: High (H), Medium (M), and Low (L)
– Font size: 1 (largest) to 5 (smallest) “Group X” – virtual font to tag unclassifiable printings and handwriting Minimum accuracy number assigned to each group based on testing
results
Noha Adly 06 Bibliotheca Alexandrina 45
Font Low Bound High Point % BooksAR-H1 97.70% 99.50% 0.43%AR-H2 97.60% 99.50% 3.42%AR-H3 97.04% 99.10% 8.53%AR-H4AR-L4 92.70% 96.70% 5.63%DT-M1DT-L2 88.40% 96.80% 7.73%TA-H1 97.30% 99.10% 2.03%TA-H2 97.60% 99.20% 14.15%TA-H3TA-H4 96.50% 97.74% 2.75%TA-L1 94.00% 97.70% 1.81%TA-L4 94.00% 97.90% 8.08%TA-M2 95.80% 98.80% 28.46%TA-M4 94.50% 97.50% 12.58%X 4.39%
Under construction
Under construction
Under construction
Font Groups
Noha Adly 06 Bibliotheca Alexandrina 46
Progress Five scanning stations since October 2003 As of January 1st 2006:
– 22,214 books digitized & processed (6.7 million pages)– 15,550 books OCRed (4.6 million pages)
• 11,101 Arabic books (3.3 million pages)• 4,449 Latin books (1.3 million pages)
Daily Rates– Scanning: ≈ 2000 pages/person– Processing: ≈ 1800 pages/person– Latin OCR: ≈ 4000 pages/person– Arabic OCR: ≈ 1500 pages/person
The target is to scan and process 5000 pages/day/scanner, leading to ≈ 25,000 books/year
Noha Adly 06 Bibliotheca Alexandrina 47
Noha Adly 06 Bibliotheca Alexandrina 48
Publishing
Challenges– Preservation of layout– Searchability of content and metadata– Efficient image compression– Accommodating low bandwidth user– Easy browsing of books– Multipaging– Multilingual text support
Noha Adly 06 Bibliotheca Alexandrina 49
Image-on-Text
Multilayered:– Visible page image– Hidden OCR text
View exact original layout while searching and highlighting
Supported with some OCR suites only
Supported format: DJVU and PDF
Noha Adly 06 Bibliotheca Alexandrina 50
UDBE – Universal Digital Book Encoder
Built around a Common OCR Format (COF)
COF
Format Handler(X)
Format Handler(Y)
Target Format(X)
Target Format(Y)
OCR Converter(A)
OCR Converter(B)
OCR Converter(C)
OCR Engine(A)
OCR Engine(B)
OCR Engine(C)
Conversion Encoding
Noha Adly 06 Bibliotheca Alexandrina 51
Common OCR Format (COF)
Captures necessary image-on-text document information
Inspired by DjVuXML and DAFS
Document Attribute Format Specification
XML-compliant – simple integration
Document
Map
Preference
Metadata
Page
Image
Text
Area
PageColumn
Region
Paragraph
Line
Word
Character
Noha Adly 06 Bibliotheca Alexandrina 52
Implementation
OCR Converter for Automatic Reader:– Supports 18 Latin languages, Arabic, and Persian– Features font learning capabilities
Format Handlers:– DjVu:
• MRC imaging model high-quality/low-file-size image compression from AT&T Labs
• Implemented around DjVu Libre and LizardTech’s Document Express
– PDF:• Widely-used PostScript-like Portable Document Format from
Adobe• Implemented in Java based on iText
Noha Adly 06 Bibliotheca Alexandrina 53
UDBE Performance
Noha Adly 06 Bibliotheca Alexandrina 54
UDBE Performance
Noha Adly 06 Bibliotheca Alexandrina 55
UDBE Performance
Noha Adly 06 Bibliotheca Alexandrina 57
Noha Adly 06 Bibliotheca Alexandrina 58
Noha Adly 06 Bibliotheca Alexandrina 59
Noha Adly 06 Bibliotheca Alexandrina 60
Noha Adly 06 Bibliotheca Alexandrina 61
Noha Adly 06 Bibliotheca Alexandrina 62
Noha Adly 06 Bibliotheca Alexandrina 63
Noha Adly 06 Bibliotheca Alexandrina 64
Noha Adly 06 Bibliotheca Alexandrina 65
Progress A database for the books, metadata and status has
been designed and implemented. The complete cycle of the workflow for producing
digital books has been automated, and integrated with the ILS.
This work has been extended to accommodate other types of materials including slides, maps, images, audio and video.
Noha Adly 06 Bibliotheca Alexandrina 66
DAR
Digital Assets Repository
Noha Adly 06 Bibliotheca Alexandrina 67
Goals
Automation of the digitization process
Integrating the actual content and metadata of varieties of object types into one homogeneous repository
Preservation and archiving of digital media produced by the Digital Lab or acquired by the Library in digital format
Enhancing the interoperability and seamless access to the Library digital assets
Noha Adly 06 Bibliotheca Alexandrina 68
Standards
Digital objects descriptive metadata– VRA Core Categories
– MARC 21
Metadata presentation– XML
– MARC format
– Dublin Core
Content dissemination– OAI-PMH
Noha Adly 06 Bibliotheca Alexandrina 69
System Architecture
DAF/DAK APIs
Digital Assets Keeper(DAK)
RepositoryDatabase
Authentication and Authorization Subsystem
Users/groups/permissionsDatabase
Storage Subsystem
OfflineStorage
OnlineStorage
Integrated LibrarySystem
CatalogDatabase
User Interface
AdministrationTool
DigitizationClient
ArchivingTool
CatalogingTool
PublishingInterface
OAIGateway
Digital Assets Factory(DAF)
DigitizationDatabase
EncodingTool
Noha Adly 06 Bibliotheca Alexandrina 70
Progress DAF has been fully deployed since March 2004 for books
In January 2005, support for images and other material was introduced.
The DAK first version was deployed in July 2005, with some parts still in the beta version.
A publishing tool has been implemented with a special viewer for digitized assets, and a viewer for books using image-on-text technology.
Noha Adly 06 Bibliotheca Alexandrina 71
Noha Adly 06 Bibliotheca Alexandrina 72
Noha Adly 06 Bibliotheca Alexandrina 73
Noha Adly 06 Bibliotheca Alexandrina 74
Noha Adly 06 Bibliotheca Alexandrina 75
The Digital Modern
History Of Egypt
Noha Adly 06 Bibliotheca Alexandrina 76
Gamal Abdel Nasser
Collection
Noha Adly 06 Bibliotheca Alexandrina 77
Nasser – Objectives
Digitize and publish the collection of the eminent Arab and Egyptian president Gamal Abdel Nasser
Provide online access to his collection through a web based system mainly intended for research purposes and documentation
Noha Adly 06 Bibliotheca Alexandrina 78
Nasser – Collection
Documents published by the Public Records Office, London, UK (53,000+ pages)
Documents published by the United State Department of State (30,000+ pages)
Over 1,300 speeches, audio and printed Over 51,000 photos and 1,000 portraits More than 1,000 videos (50+ hours) A complete archive of the articles published in the
newspapers The decrees issued by the Revolutionary Command Council
(RCC) The daily news of the President
Noha Adly 06 Bibliotheca Alexandrina 79
Nasser – Collection
Minutes of the Central Committee for Arab Socialist Union (ASU)
140+ handwritten documents with 593 papers A complete archive of the "Bisaraha" articles by Mohammed
Hassanein Haikal Caricature, stamps, coins and plastic arts illustrations Books written by and about Nasser More than 1,200 national songs Over 130 Poems
Noha Adly 06 Bibliotheca Alexandrina 80
Nasser
The entire collection has been digitized Database designed and populated with the
digital objects and their metadata Backend applications
– Managing the contents – Categorization– Adding and refining descriptions– Adding keywords
Integration of all the different information sources and media under a single interface
Front end– A web based interface – Full text Arabic and English search engine
Noha Adly 06 Bibliotheca Alexandrina 81
Nasser – Website
Noha Adly 06 Bibliotheca Alexandrina 82
Description De L’Egypte
Noha Adly 06 Bibliotheca Alexandrina 83
Description de l’Egypte
The work includes– 11 plates volumes (950+ pages)– 9 text volumes (7500+ pages)– Index book
The volumes recorded– Antiquities– Modern state– Natural history
They described cities, buildings, temples, monuments, arts, animals, plants, minerals, society, etc.
Noha Adly 06 Bibliotheca Alexandrina 84
Digitization The complete volumes of plates and text have been fully digitized.
Noha Adly 06 Bibliotheca Alexandrina 85
Processing
Noha Adly 06 Bibliotheca Alexandrina 86
Virtual BrowserThe whole collection has been integrated on a virtual browser
and made accessible to the public.
Noha Adly 06 Bibliotheca Alexandrina 87
Noha Adly 06 Bibliotheca Alexandrina 88
Noha Adly 06 Bibliotheca Alexandrina 89
First Release Provide the collection on DVD, in both English and
French Languages, for the public and for researchers
A relation established between text and images in a searchable form
Published with two versions of pictures
– Low resolution for quick browsing
– High resolution for zooming with dynamic loading
Noha Adly 06 Bibliotheca Alexandrina 90
Digitizing of the Botroseyya
Collection
Noha Adly 06 Bibliotheca Alexandrina 91
Botroseyya – Overview
This project aims at digitizing the documents pertaining to the Botros Ghaly family
The family has saved a large number of documents related to its political role since the late 1800’s.
The project will attempt to
– digitize the entire multilingual (Arabic, English, French, German, Italian and Turkish) collection, and
– provide it in searchable form for historians, politicians and researchers.
Noha Adly 06 Bibliotheca Alexandrina 92
Digitization of
Mohamed Mahmoud Pasha
Collection
Noha Adly 06 Bibliotheca Alexandrina 93
Mohamed Mahmoud PashaCollection – Overview
This project aims at digitizing the documents pertaining to Mohamed Mahmoud Pasha, one of the most famous Egyptian Prime Ministers
The project will attempt to
– digitize the entire collection of rich and rare historical documents and materials never been published before
– provide it in searchable form for historians, politicians and researchers.
Noha Adly 06 Bibliotheca Alexandrina 94
Al Hilal Digital
Collection
Noha Adly 06 Bibliotheca Alexandrina 95
Al Hilal – Overview
This project aims to publish an exhaustive digital copy of the issues of Al-Hilal since its first publication in 1892
Al-Hilal is considered the oldest continuously published cultural journal in the Arab world, and the only regular journal that has been issued for more than a hundred years
It had a marked effect on the history of the Arab world in general and the history of Egypt in particular
It played a leading role in modernizing Arab intellectual thinking, and opened new collaborations towards the cultural evolution
Noha Adly 06 Bibliotheca Alexandrina 96
Al Hilal – Progress
The volumes of years 1 to 50 were completely scanned, processed and indexed (about 51,000 pages).
An application has been implemented for browsing through the digital copies with searching facilities. The hierarchy for titles and subtitles helps users select the desired issues
The issues of each decade are to be compiled on a CD including necessary browsing and searching tools.
Noha Adly 06 Bibliotheca Alexandrina 97
Noha Adly 06 Bibliotheca Alexandrina 98
OACIS
Online Access to Consolidated
Information on Serials
Noha Adly 06 Bibliotheca Alexandrina 99
OACIS – Mission
Create a publicly and freely accessible, continuously updated listing of Middle East journals and serials, including those available in print, microform, and online
Improve access to Middle Eastern serials in libraries in the
– United States
– Europe
– Middle East Make scholarly literature from, and about, the Middle East
widely and easily available to scholars around the world
Noha Adly 06 Bibliotheca Alexandrina 100
OACIS – StatisticsHolds: 23,000+ unique title records
Noha Adly 06 Bibliotheca Alexandrina 101
OACIS – BA contribution
Over 400 records have been uploaded in the database 23 volumes have been digitized Digitized documents have been integrated into the OACIS
system through a digital viewer Content retrieval web interface for the digitized serials has
been developed Regular update of the OACIS catalog is taking place on
quarterly basis A mirror site of the system at BA has been set and released
25th January 2005 (http://oacis.bibalex.org)
Noha Adly 06 Bibliotheca Alexandrina 102
OACIS –Website
Noha Adly 06 Bibliotheca Alexandrina 103
OACIS – Digital Viewer
Noha Adly 06 Bibliotheca Alexandrina 104
OACIS – Search Contents
Noha Adly 06 Bibliotheca Alexandrina 105
Arabic and Middle Eastern
Electronic Library
(AMEEL)
Noha Adly 06 Bibliotheca Alexandrina 106
AMEEL – Overview
Develop an Arabic and Middle Eastern Electronic Library (AMEEL) containing a large collection of significant Middle Eastern resources
Bring together qualified partners to create a Middle East electronic library including:
– Digital representations of traditional materials,– “Born digital" contemporary materials– A service structure for Inter Library Loan
Building an access portal
Noha Adly 06 Bibliotheca Alexandrina 107
Noha Adly 06 Bibliotheca Alexandrina 108
Thank You