building a digital library of agricultural documents … informatics lab, iit-bombay building a...

11
Developmental Informatics Lab, IIT-Bombay Building A Digital Library of Agricultural Documents Using Open Source Software Chaitra Bahuman Research Assistant Developmental Informatics Lab IIT-Bombay Anil Bahuman Project Manager Developmental Informatics Lab IIT-Bombay Siddharth Nair Research Associate Developmental Informatics Lab IIT-Bombay 1. Document Scope and Purpose This paper provides an overview of the implementation of an online digital library for Krishi Vigyan Kendras of the Indian Council of Agricultural Research. The library includes a collection of crop diseases (Crop Doctor), a collection of crop recommendations from agricultural universities (Crop Reco) and a collection of translated aAQUA threads (Agro-Explorer) using an open source 1 digital library software called Greenstone. Sections 1-5 describe the information architecture and other details of the library software. Section 6 describes how each of the libraries can be created, edited and managed by a librarian with basic computer skills and examples of metadata 2 used in building these collections. (The library can be accessed at http://www.mlasia.iitb.ac.in/gsdl/cgi-bin/library for reference). 2. Introduction The Internet is slowly emerging as a low-cost medium for information dissemination with the computerization in Indian government organizations such as KVKs. This new medium has embraced Indian Language content with various vernacular websites now becoming more and more common. The spread is enabling a shift from traditional communication patterns between organizations and individuals. For the first time it has become possible to cater to the needs of individual farmers over a large geographic area allowing creation and instantaneous dissemination of research results, recommendations, diagnosis, epidemic alerts etc. Digitization of academic and research papers, books, journals etc are being taken up on a large scale to make them widely available. Digital content repositories can also be shared within an organization to capture both explicit and implicit knowledge in a local network called an Intranet. Most organizations are creating websites with information regarding their missions, objectives, staff and their activities. Many such sites provide useful knowledge resources scanned from printed material or typed in HTML documents (HTML is a format used to compose web documents). Web documents are catalogued and made searchable by very powerful search engines such as Google and Yahoo. The limitation is that the lay user typically is presented with a large number of results (hits) of which only a few could be relevant. There is a need for a more focused and organized collection of web content that is easily navigable, supporting various digital data formats such as text, photographs (images), audio and video. Digital libraries are easier to maintain since the chances of having broken links (links that refer to pages that no longer exist) is less than that of websites with standard HTML pages. This is because the links are created by the digital library software and not manually inserted. The Digital Library project at the Developmental Informatics Lab in IIT Bombay was initiated to create a repository of agricultural documents supporting various file formats and Indian 1 Software that is available with the original source code and can be modified and distributed by others 2 "Data about data", that describes the content, quality, condition, and other characteristics of data. Metadata is vital in helping users to find needed data similar to an index provided at the end of a book.

Upload: dinhdang

Post on 14-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Developmental Informatics Lab, IIT-Bombay

Building A Digital Library of Agricultural Documents Using Open Source Software

Chaitra Bahuman Research Assistant Developmental Informatics Lab IIT-Bombay

Anil Bahuman Project Manager Developmental Informatics Lab IIT-Bombay

Siddharth Nair Research Associate Developmental Informatics Lab IIT-Bombay

1. Document Scope and Purpose This paper provides an overview of the implementation of an online digital library for Krishi Vigyan Kendras of the Indian Council of Agricultural Research. The library includes a collection of crop diseases (Crop Doctor), a collection of crop recommendations from agricultural universities (Crop Reco) and a collection of translated aAQUA threads (Agro-Explorer) using an open source1 digital library software called Greenstone. Sections 1-5 describe the information architecture and other details of the library software. Section 6 describes how each of the libraries can be created, edited and managed by a librarian with basic computer skills and examples of metadata2 used in building these collections. (The library can be accessed at http://www.mlasia.iitb.ac.in/gsdl/cgi-bin/library for reference).

2. Introduction The Internet is slowly emerging as a low-cost medium for information dissemination with the computerization in Indian government organizations such as KVKs. This new medium has embraced Indian Language content with various vernacular websites now becoming more and more common. The spread is enabling a shift from traditional communication patterns between organizations and individuals. For the first time it has become possible to cater to the needs of individual farmers over a large geographic area allowing creation and instantaneous dissemination of research results, recommendations, diagnosis, epidemic alerts etc.

Digitization of academic and research papers, books, journals etc are being taken up on a large scale to make them widely available. Digital content repositories can also be shared within an organization to capture both explicit and implicit knowledge in a local network called an Intranet.

Most organizations are creating websites with information regarding their missions, objectives, staff and their activities. Many such sites provide useful knowledge resources scanned from printed material or typed in HTML documents (HTML is a format used to compose web documents). Web documents are catalogued and made searchable by very powerful search engines such as Google and Yahoo. The limitation is that the lay user typically is presented with a large number of results (hits) of which only a few could be relevant. There is a need for a more focused and organized collection of web content that is easily navigable, supporting various digital data formats such as text, photographs (images), audio and video.

Digital libraries are easier to maintain since the chances of having broken links (links that refer to pages that no longer exist) is less than that of websites with standard HTML pages. This is because the links are created by the digital library software and not manually inserted.

The Digital Library project at the Developmental Informatics Lab in IIT Bombay was initiated to create a repository of agricultural documents supporting various file formats and Indian

1 Software that is available with the original source code and can be modified and distributed by others 2 "Data about data", that describes the content, quality, condition, and other characteristics of data. Metadata is vital in helping users to find needed data similar to an index provided at the end of a book.

Developmental Informatics Lab, IIT-Bombay

Languages. The content has been provided by Krishi Vigyan Kendra, Baramati, who have collected recommendations from agricultural universities, residential experts, journal articles, disease forecast reports etc.

What is a Digital Library (DL)? Some definitions: “A digital library is an organized and focused collection of digital objects, including text, images, video and audio, along with methods for access and retrieval, and for selection, creation, organization, and maintenance of the collection.” Ian Witten et al. (Univ. of Waitako, NZ) “A Digital Library is a managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network. A crucial part of this definition is that the information is managed. A stream of data sent to earth from a satellite is not a library. The same data when organized systematically becomes a digital library.” William Arms (Univ. of Cornell)

What can Digital Libraries do, that traditional library cannot? • They provide a time and place independent information service to the user/student.

• They are more than just a collection of hyperlinks on a web site. They form a comprehensive storehouse of data that can be indexed and searched, all in one place.

• Corporate organizations and public bodies possess knowledge that can be reused for in-house training. DLs have emerged as a powerful knowledge management tool in such scenarios.

• Useful in distance education where issuing books is a challenge.

• Digital libraries are able to integrate freely available info on the web.

• Preserve and archive digital works or scanned documents, manuscripts, paintings etc. over a long time. An open source digital library software known as “Greenstone” was investigated for indexing, building and distributing digital collections of Agricultural content on the Internet or on CD-ROM.

The collection on crop diseases (Crop Doctor) aims at providing crop diagnostics via images. The current collection contains around 60 images of diseased crops with details on symptoms, causative agents, preventive measures and other details. The collection on crop recommendations contains about 35 plus documents with relevant recommendations from various agricultural universities across the state of Maharashtra compiled by KVK experts. The collection of translated threads on aAQUA contains documents that have been translated into Hindi and English by the Translation group at CFILT, IIT-Bombay, after correcting the grammar and spelling mistakes in the original Marathi thread. The collection of translated documents contains about 38 documents each in Hindi, English and Marathi.

3. Why Greenstone? Greenstone was used in building the content repository after doing a comparative study of 3 open source digital library softwares – DSpace, EPrints, and Greenstone (Refer Table 1).

Features DSpace Greenstone EPrints Open Source Yes Yes Yes Unicode Compliant

Yes Yes Yes

Hindi Interface (Existing)

No Yes No

Developmental Informatics Lab, IIT-Bombay

Built in HTML viewers

No Yes No

Size 6 MB 31 MB 2 MB Server Unix Only Unix/Windows Unix Only Offline viewing/Browsing

No No No

The features supported by Greenstone that were found useful are as follows: 1. Support for digital formats and Content types: Plugins distributed with Greenstone process plain text, HTML, WORD and PDF documents, Power Point presentations, images, spreadsheets, E-mail messages, etc. New plugins can be written for different document types. 2. Access and Navigation: The users access the libraries via a feature rich iconic interface. The collections are listed on one interface and the user can navigate easily through the collections. The user also has the option of changing his/her language, search and presentation preferences through the web-based interface. The documents are uploaded through a Librarian interface GUI that allows the librarian to assign metadata, choose plugins, assign classifiers for browsing indexes, etc. The documents can be grouped by titles, authors, date, organizations, etc. 3. Search and Retrieval: Greenstone uses MG (Managing Gigabytes) for compression, indexing and searching the textual information in the collections. It has a facility for “cross-collection searching,” which allows several collections to be searched at once, with the results combined behind the scenes as though you were searching a single unified collection. Any of the open collections can be searched: the Search Preferences page allows the user to choose the collections to be included in the searches. 4. Technology Platform: The system is designed to run on UNIX, Windows, and Mac OS/X servers and comprises other open source middleware and tools. The code is in the Java and C++ programming language. All versions of Greenstone use the Gnu Database Manager, GDBM. It is supplied with all Windows versions of Greenstone and installed automatically during the installation procedure. It also uses a Web server (Apache from the Apache Foundation) and other useful libraries. [The windows installation comes in 2 flavors – the local library version with a built in web server that can be used for sharing the library on the intranet and the web library version for sharing on the Internet] 5. Compliance to standards and protocols: Unicode is used throughout Greenstone. This allows any language to be processed and displayed in a consistent manner. Collections have been built containing Arabic, Chinese, English, French, Maori, Hindi, Spanish, etc. Besides, there are standard templates available in Indian Languages like Hindi, Kannada, Gujarati, etc. Greenstone also follows the OAI Protocol for metadata Harvesting and the documents are tagged with metadata that follow the Dublin Core standard. It also adheres to the Z39.50 Information Retrieval protocols.

The library can be saved on the computer and viewed even when not connected to the internet and also distributed on CDs. Various formats are supported (such as Adobe PDF, MS Power Point, MS Word etc.) and can be viewed in the library even without having the applications installed on the computer (i.e. Acrobat Reader, MS Office are not necessary).

Table 1 - Comparison matrix of various digital library softwares

Developmental Informatics Lab, IIT-Bombay

4. Examples

5. Architecture The following diagrams (Figs 1 & 2) show the Architecture of Greenstone library system

The documents are imported into the XML-compliant Greenstone Archive Format. Then the archive files are built into various searchable indexes and a collection information database that includes the

a) UNESCO - www.unesco.org b) Human Info NGO - www.humaninfo.org

c) The Indian Labour Archives - www.indialabourarchives.org

d) IIM- Kozhikode - http://intranet.iimk.ac.in/gsdl/cgi-bin/library

Fig. 1 - Overview of the Greenstone system

Fig. 2 - Overview of the Runtime systemFig.1 - Overview of the Greenstone system

Developmental Informatics Lab, IIT-Bombay

hierarchical structures that support browsing are built. This done, the collection is ready to go online and respond to requests for information.

6. Building Digital Collections The collections were built using the Graphical Librarian Interface (GLI), bundled along with the Greenstone digital library software. (See Fig. 3) This interface allows the librarian (content builder) to build collections of different file types viz., doc, PDF, post script (.ps files), rtf, jpg, png, etc. The collection building process occurs in the following steps: 1. Documents are gathered. 2. Metadata are assigned to the gathered documents. 3. The plugins necessary to process the document types are included. 4. The search types are assigned and the search indexes are assigned from the metadata set used in the collection. 5. Browsing classifiers are assigned to the document set. These classifiers help the user browse through the documents based on the different parameters assigned to the classifier. 6. The page layout of the document is designed using the format tags provided by Greenstone.

6.1. Crop Doctor : The initial collection sent by KVK contained around 60 images of diseased crops. After a meeting with the KVK experts, it was decided that the experts would send in the following details for each image of a diseased crop: 1. Commodity Name 2. Disease Name 3. Causative Factors 4. Symptoms 5. Pre-disposable Factors 6. Prevention 7. Control Measures 8. Organic/Natural Control methods (if any) 9. Description about the pest/disease 10. General Information 11. Alternative/Extra Images In this collection, the images have been tagged with metadata (mentioned above) and these metadata serve as the content that goes along with the images. Fig 4 shows a sample image of Tomato infected with Helecovherpa (fruit borer) insect and the metadata tagged to the image viz. measures undertaken to prevent and control the pest and some general information about the pest and an image describing the pest’s life cycle.

Fig.3 –The Greenstone Librarian’s Interface (GLI)

Developmental Informatics Lab, IIT-Bombay

The pages are constructed on the fly by Greenstone, by integrating the metadata in the order specified by the format tags. The page generated by Greenstone for the above sample image would appear as shown in Fig.5

The interface allows the user to search for documents, browse through all the documents, browse by commodity and diseases and a special section on biological control of pests and diseases. Main Page:

Fig. 4 – A sample image of tomato infected by the insect helecovherpa and the various metadata assigned to the image.

Fig. 5 –Document generated by Greenstone using the metadata shown in Fig.4

Developmental Informatics Lab, IIT-Bombay

Search View:

Browse View:

Browse by Commodity:

Browse by Disease:

Fig. 6 – Options available on the main page of Crop Doctor

Fig. 9 – Browse by commodities page showing the list of available commodities on top and the diseases infecting the particular commodity listed at the next level.

Fig.10– Browse by diseases page showing the list of all diseases at level 1 and the commodities which are affected by the particular disease at level 2

Fig. 8 – Browse Page showing all the images in the collection

Fig. 7 – A sample page showing search query results

Developmental Informatics Lab, IIT-Bombay

6.2. Crop recommendations: The collection on crop recommendations contains about 35 documents, sent by KVK, containing the relevant recommendations from various agricultural universities across the state of Maharashtra collected by KVK experts The documents in this collection are grouped by the following categories: 1. Vegetables 2. Cereals and Pulses 3. Fruits 4. Flowers 5. Others The entire document is imported into the Greenstone system, unlike Crop Doctor where the page content is supplied as metadata. The interface allows the user to search for documents, browse through all the documents and browse by different categories (vegetables, cereals, etc). Main Page:

Browse Page: Fig.13 shows the browse view where all the commodities, for which recommendations are available in the collection, are listed in alphabetical order.

Biological control:

Fig.13 - Browse through all the available recommendations

Fig. 12 – A snapshot of the main page showing the available options

Fig. 11 – A special section on Biological control of pests and diseases

Developmental Informatics Lab, IIT-Bombay

6.3. Collection of Translated Documents: The translation group at CFILT has translated a few threads from aAQUA (Question Answer service run by Developmental Informatics Lab with support from KVK Baramati) into Hindi and English. A collection of these translated documents has been developed and deployed. Each document has been attached with the following metadata: 1. Subject/ Title of the thread 2. Commodity Name 3. Language The interface allows the user to search for documents, browse by subject, commodity and language. Main Page:

Browse by vegetables, cereals and pulses, flowers, fruits:

Fig. 18 – A snapshot of the browsing options available on the collection of translated documents

Fig. 14 – Browse through recommendations on vegetables

Fig. 15 – Browse through recommendations on cereals and pulses

Fig. 16 – Browse through recommendations on fruits

Fig. 17 – Browse through recommendations on flowers

Developmental Informatics Lab, IIT-Bombay

This collection has been linked to the keyword search on Agro Explorer (A multilingual meaning based search engine developed by Centre for Indian Language Technology-IIT-Bombay). Any keyword submitted to Agro Explorer is passed to the DL and the search is executed on the collection of translated documents and displayed.

7. Challenges and Future Directions In the currently implemented system, the authors do not provide all the metadata required by the library. This being important for the extensibility and searchablity of the library, requires more attention. One solution is providing a site for authors to upload content and also attach required metadata entries with suggestions on choosing appropriate metadata. A lot of content is being generated by aAQUA- a multilingual online question and answer forum and portal for farmers. Almost All Questions Answered (aAQUA) provides answers to questions asked over the internet and has been deployed successfully since December 2003. The farmers in Pune District and the vicinity have sent more than 1000 questions so far with total number of posts exceeding 2500 answered by KVK Baramati. The digital library can also be accessed from aAQUA.

Browse by subjects:

Browse by Commodities:

Browse by language:

Fig. 20 – A snapshot of the browse commodities page with the commodity listed at top level and the relevant threads grouped by subjects and listed in all 3 languages

Fig. 19 – A snapshot of the browse by subjects page with the subjects (in all 3 languages) at the top level and the corresponding documents in all 3 languages at the next level

Fig. 21 – A snapshot of the browse by language page with the languages at top level and the relevant threads in the chosen language at the next level

Developmental Informatics Lab, IIT-Bombay

The challenges in organizing the aAQUA content are many. They arise due to improper grammar and spelling, requiring corrections, manual choice of keyword meta-text. Some Marathi or Hindi questions are written using English alphabets and need to be rewritten. The solution mentioned in the previous paragraph of encouraging the author (here farmer or computer operator on his behalf) to provide relevant keywords and meta-text may not be practical. This will need separate resources for cleaning, organizing and identifying good metadata. Currently a user who is bilingual, searching the digital library will have to search separately in each of the languages. We are working on adding a multilingual dictionary to enable a user to search documents across languages e.g. on searching on “agriculture”, the digital library will be able to also search using the Hindi and Marathi words for “agriculture”. This will allow a person fluent in these languages to find all matching documents.