digital libraries, archives, and large data sets alexa t. mccray national library of medicine...

38
Digital Libraries, Digital Libraries, Archives, and Archives, and Large Data Sets Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA [email protected] WHOI, June 3, 2004

Upload: charles-jessie-bridges

Post on 29-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Digital Libraries, Archives, and Digital Libraries, Archives, and Large Data Sets Large Data Sets

Alexa T. McCray

National Library of Medicine

Bethesda, Maryland USA

[email protected]

WHOI, June 3, 2004

Page 2: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

What is a digital library?What is a digital library?

“… an electronic information access system that offers the user a coherent view of an organized, selected, and managed body of information.” (Lynch, 1995)

An organization that provides the resources “… to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use…” (Waters, 1998)

Page 3: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Data Creation

Data Capture, Management, and

Preservation

Data Access

Conceptual Model of aDigital Library

Content Creators

Users

Page 4: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Long Term Preservation & Long Term Preservation & ArchivingArchiving

•OAIS (Open Archival Information System) standard

-Developed by NASA for long term preservation, archiving, data management, and access • Both digital and physical archives

•Address impacts of changing technology

-New media and data formats

-Changing user community

Page 5: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

OAISOAIS

•Framework for data management

•Functional model for

-Preservation planning

-Data management

-Archival storage

-Persistent access

Page 6: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Digital Libraries InitiativeDigital Libraries Initiative

•Research initiative lead by the National Science Foundation in collaboration with a number of other Federal agencies

•Research goal is to investigate improved methods for creating, managing and accessing large information resources and repositories

Page 7: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Research FociResearch Foci

•Content and Collections

•Systems-centered digital library research

•Human-centered digital library research

•Testbeds and Applications

Page 8: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Content and CollectionsContent and Collections

•Data capture, representation, preservation

•Metadata

•Domain specific information objects

• Intellectual property rights

•New economic and business models for digital libraries

Page 9: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Systems-centered Systems-centered ResearchResearch

•Open, networked architectures

•System scalability

• Intelligent agents

•Systems evaluation and performance studies

•Data compression

•Authentication

Page 10: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Human-centered Human-centered ResearchResearch

• Information discovery and retrieval methods

• Intelligent user interfaces

• Information visualization

•User and usability studies

•Social implications of digital libraries

Page 11: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Testbeds and Testbeds and ApplicationsApplications

•Specialized tools for e.g.,

-Document mark up

-Metadata encoding

•Specialized applications for specific domains

•Allow development of new methods for knowledge discovery and data mining

Page 12: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

The Vocabulary ProblemThe Vocabulary Problem

•Same string, different meaning

•Different string, same meaning

•Different string, similar meaning

-Unrecognized relationship

• Implicit conventions

• Implicit hierarchies

-Variety of relationships

Page 13: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Unified Medical Language Unified Medical Language (UMLS) System(UMLS) System

•Long term National Library of Medicine project

•Problem the UMLS is attempting to solve:

-Provide integrated access to biomedical information in disparate biomedical information systems• Bibliographic, factual databases, decision

support systems, knowledge-based systems

Page 14: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

UMLS Knowledge SourcesUMLS Knowledge Sources

•Metathesaurus

-Large number of biomedical concepts

•SPECIALIST Lexicon

-General English and biomedical lexical items, tools for recognizing linguistic variation

•Semantic Network

-Conceptual framework for the UMLS

Page 15: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metathesaurus Metathesaurus

•Metathesaurus

-Over one million concepts; 90 families of vocabularies

-Broad coverage of the vocabulary used in the biomedical sciences• Basic science research

• Clinical medicine

• Health services

Page 16: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Broad Coverage of Broad Coverage of BiomedicineBiomedicine

• Several perspectives

- clinical terms (SNOMED)

- information sciences (MeSH, CRISP)

- administrative terminologies (ICD-CM, CPT-4)

• Specialized vocabularies

- genomics (Gene Ontology, NCBI organism taxonomy)

- medical devices (UMD)

- anatomy (UWDA, Neuronames)

Page 17: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

From the Vocabularies to the From the Vocabularies to the MetathesaurusMetathesaurus

•Vocabularies

- terms

-hierarchies

•Metathesaurus

-organizes terms

-organizes concepts

-Relates concepts to other concepts

•Metathesaurus = Thesaurus of Thesauri

Page 18: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Common UMLS RepresentationCommon UMLS Representation

•One concept, multiple terms and strings

- renal cell carcinoma• CUI: C0007134, LUI: L0007134, SUI:S0425056

- renal cell carcinomas• CUI: C0007134, LUI: L0007134, SUI:S0081526

-hypernephroma• CUI: C0007134, LUI: L0020489, SUI:S0420320

-Grawitz tumor• CUI: C0007134, LUI: L0018219, SUI:S0375417

Page 19: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Lexical ToolsLexical Tools

•Manage lexical variation

-Perform lexical transformations• Generate inflectional variants, normalized forms

•Depend the SPECIALIST lexicon

•Used for preliminary algorithmic mapping as new vocabularies are added to the Metathesaurus

Page 20: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Digital Library Case Study: Digital Library Case Study: ClinicalTrials.govClinicalTrials.gov

•Centralized system at NLM

-Content provided by individual data providers, both federal and from the private sector

•Standard set of data elements in XML (eXtensible Markup Language) format

-Summary; recruitment information; eligibility criteria; study design; intervention being studied, location and contact information

Page 21: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

ClinicalTrials.gov

Page 22: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

ClinicalTrials.gov

Page 23: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

System Architecture: ClinicalTrials.gov

Page 24: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Digital Library Case Study:Digital Library Case Study:Profiles in ScienceProfiles in Science

•Large scale digital conversion project

•Archival collections of eminent biomedical scientists of the twentieth century

-Books, journal volumes, pamphlets, diaries, letters, manuscripts, photographs

•Materials in a variety of formats

-Text, audio, still images, video

•Testbed for experiments in digital preservation

Page 25: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Profiles in Science

Page 26: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Profiles in Science

Page 27: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metadata-drivenMetadata-drivenDocument ConversionDocument Conversion

• Interpret metadata in broadest sense

•Use metadata to drive the entire system

•Metadata record is the basic unit in the system, managing the

-Digitization process

-Display and organization of the data

-Network-based resource discovery

-Archiving and Preservation

Page 28: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metadata: Framework forMetadata: Framework forCollection ManagementCollection Management

•Metadata entry system manages all aspects of digitization process

-Unique identifiers bind digital master files, Web-derivatives, and metadata records

-Enforces quality control (pull-down menus, validation, error messages)

-Reports that manage workflow

-Security measures

Page 29: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metadata: Display and Metadata: Display and Organization of the DataOrganization of the Data

•Series of programs generate Web pages from metadata database

- Include consistency checking, validation

•Programs generate alternative views

-alphabetical, chronological, resource type, content area

Page 30: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metadata: Networked-based Metadata: Networked-based Resource DiscoveryResource Discovery

•“Dublin Core” metadata elements derived from metadata entry system

- simplicity

- semantic interoperability

- international consensus

Page 31: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Metadata: Ensuring Metadata: Ensuring Preservation and PersistencePreservation and Persistence

•Archiving responsibility

•Permanence rating

•Preservation actions

•History of origin

Page 32: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Broad Categories of Broad Categories of Metadata ElementsMetadata Elements

•Content specific

•Medium specific

•Process specific

•Storage information

•Physical characteristics

•Preservation/provenance information

Page 33: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

System Architecture: Profiles in Science

Page 34: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Digital Resources at the Digital Resources at the National Library of MedicineNational Library of Medicine

•Four levels of permanence

-Permanent: unchanging content, e.g., Profiles in Science scanned document

-Permanent: stable content, e.g., MEDLINE record

-Permanent: dynamic content, e.g., NLM home page

-Permanence not guaranteed, e.g., fact sheets

Page 35: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Preservation of Preservation of Digital InformationDigital Information

•“The conclusion reached by the impressive group of 21 experts was alarming – there is, at present no way to guarantee the preservation of digital information.” (Rothenberg, 1999)

•“Technological obsolescence [is] the greatest threat to digital collections.” (Kenney & Rieger, 2000)

Page 36: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Preservation Research at the Preservation Research at the National Library of MedicineNational Library of Medicine

• Image Migration Framework

-Prototype for image conversion, analysis, and preservation

-Associated preservation metadata

-Current experiments converting from one image format to another

Page 37: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

TIFF to PNG to TIFF

Page 38: Digital Libraries, Archives, and Large Data Sets Alexa T. McCray National Library of Medicine Bethesda, Maryland USA mccray@nlm.nih.gov WHOI, June 3, 2004

Concluding RemarksConcluding Remarks

•Digital library data management

-Requires technical decisions• Adherence to standards, planning for change

- Involves social issues• Sharing of data and knowledge

• Open access to information

- Implies promises to our users• Integrity, currency, and persistence of data