portico: a case study in the migration of proprietary formats to the jats archiving format

21
Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru

Upload: oma

Post on 19-Mar-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format. Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler, Umadevi Thanneeru. Portico & JSTOR: Committed to Preserving the Scholarly Record. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Sheila Morrissey, John Meyer, Sushil Bhattarai, Sachin Kurdikar, Jie Ling, Matthew Stoeffler,

Umadevi Thanneeru

Page 2: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico & JSTOR: Committed to Preserving the Scholarly Record

JATS-CON 2010

I T H A K A

Ithaka helps the academic community

use digital technologies to preserve the

scholarly record and to advance research and teaching in sustainable

ways

Digitization for Preservation & AccessDigital Preservation“Dark Archive” “Light Archive”

Page 3: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico Archive

• Portico’s objective is to help libraries make a secure and reliable transition from print to a reliance on e-content.

• Maintains archiving agreement with publishers to collect and preserve content.

• Receives content directly from publishers.

• Preserves:– Current journals (born digital)– Back file journals (reborn

digital)– E-books– Digitized historical collections

JATS-CON 2010

Page 4: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

An “Insurance Policy” for e-Content

• Provide libraries with access to archived content when it becomes lost, orphaned or abandoned (regardless of libraries past or current subscription):

1.Publisher ceases operation2.Publisher discontinues title3.Publisher drops back file

JATS-CON 2010

•Provide libraries with post-cancellation access – if publisher specifically names Portico

•About 90% of titles in Archive are covered by Portico post-cancellation access rights.

•Libraries asked to pay annual Archive support payment to defray cost of preservation, e.g. “insurance premium”

Page 5: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico Archive as of July 19, 2010

Category Files %

Images 84,215,731 47.93%

Publisher Supplied Text 47,393,731 26.98%

Portico Created Archival Text

43,689,083 24.87%

Application Specific Files 232,732 0.13%

Multi-file Packages 140,333 0.08%

Videos 20,604 0.01%

Audio 570 <0.00%

Executable 6 <0.00%

Total 175,692,826 100%

• 114 publisher participants• 11,788 committed journal titles• 43,253 committed e-books• 13 committed digitized collections

• >14 million articles ingested

• 688 library participants– (48% outside US)

• 4 Trigger events• 15 Post-cancellation Access Claims

JATS-CON 2010

Page 6: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Portico Preservation Infrastructure

JATS-CON 2010

• Publisher supplies XML Source file (including the text, images) and PDF page rendition. • Best approach for preserving the intellectual content of the article or book.

• Authenticate: verify that preserved content is what it purports to be.

• Verify format: ensure the file meets syntactic and semantic rules of format specification. • Repair

• Normalize (XML)

• Create preservation metadata

• Assess archival robustness of file format.

• Migrate files to ensure future usability of content.

• Replicate objects and metadata to protect against bit rot and media deterioration

• Render articles to meet viewing requirements of delivery platform.

Page 7: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Key Challenges for an Archival DTD

Dec 2001, Inera’s “E-Journal Archive DTD Feasibility Study” highlighted these Key Challenges for an Archival DTD:

• Use of generated and boilerplate text, especially in – Label text for figure captions– Citation text– Author name and affiliation– Dates

• Expression of links between author and affiliation• Reference elements• Expression of non-article and other content• Abbreviations and definitions

JATS-CON 2010

Page 8: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Key Challenges for an Archival DTD

• Keywords• Sections, including handling of sections without headers• Placement of floating objects, such as figures, tables, graphs• Tables, including cell formatting issues (cells with figures,

content alignment, etc.)• Math• Intra-, inter- and extra-article linking• Publisher-specific elements

When reviewing the minutes of the Working Group and the evolution of the DTD, we can confirm that these areas have

been the main focus of discussion.

JATS-CON 2010

Page 9: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Some Design Constraints

• IMPLIED, not REQUIRED attributes

• CDATA instead of controlled list

• Optional Elements, or relaxed order of elements

• Surprising location of Elements

• No Domain Specific Elements

JATS-CON 2010

Page 10: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Publisher/Domain Specific Elements

• Custom-Meta– Business Data– Allowed in journal-meta, article-meta, front-stub– Name/Value pair (may contain 38 different

Elements)

• Named-Content– Semantic Significance– Allowed in 112 Elements– May contain 59 different Elements

JATS-CON 2010

Page 11: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

Extended Semantics for Named-Content

• Price in Citation– Becomes <named-content content-type=“price”>

<citation reference="1" id="R1" type="serial"> <author order="1"> <name><first>S. P.</first><last>Morgan</last></name> </author> <journal> <sertitle>J. Appl. Phys.</sertitle> <URI type="ISSN">0030-3941</URI> <price>$01.00</price> <volume>29</volume> <pages><first>1358</first><last>1368</last></pages> <pubdate>1958</pubdate> </journal> <title>General solution of the Luneburg lens problem</title></citation>

JATS-CON 2010

Page 12: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

More Extended Semantics for Named-Content

• Affiliation in Footnotes/P– Becomes <named-content content-type=“aff” id=“AFF2”>

<FOOTNOTE ID="N101" TYPE="AFF"><P ALPHABET="LATIN" TYPE="INDENT">    <AFF ID="AFF2“><IT>Corresponding author address:</IT> Nicholas M. J. Hall, Dept. of Atmospheric and Oceanic Sciences, McGill University, 805 Sherbrooke St. W., Montreal PQ H3A 2K6, Canada.</AFF>

</P></FOOTNOTE>

JATS-CON 2010

Page 13: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

More Extended Semantics for Named-Content

• Funding in Acknowledgments/P– Becomes <named-content content-type=“funding”>

<ack><sectitle>ACKNOWLEDGMENTS</sectitle><p>Q.W.&#x2019;s research is partially supported by AFOSR Grant No. <funding source="USAFOSR"><contract>F49550-05-1-0025</contract></funding> and NSF Grants No. <funding source="NSF"><contract>DMS-0204243</contract></funding>, No. <funding source="NSF"><contract>DMS-0605029</contract></funding>, and No. <funding source="NSF"><contract>DMS-0626180</contract></funding>. P.Z. is partially supported by the special funds for major State Research Projects <funding source="UNSPECIFIED"><contract>2005CB321704</contract></funding> and National Science Foundation of China for Distinguished Young Scholars <funding source="NSFC"><contract>10225103</contract></funding>. H.Z.&#x2019;s work is supported in part by the Naval Postgraduate School Research Initiation Program.</p></ack>

JATS-CON 2010

Page 14: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

More Extended Semantics for Named-Content

• Organization Division in Affiliation– Becomes <named-content content-type=“division”>

<Affiliation ID="Aff12"> <OrgDivision>Optisches Institut</OrgDivision> <OrgName>Technische Universität Berlin</OrgName> <OrgAddress> <City>Berlin</City> <Country>Germany</Country> </OrgAddress> </Affiliation>

JATS-CON 2010

Page 15: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

More Extended Semantics for Named-Content

• Generic Element (addinfo)– Becomes <named-content content-type=“addinfo”>

<ref-conf id="CIT0045"><ref-conf-text><author-ref-text><surname>Bishop</surname> <givenname>CJ</givenname></author-ref-text>, <author-ref-text><surname>Aanenses</surname> <givenname>DM</givenname></author-ref-text>, <author-ref-text><surname>Jordan</surname> <givenname>GE</givenname></author-ref-text>, <author-ref-text><surname>Kilian</surname> <givenname>M</givenname></author-ref-text>, <author-ref-text><surname>Hanage</surname> <givenname>WP</givenname></author-ref-text>, <author-ref-text><surname>Spratt</surname> <givenname>BG.</givenname></author-ref-text> <presentationtitle>Electronic taxonomy: assigning strains to bacterial species via the internet</presentationtitle>. <collectworktitle>BMC Biology</collectworktitle> <publicationfield-text><year>2009</year>; <year>7</year></publicationfield-text>: <firstpage>3</firstpage>. <addinfo>doi:10.1186/1741-7007-7-3</addinfo>.</ref-conf-text> </ref-conf>

JATS-CON 2010

Page 16: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

Target DTD Structural Constraints that force the use of Named-Content

• Table in Table– TD contains named-content, which contains a table

<td><named-content content-type=“table”><table-wrap>

• Figure in Table– TD contains named-content, which contains a fig

<td><named-content content-type=“figure”><fig>

• Display-Formula in Title– Title contains named-content, which contains a display-formula

<title><named-content content-type=“display-formula”><display-formula>

JATS-CON 2010

Page 17: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

• Question/Answer– Generic and Structural– Is saying <list list-content=“question”> enough?

<Question-Answer> <Q><P><L>1</L>. The major advantage of amniotic membrane transplantation in pterygium surgery is</P></Q> <A><P><L>A</L>. reduction in surgical time</P></A> <A><P><L>B</L>. preservation of conjunctiva</P></A> <A><P><L>C</L>. better cosmetic outcomes compared with conjunctival autografting</P></A> <A><P><L>D</L>. lowest recurrence rate among the surgical techniques</P></A></Question-Answer>

JATS-CON 2010

Page 18: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

• Synonymy– Domain and Semantic– Is saying <list list-content=“synonymy”> enough?– Or <named-content content-type=“synonymy”> because of the

semantic meaning?

<SYNONYMY> <HEAD>ECHINOSTELIALES</HEAD> <ITEM><P><GENSP>Clastoderma debaryanum</GENSP> A. Blytt</P></ITEM> <ITEM><P><GENSP>Echinostelium apitectum</GENSP> K.D. Whitney, MC</P></ITEM> <ITEM><P><GENSP>Echinostelium coelocephalum</GENSP> T.E. Brooks &amp; H.W. Keller,

MC</P></ITEM> <ITEM><P><GENSP>Echinostelium minutum</GENSP> de Bary, MC</P></ITEM></SYNONYMY>

Synonyms are different scientific names that pertain to the same taxon

JATS-CON 2010

Page 19: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Challenges posed by source DTDs

• Decision Tree (Taxonomic Key)– Domain, Semantic, Structural, and Presentation

<KEY> <COUPLET><DESCR><NO>1.</NO>Hypostomal setae (Hy) shorter than half the width of labrum</DESCR> <RESP><GENSP>Sycophila mellea</GENSP> (Curtis, 1831), <GENSP>Tetramesa </GENSP>Walker, 1848</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Hypostomal setae longer or about as long as half the width of labrum</DESCR> <RESP>2</RESP></COUPLET> <COUPLET><DESCR><NO>2.</NO>More than two dorsal setae (D) present on abdominal segments A6-8</DESCR> <RESP>3</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--At least one of abdominal segments A6-8 with only two dorsal setae</DESCR> <RESP>4</RESP></COUPLET> <COUPLET><DESCR><NO>3.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>E. (Ahtola) atra</GENSP> (Walker, 1832)</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>E. nodularis</GENSP> Boheman</RESP></COUPLET> <COUPLET><DESCR><NO>4.</NO>Mandibles bidentate</DESCR> <RESP><GENSP>Eurytoma appendigaster</GENSP> group</RESP></COUPLET> <COUPLET><DESCR><NO></NO>--Mandibles unidentate</DESCR> <RESP><GENSP>Eurytoma heriadi</GENSP> Zerova</RESP></COUPLET></KEY>

tree-like model of decisions and their possible outcomes

JATS-CON 2010

Page 20: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Concluding Question

How to support Publisher/Domain Specific constructs in the Archival DTD?

• Continue use of Named-Content

• New Miscellaneous Element

• Support for adding namespaced elements

• Other

JATS-CON 2010

Page 21: Portico: A Case Study in the Migration of Proprietary Formats to the JATS Archiving Format

Questions/Answers?

Thank you

John MeyerDirector of Data Technologies100 Campus Drive, Suite 100Princeton, NJ 08540609 [email protected]

JATS-CON 2010