as linked data schottlaender’s “anything but routine” · .ttl.pdf anything but routine v....

11
Representing and Publishing Schottlaender’s “Anything But Routine” As Linked Data Bradley P. Allen Chief Architect, Elsevier @bradleypallen LD4 Conference on Linked Data in Libraries Session on Special Formats 2 Boston, MA, USA 2019-05-11 The work of the American Beat Generation author William S. Burroughs is notorious among collectors and bibliographers for its scale and complexity. In this talk, we describe our objectives and experiences in representing, enhancing and publishing the information in Brian E.C. Schottlaender’s “ANYTHING BUT ROUTINE: A Selectively Annotated Bibliography of William S. Burroughs v. 4.0” as linked data using the BIBFRAME 2.0 and ARM ontologies.

Upload: others

Post on 19-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

Representing and Publishing Schottlaender’s “Anything But Routine” As Linked DataBradley P. AllenChief Architect, Elsevier@bradleypallen

LD4 Conference on Linked Data in LibrariesSession on Special Formats 2Boston, MA, USA2019-05-11

The work of the American Beat Generation author William S. Burroughs is notorious among collectors and bibliographers for its scale and complexity. In this talk, we describe our objectives and experiences in representing, enhancing and publishing the information in Brian E.C. Schottlaender’s “ANYTHING BUT ROUTINE: A Selectively Annotated Bibliography of William S. Burroughs v. 4.0” as linked data using the BIBFRAME 2.0 and ARM ontologies.

Page 2: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

I began collecting the work of William S. Burroughs in 1984, when my search for a first edition of Naked Lunch led me to Red Stodolsky’s Baroque Bookstore in Hollywood, CA. Red, a legendary Los Angeles bookman, knew a mark when he saw one; he set the hook deep, and started me on a journey that resulted in a still-growing collection of over 300 items by or about Burroughs.

The motivation for the work described in this talk comes from the desire to share information about my collection with others over the Web. Initially this was done using social media services such as Flickr or Pinterest, but more recently, I’ve begun to explore how this information might be more usefully published as linked data.

Page 3: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

Any serious collector of the work of an author depends deeply on a good annotated bibliography. Over the last forty years, a number of dedicated scholars have grappled with documenting Burroughs’ creative output, describing it at a work and instance level across a broad range of genres, from books to mixed media. Particularly challenging is the description of Burroughs’ manifold contributions to periodicals, of which there are hundreds, sometimes appearing in complex ways, e.g., as periodicals-within-periodicals. More than twenty years after Burroughs’ death, his archives continue to yield new works.

Released in four versions between 2008 and 2016, Brian E.C. Schottlaender’s “ANYTHING BUT ROUTINE: A Selectively Annotated Bibliography of William S. Burroughs v. 4.0” (ABR) is today the standard reference for Burroughs collectors. The convenient fact that ABR is available in PDF makes it straightforward to mine for work and instance data that can be used to describe collection items.

Page 4: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

.ttl.pdf

Anything But Routine v. 4.0

.ttl

Manual review, cleansing, enhancement,

and linking

AutomatedPDF-to-text conversion

Automatedtext-to-BIBFRAME

transformation

.txt

Extracted text

Extracted works and instances

Reviewed works and instances

+

To transform ABR from PDF into works and instances represented in BIBFRAME, I implemented a simple workflow that has both automated and manual steps.

1. A Python script is used to convert the PDF file into text files, one per work category.

2. Another Python script then uses regular expression pattern matching against the work identifiers to split the text into chunks, one per instance entry. Regular expressions are then used to extract publications, titles and dates from each entry. The extracted data for a given instance is then serialized to a Turtle file. A Turtle file is also generated for each work, incorporating the links to their instances.

3. Because there can be errors in this transformation, as a third and final step, each Turtle file is manually reviewed using a text editor, compared to the corresponding entry in the original document, and corrections made as needed. Python notebooks are also leveraged in this step to perform analytics useful for further enhancing and validating descriptions, or for refining the code used in the extraction scripts.

Page 5: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

abri:A52a a bf:Instance ; rdfs:label "Sinki's Sauna" ; bf:classification abrc:A ; bf:contributor [ a bf:Agent, bf:Person ; rdfs:label "William S. Burroughs" ; bf:role "author" ] ; bf:identifiedBy [ a bf:Identifier ; bf:source "Schottlaender v4.0" ; rdf:value "A52a" ] ; bf:instanceOf abrw:A52 ; bf:note [ a bf:Note ; rdf:value "Illustrated by James Kearns." ], [ a bf:Note ; rdf:value "Staplebound (no hardbound issued)." ], [ a bf:Note ; rdf:value "Limited to 500 numbered copies." ] ; bf:provisionActivity [ a bf:ProvisionActivity, bf:Publication ; bf:agent [ a bf:Agent, bf:Organization ; rdfs:label "Pequod Press" ] ; bf:date "1982" ; bf:place [ a bf:Place ; rdfs:label "New York" ] ] ; bf:title [ a bf:Title ; rdfs:label "Sinki's Sauna" ] .

A52. Sinki’s Sauna. Illustrated by James Kearns. A. New York: Pequod Press, 1982. Staplebound (no hardbound issued).

➢ Limited to 500 numbered copies.

Most of the information in works and instances can be easily captured using a handful of BIBFRAME attributes, specifically:

● bf:contributor● bf:copyrightDate● bf:identifiedBy● bf:instanceOf● bf:note● bf:provisionActivity● bf:title

There are a number of modeling challenges yet to be addressed:

● Preserving the ordering of notes: one of the long-standing and less-charming aspects of RDF is its lack of respect for order in data elements. This causes trouble when generating a human-readable version of ABR directly from an RDF graph. There are hacks for this, but they add complexity to what is currently a pretty simple representation. Dan Brickley of Google has recently begun to grapple with this issue in the context of schema.org and the evolution of JSON-LD, so maybe in the future we can solve this by moving from Turtle to JSON-LD as our choice of serialization.

● Representing uncertainty about publication agents, dates and places: ABR makes liberal use of cataloging conventions for lexically encoding

Page 6: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

● uncertainty into strings, using square brackets and question marks. I'd like to address capturing this by using arm:InaccuracyNote in conjunction with bf:provisionActivity.

● Representing citations within notes: ABR frequently adds citations to specific notes that make reference to other bibliographies or catalogs. I'd like to understand how to augment a bf:Note with that information.

● Converting internal references into links: Notes also make reference to other items in the bibliography, for example when expressing the relationship between a book and earlier appearances of material in article form.

● Balancing descriptive information between works and instances: My first cut at the work/instance partitioning of ABR skews heavily towards instances, with little data expressed at the bf:Work level other than the title and links to the various instances of the work. I believe a better design would make choices for each bf:Work as to which pieces of descriptive information are hung on a given work and implicitly inherited down to their instances, and which have their scope restricted to specific instances. I'm optimistic that these choices can be automated by being cleverer about exploiting the structure in the ABR text.

Page 7: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

abri:A52a a bf:Instance ; rdfs:label "Sinki's Sauna" ; bf:classification abrc:A ; bf:contributor [ a bf:Agent, bf:Person ; rdfs:label "William S. Burroughs" ; bf:role "author" ] ; bf:identifiedBy [ a bf:Identifier ; bf:source "Schottlaender v4.0" ; rdf:value "A52a" ] ; bf:instanceOf abrw:A52 ; bf:note [ a bf:Note ; rdf:value "Illustrated by James Kearns." ], [ a bf:Note ; rdf:value "Staplebound (no hardbound issued)." ], [ a bf:Note ; rdf:value "Limited to 500 numbered copies." ] ; bf:provisionActivity [ a bf:ProvisionActivity, bf:Publication ; bf:agent [ a bf:Agent, bf:Organization ; rdfs:label "Pequod Press" ] ; bf:date "1982" ; bf:place [ a bf:Place ; rdfs:label "New York" ] ] ; bf:title [ a bf:Title ; rdfs:label "Sinki's Sauna" ] .

abri:A52a a bf:Instance ; rdfs:label "Sinki's Sauna" ; bf:classification abrc:A ; bf:contributor [ a bf:Agent, bf:Person ; rdfs:label "William S. Burroughs" ; bf:role "author" ], [ a bf:Agent, bf:Person ; rdfs:label "James Kearns" ; bf:role "illustrator" ] ; bf:identifiedBy [ a bf:Identifier ; bf:source "Schottlaender v4.0" ; rdf:value "A52a" ] ; bf:instanceOf abrw:A52 ; bf:note [ a bf:Note ; rdf:value "Illustrated by James Kearns." ], [ a bf:Note ; rdf:value "Staplebound (no hardbound issued)." ], [ a bf:Note ; rdf:value "Limited to 500 numbered copies." ] ; bf:provisionActivity [ a bf:ProvisionActivity, bf:Publication ; bf:agent [ a bf:Agent, bf:Organization ; rdfs:label "Pequod Press" ] ; bf:date "1982" ; bf:place [ a bf:Place ; rdfs:label "New York" ] ] ; bf:title [ a bf:Title ; rdfs:label "Sinki's Sauna" ] ; ns1:hasPart [ a arm:Binding ; bf:note [ a bf:Note, arm:DescriptiveNote ; rdf:value "Staplebound (no hardbound issued)." ] ] .

The text in bf:Notes can be used to create additional bf:contributors or ARM descriptions of features of the instance, e.g. arm:Bindings. Terms from the LC MARC Code List for Relators are used as the values of the bf:roles of a bf:Agent.

Page 8: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

abri:A52a a bf:Instance ; rdfs:label "Sinki's Sauna" ; bf:classification abrc:A ; bf:contributor [ a bf:Agent, bf:Person ; rdfs:label "William S. Burroughs" ; bf:role "author" ], [ a bf:Agent, bf:Person ; rdfs:label "James Kearns" ; bf:role "illustrator" ] ; bf:identifiedBy [ a bf:Identifier ; bf:source "Schottlaender v4.0" ; rdf:value "A52a" ] ; bf:instanceOf abrw:A52 ; bf:note [ a bf:Note ; rdf:value "Illustrated by James Kearns." ], [ a bf:Note ; rdf:value "Staplebound (no hardbound issued)." ], [ a bf:Note ; rdf:value "Limited to 500 numbered copies." ] ; bf:provisionActivity [ a bf:ProvisionActivity, bf:Publication ; bf:agent [ a bf:Agent, bf:Organization ; rdfs:label "Pequod Press" ] ; bf:date "1982" ; bf:place [ a bf:Place ; rdfs:label "New York" ] ] ; bf:title [ a bf:Title ; rdfs:label "Sinki's Sauna" ] ; ns1:hasPart [ a arm:Binding ; bf:note [ a bf:Note, arm:DescriptiveNote ; rdf:value "Staplebound (no hardbound issued)." ] ] .

abri:A52a a bf:Instance ; rdfs:label "Sinki's Sauna" ; bf:classification abrc:A ; bf:contributor [ a bf:Agent, bf:Person ; rdfs:label "William S. Burroughs" ; bf:role "author" ; owl:sameAs wd:Q188176 ], [ a bf:Agent, bf:Person ; rdfs:label "James Kearns" ; bf:role "illustrator" ; owl:sameAs wd:Q20871041 ] ; bf:identifiedBy [ a bf:Identifier ; bf:source "Schottlaender v4.0" ; rdf:value "A52a" ] ; bf:instanceOf abrw:A52 ; bf:note [ a bf:Note ; rdf:value "Illustrated by James Kearns." ], [ a bf:Note ; rdf:value "Staplebound (no hardbound issued)." ], [ a bf:Note ; rdf:value "Limited to 500 numbered copies." ] ; bf:provisionActivity [ a bf:ProvisionActivity, bf:Publication ; bf:agent [ a bf:Agent, bf:Organization ; rdfs:label "Pequod Press" ] ; bf:date "1982" ; bf:place [ a bf:Place ; rdfs:label "New York" ; owl:sameAs wd:Q60 ] ] ; bf:title [ a bf:Title ; rdfs:label "Sinki's Sauna" ] ; ns1:hasPart [ a arm:Binding ; bf:note [ a bf:Note, arm:DescriptiveNote ; rdf:value "Staplebound (no hardbound issued)." ] ] .

The rdfs:labels of bf:contributors or bf:places can be used to query name authorities to generate candidate matching entities. These are then manually verified and, if correct, have owl:sameAs links to their URLs added to the respective bf:Agent or bf:Place description.

Page 9: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

.ttl

Reviewed works & instances

Flask server & data deployed on AWS

Lambda

.ttl

Dataset dump

.ttl

.md

Markdown file deployed on Github Pages

VOID dataset description

Markdown rendering of works & instances

CI build triggered by Git commit

The linked data generated by this workflow is published to the Web in two ways:

1. As Markdown content with bibliographic information rendered in a layout modeled on the PDF version of ABR.

2. As a minimalist linked data service providing HTTP GET access to a VOID description of the dataset, a dump of the dataset, and individual bf:Works and bf:Instances, with content negotiation supporting multiple RDF serializations.

The Travis-CI continuous integration service is used so updates to the Github repository trigger automatic builds that push updated Markdown to Github Pages and server code and data files to an Amazon AWS Lambda function.

Assuming a modest level of traffic, this use of Github, Travis-CI, and AWS Lambda provides a very low-cost and low-overhead way to host the data.

Page 10: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

Catalogs

Open knowledge resources

Python notebooks using rdflib, pandas, scikit-learn, keras, snorkel,

spaCy etc. for analytics and enhancement

Git-based repositories for collaborative editing, review, and

publishing

Annotated bibliography

In the long run, I am working towards realizing the following scenario:

1. A distributed team of collaborators using a Git-hosted repository maintains an annotated bibliography as a linked open data resource.

2. Collectors independently publish item catalogs as data linked to the bibliography, reducing duplicative descriptive effort and making work and instance information consistent across catalogs.

3. Scholars use analytics tools to query those resources, together with other open knowledge resources, to conduct research into an author and her work in a more powerful and efficient manner than possible using traditional print-centric resources.

4. Research revealing relevant information results in pull requests for bibliography changes from researchers, to be reviewed and merged by committers for the bibliography repository.

Page 11: As Linked Data Schottlaender’s “Anything But Routine” · .ttl.pdf Anything But Routine v. 4.0.ttl Manual review, cleansing, enhancement, and linking Automated PDF-to-text conversion

LinksProject repository https://github.com/bradleypallen/anything-but-routine-ld

Anything But Routine v. 4.0 https://escholarship.org/uc/item/0xj4d6bm

Anything But Routine LD http://bradleypallen.org/anything-but-routine-ld

Anything But Routine LD VOID description https://wsburroughs.link/anything-but-routine/4.0/.well-known/void

Schottlaender A52a in BIBFRAME https://wsburroughs.link/anything-but-routine/4.0/instance/A52a

Collection catalog http://bradleypallen.org/wsb-catalog/

Image of Baroque Bookstore and Red Stodolsky https://bukowskiforum.com/threads/red-stodolsky-memorial.6416/

License CC BY-NC-SA 4.0

Thanks for the opportunity to share this work at LD4. Here are links to the project repository on Github, live examples of data and content generated to date, and other related resources.