bl labs presentation at liverpool john moores university
TRANSCRIPT
1 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
British Library LabsWhat is British Library Labs and what have we learned over the last four years?Mahendra Mahey
1315 – 1400, 22 March 2017Learning the Lessons of working with the British Library’s Digital Content and Data for your research(History UK with Liverpool John Moores University)
https://goo.gl/Mj9DWR
2 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
The British Library
Inside the British LibrarySpace for 1200 readers, around 400,000 visitors per year
Uses low oxygen and robotsReading room and delivery to London
Document Supply and Storage at Boston Spa
Stockton-on-TeesAuthor right to payment each time their books
are borrowed from public libraries.
St Pancras, London, UKMany books are stored 4 stories below the buildingLegal Deposit Library – Reference only
3 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Living Knowledge Vision (2015 – 2023)
Custodianship Research Business
Culture Learning International
To make our intellectual heritage accessible to everyone, for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023.
Document:http://goo.gl/h41wW7 Speech:https://goo.gl/Py9uHK
Roly Keating (Chief Executive Officer of the British Library)
To make our intellectual heritage accessible to everyone, for research, inspiration and enjoyment and be the most open, creative
and innovative institution of its kind by 2023.
4 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Collections – not just books!> 180* million items
> 0.8* m serial titles
> 8* m stamps
> 14* m books
> 3* m sound recordings> 4* m maps
> 1.6* m musical scores
> 0.3* m manuscripts
> 60* m patents
King’s Library *Estimates
5 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
http://www.bl.uk/projects/british-library-labsFunded by the Andrew W. Mellon Foundation
6 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
http://www.bl.uk/projects/british-library-labsFunded by the Andrew W. Mellon Foundation
7 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Wider…not just Researchers
Researchershttps://goo.gl/WutNyi
Artistshttp://goo.gl/nNKhQ2
LibrariansCurators
https://goo.gl/9NWZUW
Software Developershttps://goo.gl/7QQ5Tf
Archivistshttps://goo.gl/x7b4tg Educators
https://goo.gl/qh01Mi
8 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Curators / Researchers
Access & Reuse Group
©
Developers/ Technical
Staff
Project Board
Universities & wider
The World
ResearchersBL Labs
British Library
Digital Scholarship
DigitalContent
United Kingdom
Advisory Board
Digital Research
Stakeholders involved in Labs
9 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Digital research methods
Visualisations
Using Application Programming Interfaces for datasets e.g. Metadata, Images Annotation
Location based searching & Geo-tagging CrowdsourcingHuman Computation
10 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
How are we doing this?
11 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Competition
Awards
Projects
Tell us your ideas of what to do with our digital content
Show us what you have already done with our digital content in research, artistic, commercial and learning and
teaching categories
Talk to us about working on collaborative projects
12 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Why are doing this?
• Working closely with and listening to those who want use our digital collections and data for their work and helping to build services, tools and processes to support them
• We can learn how we are and should be supporting them.– Is the access to digital collections we provide sufficient?– Do we have the right tools?– Do we provide the right support?– Where are the gaps between what they want and what we can
give?– How do we build the bridges to overcome them?– Many more reasons…
13 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
digitalData all around us!
/
Knowledge Quarter London55 knowledge organisations within 1 mile radius of Kings Cross, http://www.knowledgequarter.london
https://goo.gl/pGO7QY
digitalData all around us!
14 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
#bldigital1-2 %* digitised
* estimate
Digitisation
Partnerships Commercial & Other Organisations
Amountincreasing rapidly
Bias in digitisation
http://goo.gl/bR9UJL Sample Generator
15 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
So little digitised…?
• Common misconception that all our physical items are digitised. No! Costs time and resources!
• Still a big number though!
• Dialogue is either:– you are lucky and we have the digital content relevant to your
research– we don’t have, exactly what your looking for but this is what we
have, is there anything of interest?
• We tend to attract researchers with ‘fuzzier’ research boundaries
• Artists find this dialogue easier
• Access easier for out of copyright content
16 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
only in Reading
Rooms due to ©
only on site due to
© or ethical etc
not online / available –
various storage devices,
personal data
online and open
British Library
online behind paywall
Challenges of Digital access at the Library
17 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
The Story of the Collection!
Collection
CuratorWho paid for the digitisation?
Who did the digitisation?Technology used
Born digital?
Published
Unpublished
Where is it?
Can it still be accessed?
Generates income
Reputational Risk
Legalities
Political
Ego Surprises
Metadata
Old format not supportedWhat media was the digitisation done from?
Documentation
No Metadata
Messy Metadata
Still there?
Sometimes it’s complicated, better to know as much as possible, if you want to open it up!
18 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Finding Open Digital Collections at the British Library
• Curated?Learn the story behind a collection!Is there a human who knows the ‘story’ about the collection, who wants it used, are there any surprises lurking?
• Where is it, is it accessible?
• Licensing?Internal Access and Reuse and Licensing Group (Risk assessment group – Strategic, Commercial, Copyright, Curatorial, Technical)
• Metadata available? What state is it and does it need cleaning?
https://goo.gl/Qjeqo1
https://goo.gl/Kfc4qc
Access & Reuse Group
©
19 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Open Licensed Digital Content?
15% Openly Licensed
Working through
Breakdown by collection*Manuscripts 59%Books 9%Maps and Views 7%Newspapers 3%Archives and Records 3%Paintings, Prints and Drawings 2%
*Based on digitisation projects
Largest proportion of fundingPublic / Private Partnership
15% Openly Licensed85% Available onsite
20 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
READING ROOM
ON SITE
NOT ONLINE
OPEN
British Library
£
Digital access at the Library
Labs Residency Model
21 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Playbills, Books, Newspapers (includes OCR)
Digital collections and DatasetsAt the British Library
British National Bibliography
http://bnb.data.bl.uk
http://sounds.bl.ukhttp://dml.city.ac.uk/
Music (Recordings & Sheet) & Soundshttp://goo.gl/frSMJtBroadcast News (TV and Radio)
http://goo.gl/cwThHw
http://goo.gl/pBkisZhttp://goo.gl/E8aRyQ
Usage dataImages, Manuscripts & Maps
http://www.qdl.qa/ Qatar Digital Library
http://idp.bl.uk/International Dunhuang
Project
Mapshttp://www.bl.uk/maps/
Hebrew Manuscriptshttp://goo.gl/4sbCp9
Flickr & Wikimedia Commons
https://goo.gl/LZRmaZ
22 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Cultural Heritage DatasetsDatasets about our collections Bibliographic datasets relating to our published and archival holdings
Datasets for content mining Content suitable for use in text and data mining research
Datasets for image analysisImage collections suitable for large-scale image-analysis-based research
Datasets from UK Web ArchiveData and API services available for accessing UK Web Archive
Digital mapping Geospatial data, cartographic applications, digital aerial photography and scanned historic map materials https://data.bl.uk
Discussion list: http://www.jiscmail.ac.uk/CULTURAL-HERITAGE-DATASETS
23 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Taking a peek at our Open Data
24 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
002819694
25 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
26 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
27 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Optically Character Recognised (OCR) generated TextScanned Page
Image on Flickr Commons
https://goo.gl/AC43vs
28 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
OCR XML Generated by ABBY Fine Reader
29 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Taking a peek at our onsite only accessible data
30 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
• Need to be security cleared so that you exist as a BL entity– Hence ‘Researcher in Residence Model’
• Permission required from internal IP department and perhaps commercial company involved in the digitisation
• 20 % rule in terms of re-use in research
• Learning pathways so that this becomes ‘everyday’
31 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
1
Results of digitisation exist on Windows file shares!
Windows 7, external access possible through Citrix Server
32 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL (JISC 1)
2
12 Volumes, each with terabytes of data
33 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
3
34 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
4
35 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
5
36 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
6
37 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
7
38 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
8
39 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
9
40 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
10
41 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
11
42 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
12
43 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
13
Accessing original master image (not cropped or post processedOr Service Copy (post processed)
and results of OCR available as ALTO XML
44 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
14a
Accessing original master image (not cropped or post processed)
45 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
Accessing original master image (not cropped or post processed)
14b
46 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
15a
Accessing Service Copy (post processed) and results of OCR available as ALTO XML
47 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
Accessing Service Copy (post processed)
15b
48 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers onsite at the BL
15c
Accessing OCR as ALTO XML
49 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers through Gale Interface (subscription)
1
50 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Accessing digitised newspapers through Gale Interface (subscription)
2
51 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Virtual Infrastructure for OCR text
OCR text scraped from digitised newspapers
and in cloud
Jupyter notebookWrite code in browser
Results in browserhttp://jupyter.org
52 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
What did people
actually do?
53 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Machine Learning / Reading• Analogies to how humans read
• Machines acquire ‘knowledge’ and use that knowledge to make sense of new situations
• BL doing this on a case by case basis.
• Need computational and human effort
• Human input as to where to look > computational ‘lasso throwing’ > human sift
• Legalities of this process being ‘ironed’ out with publishers
• Not well understood area…
54 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
The smell of soup!
Thanks to Memo Akten (@memotv on twitter) for the inspiration!
https://goo.gl/toq4Bo Nasreddin, 13th Century Turkish Sufi
http://web2.uvcs.uvic.ca/elc/studyzone/330/reading/smell1.htm
55 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Messy Data!Optical Character Recognition
56 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Finding things in messy data
Mrs Folly• Clean up some manually• Get ‘ground truth’• Write code to find things
reliably in it automatically• Try code on messy content• Tweak if necessary• Digital lasso around content• Manually sift through
57 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Katrina Navickas (2015) Political Meetings Mapper
http://politicalmeetingsmapper.co.ukhttps://goo.gl/Qq78Oa
Labs Symposium 2015
https://goo.gl/BSA3be
Interview 2015
The Chartist Newspaperhttp://goo.gl/vOLSnH
Chartist Monster Meeting
Chartists Re-enactment London
58 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Black Abolitionist Performances & their Presence in Britain (2016) – Hannah-Rose Murray
FrederickDouglass
EllenCraft
JosiahHenson
Ida B Wells
A Performance by Joe Williams &
Martelle Edinborough
http://frederickdouglassinbritain.com/
59 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Use of Overproof – OCR Correction
Also just re-OCR?
60 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
See Bob Nicholson – Looking for JokesSee Jennifer Batt – Looking for Poems
61 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
What can 65,000books tell us?
Image: Artwork by Alicia Martin
Just one open digital collection
62 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Worked better for female faces than men’s
Press
http://mechanicalcurator.tumblr.comPosts image every 30 minutes
http://www.flickr.com/photos/britishlibrary/
1,020,418 imagesneed tagging!
Creative uses of images
Face recognition
Mechanical Curator
http://goo.gl/qPPgxX
Flickr
Snipping out imagesfrom 65,000 Digitised Books*
>600,000,000 views
>15,500,000 tags
https://goo.gl/FgZ4HM
Work @ BL by Ben O’Steen, Labs
and Digital Research Team*Matt Prior - http://goo.gl/j29Tnx
Since Dec 2013
63 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Opportunities – increasing traffic to Library services
You can purchase a ‘High Res’ Copy
View in the Library Item Viewer
Download .pdfAll illustrations
in book
Other illustrations in booksPublished in same year
View the item in the Library Catalogue Tags auto generated
User generatedTag
Grouping for image
64 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Tagging a million imagesIterative Crowdsourcing
http://goo.gl/j6fxac
Cardiff University’sLost Visions Project
http://www.metadatagames.org/
Metadata Games
James Heald
Mario Klingemann
Chico 45
Use computational methods
Human Tagger
Top British Library Flickr Commons Taggers
http://goo.gl/8SkfM1
Machine LearningSearch Engine
& Google Imagesearch
65 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Special Jury’s Prize (2015)James Heald – Wikimedia and Map work
https://goo.gl/WYZCB2
http://goo.gl/HNQq5e
https://goo.gl/VPgffL
https://commons.wikimedia.org/
https://goo.gl/djtm1b
Labs Symposium (2015)Geotagging maps
54,000 Maps
66 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Adam Crymble (2015)Crowdsource Arcade
What if crowd sourcing
looked like this?
http://goo.gl/LBfJ4W
http://goo.gl/OH9pOZ
https://goo.gl/7z0j8p
30 mins talkLabs Symposium (2015)
https://goo.gl/SSRsdd
5 min interview (2015)
http://goo.gl/0APpE8
Game Jam
67 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
SherlockNet: Competition Winner 2016Karen Wang, Luda Zhao and Brian Do
Using Convolutional Neural Networks to Automatically Tag and Caption the British Library Flickr Commons 1 million Image Collection
Classify into one of 12 categories
>20 million tags added (total now 20 million overall)>100,000 experimental captionsData available soon!
bit.ly/sherlocknet
Pooled surrounding Optical Character Recognised
text on page from similar images
Used Microsoft COCO (photographs) & British Museum Prints and Drawings
collections as training sets.
Tags Captions
68 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Artistic / Creative Works
http://goo.gl/dM8ieA
Mario Klingeman (2015)
http://www.crossroadsofcuriosity.com
David Normal 2014 and 2015
https://www.youtube.com/watch?v=-GRgj7Q5OM0
Rob Walker 2014
http://goo.gl/bNxGZZ
Kris Hoffman (2016)
https://goo.gl/QilqqT
Jiayi Chong 2016Ling Low 2016
https://www.youtube.com/watch?v=bcOP1E5bRE0https://www.facebook.com/RealmlandStory/
Paul Rand Pierce 2016
A Hat on the GroundSpells trouble
Tragic Looking Women44 Men who Look 44
(Notice the direction faces)
69 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Mario Klingemann 2016
https://www.youtube.com/watch?v=xgnxnmqnR7YGoogle Arts and Culture Lab – Experiments with Machine Learning
https://artsexperiments.withgoogle.com/
70 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Imaginary Cities – BL Labs Project 16-17Michael Takeo Magruder
https://goo.gl/4ARwTy
An artistic exploration seeking to create provocative fictional cityscapes for the Information Age from the British Library’s digital collection of historic urban maps
71 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Some Lessons Learned and Challenges so far…• Everything starts from a conversation (external and internal)!
• Need to have several conversations with several stakeholders and tap into their tacit knowledge that isn’t always written down (esp. internal).
• It’s hard work at the beginning!
• Expectations change when researchers actually see the data, systems and experience the ‘culture’ of the organisation.
• We tend to work with researchers who can be ‘flexible’ with their research questions and are willing to embrace challenges.
• Often misunderstandings because of jargon & different meaning of words.
• Embrace dirty data, it may never be perfect!
72 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Some Lessons Learned and Challenges so far…(2)
• Many researchers have the domain knowledge but lack the technical skills to use Digital Research methods. Should they be teamed up with those that have problems that need solving (Computing) or get trained?
• Identifying / bridging gaps for researchers to use data, help them ‘navigate’ through the Library to get the data they want (sometimes).
• Huge appetite to use digital content & data (e.g. Flickr Commons stats).
• Start small and simple, but think big!
• Create and embrace serendipity, stimulate the imagination, work fast, give it energy.
• Letting go of the emotional and psychological connection to “my” collection
• If digitised collections are not used, what is the point of digitising them?
• Fail faster (don’t be afraid), small experiments, reject perfectionism, good enough
73 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
The Magic of Openness!
• By opening collections up we are creating the possibility to have them used in ways only restricted by human imagination.
• Need to work hard to tell people about our Digital Collections and Data especially if not easy to find, creating serendipity and opportunities for use!
• Give plenty of examples to inspire use!
• Support and celebrate the use!
74 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Exercise – Explore or Imagine Our Data!• CSV of Metadata
https://data.bl.uk/digbks/dig19cbooks-mdata-csv.csv
• 19th Century Books - Book Metadata - 01/09/2013.https://data.bl.uk/digbks/db21.html
• Digitised Books - Flickr Tag History - Dec 2013 to March 2016. TSVhttps://data.bl.uk/digbks/db15.html
• Digitised Hebrew Manuscripts - Metadatahttps://data.bl.uk/hebrewmanuscripts/heb1.html
• Digitised Hebrew Manuscripts: Or 2210 - Or 2364https://data.bl.uk/hebrewmanuscripts/heb8.html
• Theatrical playbills from Britain and Ireland (OCR text only)https://data.bl.uk/playbills/pb2.html
• Portraits of actors, views of theatres and playbills (covering 1750 - 1821 in a single volume)https://data.bl.uk/singlesheet/por1.html
• Volumes of Lysons Collectanea (Amusements), comprising broadsides, cuttings, advertisements on amusements.1660-1840.https://data.bl.uk/singlesheet/ad1.html
Work in pairs!
https://data.bl.uk•Report back on Data!•Data Quality•Issues
Or an idea you have thought ofwhat to do with the data!http://labs.bl.uk/Ideas+for+Labs
Smaller datasets
75 @BL_Labs @BL_DigiSchol #bldigital https://goo.gl/Mj9DWR
Contact us
Mahendra MaheyManager of BL Labs
[email protected]@bl.uk