the content mine (presented at uksg)
DESCRIPTION
A presentation to UK serials group about the value of content-mining of scientific literature and the need to allow this without licenceTRANSCRIPT
![Page 1: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/1.jpg)
The Content MinePeter Murray-Rust
University of Cambridge and Open Knowledge Foundation
A community of people and machines to extract 100,000,000 scientific facts from the scholarly literature
Slides: CC-BY
Images © Wikimedia CC-BY-SA
![Page 2: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/2.jpg)
![Page 3: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/3.jpg)
If you’re bored …THREE most important Open Access publishers?(besides BMC and PLoS) THREE most important Open Access repositories?
important
![Page 4: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/4.jpg)
TDM* (“Text-and-Data-mining”) is the use of machines to read and understand massive amounts of documents
“The Right to Read is the Right to Mine“. PMR + OKFN“Closed Access means people die” (PM-R)“Text and Data Mining saves Lives “ (John McNaught)*PMR uses “content mining”
![Page 5: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/5.jpg)
(Credit: Seth Rosenblatt/CNET)
Who’s this?
![Page 6: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/6.jpg)
(Credit: Seth Rosenblatt/CNET)
http://news.cnet.com/8301-13578_3-57611642-38/call-to-action-kicks-off-second-aaron-swartz-hackathon/
Aaron SwartzDied 2012-11-08
Facing 30 years in jail forDownloading JStor
![Page 7: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/7.jpg)
Typical papers destroy data
Numeric: astro1307.5851v4.pdf
Diagram: birds1471-2148-11-313.pdf
![Page 8: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/8.jpg)
[at Research Data Alliance, we are entering a new “era of open science”, which will be “good for citizens, good for scientists and good for society”.She explicitly highlighted the transformative potential of open access, open data, open software and open educational resources – mentioning the EU’s policy requiring open access to all publications and data resulting from EU funded research.
http://blog.okfn.org/2013/03/21/we-are-entering-an-era-of-open-science-says-eu-vp-neelie-kroes/#sthash.3SWDXDE6.dpuf
RCUKWellcomeERCNSF …
requirefully OPEN
![Page 9: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/9.jpg)
• Make science discoverable• Extract facts for research• Build reusable objects• Aggregate • Create new businesses• Check for errors => better science
Content Mining
![Page 10: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/10.jpg)
• Secondary publishers create walled gardens• Publishers’ contracts ban content-mining.• Publishers cut off Universities who mine• Publishers lobby governments to require
“licences for content mining”
• UK Hargreaves legislation will override this by law. Starts 2014.
Content Mining Problems
http://blogs.ch.cam.ac.uk/pmr/2013/10/02/text-and-data-mining-fighting-for-our-digital-future-peter-murray-rust-is-the-problem/
![Page 11: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/11.jpg)
service provider has control over applications, content, and media and restricts convenient access to non-approved applications or content.
Walled Gardens (“Free” but not “Open”)
Examples: Mendeley, Facebook, Cambridge Crystallographic Data Centre, OCLC
#animalgarden “Walled Gardens” https://vimeo.com/34323486
![Page 12: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/12.jpg)
![Page 13: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/13.jpg)
http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden
![Page 14: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/14.jpg)
STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R
Licences destroy Content Mining
![Page 15: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/15.jpg)
Licensing TDM is like publishers taxing spectacles
![Page 16: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/16.jpg)
But we can now turn PDFs into
Science
We can’t turn a hamburger into a cow
![Page 17: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/17.jpg)
Zoom in …
![Page 18: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/18.jpg)
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
![Page 19: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/19.jpg)
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Gaussian Filter
Automaticextraction
![Page 20: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/20.jpg)
Evolution of ultraviolet vision in the largest avian radiation - the passerines Anders Ödeen 1* , Olle Håstad 2,3 and Per Alström 4
HTML
Styles , superscripts
And diåcritics preserved!
AMI
![Page 21: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/21.jpg)
PDF Turdus iliacusTaeniopygia guttataSerinus canariaLanius excubitorMelopsittacus undulatusPavo cristatusSturnus vulgarisDolichonyx oryzivorusFicedula hypoleucaVaccinium myrtillusFalco tinnunculus
TurdusPomatostomus LeothrixAmytornis AcanthisittaOrthonyx x 2MalurusCnemophilus x 4Philesturnus x 2Motacilla x 2Toxorhampus x 2
![Page 22: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/22.jpg)
Typical phylo tree: 60 nodes, complex and miniscule annotation, vertical text, hyphenation and valuable branch lengths. AMI extracts ALL
![Page 23: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/23.jpg)
Acanthisittidae Acanthizidae Acrocephalidae Callaeidae Campephagidae Cnemophilidae Corvidae
0.84 0.91 0.93 0.95
Acanthisitta Acrocephalus Ailuroedus Ailuroedus Amytornis Camptostoma
AMI23.1234.5437.2138.55
Posterior probability
AMI can MEASUREBranch lengths!
NexML
Genus Family
HTML
![Page 24: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/24.jpg)
10 million spectra published /year
![Page 25: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/25.jpg)
![Page 26: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/26.jpg)
![Page 27: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/27.jpg)
Review of the NMR data reported in the Supporting Information in this article evidences instances where some of the spectra were inappropriately edited to remove impurities. A coauthor and former student, Dr. Bruno Anxionnat, has shared with me formal communication in which he states “I would like to take full responsibility for this entire situation. I was in charge of making the SI of my papers and I erased some peaks without telling anybody. All my supervisors (Pr. Cossy, Dr. Gomez Pardo and Dr. Ricci) trusted me and I wasn't dependable. I am the only one who has to be blamed for all that, in any case them. I know my behavior is highly unethical. I am deeply sorry for what I have done and for hurting people….”
![Page 28: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/28.jpg)
Crystallography Walled Garden
service provider has control over applications, content, and media and restricts convenient access to non-approved applications or content.
![Page 29: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/29.jpg)
From Saulius Grazulis
![Page 30: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/30.jpg)
Crystaleye• A database of 200,000 crystal structures scraped from
publications CIF supplemental information• CML molecules and name-value pairs• Re-usable as fragment base
Nick Day, Jim Downing, Sam Adams, N. W. England and Peter Murray-Rust*J.Appl.Cryst. (2012). 45 , 316–323, doi:10.1107/S0021889812006462
http://wwmm.ch.cam.ac.uk/crystaleye
![Page 31: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/31.jpg)
![Page 32: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/32.jpg)
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … chemical
project places
![Page 33: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/33.jpg)
The Content MineA community of people and machines to extract scientific facts from the scholarly literature on a global scale.
https://vimeo.com/78353557
![Page 34: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/34.jpg)
100,000 lines of Open code for translating PDFs to science.10 years work (PMR).
AMI works!
AMI
![Page 35: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/35.jpg)
We have friends
• ProPublica is a NY digital-democracy newspaper• Tabula is an Open PDF-table extractor • Mozilla fights for web freedom
![Page 36: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/36.jpg)
![Page 37: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/37.jpg)
Boot-Camps and hacks
Open Science, Oxford 2013-11-27(sold out before announcement!)
![Page 38: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/38.jpg)
Collaborators:I have talked with:• BMC• PLoS• British Library• Mozilla• Software Carpentry• EuropePMC• Creative Commons• OKFNI hope to talk with:• Wellcome• JISC• Ubiquity• Royal Society• Kitware• SPARC• …• …
![Page 39: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/39.jpg)
• “The right to read is the right to mine”• Unrestricted TDM saves lives• Libraries – reject TDM restrictions• Publishers – Damascene conversion • Funders – insist on CC-BY
@petermurrayrusthttp://blogs.ch.cam.ac.uk/pmr
![Page 40: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/40.jpg)
![Page 41: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/41.jpg)
3 most important Open Access repositories?
• Wikimedia• Github, StackOverflow.• National libraries and museums.
3 most important Open Access publishers?
• Wikipedia• NIH+EBI+OtherBioDatabases• arXiv, CERN/SCOAP• +PLoS+BMC
![Page 42: The Content Mine (presented at UKSG)](https://reader035.vdocuments.mx/reader035/viewer/2022062512/55495f09b4c905f24e8b5802/html5/thumbnails/42.jpg)
300 Billion USD annually on Science+Medicine
FACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS! • “we repeat about 25% of our chemistry
because we didn’t know we’d done it already”• 10,000 phylogenetic trees at 25,000 USD each;
only 4% have data (loss = 240 Million USD) • Computational chemistry – materials NO
DATA, perhaps 1,000,000,000 USDFACTS! LOST! FACTS! LOST! FACTS! LOST! FACTS!