Transcript
Page 1: Digital Scholarship: Enlightenment or Devastated Landscape?

Digital Scholarship: Enlightenment or Devastated Landscape?

Peter Murray-Rust, University of Cambridge

IT Future Conference, Informatics Forum, Edinburgh, UK 2015-12-17

(Glen Feshie, remains of forest, CC-BY-SA 2.0 Ian Shiell http://www.geograph.org/uk/photo/3944612.jpg )

Page 2: Digital Scholarship: Enlightenment or Devastated Landscape?

University of Stirling 1972student occupations and sit-ins

University of Stirling

Used without permission but with thanks and LoveLiverpool , Warwick, Emmanuel Coll Camb., UCL, Glasgow, Middlesex, …

Peter Murray-Rust,Lecturer

Page 3: Digital Scholarship: Enlightenment or Devastated Landscape?

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] per month>2.5 million (papers + supplemental data) /year*

4500 m high per year [2] Representing ? 500 Billion USD public funding

[1] http://www.crossref.org/01company/crossref_indicators.html

Page 4: Digital Scholarship: Enlightenment or Devastated Landscape?

Refs: Erriquez_Daniela_tesi, Fiorentina_Elena_tesi, Gou_Qian_Tesi, mbarontini_tesid, terracciano_maria_tesi

BagOfWords for Italian Theses

Page 5: Digital Scholarship: Enlightenment or Devastated Landscape?

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 6: Digital Scholarship: Enlightenment or Devastated Landscape?

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 7: Digital Scholarship: Enlightenment or Devastated Landscape?

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 8: Digital Scholarship: Enlightenment or Devastated Landscape?

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

Page 9: Digital Scholarship: Enlightenment or Devastated Landscape?

“Root” 4500 papers each with 1 tree

Page 10: Digital Scholarship: Enlightenment or Devastated Landscape?

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 11: Digital Scholarship: Enlightenment or Devastated Landscape?

Supertree for 924 species

Tree

Page 12: Digital Scholarship: Enlightenment or Devastated Landscape?

Supertree created from 4300 papers

Page 13: Digital Scholarship: Enlightenment or Devastated Landscape?

Systematic reviews of the Neuroscience literature:• 30,000 papers in 1 year• Extraction of data from graphs

Malcolm Macleod, Professor of Neurology and Translational Neuroscience at the Centre for Clinical Brain Sciences, University of Edinburgh, with ContentMine 2015

Page 14: Digital Scholarship: Enlightenment or Devastated Landscape?
Page 15: Digital Scholarship: Enlightenment or Devastated Landscape?

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 16: Digital Scholarship: Enlightenment or Devastated Landscape?

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 17: Digital Scholarship: Enlightenment or Devastated Landscape?

Polly has 20 seconds to read this paper…

…and 10,000 more

Page 18: Digital Scholarship: Enlightenment or Devastated Landscape?

ContentMine software can cut the effort by 50%

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

Page 19: Digital Scholarship: Enlightenment or Devastated Landscape?

ContentMine Tools*

http://iucn.contentmine.org (endangered species) http://fotd.contentmine.org (fact of the day) http://bubbles.contentmine.org (network analysis of papers)

*Dr. Mark MacGillivray, Informatics Forum, University of Edinburgh

Page 20: Digital Scholarship: Enlightenment or Devastated Landscape?

Fact of the Day• http://fotd.contentmine.co/?s=daily20151209

(images from https://en.wikipedia.org/wiki/Caenorhabditis_elegans CC-BY-SA)

Page 21: Digital Scholarship: Enlightenment or Devastated Landscape?

Facts in contextdaily IUCN endangered species news

en.wikipedia.org CC By-SA

Page 22: Digital Scholarship: Enlightenment or Devastated Landscape?

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [digital scholarship] by all scientists, scholars, teachers, students, and other curious minds. …

…share the learning of the rich with the poor and the poor with the rich, … and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Page 23: Digital Scholarship: Enlightenment or Devastated Landscape?

DNADigest + ContentMine looking for DNA datasets in the literatureEuropean Bioinformatics Institute, 2015-12-11

Page 24: Digital Scholarship: Enlightenment or Devastated Landscape?

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 25: Digital Scholarship: Enlightenment or Devastated Landscape?

After AMI2 processing…..

… AMI2 has detected a square

Page 26: Digital Scholarship: Enlightenment or Devastated Landscape?
Page 27: Digital Scholarship: Enlightenment or Devastated Landscape?

Chris Hartgerink, University of Tilburg

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

…I am content mining results reported in the psychology literature

Page 28: Digital Scholarship: Enlightenment or Devastated Landscape?

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post

“Elsevier stopped me doing my research”

Page 29: Digital Scholarship: Enlightenment or Devastated Landscape?

The Right to Readis

The Right to Roam

The Right to Mine

Kinder Mass Trespass used without permission but with love and thanks

Page 30: Digital Scholarship: Enlightenment or Devastated Landscape?

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 31: Digital Scholarship: Enlightenment or Devastated Landscape?

2014 UK “Hargreaves” reform

Page 32: Digital Scholarship: Enlightenment or Devastated Landscape?
Page 33: Digital Scholarship: Enlightenment or Devastated Landscape?

Proposed amendment after publisher lobbying

Julia Reda’s report

Page 34: Digital Scholarship: Enlightenment or Devastated Landscape?

STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: we have NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or

website”• “Subscriber shall pay a […] fee”

Heather Piwowar: “negotiating with publishers [made me physically ill]”

WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …

Licences destroy Content Mining

Page 35: Digital Scholarship: Enlightenment or Devastated Landscape?

Julia Reda MEP

Julia Reda MEPThe current copyright regime is undermining our

ability to produce evidence. It is time that academics in large numbers … speak up about this issue. Decreasing the very substantial burdens and transaction costs for research and education is one of the declared goals of the Commission’s copyright reform proposal, and the European Parliament has echoed that sentiment in my report.

Prof Ian Hargreaves:…make sure that the voices of the digital many are not drowned out in policy discussions by the digitally self-interested few.

http://www.create.ac.uk/blog/2015/09/16/epip2015-opening-keynote-response-transcript/

there’s a serious risk of Europe digging itself deeper into a digital black hole on copyright,

Page 36: Digital Scholarship: Enlightenment or Devastated Landscape?

http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-ebola.html

We were stunned recently when we stumbled across an article by European researchers in Annals of Virology [1982]*: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone.” In the future, the authors asserted, “medical personnel in Liberian health centers should be aware of the possibility that they may come across active cases and thus be

prepared to avoid nosocomial epidemics,” referring to hospital-acquired infection.

*Still behind a 35USD paywall

Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)Vera Mussah (director of county health services)

Cameron Nutt (Ebola response adviser to Partners in Health)

A System Failure of Scholarly Publishing

Page 37: Digital Scholarship: Enlightenment or Devastated Landscape?

[1] The Military-Industrial-Academic complex (1961)(Dwight D Eisenhower, US President)

Publishers AcademiaGlory+?

$$, MSreview

Taxpayer

Student

Researcher

$$ $$

in-kind

The Publisher-Academic complex[1]

Page 38: Digital Scholarship: Enlightenment or Devastated Landscape?

Panton Principles for Open Scientific Data

Jenny Molloy

Ross MounceSam Moore Peter Kraker Rosie GraySophie Kay

PANTON ARMS

Panton Fellows

CC02010

http://pantonprinciples.org/about/

Page 39: Digital Scholarship: Enlightenment or Devastated Landscape?

Elsevier wants to control Open Data

[asked by Michelle Brook]

Page 40: Digital Scholarship: Enlightenment or Devastated Landscape?

Scholarly infrastructure becomes closed

No accountability for monitoring and control

Page 41: Digital Scholarship: Enlightenment or Devastated Landscape?

Thanks to some Children of the Digital Enlightenment

• David Carroll & Joe McArthur: OAButton• Rayna Stamboliyska & Pierre-Carl Langlais• Jon Tennant• Ross Mounce • Jenny Molloy• Erin McKiernan• Jack Andraka• Michelle Brook• Heather Piwowar• TheContentMine Team• Mark MacGillivray• Rufus Pollock• Jonathan Gray• Sophie Kay• Aaron Swartz• Chris Hartgerink

Jean-Claude Bradley [1] a chemist developed Open notebook science; making the entire primary record of a research project publicly available online as it is recorded. (WP)

J-C promoted these ideas with UNDERGRADUATE scientists.

[1] Unfortunately J-C died in 2014; we held a memorial meeting in Cambridge

Sophie Kay

Page 42: Digital Scholarship: Enlightenment or Devastated Landscape?

http://www.budapestopenaccessinitiative.org/read

… an unprecedented public good. …

… completely free and unrestricted access to [digital scholarship] by all scientists, scholars, teachers, students, and other curious minds. …

…share the learning of the rich with the poor and the poor with the rich, … and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

Page 43: Digital Scholarship: Enlightenment or Devastated Landscape?

Discussion

• Let’s concentrate on what we can do to create positive change, rather than explain why we can’t do anything.*

• [1] “It’s not our fault, it’s (a) librarians (b) researchers (c) publishers (d) funders (e) governments (f) scholarly societies (g) principals/Vice-chancellors … “

Page 44: Digital Scholarship: Enlightenment or Devastated Landscape?

Top Related