digital scholarship: enlightenment or devastated landscape?

Download Digital Scholarship: Enlightenment or Devastated Landscape?

Post on 13-Apr-2017

120 views

Category:

Science

1 download

Embed Size (px)

TRANSCRIPT

PowerPoint Presentation

Digital Scholarship: Enlightenment or Devastated Landscape?Peter Murray-Rust, University of CambridgeIT Future Conference, Informatics Forum, Edinburgh, UK 2015-12-17

(Glen Feshie, remains of forest, CC-BY-SA 2.0 Ian Shiell http://www.geograph.org/uk/photo/3944612.jpg )

Hi, Im here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.

In this talk, Im going to impress the importance of data in a specific format and its utility to automated machine processing. Then Im going to demonstrate AMIs architecture and the transformation of data as it flows through the process. Im going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, Im going to introduce Andys ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.

University of Stirling 1972student occupations and sit-ins

University of StirlingUsed without permission but with thanks and LoveLiverpool , Warwick, Emmanuel Coll Camb., UCL, Glasgow, Middlesex, Peter Murray-Rust,Lecturer

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg 586,364 Crossref DOIs 201507 [1] per month>2.5 million (papers + supplemental data) /year*4500 m high per year [2]Representing ? 500 Billion USD public funding [1] http://www.crossref.org/01company/crossref_indicators.html

Refs: Erriquez_Daniela_tesi, Fiorentina_Elena_tesi, Gou_Qian_Tesi, mbarontini_tesid, terracciano_maria_tesi

BagOfWords for Italian Theses

http://chemicaltagger.ch.cam.ac.uk/Typical

Typical chemical synthesis

5

Open Content Mining of FACTsMachines can interpret chemical reactionsWe have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

What is Content?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BYSECTIONSMAPSTABLESCHEMISTRYTEXT

MATHcontentmine.org tackles these

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

8

Root

4500 papers each with 1 tree

OCR (Tesseract)Norma (imageanalysis)(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Supertree for 924 speciesTree

Supertree created from 4300 papers

Systematic reviews of the Neuroscience literature:30,000 papers in 1 yearExtraction of data from graphs

Malcolm Macleod, Professor of Neurology and Translational Neuroscience at the Centre for Clinical Brain Sciences, University of Edinburgh, with ContentMine 2015

UNITSTICKSQUANTITYSCALE

TITLESDATA!!2000+ pointsVECTOR PDF

Dumb PDFCSVSemanticSpectrum

2nd DerivativeSmoothing Gaussian FilterAutomaticextraction

Polly has 20 seconds to read this paperand 10,000 more

ContentMine software can cut the effort by 50%Polly: there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).

ContentMine Tools* http://iucn.contentmine.org (endangered species) http://fotd.contentmine.org (fact of the day) http://bubbles.contentmine.org (network analysis of papers)

*Dr. Mark MacGillivray, Informatics Forum, University of Edinburgh

Fact of the Dayhttp://fotd.contentmine.co/?s=daily20151209 (images from https://en.wikipedia.org/wiki/Caenorhabditis_elegans CC-BY-SA)

Facts in contextdaily IUCN endangered species news

en.wikipedia.org CC By-SA

http://www.budapestopenaccessinitiative.org/read an unprecedented public good.

completely free and unrestricted access to [digital scholarship] by all scientists, scholars, teachers, students, and other curious minds.

share the learning of the rich with the poor and the poor with the rich, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)

DNADigest + ContentMine looking for DNA datasets in the literatureEuropean Bioinformatics Institute, 2015-12-11

C) Whats the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 40844087

Original thanks to ChemBark

ChemBark

After AMI2 processing.. AMI2 has detected a square

Chris Hartgerink, University of TilburgI am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

I am content mining results reported in the psychology literature

Elsevier stopped me doing my research33 Replies0000-0003-1050-6809

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.

To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].

In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started bulk downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.

Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 35KB/s, 0.0021GB/min, 0.125GB/h, 3GB/day.

Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.

I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.

[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (19852013). Behavior Research Methods, 122. doi: 10.3758/s13428-015-0664-2

[MINOR EDITS: the link to the article was broken, should be fixed now. Also, I made the mistake of using "0.0021GB/s" which is now changed into "0.0021GB/min"; I also added "35KB/s" for completeness. One last thing: I am aware of Elsevier's TDM License agreement, and I nonetheless thank those who directed me towards it.]

27

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started bulk downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two w