high throughput mining of the plant-science literature

54
Mining science from the plant literature ContentMine Rothamsted Research, Harpenden, UK, 2016-09-12 Peter Murray-Rust [1]University of Cambridge [2]TheContentMine 5,000 scholarly publications every day. How many relate to plants?

Upload: petermurrayrust

Post on 14-Feb-2017

418 views

Category:

Science


1 download

TRANSCRIPT

Page 1: High throughput mining of the plant-science literature

Mining science from the plant literature

ContentMine

Rothamsted Research, Harpenden, UK, 2016-09-12

Peter Murray-Rust[1]University of Cambridge [2]TheContentMine

5,000 scholarly publications every day.How many relate to plants?

Page 2: High throughput mining of the plant-science literature

Overview

• Scholarly literature• Automation of downloading, normalization• Discipline-dependent semantics/ontology• Classification• Extraction• Annotation• Mining diagrams• Politics of mining

Page 3: High throughput mining of the plant-science literature

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

Page 4: High throughput mining of the plant-science literature

(2x digital music industry!)

Page 5: High throughput mining of the plant-science literature

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] /month 8000 papers/day2.5 3 million (papers + supplemental data) /year each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

Page 6: High throughput mining of the plant-science literature

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 7: High throughput mining of the plant-science literature

MozFest 2015

ContentMine + TGAC / hack

Page 8: High throughput mining of the plant-science literature

Terpinome Phytochemists!

Salvia officinalis

Salvia microphylla

Origanum vulgare Ocimum basilicum

Laurus nobilis [1]

[1] Lauraceae

Page 9: High throughput mining of the plant-science literature

We can search for

• Plants• Compounds• Other species• Diseases• Frequent terms

• We’ll need: sources, dictionaries, software

Page 10: High throughput mining of the plant-science literature

Europe PubMedCentral

Over 1 million biomedical papers

Page 11: High throughput mining of the plant-science literature
Page 12: High throughput mining of the plant-science literature
Page 13: High throughput mining of the plant-science literature

Dictionaries!

Diseases (WHO)

Page 14: High throughput mining of the plant-science literature
Page 15: High throughput mining of the plant-science literature
Page 16: High throughput mining of the plant-science literature

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

CONTENTMINE SOFTWARE

Crossref

Page 17: High throughput mining of the plant-science literature

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone

https://en.wikipedia.org/wiki/Carvone

Page 18: High throughput mining of the plant-science literature

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Page 19: High throughput mining of the plant-science literature

Search for carvone

Page 20: High throughput mining of the plant-science literature

https://en.wikipedia.org/wiki/Carvone

WIKIDATA

Page 21: High throughput mining of the plant-science literature

Carvone in WikidataAlso SPARQL endpointWP identifier

Chemical type

Chemical identifier

Page 22: High throughput mining of the plant-science literature

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Page 23: High throughput mining of the plant-science literature

Suggest the title of this article

Page 24: High throughput mining of the plant-science literature
Page 25: High throughput mining of the plant-science literature
Page 26: High throughput mining of the plant-science literature

species words

drug Phytochemdisease

Page 27: High throughput mining of the plant-science literature

species words

drug Phytochemdisease

disease

Page 28: High throughput mining of the plant-science literature

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 29: High throughput mining of the plant-science literature

Annotation with Hypothes.is

Original publication “on publisher’s site”Annotation “on Hypothes.is site”

Page 30: High throughput mining of the plant-science literature

Amanuens.isHypothes.is link

Hypothes.is markupof article

Page 31: High throughput mining of the plant-science literature

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 32: High throughput mining of the plant-science literature

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 33: High throughput mining of the plant-science literature
Page 34: High throughput mining of the plant-science literature
Page 35: High throughput mining of the plant-science literature

Automatic extraction of plant species from the literature

Lars Willighagen, ContentMine Fellow 2016, NLhttps://larsgw.github.io/contentmine-fellowship/html/card_c03-d.html

Page 36: High throughput mining of the plant-science literature

Mining diagrams

Page 37: High throughput mining of the plant-science literature

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

Page 38: High throughput mining of the plant-science literature

“Root”

Page 39: High throughput mining of the plant-science literature

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Page 40: High throughput mining of the plant-science literature

Supertree created from 4300 papers

Page 41: High throughput mining of the plant-science literature

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 42: High throughput mining of the plant-science literature

After AMI2 processing…..

… AMI2 has detected a square

Page 43: High throughput mining of the plant-science literature
Page 44: High throughput mining of the plant-science literature
Page 45: High throughput mining of the plant-science literature

https://contentmine-demo.herokuapp.com/

ContentMine data visualizations,Chris Kittel

Page 46: High throughput mining of the plant-science literature

https://contentmine-demo.herokuapp.com/trending

1 month , commonest disease terms

Page 47: High throughput mining of the plant-science literature

Terms from dictionaries

Page 48: High throughput mining of the plant-science literature

Co-ocurrence of gene names in same sentence

Page 50: High throughput mining of the plant-science literature

Systematic Reviews

Can we:• eliminate true negatives automatically?• extract data from formulaic language?• mine diagrams?• Annotate existing sources?• forward-reference clinical trials?

Page 51: High throughput mining of the plant-science literature

Polly has 20 seconds to read this paper…

…and 10,000 more

Page 52: High throughput mining of the plant-science literature

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

Page 53: High throughput mining of the plant-science literature

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

Page 54: High throughput mining of the plant-science literature

(2x digital music industry!)

Contentmine.orgNon-profitCollaborations include:• University of Cambridge Plant Sciences• TGAC/Open Plant• EuropePMC• Wikimedia• Some publishers