high throughput mining of the plant-science literature

Mining science from the plant literature

ContentMine

Rothamsted Research, Harpenden, UK, 2016-09-12

Peter Murray-Rust[1]University of Cambridge [2]TheContentMine

5,000 scholarly publications every day.How many relate to plants?

Overview

• Scholarly literature• Automation of downloading, normalization• Discipline-dependent semantics/ontology• Classification• Extraction• Annotation• Mining diagrams• Politics of mining

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

(2x digital music industry!)

Output of scholarly publishing

[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg

586,364 Crossref DOIs 201507 [1] /month 8000 papers/day2.5 3 million (papers + supplemental data) /year each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html

https://en.wikipedia.org/wiki/Mont_Blanc%23/media/File:Mont_Blanc_depuis_Valmorel.jpg

http://www.crossref.org/01company/crossref_indicators.html

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

MozFest 2015

ContentMine + TGAC / hack

Terpinome Phytochemists!

Salvia officinalis

Salvia microphylla

Origanum vulgare Ocimum basilicum

Laurus nobilis [1]

[1] Lauraceae

We can search for

• Plants• Compounds• Other species• Diseases• Frequent terms

• We’ll need: sources, dictionaries, software

Europe PubMedCentral

Over 1 million biomedical papers

Dictionaries!

Diseases (WHO)

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

100, 000 pages/day Semantic ScholarlyHTML(W3C community group)

Facts

Latest 20150908

CONTENTMINE SOFTWARE

Crossref

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone



Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Search for carvone


WIKIDATA

Carvone in WikidataAlso SPARQL endpointWP identifier

Chemical type

Chemical identifier

ARTICLES FACETS

gene disease drug Phytochem

species genus words

Suggest the title of this article

species words

drug Phytochemdisease

species words

drug Phytochemdisease

disease

Annotation (entity in context)

prefixsurface

label

location

suffix

Annotation with Hypothes.is

Original publication “on publisher’s site”Annotation “on Hypothes.is site”

Amanuens.isHypothes.is link

Hypothes.is markupof article

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

http://chemicaltagger.ch.cam.ac.uk/

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Automatic extraction of plant species from the literature

Lars Willighagen, ContentMine Fellow 2016, NLhttps://larsgw.github.io/contentmine-fellowship/html/card_c03-d.html

https://larsgw.github.io/contentmine-fellowship/html/card_c03-d.html



Mining diagrams

Ln Bacterial load per fly

11.5

11.0

10.5

10.0

9.5

9.0

6.5

6.0

Days post—infection

0 1 2 3 4 5

Bitmap Image and Tesseract OCR

“Root”

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Supertree created from 4300 papers

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

After AMI2 processing…..

… AMI2 has detected a square

https://contentmine-demo.herokuapp.com/

ContentMine data visualizations,Chris Kittel



https://contentmine-demo.herokuapp.com/trending

1 month , commonest disease terms



Terms from dictionaries

Co-ocurrence of gene names in same sentence

https://zenodo.org/record/61334#.V9XKT4XerCk

https://zenodo.org/record/61334%23.V9XKT4XerCk

https://zenodo.org/record/61334%23.V9XKT4XerCk

Systematic Reviews

Can we:• eliminate true negatives automatically?• extract data from formulaic language?• mine diagrams?• Annotate existing sources?• forward-reference clinical trials?

Polly has 20 seconds to read this paper…

…and 10,000 more

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

(2x digital music industry!)

Contentmine.orgNon-profitCollaborations include:• University of Cambridge Plant Sciences• TGAC/Open Plant• EuropePMC• Wikimedia• Some publishers