high throughput mining of the plant-science literature
TRANSCRIPT
Mining science from the plant literature
ContentMine
Rothamsted Research, Harpenden, UK, 2016-09-12
Peter Murray-Rust[1]University of Cambridge [2]TheContentMine
5,000 scholarly publications every day.How many relate to plants?
Overview
• Scholarly literature• Automation of downloading, normalization• Discipline-dependent semantics/ontology• Classification• Extraction• Annotation• Mining diagrams• Politics of mining
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
http://contentmine.org
(2x digital music industry!)
Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201507 [1] /month 8000 papers/day2.5 3 million (papers + supplemental data) /year each 3 mm thick 4500 m high per year [2] * Most is not Publicly readable[1] http://www.crossref.org/01company/crossref_indicators.html
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
MozFest 2015
ContentMine + TGAC / hack
Terpinome Phytochemists!
Salvia officinalis
Salvia microphylla
Origanum vulgare Ocimum basilicum
Laurus nobilis [1]
[1] Lauraceae
We can search for
• Plants• Compounds• Other species• Diseases• Frequent terms
• We’ll need: sources, dictionaries, software
Europe PubMedCentral
Over 1 million biomedical papers
Dictionaries!
Diseases (WHO)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
100, 000 pages/day Semantic ScholarlyHTML(W3C community group)
Facts
Latest 20150908
CONTENTMINE SOFTWARE
Crossref
What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits
• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)
• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is
Search for carvone
https://en.wikipedia.org/wiki/Carvone
WIKIDATA
Carvone in WikidataAlso SPARQL endpointWP identifier
Chemical type
Chemical identifier
ARTICLES FACETS
gene disease drug Phytochem
species genus words
Suggest the title of this article
species words
drug Phytochemdisease
species words
drug Phytochemdisease
disease
Annotation (entity in context)
prefixsurface
label
location
suffix
Annotation with Hypothes.is
Original publication “on publisher’s site”Annotation “on Hypothes.is site”
Amanuens.isHypothes.is link
Hypothes.is markupof article
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Automatic semantic markup of chemistry
Could be used for analytical, crystallization, etc.
Automatic extraction of plant species from the literature
Lars Willighagen, ContentMine Fellow 2016, NLhttps://larsgw.github.io/contentmine-fellowship/html/card_c03-d.html
Mining diagrams
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
“Root”
OCR (Tesseract)
Norma (imageanalysis)
(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);
Semantic re-usable/computable output (ca 4 secs/image)
Supertree created from 4300 papers
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
After AMI2 processing…..
… AMI2 has detected a square
https://contentmine-demo.herokuapp.com/
ContentMine data visualizations,Chris Kittel
https://contentmine-demo.herokuapp.com/trending
1 month , commonest disease terms
Terms from dictionaries
Co-ocurrence of gene names in same sentence
https://zenodo.org/record/61334#.V9XKT4XerCk
Systematic Reviews
Can we:• eliminate true negatives automatically?• extract data from formulaic language?• mine diagrams?• Annotate existing sources?• forward-reference clinical trials?
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
(2x digital music industry!)
Contentmine.orgNon-profitCollaborations include:• University of Cambridge Plant Sciences• TGAC/Open Plant• EuropePMC• Wikimedia• Some publishers