can machines understand the scientific literature

42
Can machines “understand” the scientific literature? Peter Murray-Rust, Reader Emeritus, Dept of Chemistry, Univ Cambridge and Founder TheContentMine Trinity College Science Society, Cambridge UK, 2017-02-21 contentmine.org is supported by a grant to PMR as a

Upload: petermurrayrust

Post on 10-Apr-2017

36 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Can machines understand the scientific literature

Can machines “understand” the scientific literature?

Peter Murray-Rust, Reader Emeritus, Dept of Chemistry, Univ Cambridge

and Founder TheContentMine

Trinity College Science Society, Cambridge UK, 2017-02-21

contentmine.org is supported by a grant to PMR as a

Page 2: Can machines understand the scientific literature

(2x digital music industry!)

ContentMine is an OpenLocked Non-Profit company

Page 3: Can machines understand the scientific literature

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 4: Can machines understand the scientific literature

AMI! Tell me what YOU know about monoxidine?

Page 5: Can machines understand the scientific literature

Wikipedia

Page 6: Can machines understand the scientific literature

Wikidata for moxonidine

Page 7: Can machines understand the scientific literature

Wikidata for moxonidine

Page 8: Can machines understand the scientific literature

Entity extraction

OPSIN says this name is wrong! OSIRIS will interpret this structureIncluding the annotation

Page 9: Can machines understand the scientific literature

Reaction Schemes

Page 10: Can machines understand the scientific literature

Tables

Page 11: Can machines understand the scientific literature

Tables

Page 12: Can machines understand the scientific literature

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 13: Can machines understand the scientific literature

Automatic semantic markup of chemistry

Could be used for analytical, crystallization, etc.

Page 14: Can machines understand the scientific literature

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 15: Can machines understand the scientific literature

6 ContentMine Fellows for 6 months

Page 16: Can machines understand the scientific literature

Neo Christopher Chung

Warsaw, Computational Biology Wants to find out geographic and temporal differences in the use of genomic software tools

Page 17: Can machines understand the scientific literature

Paola Masuzzo Ghent, Computational Omics and Systems Biology Wants to mine literature around cell migrations and invasion to create 1) collection of

minimum requirements, 2) check for nomenclatura consistency and 3) construct a knowledge map

Page 18: Can machines understand the scientific literature

Alexandra Bannach-Brown Edinburgh, Neuroscience Problem: huge body of works in animal studies about depressions. systematic review is the main

approach for getting insight. Wants: identify papers in systematic review of depressive behaviour in animals. What

drugs, what methods, what outcomes and signs/phenotypes. Use outcomes for document clustering.

and expedite scientific advances."

Corpus: 70.000 Papers

Page 19: Can machines understand the scientific literature

Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced

summaries to practitioners and agents who don’t have the means or the time to navigate the literature.

From Brazil, Life Sciences, works on project about evolution of oncology Wants: extract facts from cancer research conference papers and global health papers

OPEN NOTEBOOK RESEARCH

Page 20: Can machines understand the scientific literature

Alexandre Hannud Abdo “Our goal is to mine facts from global health research and provide automated referenced

summaries to practitioners and agents who don’t have the means or the time to navigate the literature.

From Brazil, Life Sciences, works on project about evolution of oncology „I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a

ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.“

Problem: Get text and metadata out of old conference proceedings and measure the evolution of ideas and practice using entity analysis, especially trends.

Wants: extract facts from cancer research conference papers and global health papers. Extracting topics (innovations, developments) and comparing the two types of publications. Find out which facts from conferences get later on published in articles.

Has some issues with software

Page 21: Can machines understand the scientific literature

Guanyang Zhang Biology, Arizona „My ContentMine Fellowship project will focus on mining weevil-plant associations from literature

records.“ „Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils

(Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals.“

„Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access.“

Weevil-plant association network graph made with Google Fusion Table. Each blue circle is a weevil tribe and yellow circle a plant genus. The size of a circle represents the number of associations.

Page 22: Can machines understand the scientific literature

Lars Willighagen 15 years old NL Wants: extract data about conifers (relations to chemicals, height etc.) Outcome: database with webpage containing conifer properties Table Facts Visualiser DEMO Card DEMO Word Cloud „ I applied to this fellowship to learn new things and combine the ContentMine with two previous

projects I never got to finish, and I got really excited by the idea and the ContentMine at large.“

Page 23: Can machines understand the scientific literature

Multisegment diagram

Page 24: Can machines understand the scientific literature

Multisegment diagram

Whitespace “corridors”

SuperpixelBounding box

Semanticlabels

Page 25: Can machines understand the scientific literature

Chemical Computer Vision

Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping

Page 26: Can machines understand the scientific literature

Binarization (pixels = 0,1)

Irregular edges

Page 27: Can machines understand the scientific literature

Posterisation

Extracted since unique posterized colour

Page 28: Can machines understand the scientific literature

Note Jaggy and broken pixels

NEW Bacteria must have a phylogenetic tree

Length_________Weight Binomial Name Culture/Strain GENBANK ID

EvolutionRate

Page 29: Can machines understand the scientific literature

Supertree for 924 species

Tree

Page 30: Can machines understand the scientific literature

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 31: Can machines understand the scientific literature

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 32: Can machines understand the scientific literature

C) What’s the problem with this spectrum?

Org. Lett., 2011, 13 (15), pp 4084–4087

Original thanks to ChemBark

Page 33: Can machines understand the scientific literature

After AMI2 processing…..

… AMI2 has detected a square

Page 34: Can machines understand the scientific literature
Page 35: Can machines understand the scientific literature

AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home

Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:

AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other

CLICK HERE FOR ANIMATION

(may be browser dependent)

Page 36: Can machines understand the scientific literature

Search on publicly accessible papers on “Zika”

https://rawgit.com/ContentMine/amidemos/master/zika/full.dataTables.html

Page 37: Can machines understand the scientific literature
Page 38: Can machines understand the scientific literature
Page 39: Can machines understand the scientific literature

“… simulated by 21cmFAST is in principle independent”

“it is a feature of the 21cmFAST code, and is explained in §3.1.”

SciCodes[1]: Searching for software in arXiv[1]

[1] Proposal to LJ Arnold Foundation (Alice Allen ASCL and PMR)

Using the semi-numerical simulation, 21cmFAST,

[2] arxiv.org: the physics/maths/astronomy.. Preprint server

The language identifies the software!

arxIv has >500 mentions of “21cmFast”

Page 40: Can machines understand the scientific literature

Questions and comments

Thanks:• Andy Howlett, Dept Chemistry, Cambridge• Mark Williamson, Dept Chemistry, Cambridge• Ross Mounce, Biology, University of Bath• Shuttleworth Foundation

PM-R has offered to mentor an MSc project this summer for anyone interested.

contentmine.org

Page 41: Can machines understand the scientific literature
Page 42: Can machines understand the scientific literature