content mining of science in europe

Content Mining of Science in Europe

Peter Murray-Rust, ContentMine.org, University of Cambridge & Open Forum Europe

OFA, Brussels, BE 2015-10-22

What is mining?Why is it useful?

How YOU can do it without using publishers’ APIsCopyright and restrictive practices are still a major problem

The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011

http://contentmine.org

My European Heroes

Young People(ContentMine)

NEELIE KROES

Use Cases of ContentMining

• Epidemiology of obesity (Cambridge U)• (OKF, OpenTrials) Mapping clinical trials

repositories to reports in scientific literature• Mining chemical reactions from patents• Creating a bacterial supertree-of-life from

4500 papers

Polly has 20 seconds to read this paper…

…and 10,000 more

ContentMine software can do this in a few minutes

Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”

400,000 Clinical TrialsIn 10 government registries

Mapping trials => papers

http://www.trialsjournal.com/content/16/1/80

2009 => 2015. What’s happened in last 6 years??

Search the whole scientific literatureFor “2009-0100068-41”

ContentMine-ing strategy• Discover. Crawl the COMPLETE relevant literature.

=> bibliography• Scrape (download). ALL papers• Index papers => Facts• Search/analyze papers => complex science• Extract, Annotate, Aggregate (“Transformative”)

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

TABLES

CHEMISTRYTEXT

contentmine.org tackles these

catalogue

getpapers

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

quickscrape

normaNormalizerStructurerSemanticTagger

DataFigures

UNIVRepos

search

LookupCONTENTMINING

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

CONTENTMINE Complete OPEN Platform for Mining Scientific Literature

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Facts in contextdaily IUCN endangered species news

en.wikipedia.org CC By-SA

ContentMine Fact of The Day

• Fact of the day• Endangered species in recent science• Facts• Bubbles

https://en.wikipedia.org/wiki/Tree_of_life CC BY-SA

“Root” 4500 papers each with 1 tree

OCR (Tesseract)

Norma (imageanalysis)

(((((Pyramidobacter_piscolens:195,Jonquetella_anthropi:135):86,Synergistes_jonesii:301):131,Thermotoga_maritime:357):12,(Mycobacterium_tuberculosis:223,Bifidobacterium_longum:333):158):10,((Optiutus_terrae:441,(((Borrelia_burgdorferi:…202):91):22):32,(Proprinogenum_modestus:124,Fusobacterium_nucleatum:167):217):11):9);

Semantic re-usable/computable output (ca 4 secs/image)

Supertree for 924 species

Supertree created from 4300 papers

Copyright and Mining

• PMR-premise: You cannot do reproducible scientific mining and avoid violating copyright.

• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data

analytics”– legitimizes copying (?to disk), but not publishing

*teaching, textbooks, etc. may be “commercial”

Publishing and ICT

Trust these as much as you trust these

Elsevier Microsoft

Mendeley (Elsevier) Facebook

Digital Science/Macmillan Apple

Wileyetc

STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC

Rightfind)• Technical obstruction (Wiley Captcha,

Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)

[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.

WILEY … “new security feature… to prevent systematic download of content

“[limit of] 100 papers per day”

“essential security feature … to protect both parties (sic)”

CAPTCHAUser has to type words

ContentMine working with Libraries

• Cambridge: Library, Plant Sciences, Epidemiology, Chemistry

• Cochrane Collaboration on Systematic Reviews of Clinical Trials

• FutureTDM (H2020, LIBER)• Running workshops and training

content mining of science in europe

Science

m.e computer science data mining projects

content mining of science and medicine

science in the news - asteroid mining

mining webquest on earth science

journal of science and mining

developing earth science for europe

post-mining regions in central europe - problems,...

science, technology and innovation in europe...2010 edition...

m.phil computer science data mining projects

the premier international mining and investment event in...

sustainable mining in europe – the nordic...

big dataweb, science, mining

the science of mining - pdf

open, digital science in europe

mining trusted information in medical science: an

towards(text(mining(in(climate(science...

mining in europe: current issues

mining science 2010

process mining. data science in action - semantic...

mining facts from the plant science iterature