contentmine: open data and social machines
DESCRIPTION
Scientific information is often hidden or not published properly. The ContentMine is a Social Machine consisting of semantic software and communities of domain expertise; it aims to liberate all scientific facts from the published literature on a daily basis. The talk , delivered to the Computational Institute, will be /was followed by a hands-on workshop learning how to use the technology and work as a community.TRANSCRIPT
![Page 1: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/1.jpg)
ContentMine: Open data And Social Machines
Peter Murray-Rust,
Computation Lab, Univ of Chicago, 2014-11-12
![Page 2: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/2.jpg)
ContentMine: We use machines to liberate[1] 100 million facts /yr from the scientific scholarly
literature and make them free for everyone (WikiData)
WikiData and ContentMines are social machinesThere are no longer any technical obstacles, only
people.
[1] Friday workshop: build your own social machine: scraping XML, PDFs, web pages, FORTRAN output; chemistry, evolutionary biology, computational
materialsSci, social science…
![Page 3: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/3.jpg)
Liberation Software
![Page 4: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/4.jpg)
http://en.wikipedia.org/wiki/Tim_Berners-Lee
Everything in this presentation is ODOSOS (Open Data, Open Standards, Open Source)CC0, CC-BY, W3C etc., Apache2, etc.
Open = “Free to use, re-use and redistribute
http://contentmine.orghttp://bitbucket.org/petermrhttp://wwmm.ch.cam.ac.uk
A promise: I (Petermr) will never sell out to non-transparent organizations.
![Page 5: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/5.jpg)
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. …
…Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge.(Budapest Open Access Initiative, 2003)
![Page 6: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/6.jpg)
Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles …• … cost $300,000 each to create …• … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries …• … to “publishers” who forbid access to 99.9% of
citizens of the world …
[+] Figures probably +- 50 %[*] arXiV preprint server costs $7 USD per paper
![Page 7: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/7.jpg)
petermr: I believe in Wikipedia• 2006 http://en.wikipedia.org/wiki/User:Petermr
• 2006 started Open Data (term unknown then!)
• 2009: “the bit of Wikipedia that I wrote is correct” [challenging the idea of “WP is junk”]
• 2009: “Wikipedia is the digital library of this century”
• 2012: I alert WP that Springer has copyrighted > 1000 of our images [Springergate]
• 2014: “For facts in maths, physical and biological sciences I trust Wikipedia.” (Wikimania2014)
![Page 8: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/8.jpg)
A meritocratic criticalvolunteer community
![Page 9: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/9.jpg)
Volunteer community in chemistry: Open Data/Source/Standards
![Page 10: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/10.jpg)
4 Billion USD on human genomeyielded 800 Billion USD and 4 M job-years
![Page 11: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/11.jpg)
Gloom Warning
![Page 12: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/12.jpg)
…three problems—flawed design, non-publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009]
[Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27]
Bad publication wastes science
![Page 13: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/13.jpg)
Publishers’ PDFs destroy science
PDFs do not contain words or subscripts!
PDFs do not contain tables and do not have columns
SVG is turned into JPEG because it’s easier to process
![Page 14: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/14.jpg)
Elsevier wants to control Open Data
[asked by Michelle Brook]
![Page 15: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/15.jpg)
STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R
Licences destroy Content Mining
![Page 16: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/16.jpg)
CLOSED ACCESS MEANS PEOPLE DIE
CLOSED DATA MEANS PEOPLE DIE
![Page 17: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/17.jpg)
Happiness Restored
![Page 18: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/18.jpg)
The scientist’s amanuensis• "The bane of my life is doing things I know computers could do for
me" (Dan Connolly, W3C)
Example: A semantic amanuensis could• Give me a daily digest of mineralogy papers• Extract all the crystal structures from them• Compute physical properties with GULP and NWChem• Compare the results statistically• Preserve and distribute the complete operation• Prepare the results for publication
The semantic web is having a personal amanuensis
![Page 19: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/19.jpg)
Artificial Intelligence in scienceIn 1970 chess and chemistry were the sandboxes for AI. Some approaches:• Lookup (Knowledge)• Natural Language Processing (NLP)• Brute force calculation (inc. physical methods)• Tree-pruning and heuristics• Logic (cf. OWL-DL) • Human-machine integration (crowdsourcing)• Computer Vision
Domain-specific Turing test: Can a machine pass a first-year chemistry exam?
![Page 20: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/20.jpg)
The Semantic Web
"The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001
CC-BY-SA Images from Wikipedia
![Page 21: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/21.jpg)
“Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?”
Open Crystallography?“Which countries where tropical diseases are endemic have published structures of chiral natural products?”
Linked Open data from Wikipedia
CC-BY-SA from Wikipedia
![Page 23: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/23.jpg)
• Science can be read and understood by human-machine Amanuensis-symbionts.
• Amanuenses are based on Wikipedia, databases and software (e.g. ContentMine’s AMI)
• The results are fed back into WP and WikiData
http://en.wikipedia.org/wiki/Symbiosis http://en.wikipedia.org/wiki/Eric_Fenby
![Page 24: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/24.jpg)
• Crawl scientific literature (Open Bibliography)• Scrape each scientific article (ContentMine-quickscrape)• Extract the facts (ContentMine-AMI)• Index (Wikipedia)• Republish (WikiData)
Machine Extraction of scientific facts
![Page 25: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/25.jpg)
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
![Page 26: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/26.jpg)
Linked Open Data – the world’s knowledge
very little physical science http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,ArtLiterature
Social
Knowledgebases
RDF triples
![Page 27: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/27.jpg)
Part of a COD RDF entry
The Semantic Web understands this
![Page 28: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/28.jpg)
Mathematics Markup LanguageEnergy of c.c.p lattice of argon
4 pages clippedHuman-friendly
Machine-friendly
Many editors and tools existWe used MathWeaver
Automatic!
MathML
![Page 29: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/29.jpg)
CML (Chemical Markup Language)
Human-friendly Machine-friendly
Automatic!
![Page 30: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/30.jpg)
Innovation with Componentisation
Individual, manual, unreusable, flaky
Commodity, standard, reliable, re-usable
![Page 31: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/31.jpg)
Non-semantic data
Data extraction difficult and incomplete
Human readers
Current scientific information flow … is broken for data-rich science
PDFLineprinter output
Text files
Human input
![Page 32: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/32.jpg)
Semantic network closes the loop
Data mined from document
Data available for e-science and re-use
ComputationMeasurement
SemanticAuthoring
Community
Analysis
![Page 33: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/33.jpg)
The network grows autonomously
Machine-machine
Machine-human
Human-machine
Human-human
![Page 34: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/34.jpg)
Humans and machines use different languages
![Page 35: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/35.jpg)
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
![Page 36: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/36.jpg)
Human-machine symbionts can read science!
WP_Lion
WP_Aspergillus_oryzae
WP_Soybean
![Page 37: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/37.jpg)
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
![Page 38: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/38.jpg)
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … chemical
project places
![Page 39: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/39.jpg)
Parsing chemical sentences
A FACT, uncopyrightable, and representable by triples
![Page 40: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/40.jpg)
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
![Page 41: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/41.jpg)
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
![Page 42: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/42.jpg)
But we can now turn PDFs into
Science
We can’t turn a hamburger into a cow
![Page 43: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/43.jpg)
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
![Page 44: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/44.jpg)
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Gaussian Filter
Automaticextraction
Takes < 1 second
![Page 45: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/45.jpg)
Chemical Computer Vision
Raw Mobile photo; problems:Shadows, contrast, noise, skew, clipping
![Page 46: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/46.jpg)
Binarization (pixels = 0,1)
Irregular edges
![Page 47: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/47.jpg)
Thinning: thick lines to 1-pixel
![Page 48: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/48.jpg)
Chemical Optical Character Recognition
Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
![Page 49: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/49.jpg)
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
![Page 50: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/50.jpg)
AMI Demo
http://www.mdpi.com/2218-1989/2/1/39/pdf
https://bitbucket.org/AndyHowlett/ami2-poc
ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor
May take time to start if not connected to web
Output:./target/output/reactionsexample/
SVG: ./page1annotated.svg
CML: image.g.1.4.svg.reaction0.cml AvogadroViewer:
![Page 51: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/51.jpg)
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type Culture Collection
![Page 52: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/52.jpg)
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),((((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n160))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))))))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((n53,n131),n159)))))));
http://en.wikipedia.org/wiki/Digital_image_processing
http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
![Page 53: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/53.jpg)
Open notebook science is the practice of making the entire primary record of a research project publicly available online as it is recorded. (WP)
Jean-Claude Bradley was a chemist who actively promoted Open Science in chemistry,… He coined the term Open Notebook Science. … A memorial symposium was held July 14, 2014 at Cambridge University, UK.[9]
![Page 54: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/54.jpg)
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
![Page 55: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/55.jpg)
Thanks
• Shuttleworth Foundation and Fellowship• Contentmine.org: Michelle Brook, Jenny Molloy,
Ross Mounce, Richard Smith-Unna, CottageLabs, Charles Oppenheim• Open Knowledge Foundation Community• Wikimedia Community• Blue Obelisk Community
![Page 56: ContentMine: Open Data and Social Machines](https://reader035.vdocuments.mx/reader035/viewer/2022062419/5590378c1a28ab84018b4599/html5/thumbnails/56.jpg)
My/our Dream
• An Open Bibliography of science, updated daily
• An interface for ContentMine to feed new facts into WikiData
• Domain-specific enthusiasts to create and run fact extraction and validation
• Wikipedia to become a C21 publisher of reference science