overview of practical content mining

21
view of Practical Content Mi Peter Murray-Rust JISC, London, 2014-12-01

Upload: thecontentmine

Post on 11-Feb-2017

184 views

Category:

Education


2 download

TRANSCRIPT

Page 1: Overview of Practical Content Mining

Overview of Practical Content Mining

Peter Murray-Rust

JISC, London, 2014-12-01

Page 2: Overview of Practical Content Mining

What is Content Mining

• Mining Text, Tables and Lists, Diagrams, Images• Born-digital documents• High-throughput (millions of items/year)• Formal and Informal Collaboration

• Role of UK• Hands-on• Everything is OPEN (OSI , CC-BY, CC0)

Page 3: Overview of Practical Content Mining

The Right to Read is the Right to Mine

http://contentmine.org

Page 4: Overview of Practical Content Mining

ContentMine

• 1-2 year Shuttleworth Funding from 2014-03• Free to everyone, Open Source, updated daily• Structured Text, and Image/Diagram Mining• Workshops for training and training trainers• Bottom-up community development– Bioscience (EuropePMC, BBSRC)– Disease Ebola– Astrophysics (Stray Toaster)– Chemistry (TSB, EBI, PennState - Citeseer)

• We fight for Justice and Freedom

Page 5: Overview of Practical Content Mining

ContentMine People• Jenny Molloy• Ross Mounce• Peter Murray-Rust + volunteers (Bioscience, disease)• Richard Smith-Unna + 20 quickscrape volunteers• Steph Unna• Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard

Jones)• Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth)• Advisory Board RSN

Page 6: Overview of Practical Content Mining

ContentMine Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US

Upcoming• JISC• LIBER • BL• Wellcome Trust• WHO

Page 7: Overview of Practical Content Mining

Ebola Collaborators (Atlanta)Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin

Page 8: Overview of Practical Content Mining

Regular Expressions(Easier than Crosswords or Sudoku)

Ebola EbolaMali (not Malicious)

Mali\W (end of word)

Bat or bat [Bb]at (alternatives)bat or bats bats? (optional letter)Bat or Bats or bat or bats

[Bb]ats?

Sudden onset [Ss]udden\s+onset (space/s)Panthera leo or Gorilla gorilla

[A-Z][a-z]+\s+[a-z]+(ranges of letters)

Page 9: Overview of Practical Content Mining

Ebola regex• <compoundRegex title="ebola">• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>• <regex weight="1.0" fields="marburg">(Marburg)</regex>• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagic\s+fever)</regex>• <regex weight="0.8" fields="sudden_onset">([Ss]udden\s+onset)</regex>• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omiting\s+diarrho?ea)</regex>• <regex weight="0.5" fields="guinea">(Guinea)</regex>• <regex weight="0.5" fields="sierra_leone">(Sierra\s+Leone)</regex>• <regex weight="0.5" fields="liberia">(Liberia)</regex>• <regex weight="0.5" fields="mali">(Mali)\W</regex>• <regex weight="0.6" fields="contact_tracing">([Cc]ontact\s+tracing)</regex>• <regex weight="0.5" fields="bat">\W([Bb]ats?\W)</regex>• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>• <regex weight="0.5" fields="drc">(Democratic Republic\s*(\s*of)?(\s*the)?\s*Congo)(DRC)</regex>• <regex weight="0.6" fields="safe_burial">([Ss]afe\s+burial\s+practice?s)</regex>• <regex weight="1.0" fields="etu">([Ee]bola\s+treatment\s+units?)(ETU)</regex>• </compoundRegex>

I

15 mins to create, 15 mins to install and testOr run online at CottageLabs

Page 10: Overview of Practical Content Mining

Results of Regex on Ebola• <resultsList xmlns="http://www.xml-cml.org/ami">• <results xmlns="">• <source xmlns="http://www.xml-cml.org/ami"• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">• <regex xmlns="" weight="1.0" fields="[ebola]">• <pattern>(Ebola)</pattern>• </regex>• <hits xmlns="">• <hit ebola="Ebola" />• </hits>• </regex>• </result>• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">• <regex xmlns="" weight="0.5" fields="[sierra_leone]">• <pattern>(Sierra\s+Leone)</pattern>• </regex>• <hits xmlns="">• <hit sierra_leone="Sierra Leone" />• </hits>• </regex>• </result>

Page 11: Overview of Practical Content Mining

Demo of Content Mining

ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.

Page 12: Overview of Practical Content Mining

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

Page 13: Overview of Practical Content Mining

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Page 14: Overview of Practical Content Mining

AMI (extraction) architecture

PDF2SVG

Imageanalysis

SVG2XML

Regex Species Phylo Chem

AMI

tablessectionscaptioneddiagrams

Page 15: Overview of Practical Content Mining

Immediate Stakeholders

– Researchers (bio, EBI, chem, materials, astro)– Funders WT, FWF (Austria), RCUK,– Libraries (repositories, theses)– Service providers (EuropePMC)– knowledge-based SMEs– Library organisations (JISC, RLUK, LIBER, SPARC)– Non-profits (Wikimedia, WHO, Mozilla)

Page 16: Overview of Practical Content Mining

Content production

• Scholarly articles• Theses• Repositories• Grey scientific literature• Grey politico-socio-legal literature• Company output (reports, accounts, contracts)

(e.g. OpenOil)

Page 17: Overview of Practical Content Mining

STM Publishers Licence2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights)• [cannot publish to: ] “libraries, repositories, or archives”• [cannot] “Make the results of any TDM Output available on an externally facing server or

website”• “Subscriber shall pay a […] fee”

Heather Piwowar: “negotiating with publishers [made me physically ill]”

WE WALKED OUT• Brit Library• JISC• RLUK• OKFN• …• Ross Mounce• PM-R

Licences destroy Content Mining

Page 18: Overview of Practical Content Mining

Challenges

• Active opposition from content “owners” including serious lobbying and FUD

• Ignorance and apathy from universities; inappropriate reward system

• Sub-optimal technology of publishers• Lack of common infrastructure, technology,

APIs• And it’s objectively messy anyway

Page 19: Overview of Practical Content Mining

Technical problems

• PDF: lacks words, tables, diagrams• Non-Unicode character sets (or worse)• Graphics objects largely destroyed (converted

to PNG or worse)• No communal ontology for document

structure.• HTML carries PublisherJunk and Javascript

Page 20: Overview of Practical Content Mining

Goals of Mining

• Classification of resources• Entity extraction and indexing• Aggregation within discipline• Inter-disciplinary, e.g. biodiversity,

phytochemistry• Repurposing (twitter, ePub, annotation)• Semantification/intelligent documents• Detection of error and fraud

Page 21: Overview of Practical Content Mining

What we need

• Inter/national commitment to infrastructure• Common ontologies and APIs• Development of community• Go beyond academia; non-academic reward

system