text mining: the next data frontier · 2016-05-11 · to document and classify text mining...

30
TEXT MINING: THE NEXT DATA FRONTIER An Infrastructural Approach @openminted_eu Dr. Petr Knoth CORE (core.ac.uk) Knowledge Media institute, The Open University United Kingdom

Upload: others

Post on 06-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

TEXT MINING:

THE NEXT DATA FRONTIER An Infrastructural Approach

@openminted_eu

Dr. Petr Knoth CORE (core.ac.uk)

Knowledge Media institute, The Open University United Kingdom

Page 2: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

OpenMinTeD Establish an open and sustainable Text and

Data Mining (TDM) platform and infrastructure

where researchers can collaboratively create,

discover, share and re-use knowledge from a

wide range of text based scientific and

scholarly related sources.

2

Page 3: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

beyond Open Access MAKING SENSE OF

LARGE VOLUMES OF SCIENTIFIC CONTENT

3

Page 4: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

The phases of text mining

@openminted_eu

NLP Analysis

Entity

Recognition

Data Mining

Knowledge

Discovery

Information

Extraction

STAGE 1 STAGE 2 STAGE 3 STAGE 4

Information

Retrieval

OPENMINTED -The Open Mining Infrastructure for Text and Data

Page 5: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

TDM challenges for researchers

1. Content challenges - Barriers and obstacles due to non-availability,

technical restrictions, copyright law or licensing

issues

- No uniform way to search for, retrieve and

access content for TDM

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 6: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

TDM challenges for researchers

2. Services challenges How to identify the most fitting TDM service?

How to combine with other TDM services I have

access to? How to use them on my content?

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 7: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

TDM challenges for researchers

3. Processing challenges

Where to deploy? Are my machines powerful enough?

How can I get access to powerful machines?

Where to store intermediate and final results?

How to ensure persistence of storage?

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 8: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

OpenMinTeD – Provides solutions

an open and sustainable TDM

infrastructure where researchers can

collaboratively create, discover, share and

re-use knowledge from a wide range of text

based scientific-related sources.

@openminted_eu

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 9: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

OpenMinTeD – working on many fronts

@openminted_eu

10

ACCESSIBLE

CONTENT

DISCOVERABLE

SERVICES

EFFICIENT

PROCESSING

RESEARCH

COMMUNITIES

VALUE ADDED

APPS

Via standardised programmatic interfaces

Well-documented easily discoverable text mining services and workflows which process, analyse and annotate text

Operate on public e-Infrastructures via standarized APIs

Different scientific communities have different challenges

Community-driven applications to illustrate the value of the infastructure. Engage with industry.

OPENMINTED - The Open Mining Infrastructure for Text and Data

Page 10: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

The project Started: June 2015

Duration: 3 years

Budget of: €6 million

Grant of: €5.3 million

16 Partners:

- 6 mining research groups

- 3 content providers

- 1 data center

- 1 library association

- 2 legal experts

- 6 community related partners

- 2 SMEs

Athena RIC Univ. of Manchester (NacTem) Univ. of Darmstadt INRA EMBL-EBI Agro-Know LIBER Univ. of Amsterdam Open University UK (CORE) EPFL CNIO Univ. of Sheffield (GATE) GESIS GRNET Frontiers Univ. of Stirling

PARTNERS

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 11: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

The OpenMinTeD landscape

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 12: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Infrastructural approach

OpenMinted does not build

new services, but adopts

and adapts existing services

for new communities

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 13: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Infrastructural approach

Focuses on interoperability

across text mining services

and content provision outlets

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 14: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Infrastructural approach

Creates and an Open & collaborative space for

researchers to use the best fitting text mining services available building on the

cloud computing philosophy

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 15: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

@openminted_eu

Data centre Data centre Data centre Data centre

in public cloud

Publisher text corpus

OpenAIRE/CORE text corpus

PMC text corpus

Other text corpora

Other text corpora

Other text corpora

Other types of text corpora

Layer 3:

Interoperability

to shared storage and

computing resources

Language resources Language resources

Language resources Language resources

Layer 2:

Interoperability of

language resources

& corpora

Layer 1:

Interoperability

of text mining services

(platforms or

components)

Language resources and corpora registry service

Platform services

Users: researchers, curators, text-miners and new services developers

Registry Workflow Management Auth2 & Policy management Annotator Accounting

Mining Platforms Mining Platforms Mining Platforms

Proprietary architectures

Mining Platforms

OPENMINTED = The Open Mining Infrastructure for Text and Data

Overview

Page 16: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Interoperability framework

Bringing together mining tools, resources and content

1. Content metadata & transfer standards

To document scientific literature, language resources, taxonomies and provenance as well as transfer protocols for full text retrieval

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 17: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Interoperability framework

Bringing together mining tools, resources and content

2. Service metadata & pipelining

To document and classify text mining services, how they receive input, in what form they output their results, how they combine for workflows, what granularity to consider.

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 18: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Interoperability framework

Bringing together mining tools, resources and content

3. IPR and licensing

To study IPR restrictions, describe license metadata for re-use, for content and TDM services & tools, and information on how to apply for academic and non-commercial mining research

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 19: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

OpenMinTeD users

1. End users

- Researchers, data base curators, …

- Novice: use services to advance their science

- Advanced: use TDM services into complex workflows

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 20: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

OpenMinTeD users

2. Content and service providers

- Publishers, libraries, scientific data base centres, …

- TDM researchERS

- SME’s

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 21: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

@openminted_eu

RESEARCH

ANALYTICS

SOCIAL

SCIENCES

AGRICULTURE LIFE

SCIENCES

Bottom-up approach OpenMinTeD works with 4 use cases, which give their requirements and evaluate the results.

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 22: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Openminted use case 1

Scholarly communication analytics •Semantic search and discovery of open

scientific outcomes

•Map of academia – scholarly

communication network

•Research monitoring and analytics

Partners CORE/OU, OpenAIRE/ARC, Frontiers

2

4

@openminted_eu

Page 23: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Openminted use case 2

Life sciences •Assisted curation of the EMBL-EBI chemical

databases for metabolomics

•Curation of the neurosciences resources

KnowledgeBase and Neurolex

Partners EBI - Metabolomics, Human brain project

2

5

@openminted_eu

Page 24: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Openminted use case 3

Agriculture and biodiversity •Enrich agricultural databases to assist food- and

water-borne disease outbreak alerts and product

recalls

•Image, figure and dataset discovery in the

AGRIS

Partners INRA, AGRO-KNOW

2

6

@openminted_eu

Page 25: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Openminted use case 4

social sciences Develop and evaluate methods for the automatic

detection and linking of named entities, citation

traces and intentions in social science scientific

publications

Partners GESIS

2

7

@openminted_eu

Page 26: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

What can OpenMinTeD do for you?

Are you a content provider?

make your content available for mining

Register your collections in the

OpenMinTeD registry and let others discover it

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 27: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

What can OpenMinTeD do for you?

Are you a TDM service provider?

share and collaborate with other TDM services

Register your TDM service in the

OpenMinTeD registry and let others discover it.

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 28: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

What can OpenMinTeD do for you?

Are you a text miner/research who can benefot from text-mining?

Use OpenMinTeD (when launched)

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

Page 29: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Conclusions

@openminted_eu

OPENMINTED = The Open Mining Infrastructure for Text and Data

- The ability to text-mine research literature at scale can redefine the way we do research

- OpenMinTeD is laying the groundwork (interoperability) and building the cloud infrastructure for text-mining research literature

- Building an open, transparent infrastructure that is enabling others to participate

Page 30: TEXT MINING: THE NEXT DATA FRONTIER · 2016-05-11 · To document and classify text mining services, how they receive input, in what form they output their results, how they combine

Contact us

www.openminted.eu

3

2

twitter.com/openminted_eu

facebook.com/openminted

bit.do/openmintedlinkedin

vimeo.com/openminted

bit.do/openmintedplus