paul groth: data analysis in a changing discourse: the challenges of scholarly communication

34
Data Analysis in a Changing Discourse | Presented By Date Data Analysis in a Changing Discourse The Challenges of Scholarly Communication Paul Groth @pgroth

Upload: cost-action-td1210

Post on 17-Jul-2015

137 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Presented By

Date

Data Analysis in a Changing Discourse

The Challenges of Scholarly Communication

Paul Groth @pgroth

Page 2: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Page 3: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 3

Page 4: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 4

Page 5: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Page 6: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

queri

consum

correl

hierarch

profillognorm

graph

ws-bpel

to

program

decis

global

electron

mechan

imbalanc

cook

word

bottleneck

brows

relev

recip

geograph

markov

graph-basrate

design

click

spectral

index

section

access

petri

conduct

net

usag

modular

clickstream

implicit

valu

search

forum

auction

technolog

anchor

rdf

anycast

social

opinion

semant

approxim

prefer

folksonomi

tag-bas

substr

mobil

select

use

from

&

recommend

on

relatprobabilist

uddi

prototyp

cach

ict4d

retriev

scalabl

annot

tag

learn

stream

process

share

templat

topic

minimum

explor

onlin

secur

travel

answer

product

resourc

peer-to-p

usabl

geoloc

bloom

domin

sparql

goal-driven

issu

inform

suggest

composit

feedback

telecom

keyboard

taxonomi

dynam

entiti

reinforc

monitor

polici

delici

handl

gadget

framework

spatio-tempor

discuss

workload

sidejack

submodular

mode

found

citat

hard

combinatori

meta

sponsor

energi

extract

orient

network

join

space

publish

research

content

on-lin

adapt

internet

integr

partit

navig

reason

theori

compliancthread

clickthrough

filter

length

regress

frequent

independ

denorm

rank

evolut

script

data

interact

system

messag

circl

privaci

gpseavesdrop

fuzzi

crawl

keyword

tree

structur

h-index

balanc

video

schema

browser

and

function

comput

mine

engin

rout

technology-enhanc

(well

soap

distribut

track

price

object

eye-track

regular

segment

model

co-clust

multi-keyword

determin

bulletin

commerc

qos

text

cdn

random

session

reput

find

xml

locat

winner

activ

cloak

local

express

mainten

cost-per-act requirorgan

statist

mediat

microbusi

view

wiki

set

knowledg

2.0 expertis

disjunct

detect

expert

pattern

review

wikipedia

debat

languag

chemic

flickr

approach

email

attribut

spars

isol

extens

p2p

news

advertis

popul

protect

instant

axiomat

dissemin

voicesit

tempor

facet

instanc

context

logic

load

ontolog

walk

distil

suppli

trust

communiti

duplic

invert

devic

componinterest

basic

imag

bayesian

repetit

educ

hidden

semantic-bas

novel

datalog

servic

near

behavior

anonym

incentive-cent

region

server-sid

propag

metric

cross-languag

cluster

pharm

lightweight

develop

minim

media

medic

econom

complex

dht

infer

optim

effect

userextern

task

semantics)

person

programm

the

paradigm

isoton

monet

photo

rest

collabor

demograph

web

cut

character

board

persuas

subsequ

match

applic

classfic

webpag

traffic

associ

measur

microformat

collect

cascad

soft

page

sitemap

crawler

shed

excerpt

maxim

mirror

guarante

p3p

transport

viral

for

overlay

characteris

larg

market

machin

same-origin

compress

web-bas

vs.

comparison

of

labelsemistructur

disabl

owl

effici

log

task-bas

spam

question

aspect-ori

fast

interfac

analysi

semi-supervis

wireless

cloud

pagerank

categor

consist

isid

problem

similar

query-log

classif

featur

evalu

pseudo

abstract

diagnosi

proven

generat

mutual

mashup

discoveri

virtual

bpel

field

communic

phish

architectur

longev

svm

algorithm

fsg

reliabl

descript

visual

rule

Keyword  co-­‐occurrence  network  in  WWW  2008  

web,  query,  online,  mobile  

Page 7: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

represent

monet

queri

consum

collabor

paper

semantic/data

reput

languag

entiti

web

locat

polici

with

explain

desktop

blog

to

analyz

rich

geo/tempor

analyt

applic

digit

tangible/hapt

spell

(slas)

traffic

relev

measur

unstructur

level

h

negat

authent

correct

sensemak

statist

soft

manag

crawlerwiki

enterpris

properti

aspect

porn

natur

creation

rate

design

structur

capac

extract

click

index

network

for

open

review

multimedia

definit

publish

discoveri

content

method

communiti

internet

approach

defens metadata

machin

real-world

agreement

rich-media

market

base

theori

repositori

news

advertis

vertic

on

search

auction

of

page

filter

context

social

fine-grain

improv

provis

semistructur

produc

plan

control

semant

e-commerc

effici

appli

qualiti

rank

system

right

mobil

summar

select

use

from

log

spam

interact

compos

avail

their

attack

interfac

includ

recommend

corpus

large-scal

ontolog

deliveri

that

tool

privaci

site

trailvisual

link

ling

harvest

cach

replic

novel

retriev

evolut

scalabl

servic

access

annot

contextu

learn

browser

object-ori

analysi

classif

comput

evalu

context-awar

process

in

share

mine

cluster

tag

explor

generat

onlin

facet

develop

techniqu

secur

perform media

research

exchang

econom

other

exploratori

combin

document

divers

sub/super-docu

relat

resourc

distribut

compress

discov

virus

user

component-bas

engin

data

model

feder

audit

sentiment

algorithm

author

issu

person

text

inter-organiz

suggest

mechan

the

opinion

Keyword  co-­‐occurrence  network  in  WWW2010  

search,  social,  data  

Page 8: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Figure 1. Evolution of the number of classes of the three branches of the Gene Ontology.

Dameron  O,  Be@embourg  C,  Le  Meur  N  (2013)  Measuring  the  EvoluKon  of  Ontology  Complexity:  The  Gene  Ontology  Case  Study.  PLoS  ONE  8(10):  e75993.  doi:10.1371/journal.pone.0075993  h@p://127.0.0.1:8081/plosone/arKcle?id=info:doi/10.1371/journal.pone.0075993  

Page 9: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Table 2. Gene Ontology complexity variations.

Dameron  O,  Be@embourg  C,  Le  Meur  N  (2013)  Measuring  the  EvoluKon  of  Ontology  Complexity:  The  Gene  Ontology  Case  Study.  PLoS  ONE  8(10):  e75993.  doi:10.1371/journal.pone.0075993  h@p://127.0.0.1:8081/plosone/arKcle?id=info:doi/10.1371/journal.pone.0075993  

Page 10: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

•  The most recent changes to the GO term “apoptotic process” as displayed in QuickGO [20]. In total there have been 54 changes over the lifetime of the term.

•  Huntley et al. GigaScience 2014 3:4 doi:10.1186/2047-217X-3-4

Definitions change

Page 11: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Ramifications

Page 12: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Page 13: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 13

What happens to the long tail?

Page 14: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

CHEMBL 15: Targets are now proteins

h@p://chembl.blogspot.nl/2013/01/chembl-­‐15-­‐schema-­‐changes.html  

14

Page 15: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 15

Page 16: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 16

Downstream effects

Page 17: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

The growth of data munging

17

Page 18: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

h@ps://storify.com/chenghlee/dataformathell  

h@p://isps.yale.edu/sites/default/files/files/IDCC14_DQR_PeerGreenStephenson.pdf  

Page 19: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

“60 % of time is spent on data preparation”

NASA, A.40 Computational Modeling Algorithms and Cyberinfrastructure, tech. report, NASA, 19 Dec. 2011

Page 20: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Search target Oxidoreductase: 481 targets from different species

Selection of all the oxidoreductases and filtering bioactivities with the criteria IC50 < 100 (no units could be selected): 11497 data obtained

Table exported to a excel spreadsheet and manually filtered

From Mabel Loza - USC team

Page 21: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

The Seven Deadly Sins of

Bioinformatics

Professor Carole Goble [email protected]

The University of Manchester, UK The myGrid project

OMII-UK

Page 22: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

22

Andy Law's Third Law •  “The number of unique identifiers assigned to

an individual is never less than the number of Institutions involved in the study”... and is frequently many, many more.

h@p://bioinformaKcs.roslin.ac.uk/lawslaws.html    

Page 23: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

PubChem Drugbank ChemSpider

Imatinib

Mesylate

What Is Gleevec?

Page 24: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Some Solutions

24

Page 25: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Issue: Identifiers aren’t the same and we can’t agree on when one thing equals another Solution: Adaptive identifier mapping based on profiles

Strict   Relaxed  

Analysing   Browsing  

Page 26: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | 26

Issue:  There’s  no  one  data  model  of  science    SoluKon:  Simple  “common  sense”  driven  data  model  primarily  focused  on  user  interface  needs  

Page 27: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse | provbook.org  

Page 28: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Page 29: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

My Questions:

15/03/15  

29

Page 30: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

[Gray  et  al.  ISWC  2014]  

Page 31: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Page 32: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

We have to rely on computers

32

Page 33: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

Contact: Elsevier Labs

•  Paul Groth [email protected] •  http://pgroth.com •  @pgroth

15/03/15  

33

Page 34: Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarly Communication

Data Analysis in a Changing Discourse |

•  What is the interplay between data munging and concept drift? •  What happens when humans are not in the loop? •  What’s our tolerance for fuzziness? •  Should we worry about the long tail?

34

Questions