link prediction in linked data of interspecies interactions using hybrid recommendation approach

34
Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach Hideaki TAKEDA Professor Chiang Mai, Thailand JIST 2014 November 10 th , 2014 Tsuyoshi HOSOYA Mycologist Rathachai CHAWUTHAI [email protected]

Upload: asian-institute-of-technology

Post on 02-Jul-2015

269 views

Category:

Data & Analytics


0 download

DESCRIPTION

Linked Open Data for ACademia (LODAC) together with National Museum of Nature and Science have started collecting linked data of interspecies interaction and making link prediction for future observations. The initial data is very sparse and disconnected, making it very difficult to predict potential missing links using collaborative filtering alone. In this paper, we introduce Link Prediction on Interspecies Interaction (LPII) to solve this situation using hybrid recommendation approach. Our prediction model is a combination of three scoring functions, and takes into account collaborative filtering, community structure, and biological classification. We have found our approach, LPII, to be more accurate than other combinations of perdition models. Using statistical significance testing, we demonstrate that these scoring functions are important and play different roles depending on the conditions of linked data. This shows that LPII can be applied to deal with other real-world situations of link prediction.

TRANSCRIPT

Page 1: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Link Prediction in Linked Data of Interspecies Interactions using

Hybrid Recommendation Approach

Hideaki TAKEDAProfessor

Chiang Mai, Thailand JIST 2014 November 10th, 2014

Tsuyoshi HOSOYAMycologist

Rathachai [email protected]

Page 2: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Linked Open Data for ACadamiaLODAC

“Salix pierotii”

lodac:Salix

species:hasSuperTaxon

lodac:Salix_ pierotii

Page 3: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

National Museum of Nature and Science

30,000 Interactions4,000 Fungi7,000 Hosts

Page 4: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Let’s find the Missing Linksbetween speciesLPII

Link Prediction

on Interspecies Interactions

Objective:

To predict missing links between fungi and hosts

Page 5: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Agenda

•Dataset

• Introduction

•Hybrid Recommendation• Collaborative Filtering• Community Structure• Biological Classification

• Evaluation

• Summary

• Future work

Page 6: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

lodac:Melampsora_yezoensis

rdfs:label “Melampsora yezoensis”@la ;

species:hasTaxonRank species:Species ;

species:hasSuperTaxon lodac:Melampsora .

lodac:Melampsora species:hasTaxonRank species:Genus.

lodac:Salix_pierotii

rdfs:label “Salix pierotii”@la ;

rdf:type species:ScientificName ;

species:hasSuperTaxon lodac:Salix .

lodac:Salix species:hasTaxonRank species:Genus.

lodac:Melampsora_yezoensis species:growsOn lodac:Salix_pierotii.

Dataset

6

Host

Fungus

Link

Page 7: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

lodac:Melampsora

lodac:Salix

species:hasSuperTaxon

species:hasSuperTaxon

species:growsOn

lodac:Melampsora_

yezoensis

lodac:

Salix_pierotii

7

Page 8: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

903 Rust Fungi 2,001 Hosts

2,966 Links

BiologicalClassification

of Fungi

BiologicalClassification

of Hosts

Selected

8

Page 9: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

List of

Fungus-Host

interaction with

predictive scores

DATA PREPARATION LPII APPROACH

RESULT

transform data using

a Weight Function

BIOLOGIST

Making Observation

Collaborative

Filtering

Finding

Missing

Links

Combine

Score Score Score

1 2

3

4

Intr

od

uct

ion

9

Community

Structure

Biological

Classification

Fungus-Host

Interaction

Dataset

Generate Result

Page 10: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Collaborative Filtering

Some fungi found at the same host are common neighbors.

If some close neighbors of the fungus fare found at a host h,the fungus f may be found at the host h.

10

1

Page 11: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

Fungi Hosts

11

Page 12: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

h1

h2

PCF

( f1,h2 ) = ?

Collaborative Filtering for Link Prediction

Sum of similarities between fungi with common hosts

12

Page 13: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

w = ?

Jaccard Index

13

Page 14: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

w = 0.50

w = 0.33

14

Page 15: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Predictive Score usingCollaborative Filtering

PCF( f1,h2 ) = 0.50

PCF( f2,h3 ) = 0.33

PCF( f1,h3 ) = ???

PCF( f4,h3 ) = ???

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

w = 0.50

w = 0.33

PCF( f4,h5 ) = ???

etc.15

( Dash red lines are predicted links)

Page 16: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Community Structure

If a host h is commonly foundin the community of the fungus f, the fungus f may be found at the host h.

16

2

Page 17: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

0.50

0.33

f4

f5

0.50

0.33

Bipartite GraphProjection of Fungi

f2

f1

f3

17

Page 18: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

CommunityStructure

o f

Rust Fungi

18

Using Modularity with Random Walk

Page 19: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f4

f5

0.50

0.33

Projection of Fungi

f2

f1

f3

CommunityStructureh1

h2

h3

h4

h5

Community #1

Community #2

Community #3

PCS( f,h ) =

Number of links between

the community of the

fungus f and the host h

Number of all links

given by the community

of the fungus f

PCS( f3,h1 ) =2

5= 0.40

19

Page 20: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

20

How to deal with

many very smal l

communit ies?

Page 21: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Biological Classification

If a host h is commonly foundin the biological classification of the fungus f, the fungus f may be found at the host h.

21

3

Page 22: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

BIOLOGICAL CLASSIFICATION (TAXONOMY)

Domain e.g. Eukaryota

Kingdom e.g. Fungi

Phylum e.g. Basidiomycota

Class e.g. Urediniomycetes

Order e.g. Uredinales

Family e.g. Melampsoraceae

Genus e.g. Melampsora

Species e.g. Melampsora Yezoensis

Classification Example

22

Page 23: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

with Biological Classification

G1

G2

Biological Classification

23

PBC( f,h ) =

Number of links between the

biological classification of the

fungus f and the host h

Number of all links given by

the biological classification of

the fungus f

PBC( f4,h2 ) =1

4= 0.25

Page 24: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

PCF( f,h )

PII( f,h )

Hybrid Recommender Approach

PCS( f,h )

PBC( f,h )

CollaborativeFiltering

CommunityStructure

BiologicalClassification

24

Combination of

Page 25: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Evaluation

25

Page 26: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Training set(2,500 links)

Test set(500 links)

Candidates(400,000 links)

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

Learning and Testing

f1

f2

f3

f4

f5

h1

h2

h3

h4

h5

All PossibleLinks

Existent Links Missing Links

0.4210.8640.4660.4900.3660.5150.3130.0760.3620.9020.0690.5240.8760.4640.8390.504

26

Page 27: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

AUC Area Under the receiver operating characteristic Curve

① PII( f1,h2 ) = 0.70

② PII( f2,h3 ) = 0.60

③ PII( f1,h3 ) = 0.50

④ PII( f4,h3 ) = 0.40

⑤ PII( f2,h2 ) = 0.30

⑥ PII( f3,h3 ) = 0.20

⑦ PII( f4,h3 ) = 0.10

① PII( f1,h2 ) = 0.70

② PII( f2,h2 ) = 0.60

③ PII( f3,h3 ) = 0.50

④ PII( f2,h3 ) = 0.50

⑤ PII( f1,h3 ) = 0.40

⑥ PII( f4,h3 ) = 0.30

⑦ PII( f4,h3 ) = 0.10

Predicted List #1

(sorted by predictive score)

Low AUCHigh AUC

For n comparisons,

• n' is number of times when

the test links have higher

score than the missing links.

• n" is number of times when

the test links have same

score as the missing links.

Predicted List #2

(sorted by predictive score)

27( Red scores are test links)

Page 28: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

AUC Area Under the receiver operating characteristic Curve

Combination Scoring Function(s) AUC

Stand-alone functionPCF 0.859

PCS 0.823

PBC 0.680

Summation of functionsPCF + PCS 0.867

PCF + PBC 0.876

PCS + PBC 0.865

PCF + PCS + PBC 0.892

Multiplication of functionsPCF × PCS 0.817

PCF × PBC 0.862

PCS × PBC 0.827

PCF × PCS × PBC 0.818

28

Page 29: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

RDF data of

Interspecies

Interactions

Projection

of Fungi

Collaborative

Filtering

Community

Structure

Biological

Classification

SPARQL

querying

being input of

Scoring Functions

ranking

predictions

in decreasing

order

Predicted Missing Links

of Fungus-Host together with

prediction scores

DATA PREPARATION LPII APPROACH

RESULT

Bipartite Graph

Missing

Links

Community

Detection Method

transform data using

a Weight Function

DOMAIN

EXPERT

found?yes

update

knowledgebase

NOTE

select

connected fungi

clustering using

Biological

Classification

make

observation

Data

Process

Third party method

Scoring Function

Input argument

Linear Operation

Decision

Dataflow

+

find

missing

linkssharing

LOD

Cloud

PII(f,h) +

PCF(f,h) PCS

(f,h) PBC(f,h)

1 2

3

4

29

Ove

rall

Page 30: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

PCF( f,h )PII( f,h )

Hybrid Recommender Approach

PCS( f,h )

PBC( f,h )

α

β

γγ should be very

low as about 0.1 and 0.2.

30

Page 31: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Conclusion

Informatics Biology

• RDF Model for Interspecies Interaction• Improve the use of Collaborative filtering

with sparse dataset using• Community Structure• and Biological Classification

• It has been found that • In general case, PCF + PCS is enough.• But when a node

• having a few common neighbors• and locating in a small community,• PBC becomes a key player for

making link prediction.

• This model supports the view that most fungi under the same genus have similar parasite behavior.

• Some predicted links having high predictive score, such as,• Phragmidium mucronatum ハマナス• Phragmidium fusiforme ハマナス• Phragmidium potentillae イワキンバイ

have been discovered from other literatures.

• Next enhancement is to analyze fungal species into fungal spore types.

31

Page 32: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

PCF( f,h )PII( f,h )

Future Work

PCS( f,h )

PBC( f,h )

α

β

γ

x1 (f,h)

x2 (f,h)

x3 (f,h)

32

Page 33: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

RDF data of

Interspecies

Interactions

NFungi-Projection

or GProjFungi

Collaborative

Filtering

Community

Structure

Biological

Classification

SPARQL

querying

being input of

Scoring Functions

ranking

predictions

in decreasing

order

Predicted Missing Links

of Fungus-Host together with

prediction scores

DATA PREPARATION LPII APPROACH

RESULT

Bipartite Graph

GBipt

including

LExist

Missing

Links

Or

LMiss

clustering using

a Community

Detection Method

transform data using

a Weight Function

DOMAIN

EXPERT

found?yes

update

knowledgebase

NOTE

select

connected fungi

clustering using

Biological

Classification

make

observation

Data

Process

Third party method

Scoring Function

Input argument

Linear Operation

Decision

Dataflow

+

find

missing

linkssharing

LOD

Cloud

PII(f,h) +

PCF(f,h) PCS

(f,h) PBC(f,h)

1 2

3

4Ove

rall

α β γ

33

Page 34: Link Prediction in Linked Data of Interspecies Interactions using Hybrid Recommendation Approach

Any idea for improvement?