1 dejing dou computer and information science university of oregon, eugene, oregon september, 2010@...

58
1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Upload: lynette-palmer

Post on 28-Dec-2015

220 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

1

Dejing DouComputer and Information Science

University of Oregon, Eugene, Oregon

September, 2010@ Kent State University

Page 2: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University
Page 3: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

OutlineIntroduction

Ontology and the Semantic Web Biomedical Ontology DevelopmentChallenges for Data-driven Approaches

The NEMO ProjectMining ERP Ontologies (KDD’07)Modeling NEMO Ontology Databases

(SSDBM’08, JIIS’10)Mapping ERP Metrics (PAKDD’10)Ongoing Work

3

Page 4: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

4

What is Ontology?

Formal specification of a vocabulary of domain concepts and relationships relating them .

Page 5: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

5

A Genealogy Ontology Individual

FamilyEvent

Male

Female

MarriageEvent

DivorceEvent

DeathEvent

BirthEventhusband

childIn

wife

marriage

divorce

birth

Gendersex

Classes: Individual, Male, Female, Family, MarriageEvent…

Properties: sex, husband, wife, birth……

Axioms: If there is a MarriageEvent, there will be a Family related to the husband and wife properties.

Ontology languages: OWL, KIF, OBO …

Page 6: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

6

Current WWW The majority of data resources in WWW are in human

readable format only (e.g. HTML).

human

WWW

Page 7: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

7

The Semantic Web One major goal of the Semantic Web is that web-

based agents can process and “understand” data[Berners-Lee et al 2001].

Ontologies formally describe the semantics of data and web-based agents can take web documents (e.g. in RDF, OWL) as a set of assertions and draw inferences from them.

human

SW

Web-based agents

Page 8: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Biomedical OntologiesThe Gene Ontology (GO): to standardize the formal representation of gene and gene product attributes across all species and gene databases (e.g., zebrafish, mouse, fruit fly) Classes: cellular component, molecular function, biological

process, … Properties: is_a, part_of

The Unified Medical Language System (UMLS): a comprehensive thesaurus and ontology of biomedical concepts.

The National Center of Biomedical Ontology (NCBO) at Stanford University>200 ontologies (hundreds to thousands concepts each one)

4 millions of mappings.

8

Page 9: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Biomedical Ontology DevelopmentTypically Knowledge Driven: top down process

Some basic steps and principles:Discussions among domain experts and ontology engineersSelect basic (root) classes and properties (i.e., terms)Go to deeper depth for sub-concepts and relationships.

Modularization may be considered if the ontology is expected to be large.

Add constraints (axioms)Add unique IDs (e.g., URLs) and textual definitions for termsConsistency checkingUpdating and Evolution (e.g., GO is updated every 15 minutes)

9

Page 10: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Challenges: Knowledge Sharing does not help Data Sharing

AutomaticallyAnnotation (like tags) helps Search in text (e.g., papers), but not

good for experimental data (e.g., numerical values)

Three main challenges for knowledge/data sharing:Heterogeneity: different labs use different analysis

methods, spreadsheet attributes , DB schemas. Reusability: knowledge mined from different

experimental data may not be consistent and sharable Scalability: the size of experimental data grow much

larger than the size of ontologies. Ontology-based reasoning (e.g., ABox) for large size data is a headache.

10

Page 11: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Case Study: EEG dataElectroencephalogram (EEG) data

Observing Brain Functions through EEG

11

•Brain activity occurs in cortex and cortex activity generates scalp EEG

•EEG data (dense-array, 256 channels) has high temporal (1msec) / poor spatial

resolution (2D), MR imaging (fMRI, PET) has good spatial (3D) / poor

temporal resolution (~1.0 sec)

Page 12: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

ERP data and Pattern AnalysisEvent-related potentials (ERP) are created by averaging across segments

of EEG data in different trials and time-locking (e.g., every 2 seconds) to stimulus events or response.

Some existing tools (e.g., Net Station, EEGLAB, APECS, the Dien PCA Toolbox) can process ERP data and do pattern analysis.

h

12

(A) 128-channel ERPs to visual word and nonword stimuli. (B) Time course for P100 pattern by PCA. (C) Scalp topography (spatial distribution) of P100 pattern.

Page 13: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

NEMO: NeuroElectroMagnetic Ontologies

Some challenges in ERP studyPatterns can be difficult to identify and definitions

vary across research labs. Methods for ERP analysis differ across research sites.

It is hard to compare and share the results across experiments and across labs.

The NEMO (NeuroElectroMagnetic Ontologies)

project is to address those challenges by developing ontologies to support ERP data and pattern representation, sharing and meta-analysis. It has been funded by the NIH as an R01 project since 2009.

13

Page 14: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Architecture

14

Page 15: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Progress in Data Driven ApproachesMining ERP Ontologies (KDD’07) --

Reusability

Modeling NEMO Ontology Databases (SSDBM’08, JIIS’10) -- Scalability

Mapping ERP Metrics (PAKDD’10) --

Heterogeneity

15

Page 16: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology Mining Ontology mining is a process for learning an ontology,

including classes, class taxonomy, properties and axioms, from data.

Existing ontology mining approaches focus on text mining or web mining (web content, usage, structure, user profiles). Clustering and association rule mining have been used for classes

and properties. [Li&Zhong @ TKDE 18(4), Maedche&Staab @ EKAW’00, Reinberger et al @ ODBASE’03].

NetAffix Gene ontology mining tool is applied to microarray data [Cheng et al @ Bioinformatics 20 (9)]

Our approach includes hierarchical clustering and

classification for mining class taxonomy, properties and axioms of the first-generation of ERP data-specific ontology from spreadsheets, which is novel.

16

Page 17: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

17

Knowledge Reuse in KDD

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation?Lack of formal

Semantics

Page 18: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Our Framework (KDD’07)

18 A semi-automatic framework for mining ontologies

Page 19: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Four General Procedures

Classes <= Clustering-based Classification

Class Taxonomy <= Hierarchical Clustering

Properties <= Classification

Axioms <= Association Rule Mining and Classification

19

Page 20: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Experiments on ERP DataPreprocessing Data with Temporal PCA

Mining ERP Classes with Clustering-based Classification

Mining ERP Class Taxonomy with Hierarchical Clustering

Mining Properties and Axioms (Rules) with Classification

Discovering Axioms among Properties with Association Rules Mining

20

Page 21: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Input Raw ERP data

21

Subject Condition Channel# Time1(µv) Time2(µv) Time3(µv) Time4(µv) Time5(µv) Time6(µv)

S01 A 1 0.077 0.136 0.075 0.095 0.188 0.097

S01 A 2 0.891 1.780 0.895 0.805 1.612 0.813

S01 A 3 0.014 0.018 0.013 0.040 0.066 0.035

S01 A 4 0.657 1.309 0.657 0.789 1.571 0.785

S01 A 5 0.437 0.864 0.432 1.007 2.002 1.003

S01 B 1 0.303 0.603 0.303 0.128 0.250 0.123

S01 B 2 0.477 0.951 0.483 0.418 0.841 0.418

S01 B 3 0.538 0.073 0.038 0.029 0.043 0.022

S01 B 4 0.509 1.061 0.533 0.628 1.254 0.626

S01 B 5 1.497 1.024 0.510 0.218 0.434 0.219

S02 A 1 1.275 2.987 1.500 0.382 0.769 0.386

S02 A 2 0.666 2.555 1.281 0.326 0.648 0.329

S02 A 3 0.673 1.321 0.666 1.026 2.051 1.029

S02 A 4 0.284 1.341 0.678 1.966 3.914 1.966

S02 A 5 0.980 0.564 0.292 0.511 1.012 0.507

S02 B 1 0.367 1.960 0.978 1.741 3.486 1.739

S02 B 2 0.864 0.721 0.365 1.470 2.934 1.472

S02 B 3 0.568 1.729 0.866 1.342 2.680 1.337

S02 B 4 0.149 1.134 0.575 0.210 0.423 0.215

S02 B 5 0.042 0.287 0.151 0.433 0.860 0.433

Sampling rate: 250Hz for 1500ms (375 samples)Experiment 1-2: 89 subjects and 6 experiment conditionsExperiment 3: 36 subjects and 4 experiment conditions

Page 22: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Data Preprocessing (1)

Temporal PCA Decomposition

22

component 1 + component 2 = complex waveform

+ =

PCA

PCA extracts as many factors (components) as there are variables (i.e., number of samples). We retain the first 15 PCA factors, accounting for most of variances (> 75%). The remaining factors are assumed to contain “noise”.

Page 23: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Data Preprocessing (2) Intensity, spatial, temporal and functional

metrics (attributes) for each factor

23

Page 24: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

ERP Factors after PCA Decomposition

24

TI-max

(µs)

IN-mean (ROI) (µv)

IN-mean (ROCC) (µv)

... SP-min

(channel#)

128 4.2823 4.7245 … 24

96 1.2223 1.3955 … 62

164 -6.6589 -4.7608 … 59

220 -3.635 -2.0782 … 58

244 -0.81322 0.29263 … 65

For Experiment 1 data, number of Factors = (474) (594) For Experiment 2 data, number of Factors = (588) (598)

For Experiment 3 data, number of Factors = 708

Page 25: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining ERP Classes with Clustering (1)

We use EM (Expectation-Maximization) clusteringE.g. for Experiment 1 group 2 data

25

Cluster/

Pattern

0 1 2 3

P100 0 76 0 2

N100 117 1 0 54

lateN1/N2 13 14 0 104

P300 0 61 110 42

Page 26: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining ERP Classes with Clustering (2)

We use OWL to represent ERP Classes

26

Page 27: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining ERP Class Taxonomy with Hierarchical ClusteringWe use EM clustering in both divisive and

agglomerative ways. E.g. for Experiment 3 data

27

Page 28: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining ERP Class Taxonomy with Hierarchical ClusteringWe use OWL to represent class taxonomy

28

Page 29: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining Properties and Axioms with Clustering-based Classification (1)

We use decision tree learning (C4.5) to do classification with the training data labeled by clustering results.

29

Page 30: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining Properties and Axioms with Clustering-based Classification (2)

We use OWL to represent datatype properties which are based on those attributes with high information gain (e.g., top 6).

30

Page 31: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Mining Properties and Axioms with Clustering-based Classification (3)

We use SWRL to represent axioms. In FOL:

31

Page 32: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Discovering Axioms among Properties with Association Rule Mining

We use Apriori algorithm to find association rules among properties. The split points are determined by classification rules. In FOL, they looks like:

32

Page 33: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Rule Optimization

33

Idea: (A → B) (A B → C) => (A → C)

And

Page 34: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

A Partial View of the Mined ERP Data Ontology

34

• Our first-generation ERP ontology consists of 16 classes, 57 properties and 23 axioms.

Page 35: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology-based Data Modeling (SSDBM’08, JIIS’10)

In general, ontologies can be treated as one kind of conceptual model. Considering the size of data (e.g., PCA factors) can be large, instead of building a knowledge base to store those data, we propose to use relational databases.

We designed database schemas based on our ERP ontologies which include temporal, spatial and functional concepts.

35

Page 36: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology Databases

Axioms

ClassClass

Datatype

Datatype

Objects

Facts

RelationRelation

Datatype

Datatype

keys

constraints

triggers

tuples

Now we have bridged these.

Page 37: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology Databases

Axioms

ClassClass

Datatype

Datatype

Objects

Facts

RelationRelation

Datatype

Datatype

keys

constraints

views

triggers

tuples

Page 38: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Loading time in Lehigh University Benchmark

Load Time (1.5 million facts)

(10 Universities, 20 Departments)

Page 39: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Query timeQuery Performance (logarithmic time)

Page 40: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology-based Data Modeling

For example, especially for the important subsumption axioms (e.g., subclassof ) of the current ERP ontologies, we use SQL Triggers and Foreign-Keys to represent them.

40

Page 41: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Ontology-based Data Modeling

41

The ER Diagram for the ERP ontology database shows tables (boxes) and foreign key constraints (arrows). The concepts pattern, factor, and channel are most densely connected (toward the right-side of the image) as expected.

Page 42: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

42

Page 43: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

NEMO Data Mapping (PAKDD’10)

Motivation Lack of meta-analysis across experiment

because different labs may use different metrics

Goal of the studyMapping alternative sets of ERP spatial

and temporal metrics

Page 44: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Problem definition

Alternative sets of ERP metrics

Page 45: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

ChallengesSemi-structured

dataUninformative

column headers (string similarity matching does not work)

Numerical values

Page 46: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Grouping and reordering

Page 47: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Grouping and reordering

Page 48: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Sequence post-processing

Page 49: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Cross-spatial Join

Process all point-sequence curves

Calculate Euclidean distance between sequences in the Cartesian product set (Cross-spatial join)

● ● ●

Metric Set1Metric Set1Metric Set2Metric Set2

Page 50: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Cross-spatial Join

Page 51: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Assumptions and HeuristicsThe two datasets contain the same or similar

ERP patterns if they are from the same paradigms (e.g., oddball in visual/audio - watching or listening uncommon or fake words among common words)

Page 52: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Wrong Mappings. Precision = 9/13

Gold standard mapping falls along the diagonal cells

Page 53: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

ExperimentDesign of experiment data

2 simulated “subject groups” (samples) SG1 = sample 1 SG2 = sample 2

2 data decompositions tPCA = temporal PCA decomposition sICA = spatial ICA (Independent Component Analysis)

decomposition

2 sets of alternative metrics m1 = metric set 1 m2 = metric set 2

Page 54: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Experiment Result

Overall Precision: 84.6%

Page 55: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

NEMO Related Ongoing Work

Application of our framework to other domainmicroRNA, medical informatics, gene databases,

Mapping discovery and integration across ontologies related to different modalities (e.g., EEG vs. fMRI).

55

Page 56: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

56

Joint EEG-fMRI Data Mapping

Page 57: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Joint work with:

Gwen Frishkoff, Jiawei Rong, Robert Frank, Paea LePendu, Haishan Liu, Allen Malony, and Don Tucker 3,4

57

Page 58: 1 Dejing Dou Computer and Information Science University of Oregon, Eugene, Oregon September, 2010@ Kent State University

Thanks for your attention !

Any Question?

58