yukun chen subramani mani discovery systems lab (dsl) department of biomedical informatics...

14
ACTIVE LEARNING FOR UNBALANCED DATA IN THE CHALLENGE WITH MULTIPLE MODELS AND BIASING Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

Upload: easton-covel

Post on 02-Apr-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

ACTIVE LEARNING FOR UNBALANCED DATA IN THE CHALLENGE WITH MULTIPLE

MODELS AND BIASING

Yukun ChenSubramani Mani

Discovery Systems Lab (DSL) Department of Biomedical Informatics

Vanderbilt UniversityMay 2010

Page 2: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

2

Outline

Introduction Datasets in the challenge Probabilistic models Querying methods Other methods for active learning Experiments and Results Conclusion

Page 3: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

3

Introduction Active learning challenge is based on the pool-based active learning model. Practically, labeling is costly but observational data is abundantly available

at low cost.

Active learner could find the most informative instance and perform high learning accuracy with minimal querying cost.

In the challenge, we need to optimize the global score (ALC score) by implementing probabilistic prediction model, querying strategy, and more.

Learning from datasets in the challenge is not easy because the data is very sparse, is unbalanced for class label, has high dimensional feature space, and has missing values.

Uncertainty sampling with biasing consensus (USBC) is our basic active learning strategy for prediction and querying for labels.

Page 4: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

4

Datasets in the challenge

Predictive Mapping from development to final datasets

Development Datasets

Final Datasets

The most common properties

ZEBRA E Feature type is continuous

NOVA DNumber of features and sparse rate are very

high; feature types are both binary;

ORANGE B Missing rate is high

SYLVA F Size of training/testing set is high

HIVA C Number of feature is high

IBN_SINA A It is like a general sparse dataset

Development Datasets Final Datasets

Page 5: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

5

Probabilistic Model

Random Forests (RF) classifier is the basic prediction model we used in this challenge.

We built a multi-model committee with multiple RF classifiers.

The final prediction was based on consensus posterior probability (CPP):

We also considered the variance of posterior probabilities from multiple models. The high-variance filter was used in querying method.

1

( )

1( 1 | ; )( )

M

m

mCPP P yM

xx

Page 6: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

6

Querying Method Querying method ranks the samples based on the informative values, and outputs

the most informative sample(s) to query.

Least confidence with bias (LCB) was our basic querying method.

The informative value of sample is a function of CPP and bias factor pp (the positive fraction for the current training set in active learning).

max

max

max

1* ( ); if ( )

1* (1 ( )); otherwise

1

Q ( , )LCB

CPP CPP PP

pp

CPPP

x x

x

x

max 0.5, 1 –P mean pp

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Function of Lease Confidence with Bias for Binary Class

P(y=1|x)

Q(x

)

LCB for pp=0.1

LCB for pp=0.3LCB for pp=0.5

Pmax

Page 7: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

7

Semi-supervised Learning Method

It is very important to have a good starting point on the learning curve in active learning, which is the prediction performance by knowing just one positive label.

Pure unsupervised learning method (for example the metric based on distance, similarity, clustering result) might not be good enough to make prediction.

We combined unsupervised and supervised learning: (1) For all samples, compute the cosine similarity to the positive-labeled seed; (2) Assign negative labels to K samples with smallest cosine similarity values; (3) Train the training set with one given positive sample and K predicted negative samples by

our multiple models, and predict for other samples.

Here is our comparison result between cosine similarity function and semi-supervised learning method for the initial AUC:

Dataset Name

Initial AUC by Cosine Similarity

Function

Initial AUC by semi-supervised

learning

HIVA 0.5441 +/– 0.41% 0.6502 +/– 0.65%

IBN_SINA 0.8335 +/– 0.28% 0.7900 +/– 0.28%

NOVA 0.5618 +/– 0.39% 0.6853 +/– 0.38%

ORANGE 0.5661 +/– 0.51% 0.5170 +/– 0.78%

SYLVA 0.6709 +/– 0.27% 0.8958 +/– 0.22%

ZEBRA 0.3758 +/– 0.27% 0.6751 +/– 0.48%

Page 8: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

8

Batch Size Validation For some datasets (ZEBRA, ORANGE, HIVA, and NOVA), our models did not have a

good prediction when the size of training set is small. The bad initial performance could badly affect the global score based on learning curve in Log2 space (see the learning curves with respect to initial batch size).

We ran batch size validation to search for the minimal sufficient size of initial training set.

This prevented a significant drop in performance at the beginning for our prediction model.

Batch size validation result figure for ZEBRA, IBN_SINA and NOVA:

1 2 4 8 16 32 64 128

256

51210

2420

4841

9683

92

1678

4

3276

80.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

ZEBRA

IBN_SINA

NOVA

ALC

score

Initial Batch Size

Page 9: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

9

Batch Size Validation (for ZEBRA)

0 5 10 150.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1ZEBRA USBC IBATCH 2^14 (16384): Global score=0.5199

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e

0 5 10 150.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1ZEBRA USBC IBATCH 2^12 (4096): Global score=0.4218

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e

0 5 10 150.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1ZEBRA USBC IBATCH 2^10 (1024): Global score=0.3846

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e0 5 10 15

0.4

0.5

0.6

0.7

0.8

0.9

1ZEBRA USBC IBATCH 2^1 (2): Global score=0.2876

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e0 5 10 15

0.4

0.5

0.6

0.7

0.8

0.9

1ZEBRA USBC IBATCH 2^4 (16): Global score=0.3391

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e

0 5 10 150.4

0.5

0.6

0.7

0.8

0.9

1ZEBRA USBC IBATCH 2^8 (256): Global score=0.3164

Log2(Number of labels queried)

Are

a u

nder

the

RO

C c

urv

e

Initial Batch: 256

Initial Batch: 4096 Initial Batch: 1024Initial Batch: 16384

Initial Batch: 16 Initial Batch: 2

Page 10: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

10

Experimental Setup (1) Initialization:

(1.1) Run preprocessing steps (missing value imputation, PCA, etc) if needed.

(1.2) Assign batch size as the function of iteration, depending on the batch size validation result.

(2) Run semi-supervised learning for initial prediction and basic uncertainty sampling to rank and query samples.

(3) Run uncertainty sampling with biasing consensus (USBC) in the iterations of active learning: (3.1) Add predicted negative samples into the training sets (if activated). (3.2) Train by 5 RF models and predict for all unlabeled samples. (3.3) Run high-variance filter (if activated). (3.4) Run uncertainty sampling with bias to rank and query samples (Bias

factor is the function of positive fraction and the size of training set).

(4) Output learning curves and global ALC score.

Page 11: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

11

Results (tables for development datasets and final datasets)

Dataset Name ALC AUC Initial AUC Initial Batch Size Use Filter Use Predicted

Negative

HIVA 0.3233 0.7468 +/– 0.79% 0.6502 +/– 0.65% 1 No No

IBN_SINA 0.8705 0.9960 +/– 0.09% 0.7900 +/– 0.28% 1 No Yes

NOVA 0.7675 0.9940 +/– 0.14% 0.6853 +/– 0.38% 16 Yes Yes

ORANGE 0.2037 0.7630 +/– 1.11% 0.5170 +/– 0.78% 1 No Yes

SYLVA 0.9484 0.9990 +/– 0.04% 0.8958 +/– 0.22% 1 No No

ZEBRA 0.5199 0.8318 +/– 0.56% 0.6751 +/– 0.48% 16384 No No

Dataset Name ALC AUC Initial AUC Initial Batch Size

Use Filter

Use Predicted Negative Rank

A 0.3609 0.9615 +/– 0.39% 0.7500 1 No Yes 9

B 0.1297 0.6484 +/– 0.44% 0.5000 1 No Yes 12

C 0.1876 0.7715 +/– 0.52% 0.4500 1 No No 12

D 0.5390 0.9554 +/– 0.33% 0.4500 16 Yes Yes 12

E 0.6266 0.8939 +/– 0.39% 0.7300 30000 No No 1

F 0.7853 0.9976 +/– 0.09% 0.5500 1 No No 3

The results for development datasets

The results for final datasets

Page 12: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

12

Results (Active Learning Curves for final datasets)

Dataset: D; Global score: 0.54

Dataset: B; Global score: 0.13

Dataset: C; Global score: 0.19

Dataset: A; Global score: 0.36

Dataset: E; Global score: 0.63

Dataset: F; Global score: 0.79

Page 13: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

13

Discussion For dataset E, the global score is benefited by the batch size

validation. Semi-supervised learning generates a good starting point. We won on dataset E.

For dataset F, the learning curve based on USBC is acceptable except that the initial performance is not stable. We were ranked 3rd on F.

For dataset D also the batch size validation was effective. The high-variance filter successfully helped prevent a significant drop in the curve. But the starting point is quite low.

For dataset A, USBC worked well when the size of training set was at least 64. However, the initial low performance hurt our global score.

Datasets B and C are the hardest datasets like HIVA and ORANGE. Our prediction models were not effective in these datasets.

Page 14: Yukun Chen Subramani Mani Discovery Systems Lab (DSL) Department of Biomedical Informatics Vanderbilt University May 2010

14

Conclusion

Our strategies consider more than prediction model and query model. Semi-supervised learning and batch size validation are also important parts of the active learning process.

Our methods need further evaluation using additional datasets.

The active learning challenge is still a very open problem to solve.

One possible future direction to explore is to automatically assign batch size as a function of predictive performance and informativeness.