semi-supervised learning on partially labeled imbalanced data may 16, 2010 jianjun xie and tao xiong
TRANSCRIPT
Semi-supervised Learning on Partially Labeled Imbalanced Data
May 16, 2010
Jianjun Xie and Tao Xiong
What Problem We Are Facing
Six data sets extracted from six different domains Domains were removed in the contest
They are all binary classification problems They are all imbalanced data sets
Percentage of positive labels varies from 7.2% to 25.2% This information was removed in the competition They were significantly different from the development sets
They all have one known label to start with
Datasets SummaryFinal contest datasets
Dataset Domain Feature number
Train number
Positive Label %
A Handwriting Recognition
92 17,535 7.23
B Marketing 250 25,000 9.16
C Chemo-informatics
851 25,720 8.15
D Text Processing
12,000 10,000 25.19
E Embryology 154 32,252 9.03
F Ecology 12 67,628 7.68
Stochastic Semi-supervised Learning
Condition: Label distribution is highly imbalanced, positive labels are rare Known labels are few Unlabeled data are abundant
Approach to A, C, and D: Randomly pick one record from unlabeled data pool as “negative” Use the given positive seed and picked “negative” seed as initial
cluster center for k-means clustering Label the cluster as positive where the positive seed resides Repeat above process n times Take the normalized cluster membership count of each data point
as the first set of prediction score
Our approach when number of labels <200
Stochastic Semi-supervised Learning -- continued
Approach to A, C, and D: When more labels are known after query, use both known labels
and randomly picked “negative” seeds as initial cluster center Label cluster using known positive seeds Discard cluster whose membership is not clear Store the cluster membership of each data points Use normalized positive cluster membership counts as prediction
score
Our approach when number of labels <200
Stochastic Semi-supervised Learning -- continued
Approach to B, E, and F: Randomly pick 20 unlabeled data as “negative” labels for each
known positive label. Build over-fit logistic regression model using above dataset Repeat above random picking and model building process n
times Final score is the average of n models.
Our approach when number of labels <200
Supervised Learning Using Gradient Boosting Decision Tree (TreeNet)
Querying Strategy
One critical part of active learning is the query strategy Popular approaches:
Uncertainty sampling Expected model change Query by committee
What we tried: Uncertainty sampling + density based selective sampling Random sampling (for large label purchase) Certainty sampling (try to get more positive labels)
Dataset A: Handwriting RecognitionGlobal score = 0.623, rank 2nd.
Pie Chart Title
Column Chart Title
Sequence Num. Samples
Num. Queried Samples
AUC Sampling Strategy
1 232 1 0.67 Uncertainty/Selective
2 1959 233 0.82 Uncertainty/Selective
3 4286 2192 0.92 Random
4 11057 6478 0.94 Get All
5 0 17535 0.93
Dataset B: MarketingGlobal score = 0.375, rank 2nd.
Pie Chart Title
Dataset C: Chemo-informaticsGlobal score = 0.334, rank 4th. Passive learning.
Pie Chart Title
Dataset D: Text ProcessingGlobal score = 0.331, rank 18th.
Pie Chart Title
Dataset E: EmbryologyGlobal score = 0.533, rank 3rd.
Pie Chart Title
Column Chart Title
Sequence Num. Samples
Num. Queried Samples
AUC Sampling Strategy
1 2 1 0.75 Certainty
2 3 3 0.66 Uncertainty/Selective
3 3 6 0.67 Uncertainty/Selective
4 32243 9 0.72 Get All
5 0 32252 0.86
Dataset E: Embryology
Performance gets worse with more labels
Newly queried labels did too much correction to the existing model
This phenomenon was common in this contest
Global score = 0.533, rank 3rd.
Pie Chart Title
Dataset F: EcologyGlobal score = 0.77, rank 4th.
Pie Chart Title
Column Chart Title
Sequence Num. Samples
Num. Queried Samples
AUC Sampling Strategy
1 2 1 0.76 Uncertainty/Selective
2 7 3 0.73 Uncertainty/Selective
3 542 10 0.77 Uncertainty/Selective
4 5175 552 0.95 Random
5 61901 5727 0.98 Get all
6 0 67628 0.99
Dataset F: Ecology
Performance gets worse with 2 more labels at beginning
Most of the time, too many small queries do more harm than good to global score
Pie Chart Title
Summary on ResultsOverall rank 3rd.
Pie Chart Title
Dataset Positive label %
AUC ALC Num. Queries
Rank Winner AUC
Winner ALC
A 7.23 0.925 0.623 4 2 0.862 0.629
B 9.16 0.767 0.375 2 2 0.733 0.376
C 8.15 0.814 0.334 1 4 0.799 0.427
D 25.19 0.890 0.331 3 18 0.964 0.745
E 9.03 0.865 0.533 4 3 0.894 0.627
F 7.68 0.988 0.771 5 4 0.999 0.802
Discussions
How to consistently get better performance with only a few labels across different datasets
How to consistently improve model performance with the increase of labels in a given dataset
Does the log2 scaling give too much weight on first few queries? What about every dataset starts with a little bit more labels?