weight annealing data perturbation for escaping local maxima in learning gal elidan, matan ninio,...
TRANSCRIPT
Weight AnnealingData Perturbation for Escaping Local Maxima in Learning
Gal Elidan, Matan Ninio, Nir FriedmanHebrew University
{galel,ninio,nir}@cs.huji.ac.il
Dale SchuurmansUniversity of [email protected]
The Learning Problem
),,( wDhScore
)(])[|][])[(( hPenmXmCmXhPwm
m
Density estimation:
Classification:
Logistic regression:
)()|][(log hPenhmXPwm
m
)()(][exp1log hPenxhmywm
m
Learning task: search for ),(maxarg DhScoreh
DATA
we
igh
ts
Hypothesis+ ),( DhScore
),,(maxarg wDhScoreh
Optimization is hard!
Typically resort to local optimization methods: gradient ascent, greedy
hill-climbing, EM
Escaping local maxima
These methods work by step perturbation during the local search
Local methods converge to (one of many) local optimum
TABU search Random restarts Simulated annealing S
core
h
Stuck here
Weight PerturbationOur Idea: Perturbation of instance weights
Puts stronger emphasis on a subset of the instances
Allows the learning procedure to escape local maxima
W
DATA
W
DATAperturb
Iterative Procedure
LOCAL SEARCH REWEIGHT
h
Score
Hypothesis
W
DATA
Benefits:Generality: a wide variety of learning scenariosModularity: Search is unchangedEffectiveness: Allows global changes
Iterative ProcedureTwo methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting
To maximize on original goal
slowly diminish magnitude of perturbations
Random Reweighting
When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data
Wt+1
Wt+2
W*
P(W)
W
Variance temp
Mean is original weight
Wt
Distance from original W
Wt
Adversarial ReweightingIdea: Challenge model by increasing w of “bad” (low scoring) instances
W*
Wt+1
Converge towards original distribution by constraining
distance from W*
Challenge the model by emphasizing bad samples
(minimize the score using W)A min-max game between re-weighting and optimizer
tw
t Scoretempηww exp*1
Kivinen & Warmuth
Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
DATA
we
igh
ts
The Alarm network
Learning task: find structure + parameters that maximize score
5 10 15 20 25 30 35 40-15.5
-15.45
-15.4
-15.35
-15.3
-15.25
-15.2
-15.15
Iterations
Lo
g-l
oss
/in
stan
ce o
n t
est
With similar running time: Random is superior to random re-starts Single Adversary run competes with random
Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe)
BASELINE
Random annealingAdversary
Alarm network: 37 variables, 1000 samples
HOT COLD
TRUE STRUCTURE
Lo
g-l
oss
/in
stan
ce o
n t
est
dat
a
Percent at least this good
Alarm network: 37 variables, 1000 samples
102030405060708090
BASELINE
GENERATING MODEL
-15.
1-1
5.08
-15.
06-1
5.04
-15.
02-1
5-1
4.98
-14.
96Search with missing values
Missing values introduce many local maxima EM combines search & parameters estimation (SEM)
With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best
90% of random better then baseline
Distance to true generating model
is halved!RANDOM
ADVERSARY
Real-life datasets 6 real-life examples with and without missing values
201512
VariablesSamples
36446
30300
70200
36546
13100
StockSoybean
RosettaAudio
Soy-MPromoter
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
Lo
g-l
oss
/ i
nst
ance
on
tes
t d
ata
BASELINE
Adversary20-80% Random
With similar running time: Adversary is efficient and preferable Random takes longer for inferior results
Represent using a motif Position Specific Scoring Matrix:
---------
ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT
Learning Sequence Motifs
N
n j ijini Sθ
KLSScore
1,n exp
1loglogisticw),(
Motif
DNA Promoter Sequences
Highly non-linear score optimization is hard!
ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT
A 0.97 0 0 0.02 0
C 0 0.01 0.99 0 0.2
G 0 0.99 0.1 0 0.8
T 0.03 0 0 0.98 0
Segal et al., RECOMB 2002
Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs)
ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 -5
0
5
10
15
20
25
30
35
40
45
50
Motif
Lo
g-l
oss
on
tes
t d
ata
BASELINE
Adversary20-80% Random
PSSM: 4 letters x 20 positions, 550 sample
With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times
Simulated annealingSimulated annealing:
allow “bad” moves with some probability
P(move) f(temp,)
Score
h
Wasteful propose, evaluate, reject cycle
Needs a long time to escape local maxima
WORSE then baseline on Bayesian networks!
Summary and Future Work
General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP…
Promising empirical results approach “achievable” maximum
The BIG challenge:
THEORETICAL INSIGHTS
Adversary ≠ BoostingAdversary
Output: Single hypothesis
Weights: Converge to original distribution
Learning: ht+1 depends on ht
Boosting
An ensemble
Diverge from original distribution
ht+1 depends only on wt+1
same comparison is true of Random Vs. Bagging/Bootstrap
Other annealing methodsSimulated annealing: allow “bad” moves with some probability
P(move) f(temp,)
Score
h
Score
h
Deterministic annealing: Change scenery by changing family of h
complex hypothesis
simple hypothesis
Not good on Bayesian network! Is not naturally applicable!