10/29/19
1
MachineLearningforIntelligentSystems
Instructors:NikaHaghtalab (thistime)andThorstenJoachims
Lecture17:StatisticalLearningTheory1
Reading:UML4
1
Curved:𝐶𝑢𝑟𝑣𝑒𝑑𝐺𝑟𝑎𝑑𝑒 = 100 − 0.75(95 − 𝑅𝑎𝑤𝑃𝑜𝑖𝑛𝑡𝑠)
Hardertimewiththefollowingconcepts:1. Perceptronupdatebound2. Leave-on-outerrorofKernelizedSVM3. NeuralNetworkconstruction
Prelim
2
xkcd comics
3
ReplicationCrisisinScience
4
ConvincemeofyourPsychicAbilities?
Game• I’mthinkingof𝑚 bits(0,1)• Ifsomebodyintheclassguessesmybitsequence,thatpersonclearlyhastelepathicabilities– right?
• Thinkofa6 digit0,1sequence.
101010Question:
• Ifatleastoneof|𝐻| playersguessesthebitsequencecorrectly,isthereanysignificantevidencethattheyhavetelepathicabilities?
• Howlargewould𝑚 and|𝐻| havetobeforustotrustthistest?
5
TestingforpsychicpowerSetup:• |𝐻| student𝐻 = {ℎ@,… , ℎ|C|}• 𝑚 bits(lengthofsequence)• 𝑝 = 0.5 probabilityoferroronasinglebit,ifyou’renotpsychic.
Prob.thatstudent𝑖 guessesmycodewithoutbeingpsychic?𝑃 ℎF correct ℎF not psychic) = 1− 𝑝 R
Prob.atleastonestudentguessesmycode,withoutanyonebeingpsychic?
𝑃 ℎ@ correct ∨⋯∨ ℎ C correct nobody is psychic)= 1− 1− 1− 𝑝 R |C|
Howlongshouldthesequencebe,soweare1 − 𝛿 confident?𝑚 > log(@[\)(1 − 1− 𝛿 @/|C|)
6
10/29/19
2
• BinomialDistribution:prob.ofobserving𝑘 headsin𝑚 independentcointosses,whereeachtossisheadswithprob.𝑝,is
𝑃𝑟 𝑋 = 𝑘 𝑝,𝑚) =𝑚!
𝑘! 𝑚 − 𝑘 !𝑝a(1 − 𝑝)(R[a).
• Hoeffding’s inequality:Intheabovebinomialdistribution,Pr
𝑘𝑚−𝑝 > 𝜖 ≤ 2 exp(−2𝑚𝜖g)
• UnionBound:Foranyevents𝐸F ,
𝑃𝑟 𝐸@ ∨ 𝐸g ∨⋯∨ 𝐸a ≤iFj@
a
𝑃𝑟 𝐸F .
• Nonamelemma: 1 − 𝜖 ≤ 𝑒[k
UsefulFormulas
1
1
𝐞𝐱𝐩(−𝒙)𝟏−𝒙
Pr 𝐸@ ∨ 𝐸g
𝐸@ 𝐸g
Pr E@ + Pr(Eg)
7
FundamentalQuestions
QuestionsinStatisticalLearningTheory:• Tryingtolearnaclassifierfrom𝐻?• Howgoodisthelearnedruleafter𝑚 examples?• Howmanyexamplesisneededforthelearnedruletobeaccurate?• Whatcanbelearnedandwhatcannot?• Isthereauniversallybestlearningalgorithm?
Inparticular,wewilladdress:• WhatkindofaguaranteeonthetrueerrorofaclassifiercanIgetifIknowitstrainingerror?
8
RecallPredictionasLearning
9
Sample&GeneralizationErrors
Sampleerror ofhypothesisℎ onsamplesS = { 𝑥@, 𝑦@ , … ,𝑥R, 𝑦R },denotedby𝑒𝑟𝑟v ℎ is
𝑒𝑟𝑟v ℎ =1𝑚i
Fj@
R
1(ℎ 𝑥F ≠ 𝑦F)
Sample(Empirical)Error
Generalizationerrorofhypothesisℎ ondistribution𝑃(𝑋, 𝑌),denotedby𝑒𝑟𝑟y ℎ is
𝑒𝑟𝑟y ℎ = Prz,{ ∼y
ℎ 𝑥 ≠ 𝑦 =iFj@
R
1 ℎ 𝑥 ≠ 𝑦 ⋅ 𝑃(𝑋 = 𝑥, 𝑌 = 𝑦)
Generalization(Prediction/true)Error
10
Real-worldProcess𝑃(𝑋, 𝑌)
TrainingdatasetStrain{ 𝑥@, 𝑦@ ,… , 𝑥R, 𝑦R } Learner TestingdatasetStest
𝑥R~@, 𝑦R~@ ,… ,
drawni.i.d. drawni.i.d.
hStrain
Goal: Findℎwithsmallpredictionerror𝑒𝑟𝑟y ℎ on𝑃 𝑋, 𝑌 .Strategy:Findanℎ ∈ 𝐻withsmallsampleerror𝑒𝑟𝑟v����� ℎ ontrainingdataset 𝑆���F�.Testthelearnedℎ tomeasureitstesterror𝑒𝑟𝑟v���� ℎ onaseparatetestingdataset𝑆����.
PredictionasLearning
11
Let’scomeback
12
10/29/19
3
WhatkindofaguaranteeonthetrueerrorofaclassifiercanIgetifIknowitstrainingerror?
Today’splan:• Zeroempiricalerror: IftheruleIlearnedfromH achieveszeroerroronthesamples(𝑒𝑟𝑟v ℎ = 0),howlargeis𝑒𝑟𝑟y ℎ ?
• Non-zeroempiricalerror: Howgoodisthetrueerrorofahypothesisfrom𝐻 thatthatperformswellonsamples?
Today’sassumption:ThehypothesissetH isfinite.
GeneralizationErrorBounds
13
IfthehypothesisIlearnedfromH achieveszeroerroronthesamples(𝑒𝑟𝑟v ℎ = 0),howlargeis𝑒𝑟𝑟y ℎ ?
• AssumeH isfinite.• AssumeRealizability:Thereisaconsistentclassifier.àThereisalwaysoneℎ ∈ H that𝑒𝑟𝑟y ℎ = 0 – onepersonispsychic.
Algorithmℒ takesaset𝑆 of𝑚 samplesfrom𝑃 andpicksℎv thathas0empiricalerror.What’stheboundon𝑒𝑟𝑟y ℎ ?1. Fixahypothesisℎ ∈ H beforeseeing𝑆.What’stheprobabilitythat
𝑒𝑟𝑟y ℎ > 𝜖, but𝑒𝑟𝑟v ℎ = 0?2. What’stheprobabilitythat𝑒𝑟𝑟y ℎv > 𝜖, but𝑒𝑟𝑟v ℎv = 0?
ZeroEmpiricalError
14
SampleComplexity– 0EmpiricalError
Foranyinstancespace 𝑋 andsetoflabels𝑌 = {−1, 1} andforanydistribution 𝑃 on 𝑋×𝑌,consideraset𝑆 of𝑚 i.i.d.samplesfrom𝑃,wehave
Prv∼y�
∃ ℎ ∈ 𝐻, such that 𝑒𝑟𝑟v ℎ = 0, but 𝑒𝑟𝑟y ℎ > 𝜖 ≤ |𝐻|𝑒[kR.
Theorem
Let𝑚 ≥ @kln 𝐻 + ln @
�.Foranyinstancespace𝑋,labelsY =
{−1, 1},distribution 𝑃 on 𝑋×𝑌,withprobability1 − 𝛿 overi.i.ddrawsofset𝑆 of𝑚 samples,wehaveAny ℎ ∈ 𝐻 that has 𝟎 𝐞𝐦𝐩𝐢𝐫𝐢𝐜𝐚𝐥 𝐞𝐫𝐫𝐨𝐫, has 𝐭𝐫𝐮𝐞 𝐞𝐫𝐫𝐨𝐫 of 𝑒𝑟𝑟y ℎ ≤ 𝜖.
Theorem:SampleComplexity(zeroempiricalerror)
LearningAlgorithm: Givenasampleset𝑆 andhypothesisclassℎ ∈𝐻,ifthereisaℎv ∈ 𝐻 thatisconsistentwith𝑆,returnℎv.(Eqv.Returnℎv inversionspaceVS(𝐻, 𝑆))
15