ece6980 an algorithmic and information theoretic toolbox for … · 2016-12-21 · information...

25
ECE6980 An Algorithmic and Information Theoretic Toolbox for Massive Data

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

ECE6980

An Algorithmic and Information Theoretic

Toolbox for Massive Data

Page 2: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Instructor:Jayadev AcharyaEmail:[email protected]:TuTh 1.25-2.40,203PhillipsOfficeHours:MoTh 3-4,304RhodesWebsite:http://people.csail.mit.edu/jayadev/ece6980

Logistics

Page 3: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Grading• Scribealecture:10%

• Encouragedtofillinthedetails,provideexamples

• Assignments30-60%• 2-3assignments• Typeset?

• Projectreportandpresentation:40-60%• Readanewrelatedpaper• Presentasummaryinyourownwords• Canchoosefromalist

• Interruptions: 5%

Page 4: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Lectures• Lecturesprimarilyontheboard• Derivethings(mostlyfromscratch)

Page 5: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Courseoverview• Lotofinterestindatascience

• Numberofcoursesonoffer• Manyaspectscanbecovered

• Thiscourse:• Coreprimitives• Efficientalgorithms• Fundamental limits• Mostlytheoretical,encourage implementation

Page 6: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Prerequisites• Undergraduateprobability/random processes• Basiccombinatorics

• Whatisthevarianceofarandomvariable?• Whatisabinomialdistribution?• Whenaretworandomvariables independent?

Page 7: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Whatyoushouldlearn?• Fastalgorithmsforstatisticalproblems

• Learningdiscretedistributions• Finitesamplehypothesistesting

• Howtoproveinformationtheoretic lowerbounds

Probabilisticthinking

Page 8: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Classical

Smalldomain𝐷

𝑛𝑙𝑎𝑟𝑔𝑒, 𝐷 𝑠𝑚𝑎𝑙𝑙

Modern

Largedomain𝐷

𝑛𝑠𝑚𝑎𝑙𝑙 , 𝐷 𝑙𝑎𝑟𝑔𝑒

Oldquestions,newissues

Domain:

𝑛 = 1000 tosses

AsymptoticanalysisComputationnotcrucial

Newchallenges

Onehumangenome

Domain:

Page 9: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

ResourcesSamples

• Howmuchdataneeded?• Inferencewhendataisscarce

Computation• Howdoesrun-timescalewithdataanddomainsize?• Evenquadratic mightbeprohibitive

Otherresources• Storage:Notenoughspacetostorealldata• Communication:Distributeddataacrossservers

Page 10: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

GoalsForstatisticalinference• Design efficientalgorithms• Understand fundamental limits

INFORMATIONTHEORY

MACHINELEARNINGSTATISTICS

ALGORITHMS

Page 11: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Distributionlearning

Page 12: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Asimplesetting• Support set𝒳• Distribution𝑝:𝒳 → ℝ45 ,suchthat∑ 𝑝 𝑥8∈𝒳 = 1• Samples𝑥: = 𝑥;𝑥< … 𝑥: drawnfrom𝑝• Outputadistribution𝑞(𝑥:) afterobserving𝑥:

Tossacoin:HTTTHTTH

Throwadie:31344536

Page 13: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Whatisagoodestimator• Wouldlike𝑞 tobecloseto𝑝• 𝐿(𝑝, 𝑞):Lossforestimating𝑝 with𝑞

• Totalvariationdistance,KLdivergence,…

• Findanestimatorwithsmall𝐿• Foragivenlossfunction,howmanysamplesneeded?

• Empiricalestimators:• 𝑞 𝐻 = C

D

• Analyzetheperformance ofempiricalestimators

Page 14: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Learning• Givensamples fromaGaussiandistribution𝑁(𝜇, 𝜎<)• LearnwithaGaussiandistribution

• Relativelysimple

Page 15: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Learning

Ratioofbreadthtoheightof1000crabsbyW.WeldonNotnormallydistributed,morethanonespecies?KarlPearson:MixturesofGaussians (muchharder!!)

Page 16: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Distributiontesting

Page 17: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

PolishMultilotek:• Picks20numbersbetween1,…,80

Isitfair?

Testinguniformity

Page 18: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Thanks to Krzysztof Onak (pointer) and Eric Price (graph)(FigurebyOnak,Price,Rubinfeld)

Testinguniformity(contd)

Page 19: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Asimplesetting• 𝒳 = 𝑘• 𝑢: uniformdistributionover𝒳• 𝑥:: 𝑛 samples fromadistribution𝑝

Question: Is𝑝 = 𝑢 OR 𝑝 − 𝑢 ; ≥ 𝜀?

Howmanysamplesdoweneed?Takeaguess…

Page 20: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

𝑋:: asportsarticle𝑌:: areligiousarticle

𝑍:oneword

Q:Is𝑍morelikelytoappear insportsorreligion?

Necessarily assign𝑍 towhereitappearsmoreoften?

Asimpleclassificationproblem

Page 21: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Propertyestimation

Page 22: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Predictingnewelements

...

Howmanynewspecies?

Corbet collectedbutterflies inMalayaforoneyear

Applications of estimating the unseen

Corbett collected butterflies in Malaya for 1 year

Frequency 1 2 3 4 5 6 7 ..Species 118 74 44 24 29 22 20 ..

# of seen species = 118 + 74 + 44 + 24 + . . .

# of new species in the next year?

# of words in a book..

17 /46

Howmanynewspecies ifhegoesforonemoreyear?

Page 23: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Entropyestimation

Measuring Randomness in Data

Estimating randomness of the observed data:

Neural signal processing Feature selection for machine learning

Image Registration

Approach: Estimate the “entropy” of the generating distribution

Shannon entropy H(p)

def=

Px �px log px

1

Howmuchrandomnessinneuralspikes?

Howtoestimateentropy fromobservations?

Page 24: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Entropyestimation• 𝒳 = 𝑘• 𝑥:: 𝑛 samples fromadistribution𝑝

Question: Estimate𝐻(𝑝)

Page 25: ECE6980 An Algorithmic and Information Theoretic Toolbox for … · 2016-12-21 · Information Theoretic Toolbox for Massive Data. Instructor: Jayadev Acharya Email: ... • Undergraduate

Resourceconstraints

DatatoobigtobestoredinasinglemachineLotofrecent interest

samplestream

limitedmemory

decision

distributeddata

limitedcommunication