compsci 590.02 instructor: ashwin machanavajjhala · 2013. 1. 10. · – every class based on 1...
TRANSCRIPT
![Page 1: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/1.jpg)
AlgorithmsforBig‐DataManagement
CompSci590.02Instructor:AshwinMachanavajjhala
1Lecture1:590.02Spring13
![Page 2: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/2.jpg)
AdministriviahCp://www.cs.duke.edu/courses/spring13/compsci590.2/
• Tue/Thu3:05–4:20PM
• “ReadingCourse+Project”– Noexams!
– Everyclassbasedon1(or2)assignedpapersthatstudentsmustread.
• Projects:(50%ofgrade)– Individualorgroupsofsize2‐3
• ClassPar\cipa\on+assignments(other50%)
• Officehours:byappointment
2Lecture1:590.02Spring13
![Page 3: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/3.jpg)
Administrivia• Projects:(50%ofgrade)
– Ideaswillbepostedinthecomingweeks
• Goals:– Literaturereview– Someoriginalresearch/implementa\on
• Timeline(detailswillbepostedonthewebsitesoon)– ≤Feb12:ChooseProject(ideaswillbeposted…newideaswelcome)
– Feb21:Projectproposal(1‐4pagesdescribingtheproject)– Mar21:Mid‐projectreview(2‐3pagereportonprogress)
– Apr18:Finalpresenta\onsandsubmission(6‐10pageconferencestylepaper+20minutetalk)
Lecture1:590.02Spring13 3
![Page 4: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/4.jpg)
Whyyoushouldtakethiscourse?• Industry,academicandgovernmentresearchiden\fiesthevalue
ofanalyzinglargedatacollec\onsinallwalksoflife.– “WhatNext?AHalf‐DozenDataManagementResearchGoalsforBig
DataandCloud”,SurajitChaudhuri,MicrosoOResearch
– “Bigdata:ThenextfronQerforinnovaQon,compeQQon,andproducQvity”,McKinseyGlobalInsQtuteReport,2011
Lecture1:590.02Spring13 4
![Page 5: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/5.jpg)
Whyyoushouldtakethiscourse?• Veryac\vefieldandtonsofinteres\ngresearch.
Wewillreadpapersin:– DataManagement– Theory
– MachineLearning
– …
Lecture1:590.02Spring13 5
![Page 6: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/6.jpg)
Whyyoushouldtakethiscourse?• Introtoresearchbyworkingonacoolproject
– ReadscienQficpapers
– Formulateaproblem– PerformascienQficevaluaQon
Lecture1:590.02Spring13 6
![Page 7: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/7.jpg)
Today• Courseoverview
• Analgorithmforsampling
Lecture1:590.02Spring13 7
![Page 8: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/8.jpg)
INTRODUCTION
Lecture1:590.02Spring13 8
![Page 9: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/9.jpg)
WhatisBigData?
Lecture1:590.02Spring13 9
![Page 10: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/10.jpg)
Lecture1:590.02Spring13 10
hCp://visual.ly/what‐big‐data
![Page 11: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/11.jpg)
Lecture1:590.02Spring13 11
hCp://visual.ly/what‐big‐data
![Page 12: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/12.jpg)
3KeyTrends• Increaseddatacollec\on
• (Sharednothing)Parallelprocessingframeworksoncommodityhardware
• Powerfulanalysisoftrendsbylinkingdatafromheterogeneoussources
Lecture1:590.02Spring13 12
![Page 13: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/13.jpg)
Big‐Dataimpactsallaspectsofourlife
13Lecture1:590.02Spring13
![Page 14: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/14.jpg)
ThevalueinBig‐Data…
14
+250% clicks vs. editorial one size fits all
+79% clicks vs. randomly selected
+43% clicks vs. editor selected
Recommendedlinks PersonalizedNewsInterests
TopSearches
Lecture1:590.02Spring13
![Page 15: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/15.jpg)
ThevalueinBig‐Data…
15
“IfUShealthcareweretousebigdata
creaQvelyandeffecQvelytodriveefficiencyand
quality,thesectorcouldcreatemorethan
$300billioninvalueeveryyear.”McKinseyGlobalIns\tuteReport
Lecture1:590.02Spring13
![Page 16: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/16.jpg)
Example:GoogleFlu
Lecture1:590.02Spring13 16
![Page 17: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/17.jpg)
Lecture1:590.02Spring13 17
hCp://www.ccs.neu.edu/home/amislove/twiCermood/
![Page 18: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/18.jpg)
CourseOverview• Sampling
– ReservoirSampling
– Samplingwithindices– SamplingfromJoins
– MarkovchainMonteCarlosampling
– GraphSampling&PageRank
Lecture1:590.02Spring13 18
![Page 19: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/19.jpg)
CourseOverview• Sampling
• StreamingAlgorithms– Sketches– OnlineAggrega\on– Windowedqueries
– Onlinelearning
Lecture1:590.02Spring13 19
![Page 20: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/20.jpg)
CourseOverview• Sampling
• StreamingAlgorithms• ParallelArchitectures&Algorithms
– PRAM
– MapReduce
– Graphprocessingarchitectures:BulkSynchronousparallelandasynchronousmodels
– (Graphconnec\vity,MatrixMul\plica\on,BeliefPropaga\on)
Lecture1:590.02Spring13 20
![Page 21: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/21.jpg)
CourseOverview• Sampling
• StreamingAlgorithms• ParallelArchitectures&Algorithms
• Joiningdatasets&RecordLinkage– ThetaJoins:orhowtoop\mallyjointwolargedatasets
– ClusteringsimilardocumentsusingminHash
– Iden\fyingmatchingusersacrosssocialnetworks
– Correla\onClustering– MarkovLogicNetworks
Lecture1:590.02Spring13 21
![Page 22: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/22.jpg)
SAMPLING
Lecture1:590.02Spring13 22
![Page 23: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/23.jpg)
WhySampling?• Approximatelycomputequan\\eswhen
– Processingtheen\redatasettakestoolong.HowmanytweetsmenQonObama?
– Computa\onisintractableNumberofsaQsfyingassignmentsforaDNF.
– Donothaveaccessorexpensivetogetaccesstoen\redata.HowmanyrestaurantsdoesGoogleknowabout?NumberofusersinFacebookwhosebirthdayistoday.WhatfracQonofthepopulaQonhastheflu?
Lecture1:590.02Spring13 23
![Page 24: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/24.jpg)
Zero‐OneEs\matorTheoremInput:AuniverseofitemsU(e.g.,alltweets)
AsubsetG(e.g.,tweetsmen\oningObama)
Goal:Es\mateμ=|G|/|U|
Algorithm:• PickNsamplesfromU{x1,x2,…,xN}• Foreachsample,letYi=1ifxiεG.• Output:Y=ΣYi/N
Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),thenPr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ
Lecture1:590.02Spring13 24
![Page 25: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/25.jpg)
Zero‐OneEs\matorTheoremAlgorithm:
• PickNsamplesfromU{x1,x2,…,xN}• Foreachsample,letYi=1ifxiεG.
• Output:Y=ΣYi/N
Theorem:Letε<2.IfN>(1/μ)(4ln(2/δ)/ε2),then
Pr[(1‐ε)μ<Y<(1+ε)μ]>1‐δ
Proof:Homework
Lecture1:590.02Spring13 25
![Page 26: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/26.jpg)
SimpleRandomSample• GivenatableofsizeN,pickasubsetofnrows,suchthateach
subsetofnrowsisequallylikely.
• Howtosamplenrows?• …ifwedon’tknowN?
Lecture1:590.02Spring13 26
![Page 27: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/27.jpg)
ReservoirSamplingHighlights:
• Makeonepassoverthedata• Maintainareservoirofnrecords.
• A}erreadingtrows,thereservoirisasimplerandomsampleofthefirsttrows.
Lecture1:590.02Spring13 27
![Page 28: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/28.jpg)
ReservoirSampling[ViCerACMToMS‘85]AlgorithmR:
• Ini\alizereservoirtothefirstnrows.
• Forthe(t+1)strowR,
– Pickarandomnumbermbetween1andt+1
– Ifm<=n,thenreplacethemthrowinthereservoirwithR
Lecture1:590.02Spring13 28
![Page 29: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/29.jpg)
Proof
Lecture1:590.02Spring13 29
![Page 30: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/30.jpg)
Proof• IfN=n,thenP[rowisinsample]=1.Hence,reservoircontains
alltherowsinthetable.
• SupposeforN=t,thereservoirisasimplerandomsample.Thatis,eachrowhasn/tchanceofappearinginthesample.
• ForN=t+1:– (t+1)strowisincludedinthesamplewithprobabilityn/(t+1)– Anyotherrow:
P[rowisinreservoir]=P[rowisinreservoira}ertsteps]*P[rowisnot replaced] =n/t*(1‐1/(t+1))=n/(t+1)
Lecture1:590.02Spring13 30
![Page 31: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/31.jpg)
Complexity• Running\me:O(N)
• Numberofcallstorandomnumbergenerator:O(N)
• Expectednumberofelementsthatmayappearinthereservoir:
n+ΣnN‐1n/(t+1)=n(1+HN‐Hn)≈n(1+ln(N/n))
• Isthereawaytosamplefaster?in\meO(n(1+ln(N/n)))??
Lecture1:590.02Spring13 31
![Page 32: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/32.jpg)
Fasteralgorithm• AlgorithmRskipsover(doesnotinsertintoreservoir)anumber
ofrecords(N‐n(1+ln(N/n)))
• Atanystept,letS(n,t)denotethenumberofrowsskippedbytheAlgorithmR.– InvolvedO(S)\meandO(S)callstotherandomnumbergenerator.
• P[S(n,t)=s]=?
Lecture1:590.02Spring13 32
![Page 33: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/33.jpg)
Fasteralgorithm• Atanystept,letS(n,t)denotethenumberofrowsskippedbythe
AlgorithmR.
• P[S(n,t)=s]=forallt<x<=t+s,rowxwasnotinsertedintoreservoir,butrowt+s+1isinserted.
={1‐n/(t+1)}x{1–n/(t+2)}x…x{1‐n/(t+s)}xn/(t+s+1)
• WecanderiveexpressionforCDF:P[S(n,t)<=s]=1–(t/t+s+1)(t‐1/t+s)(t‐2/t+s‐1)…(t‐n+1/t+s‐n+2)
Lecture1:590.02Spring13 33
![Page 34: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/34.jpg)
FasterAlgorithmAlgorithmX
• Ini\alizereservoirwithfirstnrows.
• A}erseeingtrows,randomlysampleaskips=S(n,t)fromtheCDF
• Pickanumbermbetween1andn
• Replacethemthrowinthereservoirwiththe(t+s+1)strow.
• Sett=t+s+1
Lecture1:590.02Spring13 34
![Page 35: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/35.jpg)
FasterAlgorithmAlgorithmX
• Ini\alizereservoirwithfirstnrows.• A}erseeingtrows,randomlysampleaskips=S(n,t)fromthe
CDF– PickarandomUbetween0and1
– FindtheminimumssuchthatP[S(n,t)<=s]<=1‐U
• Pickanumbermbetween1andn
• Replacethemthrowinthereservoirwiththe(t+s+1)strow.• Sett=t+s+1
Lecture1:590.02Spring13 35
![Page 36: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/36.jpg)
AlgorithmX• Running\me:
EachskiptakesO(s)\metocomputeTotal\me=sumofalltheskips=O(N)
• Expectednumberofcallstotherandomnumbergenerator=2*expectednumberofrowsinthereservoir
=O(n(1+ln(N/n)))op\mal!
Seepaperforalgorithmwhichhasop\malrun\me
Lecture1:590.02Spring13 36
![Page 37: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/37.jpg)
Summary• Samplingisanimportanttechniqueforcomputa\onwhendatais
toolarge,orthecomputa\onisintractable,orifaccesstodataislimited.
• Reservoirsamplingtechniquesallowcompu\ngasampleevenwithoutknowledgeofthesizeofthedata.– Alsocandoweightedsampling[Efraimidis,SpirakisIPL2006]
• Veryusefulforsamplingfromstreams(e.g.,twiCerstream)
Lecture1:590.02Spring13 37
![Page 38: CompSci 590.02 Instructor: Ashwin Machanavajjhala · 2013. 1. 10. · – Every class based on 1 (or 2) assigned papers that students must read. • Projects: (50% of grade) – Individual](https://reader036.vdocuments.mx/reader036/viewer/2022071413/610adf78b7ecca1b1424343c/html5/thumbnails/38.jpg)
References• J.ViCer,“RandomSamplingwithaReservoir”,ACMTransac\ononMathema\cal
So}ware,1985• P.Efraimidis,P.Spirakis,“Weightedrandomsamplingwithareservoir”,Journal
Informa\onProcessingLeCers,97(5),2006
• R.Karp,R.Luby,N.Madras,“MonteCarloApproxima\onAlgorithmsforEnumera\onProblems”,JournalofAlgorithms,1989
Lecture1:590.02Spring13 38