data mining distributed streams - github pages · frequency counting misra, gries. finding repeated...
TRANSCRIPT
![Page 1: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/1.jpg)
DataMiningDistributedStreams
EdoLibertyPrincipalScientistAmazonWebServices
![Page 2: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/2.jpg)
Data
Computation Result
TheWorld
Singlemachinedataprocessing
![Page 3: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/3.jpg)
Data Data Data Data
Computation Result
TheWorld
Distributedstorage
![Page 4: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/4.jpg)
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Computation Result
TheWorld
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Distributedcompute(map/reduce,MPI,…)
![Page 5: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/5.jpg)
Data+Compute
Data+Compute
Data+Compute
Data+Compute
Computation Result
TheWorld
Data+Compute
Data+Compute
Data+Compute
Data+Compute
ComputationQuery
Distributedmodel(indexes,databases,Spark…)
![Page 6: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/6.jpg)
207big-datainfographics(ametainfographic)
![Page 7: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/7.jpg)
![Page 8: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/8.jpg)
Sketch
TheWorld
QueryAlgorithm ResultQuery
Result
Compute
Thestreamingmodel
![Page 9: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/9.jpg)
Merge+Sketch
TheWorld
QueryAlgorithm ResultQuery
Result
Compute+Sketch
Compute+Sketch
Compute+Sketch
Compute+Sketch
Thedistributedstreamingmodel
![Page 10: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/10.jpg)
Sketch
Result
Iterator
Computation
Thestreamingmodel(moreaccurately)
O(n) Items
O(polylog(n)) Space
O(polylog(n)) Computationperitem
1 7 8 1 0 1 7 7
![Page 11: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/11.jpg)
Sketch Result
Iterator Iterator
Communicationcomplexity
1 7 8 1 0 1 7 7
![Page 12: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/12.jpg)
Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)
WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification
Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching
![Page 13: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/13.jpg)
Items(words,IP-addresses,events,clicks,...)• Itemfrequencies• ApproximateQuantiles• Countingdistinctelements• Momentandentropyestimation• Approximatesetoperations• SamplingVectors(textdocuments,images,examplefeatures,...)• Dimensionalityreduction• Clustering(k-means,k-median,…)• LinearRegression• Machinelearning(someofitatleast)
WhatCanwedointhismodel?Matrices(textcorpora,recommendations,...)• Covarianceestimationmatrix• Lowrankapproximation• Sparsification
Graphs*(socialnetworks,communications,...)• Connectivity• CutSparsification• WeightedMatching
![Page 14: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/14.jpg)
FrequencyCounting
Misra,Gries.Findingrepeatedelements,1982.
Demaine,Lopez-Ortiz,Munro.Frequencyestimationofinternetpacketstreamswithlimitedspace,2002
Karp,Shenker,Papadimitriou.Asimplealgorithmforfindingfrequentelementsinstreamsandbags,2003
Thename``Lossy Counting"wasusedforadifferentalgorithmbyManku andMotwani,2002
Metwally,Agrawal,Abbadi,EfficientComputationofFrequentandTop-kElementsinDataStreams,2006
Charikar,Chen,Farach-Colton,Findingfrequentitemsindatastreams,2002
Cormode,Muthukrishnan,AnImprovedDataStreamSummary:TheCount-MinSketchanditsApplications.
![Page 15: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/15.jpg)
n
f( ) = 5
ProblemDefinition
|f 0 � f | < "n
![Page 16: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/16.jpg)
Canwedobetterthansampling?
f 0( ) = 3 · n/`
` = O(1/"2)
![Page 17: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/17.jpg)
`
![Page 18: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/18.jpg)
`
![Page 19: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/19.jpg)
`
![Page 20: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/20.jpg)
`
![Page 21: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/21.jpg)
`
![Page 22: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/22.jpg)
`
![Page 23: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/23.jpg)
`
![Page 24: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/24.jpg)
f 0( ) = 0
`
f 0( ) = 2
![Page 25: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/25.jpg)
Assumewedeletetimest
Secondfact: f
0(x) � f(x)� t
f
0(x) f(x)Firstfact:
Analysis
Therefore: |f 0(x)� f(x)| t
![Page 26: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/26.jpg)
Wedeletedifferentitemseverytime!
Thirdfact: t n/`
`
Analysis
Wegetthat:
⌅When:(muchbetterthansampling!)` = 1/"
|f 0(x)� f(x)| < "n
![Page 27: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/27.jpg)
Items’exactprobability p(x) = f(x)/n
p
0(x) = f
0(x)/n
|p0(x)� p(x)| 1/`
Analysis
Approximateprobability
Weget:
Ifwegetonlyaerrorinourestimations.
Wewouldneed10billion samplestogetthesameaccuracy!
` = 10, 000 0.01%
![Page 28: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/28.jpg)
Emailthreads
Asimpleemailthread(that’snotveryhardtodo…)
![Page 29: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/29.jpg)
ThreadingMachineGeneratedEmail
Ailon,Karnin,Maarek,Liberty,ThreadingMachineGeneratedEmail,WSDM2013
![Page 30: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/30.jpg)
ThreadingMachineGeneratedEmail
![Page 31: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/31.jpg)
ThreadingMachineGeneratedEmail
![Page 32: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/32.jpg)
Streamingquantiles
Manku,Rajagopalan,Lindsay.Randomsamplingtechniquesforspaceefficientonlinecomputationoforderstatisticsoflargedatasets.Munro,Paterson.Selectionandsortingwithlimitedstorage.Greenwald,Khanna.Space-efficientonlinecomputationofquantilesummaries.Wang,Luo,Yi,Cormode.Quantilesoverdatastreams:Anexperimentalstudy.Greenwald,Khanna.Quantilesandequidepth histogramsoverstreams.Agarwal,Cormode,Huang,Phillips,Wei,Yi.Mergeable summaries.Felber,Ostrovsky.ArandomizedonlinequantilesummaryinO((1/ε)log(1/ε))words.Lang,Karnin,Liberty,OptimalQuantileApproximationinStreams.
![Page 33: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/33.jpg)
ProblemDefinition
n
0 nn/2
R( ) = 0.6 · n
|R0 �R| < "nSamplingvaluesgives canwedobetter?O(1/"2)
![Page 34: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/34.jpg)
Thebasicbufferidea
1 0 35 4 7
Bufferofsizek
![Page 35: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/35.jpg)
Thebasicbufferidea
Storeskstreamentries
1
03
5
47
![Page 36: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/36.jpg)
Thebasicbufferidea
Thebuffersortskstreamentries
10
3
54
7
![Page 37: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/37.jpg)
Thebasicbufferidea
Deleteseveryotheritem
10
3
54
7
![Page 38: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/38.jpg)
Thebasicbufferidea
Andoutputstherestwithdoubletheweight
035
![Page 39: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/39.jpg)
Thebasicbufferidea
0
0
x x
1 54 7
1
3
3
4
5
7
R(x) = 2
R
0(x) = 2
R
0(x) = 2
R(x) = 5
R
0(x) = 4
R
0(x) = 6
![Page 40: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/40.jpg)
Thebasicbufferidea
Repeattimeuntiltheendofthestream
0
|R0(x)�R(x)| < n/k
nn/2
n/k
1 0 355
![Page 41: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/41.jpg)
n
Buffersofsize k
|R0(x)�R(x)| n log2(n)/k
log2(n)
1 0 35
Manku-Rajagopalan-Lindsay(MRL)sketch
![Page 42: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/42.jpg)
k = log2(n)/"Ifweset
|R0(x)�R(x)| "nWeget
Andwemaintainonlyitemsfromthestream!log
22(n)/"
Manku-Rajagopalan-Lindsay(MRL)sketch
![Page 43: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/43.jpg)
Greenwald-Khanna(GK)sketch
|R0(x)�R(x)| "nItgets
Andmaintainsonlyitemsfromthestream!
Usesacompletelydifferentconstruction
O(log(n)/")
![Page 44: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/44.jpg)
Agarwal,Cormode,Huang,Phillips,Wei,Yi(1)
Buffersofsize klog(1/")
startsamplingafteritemsO(1/"2)
log
2(1/")/"Reducesspaceusagetoitemsfromthestream.
1 0 35
![Page 45: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/45.jpg)
Agarwal,Cormode,Huang,Phillips,Wei,Yi(2)
E[R0(x)] = R(x)
R
0(x) isarandomvariablenowand
R(x) = 1
R
0(x) = 2
R
0(x) = 0
x
Reducesspaceusagetoitemsfromthestream.log
3/2(1/")/"
5 7
5
7
![Page 46: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/46.jpg)
Reducesspaceusagetoitemsfromthestream.
Lang,Karnin,Liberty(1)
Exponentiallyshrinkingbuffers
plog(1/")/"
1 0 35
![Page 47: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/47.jpg)
Reducesspaceusagetoitemsfromthestream.
Lang,Karnin,Liberty(2)
Exponentiallydecreasingbuffersizes
GKSketch
log log(1/")/"
1 0 35
![Page 48: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/48.jpg)
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
100 1000 10000 100000 1e+06
Err
or
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
0
500
1000
1500
2000
2500
3000
3500
4000
100 1000 10000 100000 1e+06
Space
Use
d F
or
Sto
ring S
am
ple
s
Number of Items in Randomly Permuted Stream
Lazy KLL versus (Sketch Library and Two Variants)
Sketch LibraryVariant 1Variant 2Lazy KLL
Someexperimentalresults
![Page 49: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/49.jpg)
CountDistinct(DemoOnly)
![Page 50: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/50.jpg)
>>headdata.csv0103023732
Inthisone,rowi tasksavaluefrom[0,i]uniformlyatrandom.
Assumeyouneedtoestimatethenumberofunique numbersinafile
![Page 51: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/51.jpg)
>>time wc -lc data.csv1000000076046666data.csv
real0m0.101suser 0m0.072ssys 0m0.021s
Readingthefiletake~1/10seconds.Wedon’tforeseeIObeinganissue.
Somestats:thereare10,000,000suchnumbersinthis~76Mbfile.
![Page 52: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/52.jpg)
>>timesortdata.csv -u|wc -l5001233
real2m37.071suser2m36.587ssys0m0.376s
Tocountthenumberofdistinctitemsyoumighttrythis:
>>sortdata.csv |uniq |wc-l
>>sortdata.csv -u|wc-l
However,itisfastertohave“uniqify”whilesorting.
![Page 53: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/53.jpg)
>>timesort data.csv -u-n|wc -l5001233
real 0m11.809suser 0m11.587ssys 0m0.228s
Still,mostofthetimeisspentoncomparingstrings....
>>sort data.csv -u-n-S100%|wc -l
Thisismuchbetter!
![Page 54: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/54.jpg)
>>timesketchuniqdata.csvEstimate :4974249UpperBound:5116569LowerBound:4835874
real0m1.527suser0m1.506ssys0m0.152s
Thisisthewaytodothiswiththesketchinglibrary
>>sketchuniq data.csv
Toofasttousethesystemmonitor UI...
Ituses~32kofmemory!
![Page 55: Data Mining Distributed Streams - GitHub Pages · Frequency Counting Misra, Gries. Finding repeated elements, 1982. Demaine, Lopez-Ortiz, Munro. Frequency estimation of internet packet](https://reader034.vdocuments.mx/reader034/viewer/2022050601/5fa8b3ff18a93b0cc27c363b/html5/thumbnails/55.jpg)
Thankyou!