distributed data managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf ·...
TRANSCRIPT
![Page 1: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/1.jpg)
Distributed Data ManagementSummer Semester 2015
TU Kaiserslautern
Prof. Dr.-Ing. Sebastian Michel
Databases and Information Systems Group (AG DBIS)
http://dbis.informatik.uni-kl.de/
Distributed Data Management, SoSe 2015, S. Michel 1
![Page 2: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/2.jpg)
Announcements
• Please send specific questions you have to Evica Milchevski or Kiril Panev ([email protected] or [email protected] )
• They will collect them and compile a FAQ list to walk through in the exam preparation session on July 21 during regular exercise session hours.
• Please do not forget to register for the exam also at the examination office.
• Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional for the next exercise session; everyone gets the “point” for this assignment.
Distributed Data Management, SoSe 2015, S. Michel 2
![Page 3: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/3.jpg)
(DISTRIBUTED) DATA STREAM PROCESSING
Distributed Data Management, SoSe 2015, S. Michel 3
![Page 4: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/4.jpg)
Data Stream• A data stream is a sequence of data tuples.
• Think of standard tuples of relational databases.
• With time information (timestamps)
• One after the other, or in batches, they are generated.
• That means, Data is moving! Continuously generated (assumed infinite!)
• Potentially high pace.
• System has to process data without first storing everything (how would that be possible anyway if stream is infinite?!)
Distributed Data Management, SoSe 2015, S. Michel 4
![Page 5: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/5.jpg)
Sensor Networks as Data Streams Origin• E.g., in Environmental Monitoring
Distributed Data Management, SoSe 2015, S. Michel 5
StationStream(timestamp, humidity, solarRadiation, windSpeed, snowHeight)
• Various application scenarios:– avalanche risk level
computation
– insights for agriculture
– air pollution (urban) monitoring
![Page 6: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/6.jpg)
Sample Applications: Pothole Patrol• The Pothole Patrol
• Detecting and reporting the surface conditions of roads; using sensors in vehicles
• Using 3-axis accelerometer+GPS + learning
Distributed Data Management, SoSe 2015, S. Michel 6
Eriksson et al. The Pothole Patrol: Using a Mobile Sensor Network for Road Surface Monitoring. MobiSys 2008.
![Page 7: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/7.jpg)
Sample Application: Swiss Experiment
Distributed Data Management, SoSe 2015, S. Michel7
• Environmental monitoring
• Sensor data management and meta data sharing.
• Across many different types of measurement: (hydrology, alpine monitoring, atmospheric phenomena, earthquakes, …)
• Also higher level applications like putting sensors and interpretations on maps, computing statistics over streams.
http://www.swiss-experiment.ch
![Page 8: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/8.jpg)
Sensors
Distributed Data Management, SoSe 2015, S. Michel 8
• Mobile vs. static
• Large vs. tiny (smart dust!)
• bytes/hours vs. > GB/minute
ambient temp.sensor of a car
Sensors at LHC@CERN
Tiny sensor atU Michigan
![Page 9: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/9.jpg)
Example Sensor (Tinynode) on Top of Extension Board
Distributed Data Management, SoSe 2015, S. Michel 9
Antenna
Light Sensor
Connector Board for extras
RS-233 for programming
htt
p:/
/ww
w.t
inyn
od
e.co
m/
Actual Sensor
![Page 10: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/10.jpg)
Social Sensors• Explicitly: Snow Tweets (http://snowcore.uwaterloo.ca/snowtweets/)
– #snowtweets 50.0 cm. at K1A 0A2
– #snowtweets 10.0 in. at 20500
– #snowtweets 4 cm at Palmerston North 4414
• Implicitly: By mentioning topics, people, in social communication
Distributed Data Management, SoSe 2015, S. Michel 10
time
![Page 11: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/11.jpg)
11
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
![Page 12: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/12.jpg)
12
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
![Page 13: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/13.jpg)
13
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
![Page 14: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/14.jpg)
14source: http://blog.socialflow.com/
Earthquake News on Twitter
Distributed Data Management, SoSe 2015, S. Michel
![Page 15: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/15.jpg)
Classic Example: Stock Market
• Real-time analysis of stock marked changes
• Computing statistics over streams, e.g., for decision support
• Opportunities for reacting in real-time
• Even with fully automated means: algorithmic trading
Distributed Data Management, SoSe 2015, S. Michel 15
![Page 16: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/16.jpg)
So Far: Databases/NoSQL Datastores• Data is changing, yes, but this is more due to
inserts and update to stored data items
• Historic data is kept
• Queries operate on full data (tables)
• MapReduce is extreme, Write-once & Read-many times
• Data warehousing, too: periodically loading data in store for deep(er) analytics
• Data mining
Distributed Data Management, SoSe 2015, S. Michel 16
![Page 17: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/17.jpg)
Data Stream Management vs. Traditional Data Management
• At query time, data is accessed as a whole
• Data is persistently stored
• Queries are ad-hoc (mainly)
Distributed Data Management, SoSe 2015, S. Michel 17
DATA Base/Store
Query & ResultsInsert
Update
Delete
![Page 18: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/18.jpg)
Data Stream Management vs. Traditional Data Management (Cont’d)
• Data is moving! Continuously generated (assumed infinite!)• At high pace• Queries are (mainly) continuous (aka. standing). Registered
once, observed “forever”.• Answer to queries in (near) real-time required (often)• Probabilistic methods for efficiency or considering only part of
the stream (sliding window)Distributed Data Management, SoSe 2015, S. Michel 18
DATA STREAM
Set of queries
![Page 19: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/19.jpg)
DBMS vs. DSMS
Distributed Data Management, SoSe 2015, S. Michel 19
Database management system (DBMS)
Data stream management system (DSMS)
Persistent data (relations) Volatile data streams
Random access Sequential access
One-time queries Continuous queries
(theoretically) unlimited secondary storage Limited main memory
Only the current state is relevant Consideration of the order of the input
Relatively low update rate Potentially extremely high update rate
Little or no time requirements Real-time requirements
Assumes exact data Assumes outdated/inaccurate data
Plannable query processingVariable data arrival and data characteristics
http://en.wikipedia.org/wiki/Data-stream_management_system
![Page 20: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/20.jpg)
Data Stream Model
• Stream of data items is unbounded (available memory is not)
• No way to store entire stream (how could we, its (probably) not ending)
• To compute query results, need to devise algorithm with little memory consumption
Distributed Data Management, SoSe 2015, S. Michel 20
![Page 21: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/21.jpg)
Overview of Data Stream Topics• Synopses:
– concise representations of stream content– tailored to tasks, e.g., counting distinct elements– usually not exact, but approximations (estimators) of
true values. – generally useful for representing data compactly– We will look at some of them today
• (Sliding) Windows:– focus of certain recent subset of data– computation of functions/joins over window(s)
content– Will look at CQL language: think “SQL” for streaming
data
Distributed Data Management, SoSe 2015, S. Michel 21
![Page 22: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/22.jpg)
Data Stream Mining: Teasers
• I tell you integer numbers between 1 and N
• I will tell all but one number
• After N-1 numbers I ask: which number was missing?
Distributed Data Management, SoSe 2015, S. Michel 22
481 324 122 412 871 231 849 447 641 …
![Page 23: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/23.jpg)
Data Stream Mining: Teasers (Cont’d)
• Keep Boolean array of length N:
– Mark position for observed number
– Size required: N
– Computation at end: N to find missing number
Distributed Data Management, SoSe 2015, S. Michel 23
![Page 24: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/24.jpg)
Data Stream Mining: Teasers (Cont’d)
• Keep Boolean array of length N:
– Mark position for observed number
– Size required: N
– Computation at end: N to find missing number
Distributed Data Management, SoSe 2015, S. Michel 24
• Much better:
– keep sum of numbers: S
– Missing number is N*(N+1)/2 - S
![Page 25: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/25.jpg)
Counting Occurrences
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How often does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2015, S. Michel 25
![Page 26: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/26.jpg)
Counting Occurrences
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How often does a2 occur?
• How to implement?
Distributed Data Management, SoSe 2015, S. Michel 26
• Keep counter for each id
• Required space #ids (=N)
• Not feasible of N is very large
![Page 27: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/27.jpg)
Probabilistic Counting: Count-Min Sketch
• Keep 2-dim array (h, r)
• h hash functions hi that map to range 0…(r-1)
Distributed Data Management, SoSe 2015, S. Michel 27
Cormode, Muthukrishnan (2004). An Improved Data Stream Summary: The Count-Min Sketch and its Applications. J. Algorithms 55: 29–38.
0 1 2 3 4 5
• Arriving item x.
• For each j: array[j, hj(x)]++
h1
h2
h3
h4
![Page 28: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/28.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 28
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
![Page 29: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/29.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 29
1
1
1
1
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
red = inserted
![Page 30: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/30.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 30
1 1
2
1 1
1 1
0 1 2 3 4 5
h1
h2
h3
h4
a, b, a, a, c, a, c, ….Data stream is
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
red = inserted
![Page 31: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/31.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 31
1 2
3
2 1
2 1
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
red = inserted
![Page 32: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/32.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 32
1 3
4
3 1
3 1
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
red = inserted
![Page 33: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/33.jpg)
Count-Min Sketch: Insert Example
Distributed Data Management, SoSe 2015, S. Michel 33
1 1 3
1 4
4 1
3 2
0 1 2 3 4 5
h1
h2
h3
h4
x h1(x) h2(x) h3(x) h4(x)
a 4 5 0 2
b 3 5 1 3
c 2 2 0 3
… … … … …
Consider the following 4 hash functions, for ease of usage, displayed by their value when applied to a, b, or c:
a, b, a, a, c, a, c, ….Data stream is
Imagine that continues now a bit, then we might end up with ……
red = inserted
![Page 34: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/34.jpg)
Count-Min Sketch: Counting
• How often did we see item a?
• Recall the hash function values for a:
h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
Distributed Data Management, SoSe 2015, S. Michel 34
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
![Page 35: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/35.jpg)
Count-Min Sketch: Counting
• How often did we see item a?
• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
• Take minimum of the corresponding values: Here: 4
Distributed Data Management, SoSe 2015, S. Michel 35
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
9
8
8
4
Is this estimator generally underestimating or overestimating or can’t we say anything about that?
![Page 36: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/36.jpg)
Count-Min Sketch: Counting
• How often did we see item a?
• Look at positions: h1(a) = 4, h2(a)=5, h3(a)=0, h4(a)=2
• Take minimum of the corresponding values: Here: 4
• Estimate is never underestimating
• Overestimation probabilistically bounded
Distributed Data Management, SoSe 2015, S. Michel 36
5 3 4 4 9 3
4 7 1 4 4 8
8 4 6 7 2 1
3 1 4 8 7 5
0 1 2 3 4 5
h1
h2
h3
h4
9
8
8
4
![Page 37: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/37.jpg)
Unbiased vs. Biased Estimators
• Given a real number and an estimator of it, denoted as
• E.g., number of distinct elements in a set S
• is called an unbiased estimator if
• and biased otherwise, in which case
Distributed Data Management, SoSe 2015, S. Michel 37
n̂
![Page 38: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/38.jpg)
Counting Distinct Elements
• Consider a stream of elements ai
…, a2, a84, a41, a2, a77, a231, a2, a4, a54, …
• How to compute/estimate the number of distinct elements observed?
Distributed Data Management, SoSe 2015, S. Michel 38
![Page 39: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/39.jpg)
Application• Streams (one pass, little memory footprint)
• Distributed systems: compact data exchange (recall Bloom filter)
• But also counting in Database Systems
• Sketches for partial data can often be merged for global view
Distributed Data Management, SoSe 2015, S. Michel 39
sketch
sketch
Efficient Counting, Comparing:How many distinct elements altogether?
![Page 40: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/40.jpg)
Flajolet Martin (FM) Sketch (aka. Hash Sketch)
• Allocate a bitvector B of size m = log(N)
• Hash items to bitvector positions:
– Hash each item i to a m-bit number h(i)
– Compute position k of the least-significant “1” of h(i)
– Set the bit B[k] to “1”
Given data stream 17, 5, 19, 211, 17, 5, 31
h(17) = 010100 then least-sig. 1 bit = 2h(5) = 000101 then least-sig. 1 bit = 0...
Proposed originally by Flajolet and Martin in 1985
Distributed Data Management, SoSe 2015, S. Michel 40
![Page 41: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/41.jpg)
FM-Sketch: Estimator
• Get then position t of left-most “0” bit of B
• Count-Distinct Estimate of real distinct number n:
here: with t = 3:
• Improvement: – If you use more bitmaps and compute an average
position t, you can improve count-distinct estimate
Distributed Data Management, SoSe 2015, S. Michel 41
111010 B:• In the end B might look like this
Note: Be careful with left-most bit, least significant bit; depends on interpretation of bits
with
![Page 42: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/42.jpg)
FM Sketch: Intuition/Idea
• B[0] is set approximately n/2 times
• B[1] is set approximately n/4 times
• B[i] = 0 if i >> log2(n)
• B[i] = 1 if i<<log2(n)
• “Mix” of 1s and 0s around i≈log2(n)
• Use left-most zero at indicator for log2(n):
n ≈ 2position of left most zero bit
Distributed Data Management, SoSe 2015, S. Michel 42
![Page 43: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/43.jpg)
FM-Sketch: Union
• Given: two multisets S and T and theirs sketches and of size m
• Then:
The sketch
is the sketch of
BSBT
Distributed Data Management, SoSe 2015, S. Michel 43
![Page 44: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/44.jpg)
K-Min Value (KMV) Synopsis
• KMV synopsis is ordered set of k smallest values
},...,,{ )()2()1( kUUUL
0 1
• Unbiased Estimator:
– Exact error analysis based on theory of order statistics
– Asymptotically optimal as k becomes large
Distributed Data Management, SoSe 2015, S. Michel 44
• Given set S of values. Want number of distinct elements n := D(S) (notation)
• Hashing outputs values uniformly in [0,1]
k-min values
Slide based on PPT slides from Beyer et al. ‘07
[Beyer et al. , 2007]
![Page 45: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/45.jpg)
K-Min Value (KMV) Synopsis (Cont’d)
• KMV synopsis:
• Distance between each hash value is 1/n (assuming uniformity)
• We are interested in knowing the value of n
• Natural derivation of basic estimator:
Observation: Thus:
1/n
},...,,{ )()2()1( kUUUL
0 1U(1)
U(2)
U(k)
...
Hashing = dropping the n DVs uniformly on [0,1]
k-min values
Slide based on PPT slides from Beyer et al. ‘07
Why does this work?
This is a biased estimator.(k-1) instead of k renders it unbiased
![Page 46: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/46.jpg)
KMV Example• Implemented hash function using MD5 (and
mapped to [0,1]. Items are characters a-z.
Distributed Data Management, SoSe 2015, S. Michel
[0.043, 0.172, 0.281, 0.354, 0.382, 0.421, 0.443, 0.459, 0.463, 0.523, 0.556, 0.565, 0.569, 0.57, 0.59, 0.644, 0.652, 0.675, 0.682, 0.724, 0.818, 0.864, 0.89, 0.938, 0.994, 0.997]
k U(k)1 0.000 0,043
2 5.814 0,172
3 7.117 0,281
4 8.475 0,354
5 10.471 0,382
6 11.876 0,421
7 13.544 0,443
8 15.251 0,459
9 17.279 0,463
10 17.208 0,523
11 17.986 0,556
12 19.469 0,565
13 21.090 0,569
14 22.807 0,570
15 23.729 0,590
16 23.292 0,644
17 24.540 0,652
18 25.185 0,675
19 26.393 0,682
20 26.243 0,724
21 24.450 0,818
The sorted (ascending) hash values for the 26 characters a-z are:
• In general, why are distinct elements counted?
• What if we have one KMV sketch for a multiset A and one for multiset B? Can we combine them?
46
![Page 47: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/47.jpg)
(Multiset) Union of Partitions
0
k-min
0
k-min
0
k-min
U(k)
L
LA LB
• Combine KMV synopses: L = LA LB
• Theorem: L is a KMV synopsis of AB
… 1 … 1
… 1
Distributed Data Management, SoSe 2015, S. Michel 47
Take union of values and consider again the k smallest ones:
Slide based on PPT slides from Beyer et al. ‘07
![Page 48: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/48.jpg)
Distributed Data Management, SoSe 2015, S. Michel
• L=LALB as before (union): contains k elements– L corresponds to a uniform random sample of DVs in
AB
• K = # values in L that are also in D(AB)Theorem: Can compute from LA and LB alone
• K/ k estimates Jaccard coefficient:
estimates
• Unbiased estimator of #DVs in the intersection:
(Multiset) Intersection of Partitions
48
D(set) = distinct values
Slide based on PPT slides from Beyer et al. ‘07
![Page 49: Distributed Data Managementdbis.informatik.uni-kl.de/files/teaching/ss15/ddm/lecture9.pdf · •Since we did not capture CQL in today’s lecture, the assignment 3 on sheet 6 is optional](https://reader033.vdocuments.mx/reader033/viewer/2022060600/605424cdf0b5ac4da1469687/html5/thumbnails/49.jpg)
Literature
Distributed Data Management, SoSe 2015, S. Michel 49
• Graham Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55(1): 58-75 (2005)
• Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209 (1985)
• Kevin S. Beyer, Peter J. Haas, Berthold Reinwald, Yannis Sismanis, Rainer Gemulla: On synopses for distinct-value estimation under multisetoperations. SIGMOD Conference 2007: 199-210