generating wide-area content-based publish/subscribe workloads
DESCRIPTION
Generating Wide-Area Content-Based Publish/Subscribe Workloads. Albert Yu , Pankaj K. Agarwal, Jun Yang Duke University. Overview. Publish/Subscribe systems Data extraction Workload generation Conclusion and future work. Publish/Subscribe. Publishers. Subscribers. Brokers. - PowerPoint PPT PresentationTRANSCRIPT
Albert Yu, Pankaj K. Agarwal, Jun YangDuke University
1
Generating Wide-Area Content-Based Publish/Subscribe
Workloads
Overview
2
Publish/Subscribe systemsData extractionWorkload generationConclusion and future work
Publish/Subscribe
3
Publishers
BrokersSubscribers
Broker network
Two tasks
4
Subscription processingMatch and process each publish event with a large
set of subscriptions.Notification dissemination
Notify those interested subscribers over a network.
Event and network spaces
5
Event spaceEvent is a point.Subscription defines a region (ex: rectangle).
Network spaceNetwork location is a point.Distance between two network locations
approximates the latency between them.
e1
e2
e3
e4S1
S2
Lack of publicly available, realistic workloads
6
Privacy concern and commercial interestsLack of widely deployed systems
supporting powerful content-based subscriptions
Goal
7
Collect the limited amount of various statistics available to public.
Generate a workload consistent with these statistics.
Generate other workloads according to user-defined deviations.
Workload components
8
A set of subscriptions, each of which corresponds to:A rectangular region of interest in the event
spaceA point in the network space
An event distribution over the event spaceA set of brokers (optional)
A point in the network space
9
How to assign subscribers to brokers?
Balancing semantic similarity and network proximity in dissemination network design is a hard optimization.
Optimal tradeoff depends on the amounts of event matching shared versus disjoint interests.
Motivation: Broker-subscriber assignment
Letting a broker handle subscribers that are far awayViolate delivery latency requirements.Increase communication costs.
Cluster subscribers with similar interestsPotentially minimize network traffic.
10
Take into accountSubscription interestSubscription locationEvent distribution
Exploring correlation between event and network spaces provides more optimization opportunities.
Motivation: Broker-subscriber assignment
Related work
11
Characterize pub/sub systemsProperties of RSS feeds [Liu et al.’ 05]Stock popularity in NYSE [Tock et al.’05]
Simple synthesized workloadsEvent space
Uniform and Gaussian distributions [Baldoni et al.’07]Zipf distribution [Bianchi et al.’07]
Network spaceSubscribers are located uniformly or randomly in the
network [Baldoni et al.’07, Papaemmanouil and Cetintemel’05]
Two work phases of our generator
12
Data extractionWorkload generation
Data extraction
13
Data extraction Data generatorSummary statistics of subscriber interests, locations, and events
Data extraction (Cont’d)
14
Event space
• Subscription count• Event count• Distribution of subscribers over the network
For each cell ,
Data extraction (Cont’d)
15
Data from Google GroupsData from PlanetLabOur approach can be applied to other data
sources that offer similar types of summary information.
16
Google defines hierarchies over topics and regions.Google Groups
Tag each group with three attributes.Ex: Asian languages -> Eastern Asian languages -> Korean
Treat topic and language as dimensions of the event space.
Each interestPair of topic and language.
Google Groups (Cont’d)
17
l1
l2 l3
l4 l5 l6 l7Lang hierarchy
t1
t2 t3
t4 t5 t6 t7Topic hierarchy l4 l5 l6 l7
t4
t5
t6
t7
Event space
Interest = (t7, l4)
Google Groups (Cont’d)
18
Collect statistical summary for each interest◦# messages per month posted to
groups associated to that interest.◦# members in each group associated
with that interest.
Google Groups (Cont’d)
19
Divide all Google groups associated with the same interest by their geographic regions.
Count #members within each geographic region.
Rough indication of the distribution of subscribers over the network.
Google Groups (Cont’d)
20
Event space
• Subscription count• Event count• Distribution of subscribers by geographic region
For each cell (interest),
Network Location
21
Data from Google Groups gives us a rough distribution of subscriptions by geographic regions.
Still need actual network locations.
PlanetLab nodesIP-addressEmbed inter-node latencies in a low-
dimensional Euclidean space [Dabek et al’04, Ledlie et al’02, Ng et al’02]
Google Groups
PlanetLab nodes
Geographic regions Coordinates
Popularities of interests
22
By removing the top 24 interests, # members reduces from 8.1 million to 4.3 millon.Top three are (business services, English), (small business, English), (consulting, English).
Super-interest
Distribution of interests in event space for different geographic regions
23
Asia US
Europe
Simplified Chinese
Simplified Chinese
Simplified Chinese
English English
English
Super-interest
Two work phases of our generator
24
Data extractionWorkload generation
Workload generation
25
Data extraction
Workload generationA set of range subscriptions
A set of events
Skewness parameterInterest generalization parameterRange perturbation parameterWorkload size parameter
Summary statistics of subscriber interests, locations, and events
Workload generation
26
Interest diffusionInterest generalizationCategorical-to-range subscription
conversionWorkload with different size
Workload generation
27
Interest diffusionInterest generalizationCategorical-to-range subscription
conversionWorkload with different size
97.5 16.5187.5277.5
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
97.5
16.5 16.5 16.5
16.5
16.5
25.5
34.5 52.5
52.5
187.5
88.5Event space
Workload generation
28
Interest diffusionInterest generalizationCategorical-to-range subscription
conversionWorkload with different size
Topic hierarchy Language hierarchy
Workload generation
29
Interest diffusionInterest generalizationCategorical-to-range subscription
conversionWorkload with different size
(soccer, Korean)
Workload generation
30
Interest diffusionInterest generalizationCategorical-to-range subscription
conversionWorkload with different size
Interest diffusion
31
Popularity of an interest = number of subscriptions in its subtree.
Siblings of an interest are “related.”Reduce the popularity variance among the
silbings.
t1
t2 t3
t4 t5 t6 t7
Interest diffusion
32
GoalGiven a user-specified value p, reduce all
popularity variances by a factor of p for all levels of granularity.
Under the following constraints Total subscription count remains constant. Popularity of an interest = sum of child popularities.
t1
t2 t3
t4 t5 t6 t7
t1
t2 t3
t4 t5 t6 t7
l1
l2 l3
l4 l5 l6 l7
l4 l5 l6 l7
t4
t5
t6
t7
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
Lang hierarchyTopic hierarchy
Subscription countl2 l3
t2
t3 500 230
70 400
33
l4 l5 l6 l7
t4
t5
t6
t7
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
l2 l3
t2
t3 500 230
70 400
Mean: (500 + 230 + 70 + 400) / 4 = 300
Variance: [(500 – 300)2 + (230 – 300) 2 + (70 – 300) 2 + (400 – 300) 2 ]/4= 26950
34
l4 l5 l6 l7
t4
t5
t6
t7
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
l2 l3
t2
t3 500 230
70 400
Mean: 300Variance: 26950
Mean: 125Variance: 11425
Mean: 57.5Variance: 6768.75
Mean: 17.5Variance: 68.75
Mean: 100Variance: 3750
Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.35
l2 l3
t2
t3 New: C2*
Old mean: CNew mean: C*
Goal : Given a user-specified value p, reduce all popularity variances by a factor of p.
Old: C2New: C1*Old: C1
New: C4*Old: C4
New: C3*Old: C3
36
l4 l5 l6 l7
t4
t5
t6
t7
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
l2 l3
t2
t3 500 230
70 400
Mean: 300Variance: 26950
Mean: 125Variance: 11425
Mean: 57.5Variance: 6768.75
Mean: 17.5Variance: 68.75
Mean: 100Variance: 3750Proceed top-down from the
coarsest level of granularity to the finest level of granularity
37
l4 l5 l6 l7
t4
t5
t6
t7
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
l2 l3
t2
t3 480 237
93400
Mean: 300Variance: 26950
Mean: 125Variance: 11425
Mean: 57.5Variance: 6768.75
Mean: 17.5Variance: 68.75
Mean: 100Variance: 3750P = 0.81
39070
500 230
21829.5120
23.25
59.25
97.5
38
l4 l5 l6 l7
t4
t5
t6
t7
l2 l3
t2
t3 480 237
93400
Mean: 300Variance: 26950
Mean: 125Variance: 11425
Mean: 57.5Variance: 6768.75
Mean: 17.5Variance: 68.75
Mean: 100Variance: 3750P = 0.81
39070
500 230
21829.5120
23.25
59.25
97.5
97.5 16.5187.5277.5
10
10
30
20
50
100
200
50
100 300
90 10
200
10 10
10
97.5
16.5 16.5 16.5
16.5
16.5
25.5
34.5 52.5
52.5
187.5
88.5
9254.25 5482.69
55.6875 3037.5
39
Along the language dimension
40
Before diffusion After diffusion
Along the topic dimension
41
Before diffusion After diffusion
Conclusion and Future work
42
Make the best out of the limited amount of publicly available information to generate realistic workloads.
Make deviations easy to understand and control by users.
ExtensionsChanges to event distributions and
subscriptions over time.Subscriptions beyond multi-dimensional
range predicates.Statistical models.
Thank you
43