Karbon Insight: Realtime Reporting
Introduction to ad serving
Video player
Ad player
Distributor Tracker
Event tracking
•View (event ID 127)•Click (event ID 128)•and many more
What do our customers want?
•Any report they can dream up•Right away!
Simple report: hour by ad and event
Realtime reporting
Multidimensional OLAP cube
Ad
Event
Time
ROLAP with star schema
Disadvantages of ROLAP
•Slow queries•Lots of joins•Expensive to scale•SQL limitations
MOLAP to the rescue!
What is a counter?
You can’t always get what you want...
•Time•Event•Ad•Device
•Category•Location•Tag•Demography
Possible report dimensions
Many counters
8 dimensionsaverage size of 50
508 counters!(39 trillion)
Average campaign length:
21 days(504 hours)
Time flies like a banana
21 days = 39 trillion counters
42 days -> 78 trillion84 days -> 156 trillion365 days -> 677 trillion
5 years down the road
3.39 quadrillion
3.39 quadrillion is a rather large number indeed
Number of stars in 7500 galaxies like the Milky way.
15% of the surveyed universe!
But you might
just get what you
need!
Fake it till you can make it
Don’t aggregate anything until they ask
for it!
•Time period•By hour•And ad•Views•Clicks
Counter Storage
Why Cassandra?
•Fast writes•Linear scaling•Battle-hardened•(Relatively) simple operations•Great community!
Cassandra
TrackerTracker
FlusherFlusherAggregatorAggregator
MergerMerger
live00 ... live31
RabbitMQ
flush00 ... flush31counter00 ... counter31
Our setup
•DataStax CE 1.1.9•18 node cluster•1 datacentre
Data model
•1 keyspace (RF: 3)•1 column family•Leveled compaction
Row keys
aggregate definition IDdimension valuestime granularity
adef1|(ad1:127)|houradef1|(ad1:128)|houradef1|(ad2:127)|hour...adef1|(ad5:128)|day
Example row keys
Columns
time value ->counter
transaction ID ->id
2013-09-10.18 -> 6348
txID -> 876219102
Example columns
2013-09-10.19 -> 9784
total -> 6348
txID -> 876219102
Columns for rows with no time aggregation
Reading counters
Build row keyadef1|(ad1:127)|hour
Prepare querykeyspace .prepareQuery(columnFamily) .getKey(rowKey)
Column ranges2013-09-10.17
...2013-09-10.23
Execute query asynchronously
Get column valueFirst byte is counter type
(long, double, Hyper LogLog)
Writing counters
Flush shards
...Flusher 1
shards 00-08
Flusher 4
shards 24-32
Cassandra
Merge increment rows with read cache
Skip rows with the same transaction ID
Write rows in mutation batches
(of 400)
Things we got wrong
Each CF has 1M heap overhead
Too many column families
Multi-tenancy FTW!
FAIL #1
CLI defaults to replicationfactor of 1!
Manual operations
Tools and automation FTW!
FAIL #2
No way to undo data loading
No snapshots
Automated snapshots FTW!
FAIL #3
Post-processing of queried data
Timezones
Store data in customer timezone
FAIL #4
10 TB of data1500 wps40,000 rps
Q&A