Download - Apache Cassandra at Videoplaza — Stockholm Cassandra Users — September 2013

Karbon Insight: Realtime Reporting

Introduction to ad serving

Video player

Ad player

Distributor Tracker

Event tracking

•View (event ID 127)•Click (event ID 128)•and many more

What do our customers want?

•Any report they can dream up•Right away!

Simple report: hour by ad and event

Realtime reporting

Multidimensional OLAP cube

Ad

Event

Time

ROLAP with star schema

Disadvantages of ROLAP

•Slow queries•Lots of joins•Expensive to scale•SQL limitations

MOLAP to the rescue!

What is a counter?

You can’t always get what you want...

•Time•Event•Ad•Device

•Category•Location•Tag•Demography

Possible report dimensions

Many counters

8 dimensionsaverage size of 50

508 counters!(39 trillion)

Average campaign length:

21 days(504 hours)

Time flies like a banana

21 days = 39 trillion counters

42 days -> 78 trillion84 days -> 156 trillion365 days -> 677 trillion

5 years down the road

3.39 quadrillion

3.39 quadrillion is a rather large number indeed

Number of stars in 7500 galaxies like the Milky way.

15% of the surveyed universe!

But you might

just get what you

need!

Fake it till you can make it

Don’t aggregate anything until they ask

for it!

•Time period•By hour•And ad•Views•Clicks

Counter Storage

Why Cassandra?

•Fast writes•Linear scaling•Battle-hardened•(Relatively) simple operations•Great community!

Cassandra

TrackerTracker

FlusherFlusherAggregatorAggregator

MergerMerger

live00 ... live31

RabbitMQ

flush00 ... flush31counter00 ... counter31

Our setup

•DataStax CE 1.1.9•18 node cluster•1 datacentre

Data model

•1 keyspace (RF: 3)•1 column family•Leveled compaction

Row keys

aggregate definition IDdimension valuestime granularity

adef1|(ad1:127)|houradef1|(ad1:128)|houradef1|(ad2:127)|hour...adef1|(ad5:128)|day

Example row keys

Columns

time value ->counter

transaction ID ->id

2013-09-10.18 -> 6348

txID -> 876219102

Example columns

2013-09-10.19 -> 9784

total -> 6348

txID -> 876219102

Columns for rows with no time aggregation

Reading counters

Build row keyadef1|(ad1:127)|hour

Prepare querykeyspace .prepareQuery(columnFamily) .getKey(rowKey)

Column ranges2013-09-10.17

...2013-09-10.23

Execute query asynchronously

Get column valueFirst byte is counter type

(long, double, Hyper LogLog)

Writing counters

Flush shards

...Flusher 1

shards 00-08

Flusher 4

shards 24-32

Cassandra

Merge increment rows with read cache

Skip rows with the same transaction ID

Write rows in mutation batches

(of 400)

Things we got wrong

Each CF has 1M heap overhead

Too many column families

Multi-tenancy FTW!

FAIL #1

CLI defaults to replicationfactor of 1!

Manual operations

Tools and automation FTW!

FAIL #2

No way to undo data loading

No snapshots

Automated snapshots FTW!

FAIL #3

Post-processing of queried data

Timezones

Store data in customer timezone

FAIL #4

10 TB of data1500 wps40,000 rps

Top Related