Apache Cassandra at Videoplaza Stockholm Cassandra Users September 2013
Post on 29-Nov-2014
DESCRIPTIONJosh Glover, Software Engineer at Videoplaza, will introduce you to the domain of video advertising and show how Videoplaza uses Apache Cassandra as part of a system that solves the difficult problem of allowing clients to analyse the performance of their advertising campaigns in real-time. Videoplaza needs to aggregate data for tens of thousands of combinations of dimensions and metrics for hundreds of clients from an incoming stream of thousands of requests per second, and do it fast enough so that clients can see trends as they happen.
- 1. Karbon Insight: Realtime Reporting
2. Introduction to ad serving Video player Ad player Distributor Tracker 3. Event tracking View (event ID 127) Click (event ID 128) and many more 4. What do our customers want? Any report they can dream up Right away! 5. Simple report: hour by ad and event 6. Realtime reporting Multidimensional OLAP cube Ad Event Time 7. ROLAP with star schema 8. Disadvantages of ROLAP Slow queries Lots of joins Expensive to scale SQL limitations 9. MOLAP to the rescue! 10. What is a counter? 11. You cant always get what you want... 12. Time Event Ad Device Category Location Tag Demography Possible report dimensions 13. Many counters 8 dimensions average size of 50 508 counters! (39 trillion) 14. Average campaign length: 21 days (504 hours) 15. Time flies like a banana 21 days = 39 trillion counters 42 days -> 78 trillion 84 days -> 156 trillion 365 days -> 677 trillion 16. 5 years down the road 3.39 quadrillion 17. 3.39 quadrillion is a rather large number indeed Number of stars in 7500 galaxies like the Milky way. 15% of the surveyed universe! 18. But you might just get what you need! 19. Fake it till you can make it Dont aggregate anything until they ask for it! 20. Time period By hour And ad Views Clicks 21. Counter Storage 22. Why Cassandra? Fast writes Linear scaling Battle-hardened (Relatively) simple operations Great community! 23. Cassandra TrackerTracker FlusherFlusher AggregatorAggregator MergerMerger live00 ... live31 RabbitMQ flush00 ... flush31counter00 ... counter31 24. Our setup DataStax CE 1.1.9 18 node cluster 1 datacentre 25. Data model 1 keyspace (RF: 3) 1 column family Leveled compaction 26. Row keys aggregate definition ID dimension values time granularity 27. adef1|(ad1:127)|hour adef1|(ad1:128)|hour adef1|(ad2:127)|hour ... adef1|(ad5:128)|day Example row keys 28. Columns time value -> counter transaction ID -> id 29. 2013-09-10.18 -> 6348 txID -> 876219102 Example columns 2013-09-10.19 -> 9784 30. total -> 6348 txID -> 876219102 Columns for rows with no time aggregation 31. Reading counters 32. Build row key adef1|(ad1:127)|hour 33. Prepare query keyspace .prepareQuery(columnFamily) .getKey(rowKey) 34. Column ranges 2013-09-10.17 ... 2013-09-10.23 35. Execute query asynchronously 36. Get column value First byte is counter type (long, double, Hyper LogLog) 37. Writing counters 38. Flush shards ... Flusher 1 shards 00-08 Flusher 4 shards 24-32 Cassandra 39. Merge increment rows with read cache Skip rows with the same transaction ID 40. Write rows in mutation batches (of 400) 41. Things we got wrong 42. Each CF has 1M heap overhead Too many column families Multi-tenancy FTW! FAIL #1 43. CLI defaults to replication factor of 1! Manual operations Tools and automation FTW! FAIL #2 44. No way to undo data loading No snapshots Automated snapshots FTW! FAIL #3 45. Post-processing of queried data Timezones Store data in customer timezone FAIL #4 46. 10 TB of data 1500 wps 40,000 rps 47. Q&A