data lessons learned at scale
DESCRIPTION
Data lessons learned at scale while building AddThis' data processing system.TRANSCRIPT
Charlie ReverteVP Engineering
@numbakrrunch
Data Lessons Learned at Scale
@numbakrrunch
Topic
Half of the work that it takes to do data science is plumbing and wrangling
Here are some lessons we’ve learned..
@numbakrrunch
About AddThis
We make tools for websites
@numbakrrunch
Our Data
We process website data● Visitation● Sharing● Following● Content Classification
And use it to improve the site● Content Recommendation● Personalization● Analytics
@numbakrrunch
At Scale...
● 14 million domains● 100 billion views/month● 50k events/sec● 160k concurrent firewall sessions● 500k unique ganglia metrics
@numbakrrunch
Distributed ID Generation
● Session IDs are generated in the browser● We concatenate time and a random value
Hex: 4f6934b6f54bd7c1
Base64: T2k0to403VS
● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)
● Naturally time ordered, built-in DoB
Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/
time rand63 31 0
@numbakrrunch
Counting Things
● Cardinality● Set membership● Top-k elements● Frequency
● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus
○ Distributed counting○ Checkpointing
Stream-lib: https://github.com/clearspring/stream-lib
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
@numbakrrunch
Joining Data
● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data
● Join and de-normalize data when you ingest○ Disk is cheap
● Join your data in client-side storage○ Browsers as a lossy distributed database
● Oceans of data in the cloud..
“The value is in the join” (or something like that)
https://github.com/stewartoallen
@numbakrrunch
Sharding and Sampling
● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate○ Storage is cheap, multiple copies?
● Shards also useful for sampling○ Complete data subsets
● Can yield statistical significance○ Depending on the question
Deployment
● Continuous Deploy?● Deploying our javascript costs $3k
○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)
● 2PB of CDN data served per month● Have DDOSed ourselves
○ Very interesting bugs● Simulation is weak
○ The internet is a dirty place○ Embrace incremental deploys
@numbakrrunch
The Log
Jay Kreps: “Real-time data’s unifying abstraction”● Centralized logging● Loosely coupled consumers
Divide your dependencies:● Synchronous - 0mq● Asynchronous - Kafka
Distributed event logging● Does determinism matter?
Log format durability?● Protobuf?
http://bit.ly/thelog
@numbakrrunch
Columnar Compression
● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● https://github.com/addthis/columncompressor
○ by @abramsm
Time IP UID URL Geo Time
IP
UID
URL
Geo
Input Data Stored Data
Block Size
@numbakrrunch
Tunable QoS
Cassandra URL Store
● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs
○ Depending on write rate per record
● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness
6
CDN cache
HydraOur custom processing systemOptimized for real-time data
Just open sourced:https://github.com/addthis/hydra
Go see @csby’s talkGreat Hall North @3:55pm
@numbakrrunch
Summary
● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect
○ I’m still struggling with this