data lessons learned at scale

Charlie ReverteVP Engineering

@numbakrrunch

Data Lessons Learned at Scale

@numbakrrunch

Topic

Half of the work that it takes to do data science is plumbing and wrangling

Here are some lessons we’ve learned..

@numbakrrunch

About AddThis

We make tools for websites

@numbakrrunch

Our Data

We process website data● Visitation● Sharing● Following● Content Classification

And use it to improve the site● Content Recommendation● Personalization● Analytics

@numbakrrunch

At Scale...

● 14 million domains● 100 billion views/month● 50k events/sec● 160k concurrent firewall sessions● 500k unique ganglia metrics

@numbakrrunch

Distributed ID Generation

● Session IDs are generated in the browser● We concatenate time and a random value

Hex: 4f6934b6f54bd7c1

Base64: T2k0to403VS

● Time-bounded probabilistic uniqueness○ (m 2) / n = 0.142 collisions/sec (at 35k rq/sec)

● Naturally time ordered, built-in DoB

Compare to Twitter Snowflakehttps://github.com/twitter/snowflake/

time rand63 31 0

https://github.com/twitter/snowflake/

https://github.com/twitter/snowflake/

@numbakrrunch

Counting Things

● Cardinality● Set membership● Top-k elements● Frequency

● Estimate when possible● Sample when possible● Often streaming vs. batch● Mergeability is a big plus

○ Distributed counting○ Checkpointing

Stream-lib: https://github.com/clearspring/stream-lib

http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

https://github.com/clearspring/stream-lib





@numbakrrunch

Joining Data

● Value of data increases with higher dimensionality○ Geo, user profile, page attributes, external data

● Join and de-normalize data when you ingest○ Disk is cheap

● Join your data in client-side storage○ Browsers as a lossy distributed database

● Oceans of data in the cloud..

“The value is in the join” (or something like that)

https://github.com/stewartoallen

https://github.com/stewartoallen

@numbakrrunch

Sharding and Sampling

● Choose your shard keys wisely○ High cardinality field to reduce lumpiness○ What do you need to co-locate○ Storage is cheap, multiple copies?

● Shards also useful for sampling○ Complete data subsets

● Can yield statistical significance○ Depending on the question

Deployment

● Continuous Deploy?● Deploying our javascript costs $3k

○ Have to invalidate 1.4B browser caches○ Several hours to flush to browsers (clench)

● 2PB of CDN data served per month● Have DDOSed ourselves

○ Very interesting bugs● Simulation is weak

○ The internet is a dirty place○ Embrace incremental deploys

@numbakrrunch

The Log

Jay Kreps: “Real-time data’s unifying abstraction”● Centralized logging● Loosely coupled consumers

Divide your dependencies:● Synchronous - 0mq● Asynchronous - Kafka

Distributed event logging● Does determinism matter?

Log format durability?● Protobuf?

http://bit.ly/thelog



@numbakrrunch

Columnar Compression

● Columnar storage techniques for row data● Better compressor efficiency● Different compressors per column● >20% size savings● https://github.com/addthis/columncompressor

○ by @abramsm

Time IP UID URL Geo Time

IP

UID

URL

Geo

Input Data Stored Data

Block Size

https://github.com/addthis/columncompressor

https://github.com/addthis/columncompressor

http://twitter.com/abramsm

@numbakrrunch

Tunable QoS

Cassandra URL Store

● We scrape and classify 20M URLs/day● 750 million active records● 2.2B reads/day● Variable cache TTLs

○ Depending on write rate per record

● Global TTL knob○ Turn up to reduce load for maintenance○ Turn down to improve responsiveness

6

CDN cache

HydraOur custom processing systemOptimized for real-time data

Just open sourced:https://github.com/addthis/hydra

Go see @csby’s talkGreat Hall North @3:55pm

https://github.com/addthis/hydra

https://github.com/addthis/hydra

@numbakrrunch

Summary

● Are you more like the post office or the bank?● Look for good-enough answers● Fight your nerd tendency for perfect

○ I’m still struggling with this

Questions?

@numbakrrunchSlides: http://bit.ly/datalessons

http://bit.ly/datalessons

data lessons learned at scale

Technology