the big data revolution is an evolution
DESCRIPTION
Dealing with data doesn't only require a data store, it requires an infrastructure. At SimpleReach, we have 5 data storage layers to service all of our data needs. These range from high volume, high velocity data ingestion with real-time analytics to ad-hoc style historical analysis with search capabilities. To communicate effectively between applications, data stores sit behind a service architecture for consistent data access patterns and failover/redundancy. This talk is a story of how we came to this architecture and some of the lessons we learned along the way.TRANSCRIPT
Eric Lubow
@elubow
The Big Data Revolution is an
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Overvie• Evolution
• SimpleReach
• Data Stores / Languages
• Architecture Implementation
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
We're in the midst of an evolution, not a revolution.
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
The 2 Truths
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Even with the right tools, 80% of the work of building a big data system is acquiring and refining
The Real Truth
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
30m plays/day + 4m user ratings + 75k movies metadata + 24.4m users metadata =
David Fincher + Kevin Spacey + British House of
Cards
Mitch Hurwitz + Will Arnett + Jason Bateman + Arrested
Development
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
BRING IT TOGETHE
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
evolutionrevolutionInsufficient Capabilities
Scale/Need Changes
Development & Integration
New Products
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Millions of URLs per day
• Over 1 billion pageviews per month
• 250m events per day (~3k events/second)
• Auto-scale 90-130 machines depending on traffic
SimpleReach
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
HUMBLE BEGINNINGS
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Scale
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
AND THEN...
C*
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Large data volume ingestion at high velocity
• Really fast writes to many locations (eventual consistency)
• Query by column groups within rows (slicing)
• TTLs for small group aggregation
• Wrote Helenus, Node.js driver for Cassandra
Cassandra C*
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Fast atomic increments (Node.js is native JSON)
• Sharding
• Solid ORM for Rails (MongoID)
• B-Tree Indexes
• Document based via JSON
• TTLs for ephemeral data
MongoDB
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Supports hundreds of thousands transactions per second
• Great caching engine
• Supports useful variable types like sets, sorted set, lists
• Everything is guaranteed to be Memory Mapped
Redis
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Works with standard MySQL driver
• Column Stores for ad-hoc analytics queries in SQL
• Heavy compression of data (avg 12:1)
Infobright
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
• Polyglottany doesn’t only apply to data stores
• Each language has its own benefit to each stack layer
• Each language has its own individual benefits
• Each language has its own development benefits
The c0dez
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Cons• Redis - Can only utilize a single core. SerDe price.
• Infobright - DELETE/UPDATEs are VERY expensive
• Cassandra - No btree indexes or probabilistic counters
• Mongo - Indexes must fit in memory. Forced Replica ping times
• Python - Whitespace. Community
• Ruby - Not high performance enough for our standards
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Evolution Takes Work• Service Oriented Architecture (Internal API)
• Data accuracy checks: visual and programmatic
• Built framework for testing out engines (Storage, Queueing, etc)
• Access to many toolsets (for all languages, DBs, Engines)
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Service
Internal API
Solr
Real-timeC*
C*
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Path of a Packet
InternetEP
Inte
rnal
API
Solr
C*
Mong
Redis
IB
API
Fire Hos
SC
Cons
umer
s
Que
ue
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Architecture DistributionUS-EAST-1a
MONGO-SHARD-0001-B
MONGO-SHARD-0000-A
CASSANDRA-0001
CASSANDRA-0010
REDIS-0001A
INFOBRIGHT-0001
iAPI-0001
US-EAST-1b
MONGO-SHARD-0002-B
MONGO-SHARD-0001-A
CASSANDRA-0002
CASSANDRA-0011
REDIS-0001B
iAPI-0002
US-EAST-1e
MONGO-SHARD-0002-A
MONGO-SHARD-0000-B
CASSANDRA-0003
CASSANDRA-0012
INFOBRIGHT-0002
iAPI-0003
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
The Schrute of the Problem
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Evolving Amazon Tools• Full Featured API
• Simple Queuing Service
• Data Pipelining
• OpsWorks
• Cloud Formation
• Redshift Analytics
• CloudSearch
• Elastic Beanstalk
• Elastic MapReduce
• Simple Workflow Coordinator
• S3 / Glacier
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
DevOps Wizardry• Extensive use of AWS
• Monitor: Nagios, Statsd, and Graphite
• Manage: Chef, OpsWorks, cSSHx
• Deployments
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Summary• Solutions Require Evolution
• Build, Use, and Integrate Tools
• Abstraction
• Distribution
• Monitoring & Automation
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
A revolution only lasts fifteen years, a period which coincides with the
Evolution Takes Time
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
We’re (Ask us about Food Coma Fridays)
Big Data Revolution is an Evolution
Eric Lubow @elubow #NYCassandra2013
Questions are guaranteed in life.Answers aren’t.
Eric Lubow
@elubow
Thank you.