cassandra summit 2014: apache cassandra best practices at ebay
DESCRIPTION
Presenter: Feng Qu, Principal DBA at eBay Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.TRANSCRIPT
Feng Qu principal database engineer, ebay inc
September 11, 2014
Cassandra Best Prac-ces at
ebay inc
CassandraSummit2014 | #CassandraSummit
CassandraSummit2014 | #CassandraSummit
Agenda
• ebay inc Cassandra footprints • NoSQL life cycle • Cassandra best prac?ces • Q&A
CassandraSummit2014 | #CassandraSummit
ebay inc
CassandraSummit2014 | #CassandraSummit
ebay inc Database Pla5orms • We manage thousands of databases powering eBay and PayPal
CassandraSummit2014 | #CassandraSummit
Why NoSQL?
• Challenges of tradi?onal RDBMS • Performance penalty to maintain ACID features • Lack of na?ve sharding and replica?on features • Lack of linear scalability • Cost of soMware/hardware • Higher cost of commit
• NoSQL used in eBay inc • Cassandra, Couchbase, MongoDB managed by DBA • HBase, Redis, OpenTSDB managed by developers
CassandraSummit2014 | #CassandraSummit
Cassandra @ ebay inc
• Started in 2011 at eBay and later expanded to PayPal • Started with Apache Cassandra 0.8, now using Apache Cassandra 2.0 and DataStax Enterprise 4.0
• Over a dozen produc?on clusters on hundreds of servers across 3 data centers
• Choices between dedicated cluster for large/cri?cal use case and mul?-‐tenant cluster for small use cases
• Over 20 billions daily reads/writes to Cassandra • Cluster size varies from 4-‐node to 80-‐node • 100TB+ user data on HDD, local SSD and SSD array
• One cluster is es?mated to grow over few PBs
CassandraSummit2014 | #CassandraSummit
Use Case Analysis
Data Modeling
Capacity Planning Deployment
Operation
NoSQL Life Cycle
CassandraSummit2014 | #CassandraSummit
Data Modeling Phase
• Development team requests a review mee?ng for a new use case with data architect
• Once data architect understands requirement and then recommends a proper data store. It could be either one of RDBMS or one of NoSQL products we support
• Both par?es work on data modeling together • Outputs the engagement are a set of ?ckets, for tracking purpose, which captures project informa?on and data configura?on for chosen data store.
CassandraSummit2014 | #CassandraSummit
Data Modeling Best Prac-ces
• Unlike tradi?onal RDBMS, data modeling for Cassandra is quite different. • Modeling around query pa_ern, not en?ty • De-‐normalize to improve read performance • Separate read heavy data from write heavy data • Store values in column names as names are physical sorted already
• Former eBay architect Jay Patel published few technical blogs on Cassandra data modeling.
CassandraSummit2014 | #CassandraSummit
Data Modeling Best Prac-ces -‐ indexing
• Secondary index + Less overhead as built in + data and index are changed atomically -‐ not scale well with high cardinality data
• Column family as index + No hot spot -‐ index is maintained manually by applica?on -‐ index change is not atomically
• Avoid secondary index and use column family as index if possible
CassandraSummit2014 | #CassandraSummit
Benchmark Tes-ng
• Benchmark tes?ng is key to capacity planning • Performance baseline with near-‐real traffic in produc?on size environment • for different type of hardware • for different soMware release • for different use case or workload
• A proac?ve and repe??ve process
CassandraSummit2014 | #CassandraSummit
Capacity Planning Phase • Is key to avoid surprise in produc?on • The concept behind capacity planning is simple, but the mechanics are harder.
• Business requirements may increase, need to forecast how much resource must be added to the system to ensure that user experience con?nues uninterrupted • Input: clearly defined capacity goal coming from business requirement and performance baseline from benchmark test
• Output: Iden?fy resources to be added, such as memory, CPU, storage, I/O, network
• Always prepare for peak + headroom
CassandraSummit2014 | #CassandraSummit
Deployment Best Prac-ces
• SoMware packages with customized op?miza?on • kernel, JVM heap, compac?on
• Deployment automa?on for efficiency • Mul? data center deployment for load balancing and disaster recovery
• Vnode is a must for manageability • SSD as default storage requires addi?onal OS level tuning
CassandraSummit2014 | #CassandraSummit
Opera-on Best Prac-ces
• Collect system and database metrics • Monitoring and aler?ng
• event driven and metrics driven alerts • Opera?on runbook
• Reduce human error • Performance tuning runbook
• nodetool tpstats for dropped requests • nodetool cdistograms for latency distribu?on
• Troubleshoo?ng runbook • Document previous incidents as future reference
CassandraSummit2014 | #CassandraSummit
Opera-on Best Prac-ces
• Rou?ne repair is not really needed if there is no deletes. You s?ll need run repair aMer bringing up a down node if it is dead for a while
• Use CNAME in client configura?on to avoid client conf change in case of hardware replacement with new IP/name
• Reduce gc_grace to reduce overall data size • Disable row cache, unless you have <100K rows • Collect sta?s?cs, real-‐?me or historical, to monitor overall system performance
• Disable swap to avoid a slow node
CassandraSummit2014 | #CassandraSummit
Capacity Review
• Rou?ne capacity review and adjustment • When to scale up and when to scale out
• In general, scale out by adding nodes to increase capacity with NoSQL
• Some?mes, it’s cost efficient to scale up at component level by iden?fying scaling bo_leneck, then resolve it accordingly • Network bandwidth: upgrade to 10 Gbps network • I/O latency: upgrade to (be_er) SSD • Storage: add/expand data volume
CassandraSummit2014 | #CassandraSummit
Typical Use Cases • Write Intensive: metrics collec?on, logging
• Collec?ng metrics from tens of thousands devices periodically
• Read Intensive: home page feeds • Recommenda?on backend to generate dynamic taste graph
• Mixed workload: personaliza?on, classifica?on • Data is loaded from data warehouse periodically in bulk and from user events consistently
• Data is retrieved in real ?me when user visits ebay site
CassandraSummit2014 | #CassandraSummit
Metrics Collec-on Applica-on
CassandraSummit2014 | #CassandraSummit
The End • We are hiring for NoSQL talent. • Contact:
• [email protected] • www.linkedin.com/in/fengqu/
• Q&A