scalability broad strokes
Post on 15-Jan-2015
Embed Size (px)
DESCRIPTIONA high level views of scalability best practices.
- 1. Scalability Broad Strokes - Best practices
2. Definition Concurrency a.k.a number of simultaneous requests, Latency Throughput a.k.a total number of item processed Extensibility - application design for ability to add new features etc. Wed be mostly talking about first two. 3. Concurrency & Performance Scalability is measured as number of requests/users an application support without degrading the performance. Performance is a measure of individual request process time mostly. 4. Handling Scale Throttling Cache Stateful vs. stateless Asynchronous vs. synchronous Service oriented design 5. Where (Multi tiered) At the client (Browser) Http headers Asynchronous calls local DB At the server ( Web tier/application tier) Cache -- distributed Stateless Asynchronous DB Cap theorem 6. Client Http headers Pragmatic headers not only cache on browsers but help with intelligent proxies. YSlow/G page speed guidelines are always useful. e-Tags, long expiry are very good practices. sprites and image maps Ajax is good for scalability but some time may cause performance issues. 7. Client Server Network Always compress response. Even on JSON the bandwidth gains are great. In server-server calls consider binary protocols or more efficient ones Even on the web, network layer like spdy etc. are interesting. 8. Server -- Numbers all should know http://static.googleusercontent.com/media/research. google.com/en//people/jeff/stanford-295-talk.pdf Writes are heavy. Disk seeks are heavier than network round trip with memory seek. Global shared data is expensive, if locking is involved. Reads do not need to be transactional, just consistent. Eventual consistency is useful. 9. Server - Cache(Low latency) Cache Complete HTML response Output from Database Cache strategy is determined by is it a broadcast? is it a multicast? A unicast? Cache works best for broadcast. Distributed Caching with consistent hash works very well. Pitfall is cache purge 10. Server (Concurrency) Sequential processing is leaving out CPU and other resources Write parallelism is very important. But Shared globals are heavy, hence a trade off. In case of Java, JMM understanding is necessary. Amdahls Law helps in determining the maximum gain that can be achieved with parallel implementations. If making it parallel, even a small fraction of sequential work can cause loss of throughput 11. Server (State?full:less) Given shared access is expensive, keeping state on server is heavy. Sessions if available on shared memory are great. No session and share nothing works best. Even cache is better. Generally stateless code is modular, easier to unit test and easier to profile. On a function stack than heap. Stateless helps in scale out. (Scale out??) 12. Server Synchronous/Asynchronous Waiting for I/O, network connections, DB queries is bad. How about query of death? on write? Writes if not very small should be kept asynchronous. Helps on parallelization. Reliable queues can improve latency. idempotent code helps in avoiding many pitfalls. Generally asynchronous is achieved Queue/Topic based infrastructure Good for event processing and propagation of events Incremental batches Asynch I/O ? servers, Node.js/ngnix/apache event mpm ?? 13. Debugging for Scale Profile In java gc logs JVisualVM Thread and memory dumps GNU hprof strace gdb system utilities 14. Scale Horizontal vs. vertical For a stateless, asynchronous, idempotent and multithreaded application the horizontal scaling works , very well. Easier to understand with storage a.k.a databases. 15. Database Which type of DBMS ? RDBMS Key space based multi column family Document based Graph any other NoSQL? Solr and elasticsearch 16. Database scale out limitation CAP theorem Consistency Availability Partition tolerance Not available simultaneously Eventual consistency is preferred choice. 17. RDBMS Index based query always For RDBMS a query of death is a death knock. Generally Write once and read at multiple slaves works better. To normalize or not normalize for extensibility Use solr/nosql for read scale One multiple table join complex query or multiple simple query?? (performance/scale) 18. NoSQL Several options ranging from document databases to multiple column family We mostly use Mongo Cassandra Neo4j (in some cases) Titan Provide very high throughput with manageable clustering/sharding 19. Mongo (iBeat) Increasing data volumes threatens the scalability and availability Though search is available, its not very efficient. The limit of a single document is 16 MB. Repair DB and reindexing do impact performance. 20. Mongo (iBeat ..) Mongo sharding as a solution Data volume per replica set decreased. For document size limit gridFS was used. With less document volume, the overhead of index etc. reduced. But sharding itself with large amount of data was carried out over a long period of time. 21. Big Data Normally associated with such large and complex data that traditional data management/visualization tools fail to capture, curate or process. Current definition defines 3 aspects a.k.a (3V) Volume Velocity Variety General usage is in Genetic algorithms Machine learning Natural language processing Time series analysis (a.k.a attribution analysis) Visualizations 22. Big Data Our usage is Analytics User preference,personalization,profiling Recommendation Decision support system The standard known open source eco systems Hadoop Event processors /stream engines e.g. storm,spark,S4 23. Big data (Hadoop..) Hadoop - Originally a component of Nutch, is now a biggest driver in big data technologies. MapReduce a mechanism/framework to run massively parallel systems. Published originally by Google. Mapreduce - the trick is distributed sorting. New languages for statistical computation e.g. R 24. Hadoop stack components Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/ 25. Big data - Real time analysis While Map Reduce is great throughput solution, it doesnt help with real time or near real time processing Eco system are evolving either coupled with MapReduce or HDFS. Storm/Spark stream for augmenting Mapreduce based computations. 26. Most important Ability to determine impact of changes Seamless deployments 27. ?