scalability broad strokes

27
Scalability Broad Strokes - Best practices

Upload: gagan-bajpai

Post on 15-Jan-2015

160 views

Category:

Software


1 download

DESCRIPTION

A high level views of scalability best practices.

TRANSCRIPT

Page 1: Scalability   broad strokes

ScalabilityBroad Strokes - Best practices

Page 2: Scalability   broad strokes

Definition● Concurrency a.k.a number of simultaneous

requests, Latency● Throughput a.k.a total number of item

processed● Extensibility - application design for ability to

add new features etc.● We’d be mostly talking about first two.

Page 3: Scalability   broad strokes

Concurrency & Performance

● Scalability is measured as number of requests/users an application support without degrading the performance.

● Performance is a measure of individual request process time mostly.

Page 4: Scalability   broad strokes

Handling Scale● Throttling● Cache● Stateful vs. stateless● Asynchronous vs. synchronous● Service oriented design

Page 5: Scalability   broad strokes

Where (Multi tiered)● At the client (Browser)

○ Http headers○ Asynchronous calls○ local DB

● At the server ( Web tier/application tier)○ Cache -- distributed○ Stateless○ Asynchronous

● DB○ Cap theorem

Page 6: Scalability   broad strokes

Client● Http headers

○ Pragmatic headers not only cache on browsers but help with intelligent proxies.

○ YSlow/G page speed guidelines are always useful.○ e-Tags, long expiry are very good practices.○ sprites and image maps

● Ajax is good for scalability but some time may cause performance issues.

Page 7: Scalability   broad strokes

Client Server Network● Always compress response.● Even on JSON the bandwidth gains are

great.● In server-server calls consider binary

protocols or more efficient ones ● Even on the web, network layer like spdy

etc. are interesting.

Page 8: Scalability   broad strokes

Server -- Numbers all should know

● http://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295-talk.pdf

● Writes are heavy.● Disk seeks are heavier than network round trip with

memory seek.● Global shared data is expensive, if locking is involved.● Reads do not need to be transactional, just consistent.● Eventual consistency is useful.

Page 9: Scalability   broad strokes

Server - Cache(Low latency)

● Cache ○ Complete HTML response○ Output from Database

● Cache strategy is determined by○ is it a broadcast?○ is it a multicast?○ A unicast?

● Cache works best for broadcast.● Distributed Caching with consistent hash works very well. ● Pitfall is cache purge

Page 10: Scalability   broad strokes

Server (Concurrency)● Sequential processing is leaving out CPU and other

resources● Write parallelism is very important.● But Shared globals are heavy, hence a trade off.● In case of Java, JMM understanding is necessary.● Amdahl’s Law helps in determining the maximum gain

that can be achieved with parallel implementations.● If making it parallel, even a small fraction of sequential

work can cause loss of throughput

Page 11: Scalability   broad strokes

Server (State?full:less)● Given shared access is expensive, keeping state on

server is heavy.● Sessions if available on shared memory are great.● No session and share nothing works best.● Even cache is better.● Generally stateless code is modular, easier to unit test

and easier to profile.● On a function stack than heap.● Stateless helps in scale out. (Scale out??)

Page 12: Scalability   broad strokes

Server Synchronous/Asynchronous

● Waiting for I/O, network connections, DB queries is bad.● How about “query of death”? on write?● Writes if not very small should be kept asynchronous.● Helps on parallelization.● Reliable queues can improve latency.● idempotent code helps in avoiding many pitfalls.● Generally asynchronous is achieved

○ Queue/Topic based infrastructure■ Good for event processing and propagation of events

○ Incremental batches● Asynch I/O ? servers, Node.js/ngnix/apache event mpm ??

Page 13: Scalability   broad strokes

Debugging for Scale● Profile

○ In java■ gc logs■ JVisualVM■ Thread and memory dumps

○ GNU■ hprof■ strace■ gdb■ system utilities

Page 14: Scalability   broad strokes

Scale Horizontal vs. vertical

● For a stateless, asynchronous, idempotent and multithreaded application the horizontal scaling works , very well.

● Easier to understand with storage a.k.a databases.

Page 15: Scalability   broad strokes

Database● Which type of DBMS ?

○ RDBMS○ Key space based multi column family○ Document based○ Graph○ any other NoSQL?○ Solr and elasticsearch

Page 16: Scalability   broad strokes

Database scale out limitation

● CAP theorem○ Consistency○ Availability○ Partition tolerance○ Not available simultaneously

● Eventual consistency is preferred choice.

Page 17: Scalability   broad strokes

RDBMS● Index based query always● For RDBMS a query of death is a death knock.● Generally Write once and read at multiple slaves works

better.● To normalize or not● normalize for extensibility● Use solr/nosql for read scale● One multiple table join complex query or multiple simple

query?? (performance/scale)

Page 18: Scalability   broad strokes

NoSQL● Several options ranging from document databases to

multiple column family● We mostly use

○ Mongo○ Cassandra○ Neo4j (in some cases)○ Titan

● Provide very high throughput with manageable clustering/sharding

Page 19: Scalability   broad strokes

Mongo (iBeat)● Increasing data volumes threatens the

scalability and availability● Though search is available, it’s not very

efficient.● The limit of a single document is 16 MB.● Repair DB and reindexing do impact

performance.

Page 20: Scalability   broad strokes

Mongo (iBeat ..)● Mongo sharding as a solution● Data volume per replica set decreased.● For document size limit gridFS was used.● With less document volume, the overhead of

index etc. reduced.● But sharding itself with large amount of data

was carried out over a long period of time.

Page 21: Scalability   broad strokes

Big Data● Normally associated with such large and complex data that traditional data

management/visualization tools fail to capture, curate or process.● Current definition defines 3 aspects a.k.a (3V)

○ Volume○ Velocity○ Variety

● General usage is in○ Genetic algorithms○ Machine learning○ Natural language processing○ Time series analysis (a.k.a attribution analysis)○ Visualizations ○ and many more ...

Page 22: Scalability   broad strokes

Big Data● Our usage is

○ Analytics○ User preference,personalization,profiling○ Recommendation○ Decision support system

● The standard known open source eco systems○ Hadoop○ Event processors /stream engines e.g. storm,spark,S4

Page 23: Scalability   broad strokes

Big data (Hadoop..)● Hadoop - Originally a component of Nutch, is now a

biggest driver in big data technologies.● MapReduce a mechanism/framework to run massively

parallel systems. Published originally by Google.● Mapreduce - the trick is distributed sorting.● New languages for statistical computation e.g. R

Page 24: Scalability   broad strokes

Hadoop stack components

Image borrowed from http://blogs.gartner.com/merv-adrian/2013/02/21/hadoop-2013-part-two-projects/

Page 25: Scalability   broad strokes

Big data - Real time analysis

● While Map Reduce is great throughput solution, it doesn’t help with real time or near real time processing

● Eco system are evolving either coupled with MapReduce or HDFS.

● Storm/Spark stream for augmenting Mapreduce based computations.

Page 26: Scalability   broad strokes

Most important● Ability to determine impact of changes● Seamless deployments

Page 27: Scalability   broad strokes

?