ashish thusoo evolution of big data architectures
Post on 15-Jul-2015
444 Views
Preview:
TRANSCRIPT
Evolution of Big Data ArchitecturesArchitecture Summit, Aug 2012
Ashish Thusoo
Outline
Demand for Big Data
Architectural Trade Offs and Evolution
Where next?
The Changing Planet
3 Technology Drivers
Devices
Infrastructure
Applications
Evolution: Devices
Evolution: Devices
Key Capabilities
Connected
Location Aware
Sensory & Powerful
Evolution: Devices
Evolution: Connectivity
Mobile Subscription Density 2004
Evolution: Connectivity
Mobile Subscription Density 2010
Evolution: Bandwidth
Evolution: Applications
Salient Traits
Cloud based
Web scale
Explosion in Data
Big Data
Volume
Velocity
Variety
Big Data: Volume
Volume:
2011: 1.8 zettabytes of digital universe
2009 - 2020: 35 zettabytes
Big Data: Velocity
Velocity
340 million tweets per day
72 hours of video uploaded every minute on YouTube
2.9 million emails a second
Big Data: Variety
Variety
Video
Pictures
Applications Logs
etc. etc...
Disruptive Architectures
Disruptions in Data Arch
Change in Focus (1990s -> 2000s)
Performance -> Scalability & Availability
Rigid/Structured -> Flexible/Semistructured
Scalability & Availability
Towards Scalability
Problem
10K ops/sec -> 1M ops/sec
TB of data -> PB of data
Towards Scalability
Solution: SHARDING (Divide and Conquer)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Towards Scalability
How do we quickly route a record to a shard?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
fn( )- Consistent Hashing- Mapping Table
Towards Scalability
What happens is part of the record is in one shard and part in another?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Towards Scalability
Keep it Simple: Application deals with atomicity & consistency semantics
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Towards AvailabilityWhat if my shard is down? Where do I put my record?
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
X?
Towards AvailabilityLets just replicate the shards and pray that one is available :)
1101100011000001100100101111101011011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
X11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Towards Availability
Replication strategies
What should be the number of replicas?
How to rebuild a replica?
How to propogate a record to a replica?
1990s vs 2000sDifferent Focus: 1990s (Raw Performance)
Optimal I/O structures
Cache Sensitive Algorithms
2000s (Scalability, Availability)
Sharding
Replication
Flexibility/Semi-structure
Towards Flexibility
Problem
Does structure in a database make it slower to write applications (sprint vs waterfall model)?
My data is not records and tables?
Towards Flexibility
How knowing my record structure help by data system?
Helps to optimize execution plans
Helps to optimize my storage layouts
Trade off?
Application change means database schema change, rebuilding indexes etc. etc.
Towards Flexibility
Most of my operations are simple lookups, range lookups and updates
Since the execution is simple we don’t need all the structure
Keep enough structure to support fast gets and puts
Towards Flexibility
Solution: Key-Value Stores (NoSQL)
1101100011
1101100011
1101100011
1101100011
1101100011
1101100011
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
KEY VALUE
1101100011 11011000110000011001001011111010
1101100011
1101100011
1101100011 11011000110000011001001011111010
- Sorted HashMaps
- Sorted Files
Towards Flexibility
Need to update related “values” of a key (Some Atomicity)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
KEY VALUE
Towards Flexibility
Need update related “values” of a key (Some Atomicity)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
KEY VALUE11011000110
11011000110
11011000110
11011000110
11011000110
11011000110
TAG
TAG = COLUMN FAMILY
Towards Flexibility
gets and puts are fine for online applications BUT..
What about Analytics?
Transformations can be really complicated...
Towards Flexibility
Is there a simple construct that can solve a number of analytics queries
of course: SORT
And it can be parallelized too
Towards Flexibility
MAP/REDUCE (Scalable Parallel Pluggable SORT)
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Mappers11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Reducers
m{ } r{ }m: user defined map functionr: user defined reduce function
Towards Flexibility
MAP/REDUCE and Failures
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Mappers
X11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
Reducers
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
11011000110000011001001011111010
1990s vs 2000sDifferent Focus: 1990s (Raw Performance)
Structure important for speed optimizations
Stream everything through Query plan
2000s (Sprint mode of application development)
Support dev efficiency and data variety
Checkpointing for restartability
Where now?
The New Meets The Old
Disruption?
Well we still need SQL
We still need to make these work with other components
Guess what? Efficiency is also important at scale
Where Does New Fail?
Transactions?
Moving money from one account to another
Graphs?
Networks everywhere
How to do second order analysis on graphs
Thank You!
top related