kansas city big data: the future of insights - keynote: "big data technologies and...
DESCRIPTION
Kansas City IT Professionals, a grassroots tech community of 9,000+ members held an event on August 30th, 2012 entitled Big Data: The Future Of Insights (see: http://kcitp.me/M67S9M). The event consisted of 2 keynotes & a panel with expert data scientists, engineers, and data analysts from companies like Adknowledge and Cerner. This talk, entitled "Big Data Technologies and Tools" was delivered by Ryan Brush, Distinguished Engineer w/ CernerTRANSCRIPT
Big Data Technologies and Techniques
Ryan BrushDistinguished Engineer, Cerner Corporation
@ryanbrush
Relational Databases are Awesome
Relational Databases are Awesome
Atomic, transactional updates
Declarative queries
Guaranteed consistency
Easy to reason about
Long track record of success
Relational Databases are Awesome
…so use them!
Relational Databases are Awesome
…so use them!
But…
Those advantages have a cost
Global, atomic state means global, atomic coordination
Coordination does not scale linearly
The costs of coordination
Remember the network effect?
The costs of coordination
2 nodes = 1 channel5 nodes = 10 channels12 nodes = 66 channels25 nodes = 300 channels
So we better be able to scale
The costs of coordination
Databases have optimized this in many clever ways, but a limit on scalability still exists
Let’s look at some ways to scale
Bulk processing billions of records
Bulk processing billions of recordsData aggregation and storage
Bulk processing billions of recordsData aggregation and storage
Real-time processing of updates
Bulk processing billions of recordsData aggregation and storage
Real-time processing of updates
Serving data for: Online AppsAnalytics
Let’s start with scalability of bulk processing
Quiz: which one is scalable?
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
1000 Windows ME machines runningindependent Excel macros
Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process
1000 Windows ME machines runningindependent Excel macros
Independence Parallelizable
Independence Parallelizable
Parallelizable Scalable
“Shared Nothing” architectures are themost scalable…
“Shared Nothing” architectures are themost scalable…
…but most real-world problems requireus to share something…
“Shared Nothing” architectures are themost scalable…
…but most real-world problems requireus to share something…
…so our designs usually have a parallelpart and a serial part
The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.
Amdahl’s LawS : speed improvementP : ratio of the problem that can be parallelizedN: number of processors
MapReduce PrimerInput Data
Split 1
Split 2
Split 3
Split N
.
.
.
Mapper 1
Mapper 2
Mapper 3
Mapper N
.
.
.
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
MapReduce Example: Word CountBooks
Count words per book
.
.
.
Map Phase
Sum words A-C
.
.
ReducePhase
Shuffle
Sum wordsD-E
Sum words W-Z
Count words per book
Count words per book
Notice there is still a serial part of the problem: the of the reducers must be combined
Notice there is still a serial part of the problem: the of the reducers must be combined
…but this is much smaller, and can behandled by a single process
Also notice that the network is a shared resource when processing big data
Also notice that the network is a shared resource when processing big data
So rather than moving data to computation,we move computation to data.
MapReduce Data LocalityInput Data
Split 1
Split 2
Split 3
Split N
.
.
.
Mapper 1
Mapper 2
Mapper 3
Mapper N
.
.
.
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
= a physical machine
Data locality is only guaranteed the Map phase
Data locality is only guaranteed the Map phase
So the most data-intensive work should bedone in the map, with smaller sets set to the reducer
Data locality is only guaranteed the Map phase
So the most data-intensive work should bedone in the map, with smaller sets set to the reducer
Some Map/Reduce jobs have no reducer at all!
MapReduce Gone WrongBooks
Count words per book
.
.
.
Map Phase
Sum words A-C
.
.
ReducePhase
Shuffle
Sum wordsD-E
Sum words W-Z
Count words per book
Count words per book
Word Addition
Service
Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it
So for data processing, prefer embedded libraries over remote services
So for data processing, prefer embedded libraries over remote services
Use remote services for configuration, to prime caches, etc. – just not for every data element!
Joining a billion records
Word counts are great, but many real-worldproblems mean bringing together multiple datasets.
So how do we “join” with MapReduce?
Map-Side Joins
Data Set 1
Split 3 Mapper 3
Map Phase
Reducer 1
Reducer 2..
ReducePhase
Shuffle
Data set 2
Split 1 Mapper 1Data set 2
Split 2 Mapper 2Data set 2
When joining one big input to a small one,Simply copy the small data set to each mapper
Merge in Reducer
Data Set 1
Split 1
Split 2
Split 3
Group by key
Map Phase
Reducer 1
Reducer 2
Reducer N
.
.
ReducePhase
Shuffle
Group by key
Group by key
Data Set 2
Split 1
Split 2
Split 3
Group by key
Group by key
Group by key
Route common items to the same reducer
Higher-Level Constructs
MapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all compileInto MapReduce
Crunch!
Use one!
MapReduce and MPP Databases
MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databases
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured dataData in a distributed filesystem Data in sharded relational databases
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operations
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data
SQL and User-Defined Functionsrunning next to data
MapReduce MPP DatabasesOriented towards unstructured or semi-structured data
Oriented towards structured data
Java or Domain-Specific Languages(e.g., Pig and Hive)
SQL
Data in a distributed filesystem Data in sharded relational databases
Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data
SQL and User-Defined Functionsrunning next to data
Poor interactive query support Good interactive query support
MapReduce MPP Databases
…are complementary!
MapReduce MPP Databases
…are complementary!
Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis
Bulk processing of millions of recordsData aggregation and storage
Hadoop Distributed Filesystem
Scales to many petabytes
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodes
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what file
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicate
Hadoop Distributed Filesystem
Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicateWrite and append only – no random updates!
Client
Name Node
Data Node 1 Data Node 2 Data Node N. . .Block
Block
Block Block
Block
Lookup Data Node
Replicate Replicate
Write
HDFS Writes
Client
Name Node
Data Node 1 Data Node 2 Data Node N. . .Block
Block
Block Block
Block
Lookup Block locations
Read
HDFS Reads
HDFS Shortcomings
No random readsNo random writesDoesn’t deal with many small files
HDFS Shortcomings
No random readsNo random writesDoesn’t deal with many small files
Enter HBase“Random Access To Your Planet-Size Data”
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted files
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers
HBase
Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers
Preserves scalability, data locality, andMap/Reduce features of Hadoop
Use HBase when:You have noisy, semi-structured data
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
To handle huge write loads
Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem
To handle huge write loadsAs a scalable key/value store
But there are drawbacks:Limited schema supportLimited atomicity guaranteesNo built-in secondary indexes
HBase is a great tool for many jobs,but not every job
The data store should alignwith the needs of the application
So a pattern is emerging:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
But we have a potential bottleneck
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
Direct inserts are designed for online updates, not massively parallel data loads
So shift the work into MapReduce, and pre-build files for bulk import
Oracle Loader for HadoopHBase HFile Import Bulk Loads for MPP
And we’re missing an important piece:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
MapReduce Jobs
MPP
Relational
Document Store
Storage
HBase
And we’re missing an important piece:
Hadoop with
HBase
Millennium
CCDs
Claims
HL7
Collection Aggregation Processing
Realtime Processing
MPP
Relational
Document Store
Storage
HBase
Map/Reduce
Jobs (batch)
How do we make it fast?
Speed Layer
Batch Layer
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
How do we make it fast?
Speed Layer
Batch LayerHigh Latency (minutes or hours to process)
Low Latency (seconds to process)
Move data to computation
Move computation to dataYears of data
Hours of data
Bulk loads
Incremental updates
How do we make it fast?
Speed Layer
Batch LayerMapReduce
Storm
Complex Event Processing
Hadoop
And now, the challenge…
Process all data overnight
Process all data overnight
Quickly create new data models
Simple correction of any bugs
Fast iteration cycles means fast innovation
Much easier to understand and work with
Questions?