kansas city big data: the future of insights - keynote: "big data technologies and...

Big Data Technologies and Techniques

Ryan BrushDistinguished Engineer, Cerner Corporation

@ryanbrush

Relational Databases are Awesome


Atomic, transactional updates

Declarative queries

Guaranteed consistency

Easy to reason about

Long track record of success


…so use them!


…so use them!

But…

Those advantages have a cost

Global, atomic state means global, atomic coordination

Coordination does not scale linearly

The costs of coordination

Remember the network effect?


2 nodes = 1 channel5 nodes = 10 channels12 nodes = 66 channels25 nodes = 300 channels

So we better be able to scale


Databases have optimized this in many clever ways, but a limit on scalability still exists

Let’s look at some ways to scale

Bulk processing billions of records

Bulk processing billions of recordsData aggregation and storage


Real-time processing of updates


Real-time processing of updates

Serving data for: Online AppsAnalytics

Let’s start with scalability of bulk processing

Quiz: which one is scalable?

Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process

Quiz: which one is scalable?1000-node Hadoop cluster where jobs depend on a common process

1000 Windows ME machines runningindependent Excel macros

Independence Parallelizable

Independence Parallelizable

Parallelizable Scalable

“Shared Nothing” architectures are themost scalable…


…but most real-world problems requireus to share something…


…but most real-world problems requireus to share something…

…so our designs usually have a parallelpart and a serial part

The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.

Amdahl’s LawS : speed improvementP : ratio of the problem that can be parallelizedN: number of processors

MapReduce PrimerInput Data

Split 1

Split 2

Split 3

Split N

.

.

.

Mapper 1

Mapper 2

Mapper 3

Mapper N

.

.

.

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

MapReduce Example: Word CountBooks

Count words per book

.

.

.

Map Phase

Sum words A-C

.

.

ReducePhase

Shuffle

Sum wordsD-E

Sum words W-Z



Notice there is still a serial part of the problem: the of the reducers must be combined

Notice there is still a serial part of the problem: the of the reducers must be combined

…but this is much smaller, and can behandled by a single process

Also notice that the network is a shared resource when processing big data

Also notice that the network is a shared resource when processing big data

So rather than moving data to computation,we move computation to data.

MapReduce Data LocalityInput Data

Split 1

Split 2

Split 3

Split N

.

.

.

Mapper 1

Mapper 2

Mapper 3

Mapper N

.

.

.

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

= a physical machine

Data locality is only guaranteed the Map phase


So the most data-intensive work should bedone in the map, with smaller sets set to the reducer


So the most data-intensive work should bedone in the map, with smaller sets set to the reducer

Some Map/Reduce jobs have no reducer at all!

MapReduce Gone WrongBooks


.

.

.

Map Phase

Sum words A-C

.

.

ReducePhase

Shuffle

Sum wordsD-E

Sum words W-Z



Word Addition

Service

Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it

So for data processing, prefer embedded libraries over remote services

So for data processing, prefer embedded libraries over remote services

Use remote services for configuration, to prime caches, etc. – just not for every data element!

Joining a billion records

Word counts are great, but many real-worldproblems mean bringing together multiple datasets.

So how do we “join” with MapReduce?

Map-Side Joins

Data Set 1

Split 3 Mapper 3

Map Phase

Reducer 1

Reducer 2..

ReducePhase

Shuffle

Data set 2

Split 1 Mapper 1Data set 2

Split 2 Mapper 2Data set 2

When joining one big input to a small one,Simply copy the small data set to each mapper

Merge in Reducer

Data Set 1

Split 1

Split 2

Split 3

Group by key

Map Phase

Reducer 1

Reducer 2

Reducer N

.

.

ReducePhase

Shuffle

Group by key

Group by key

Data Set 2

Split 1

Split 2

Split 3

Group by key

Group by key

Group by key

Route common items to the same reducer

Higher-Level Constructs

MapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all compileInto MapReduce

Crunch!

Use one!

MapReduce and MPP Databases

MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databases

MapReduce MPP DatabasesOriented towards unstructured or semi-structured data

Oriented towards structured dataData in a distributed filesystem Data in sharded relational databases


Oriented towards structured data

Java or Domain-Specific Languages(e.g., Pig and Hive)

SQL

Data in a distributed filesystem Data in sharded relational databases




SQL


Poor support for iterative operations Good support of iterative operations




SQL


Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data

SQL and User-Defined Functionsrunning next to data




SQL


Poor support for iterative operations Good support of iterative operationsArbitrarily complex programsrunning next to data

SQL and User-Defined Functionsrunning next to data

Poor interactive query support Good interactive query support

MapReduce MPP Databases

…are complementary!

MapReduce MPP Databases

…are complementary!

Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis

Bulk processing of millions of recordsData aggregation and storage

Hadoop Distributed Filesystem

Scales to many petabytes


Scales to many petabytesSplits all files into blocks and spreadsthem across data nodes


Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what file


Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicate


Scales to many petabytesSplits all files into blocks and spreadsthem across data nodesThe name node keeps track of what blocks belong to what fileAll blocks written in triplicateWrite and append only – no random updates!

Client

Name Node

Data Node 1 Data Node 2 Data Node N. . .Block

Block

Block Block

Block

Lookup Data Node

Replicate Replicate

Write

HDFS Writes

Client

Name Node

Data Node 1 Data Node 2 Data Node N. . .Block

Block

Block Block

Block

Lookup Block locations

Read

HDFS Reads

HDFS Shortcomings

No random readsNo random writesDoesn’t deal with many small files

HDFS Shortcomings

No random readsNo random writesDoesn’t deal with many small files

Enter HBase“Random Access To Your Planet-Size Data”

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted files

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers

HBase

Emulates random I/O with a Write Ahead Log (WAL)Periodically flushes log to sorted filesFiles accessible as tables, split acrossmany regions, hosted by region servers

Preserves scalability, data locality, andMap/Reduce features of Hadoop

Use HBase when:You have noisy, semi-structured data

Use HBase when:You have noisy, semi-structured dataYou want to apply massively parallelprocessing to your problem


To handle huge write loads


To handle huge write loadsAs a scalable key/value store

But there are drawbacks:Limited schema supportLimited atomicity guaranteesNo built-in secondary indexes

HBase is a great tool for many jobs,but not every job

The data store should alignwith the needs of the application

So a pattern is emerging:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7

Collection Aggregation Processing

MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

But we have a potential bottleneck

Hadoop with

HBase

Millennium

CCDs

Claims

HL7


MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

Direct inserts are designed for online updates, not massively parallel data loads

So shift the work into MapReduce, and pre-build files for bulk import

Oracle Loader for HadoopHBase HFile Import Bulk Loads for MPP

And we’re missing an important piece:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7


MapReduce Jobs

MPP

Relational

Document Store

Storage

HBase

And we’re missing an important piece:

Hadoop with

HBase

Millennium

CCDs

Claims

HL7


Realtime Processing

MPP

Relational

Document Store

Storage

HBase

Map/Reduce

Jobs (batch)

How do we make it fast?

Speed Layer

Batch Layer

http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems


Speed Layer

Batch LayerHigh Latency (minutes or hours to process)

Low Latency (seconds to process)

Move data to computation

Move computation to dataYears of data

Hours of data

Bulk loads

Incremental updates


Speed Layer

Batch LayerMapReduce

Storm

Complex Event Processing

Hadoop

And now, the challenge…

Process all data overnight

Process all data overnight

Quickly create new data models

Simple correction of any bugs

Fast iteration cycles means fast innovation

Much easier to understand and work with

Questions?

kansas city big data: the future of insights - keynote: "big data technologies and...

Business

data processing

small data set

big data technologies

dataintensive work

phase mapper

phase group

reducer nsplit n mapper

reducer atall