cassandra and storm at health market sceince

Post on 06-May-2015

4.930 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

STORM AND CASSANDRA AT HEALTH MARKET

SCIENCE

P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz

Cassandra @ HMS Storm Overview Storm-Cassandra Examples / Demo Future : Trident

Agenda

Our Products

Master Data ManagementGood, bad doctors?

Prescriber eligibility and remediation.

Cassandra to the Rescue

1000’s of Feeds

Δt

C* Masterfile

Big Data for us == Variety of Data

But…

Search unstructured data Real-time Analytics / Reporting Transactional Processing

Changes reflected immediately.Wide-row Indexes

What might that look like?

C*

RDBMS

I’m happy

Dro

pwiz

ard

wid

e-ro

w in

dex

Provide for Polyglot Persistence

What we did wrong…

Could not react to transactional changes Needed extra logic to track what changed Took too long

C* at HMS

Load, Standardize, Match, ConsolidateWrite results to C*

Track Changes over TimePractitioner DataFeed Quality

C* at HMS

Best PracticesPrefer Write over Read (esp. Read before Write)Avoid Queries/Scans

○ Fetch by key whenever possible○ Put Comparators to work○ Pre-compute whenever possible

How?

Treat All Data as ImmutableUpdates are inserts with new version/timestamp

Data ModelHeavy use of compositesTimestamps/Versions in Keys

Treat Feeds as Real-Time Streams

What Storm is to us…

CrudOp

A High Throughput Data Processing Pipeline

RDBMSSoR

Dimensional Counts

ETL Enrichment

Fuzzy Index

Enter Storm

Storm Overview

Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data

Anatomy of a Storm Cluster

NimbusMaster Node

ZookeeperCluster Coordination

SupervisorsWorker Nodes

Storm Primatives

StreamsUnbounded sequence of tuples

SpoutsStream Sources

BoltsUnit of Computation

TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”

Storm Spouts

Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data

Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs

Storm Bolts

Receive Tuples from Spouts or other Bolts Operate on, or React to Data

Functions/Filters/Joins/AggregationsDatabase writes/lookups

Optionally emit additional Tuples

Storm Topologies

Data flow between spouts and bolts Routing of Tuples between spouts/bolts

Stream “Groupings” Parallelism of Components Long-Lived

Storm Topologies

Storm and Cassandra

Use Cases:Write Storm Tuple data to C*

○ Computation Results○ Pre-computed indices

Read data from C* and emit Storm Tuples○ Dynamic Lookups

http://github.com/hmsonline/storm-cassandra

Storm Cassandra Bolt Types

CassandraBolt

C*CassandraLookupBolt

STORM

CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching

CassandraLookupBoltReads data from Cassandra

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

Provides generic Bolts for writing/reading Storm Tuples to/from C*

TupleTuple

Mapper Rows

C*TuplesColumnsMapper Columns

STORM

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an

arbitrary data model

Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a

C* row into a Storm Tuple

Given a C* Row Key and list of Columns:Return a list of Storm Tuples

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project Current State:

Version 0.4.0Uses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:

○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query

Composite Key/Column SupportTrident support

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

Future Plans:Switch to CQLEnhanced Trident Support

http://github.com/hmsonline/storm-cassandra

Word Count Demo

http://github.com/hmsonline/storm-cassandra

DRPC

Reach Demo

Next Level : Trident

Trident

Provides a higher-level abstraction for stream processingConstructs for state management and Batching

Adds additional primitives that abstract away common topological patterns

Deprecates transactional topologies Distributes with Storm

Sample Trident Operations Partition Local

Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate

○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )

A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),

new Split(), new Fields("word"))

.groupBy(new Fields("word")) .persistentAggregate(

MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))

.parallelismHint(6);

https://github.com/nathanmarz/storm/wiki/Trident-state

Trident StateSequenced writes by batch/transaction id. Spouts

Transactional○ Batch contents never change

Opaque○ Batch contents can change

StateTransactional

○ Store tx_id with counts to maintain sequencing of writes.Opaque

○ Store previous value in order to overwrite the current value when contents of a batch change.

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz

THANKS!

top related