BECOMING FRIENDS WITH CASSANDRA & SPARK
DANI TRAPHAGEN & JON HADDAD
YOU
SPARK
C*
BECOMING FRIENDS WITH CASSANDRA & SPARK
DANI TRAPHAGEN & JON HADDAD
YOU
SPARKC*
HOUSEKEEPING
RAISE YOUR HAND IF YOU DON’T HAVE THE VM OSCON2016.ZIP
1.copy the vm files to a place of your choosing
2.open virtual ovf
VM INSTRUCTIONS
3.import the .ovf as prompted
3.open the packer ovf in VirtualBox
4.check out the vm
LET’S GET STARTED
WHAT ARE WE GOING TO COVER?1. CASSANDRA ARCHITECTURE,
CQL, DATA MODELING 2. SPARK DATAFRAMES
RDBMS & YOU
SQLITE, PYTHON SCRIPTS, LOG FILES
SUCH AS?
SMALL DATA
MOST WEB SITES
RDBMS
MEDIUM DATA
CAN RDBMS WORK FOR BIG DATA?
YOU BIG DATA
VERTICAL SCALE
VERTICAL SCALESTARTING
MY BUSINESS
YAY!
VERTICAL SCALESTARTING
MY BUSINESS
YAY!
VERTICAL SCALESTARTING
MY BUSINESS
YAY!
OH, WHOA, THINGS ARE KICKING UP
VERTICAL SCALESTARTING
MY BUSINESS
YAY!
OH, WHOA, THINGS ARE KICKING UP
VERTICAL SCALESTARTING
MY BUSINESS
YAY!
OH, WHOA, THINGS ARE KICKING UP
ACID IS A LIE
ACID IS A LIEATOMICITY
ACID IS A LIEATOMICITYCONSISTENCY
ACID IS A LIEATOMICITYCONSISTENCYISOLATION
ACID IS A LIEATOMICITYCONSISTENCYISOLATIONDURABILITY
ACID IS A LIEATOMICITYCONSISTENCYISOLATIONDURABILITY
ASYNC REPLICATION != CONSISTENCY
ASYNC REPLICATION != CONSISTENCY
CLIENT
ASYNC REPLICATION != CONSISTENCY
CLIENT
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
REPLICATION LAG
CONSISTENT?
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
REPLICATION LAG
CONSISTENT?
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
REPLICATION LAG
IDK?
CONSISTENT?
ASYNC REPLICATION != CONSISTENCY
CLIENTMASTER SLAVE
REPLICATION LAG
LOL NO! IDK?
THIRD NORMAL FORM DOESN’T SCALE
▸ UNPREDICTABLE
▸ DATA > MEMORY?
▸ DISK SEEKS ALL DAY
▸ USERS = ANGRY
THIRD NORMAL FORM DOESN’T SCALE
AWFUL▸ UNPREDICTABLE
▸ DATA > MEMORY?
▸ DISK SEEKS ALL DAY
▸ USERS = ANGRY
SHARDING
SHARDING
CLIE
NT
SHARDING
CLIE
NT
SHARDING
CLIE
NTNIGHTMARE
AVAILABILITY?
AVAILABILITY?NOT WITH
THESE KNUCKLEHEADS
CONCLUSION: SCALING IS HARD
FRIEND #1: CASSANDRA
FRIEND #1: CASSANDRA
ARCHITECTURE
ARCHITECTURE
PEER TO PEER
▸ With Cassandra there is no Master Slave Hierarchy
▸ Every node is the captain of it’s own ship
▸ Processes within Cassandra make this possible
▸ Replication
▸ Consistency Level
NODE1
NODE2
NODE3
NODE4
ARCHITECTURE
PEER TO PEER
▸ With Cassandra there is no Master Slave Hierarchy
▸ Every node is the captain of it’s own ship
▸ Processes within Cassandra make this possible
▸ Replication
▸ Consistency Level
NODE1
NODE2
NODE3
NODE4
WHAT DOES THIS GET US?
WHAT DOES THIS GET US?
LINEAR SCALABILITY
WHAT DOES THIS GET US?
LINEAR SCALABILITY
HIGH AVAILABILITY
TOPOLOGY
CLIENT
TOPOLOGY
CLIENT
TOPOLOGY
OPERATION
CLIENT
TOPOLOGY
OPERATION
CLIENT
TOPOLOGY
OPERATION
CLIENT
TOPOLOGY
OPERATION
NODE3
NODE4
▸ Replication factor is the number of replicas/puppies
ARCHITECTURE
REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA
NODE1
NODE2
NODE3
NODE4
▸ Replication factor is the number of replicas/puppies
ARCHITECTURE
REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA
NODE1
NODE2
NODE3
NODE4
▸ Replication factor is the number of replicas/puppies
ARCHITECTURE
REPLICATION IS HOW CASSANDRA DISTRIBUTES DATA
NODE1
NODE2
NODE3
NODE4
▸ The coordinator talks to the client, sending an ack for the write
ARCHITECTURE
HOW DO WE ACKNOWLEDGE REPLICATION?
NODE1
NODE2
COORDINATOR
NODE3
NODE4
▸ The coordinator talks to the client, sending an ack for the write
ARCHITECTURE
HOW DO WE ACKNOWLEDGE REPLICATION?
NODE1
NODE2
COORDINATOR
NODE3
NODE4
▸ The coordinator talks to the client, sending an ack for the write
ARCHITECTURE
HOW DO WE ACKNOWLEDGE REPLICATION?
NODE1
NODE2
COORDINATOR
ack
ARCHITECTURE
TUNABLE CONSISTENCY LEVELS
NODE1
NODE2
NODE3
NODE4
▸ One
▸ Quorum
▸ All
ONE
ARCHITECTURE
NODE1
NODE2
NODE3
NODE4
▸ One replica acks adorable puppy data
ONE
ARCHITECTURE
NODE1
NODE2
NODE3
NODE4
▸ One replica acks adorable puppy data
▸ All replicas ack adorable puppy data
NODE3
NODE4
ARCHITECTURE
ALL
NODE1
NODE2
▸ All replicas ack adorable puppy data
NODE3
NODE4
ARCHITECTURE
ALL
NODE1
NODE2
▸ All replicas ack adorable puppy data
NODE3
NODE4
ARCHITECTURE
ALL
NODE1
NODE2
ARCHITECTURE
QUORUM
NODE1
NODE2
NODE3
▸ Quorum = (sum_of_replication_factors / 2) + 1
▸ How many nodes get puppies if our replication factor is 3, & we want quorum?
NODE4
ARCHITECTURE
QUORUM
NODE1
NODE2
NODE3
▸ Quorum = (sum_of_replication_factors / 2) + 1
▸ How many nodes get puppies if our replication factor is 3, & we want quorum?
NODE4
MULTI-DC PARAMETERS▸Quorum vs. Local_Quorum
▸One vs. Local_One
US-EAST US-WEST
PARTITIONER
CONSISTENT HASHINGJust how is data actually distributed around the cluster?
PARTITIONER
CONSISTENT HASHINGJust how is data actually distributed around the cluster?
PARTITIONER
CONSISTENT HASHINGJust how is data actually distributed around the cluster?
PARTITIONER
CONSISTENT HASHINGJust how is data actually distributed around the cluster?
PARTITIONER
CONSISTENT HASHINGJust how is data actually distributed around the cluster?
CASSANDRA DATA MODELING SOUNDS HARD
CASSANDRA DATA MODELING SOUNDS HARDNOT REALLY
GAIN QUERY POWERSWITH CQL
GAIN QUERY POWERSWITH CQL
DATA STRUCTURES IN CASSANDRA
KEYSPACE
DATA STRUCTURES IN CASSANDRA
KEYSPACE
DATA STRUCTURES IN CASSANDRA
TABLE
KEYSPACE
DATA STRUCTURES IN CASSANDRA
ROWS TABLE
KEYSPACE
DATA STRUCTURES IN CASSANDRA
ROWS
TABLE
KEYSPACE
PARTITIONS
DATA STRUCTURES IN CASSANDRA
ROWS
TABLE
KEYSPACE
PARTITIONS
DATA STRUCTURES IN CASSANDRA
ROWS
TABLE
KEYSPACE
PARTITIONS
DATA STRUCTURES IN CASSANDRA
ROWS
TABLE
PRIMARY KEY = PARTITION KEY + CLUSTERING COLUMNS
PARTITION KEY
PARTITION KEYTHIS IS HOW YOU RETRIEVE A PARTITION
CLUSTERING COLUMNS
CLUSTERING COLUMNSTHIS IS HOW YOU GET SORTING, ORDER AND UNIQUE IDENTIFICATION
WHY ARE CLUSTERING COLUMNS SO COOL?
HOW DO I USE CQL?
CQLSH
HOW DO I USE CQL?
SOME EXAMPLES FROM A MOVIE DB
CREATE A KEYSPACECREATE KEYSPACE movielens_small WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};
CREATE A TABLECREATE TABLE movies ( id uuid PRIMARY KEY, avg_rating float, genres set<text>, name text, release_date date, url text, video_release_date date)
PRIMARY KEY IN WHITE
CREATE A TABLECREATE TABLE ratings_by_movie ( movie_id uuid, user_id uuid, rating int, ts int, PRIMARY KEY (movie_id, user_id))
PRIMARY KEY IN WHITE
INSERT STATEMENT EXAMPLEinsert into movies (id, name, genres) values (976de5da-93ae-4bf0-b127-d19eea1c8ea4, 'My Awesome Movie (2016)', {'Comedy'});
THIS ALL LOOKS TOO FAMILIAR, DOESN’T IT?
BUT REMEMBER…
THIRD NORMAL FORM DOESN’T SCALE
▸ UNPREDICTABLE
▸ DATA > MEMORY?
▸ DISK SEEKS ALL DAY
▸ USERS = ANGRY
THIRD NORMAL FORM DOESN’T SCALE
AWFUL▸ UNPREDICTABLE
▸ DATA > MEMORY?
▸ DISK SEEKS ALL DAY
▸ USERS = ANGRY
DATA MODELING PRO TIPS
DATA MODELING PRO TIPS▸no joins
DATA MODELING PRO TIPS▸no joins
▸query driven methodology, instead
DATA MODELING PRO TIPS▸no joins
▸query driven methodology, instead
▸denormalize
DATA MODELING PRO TIPS▸no joins
▸query driven methodology, instead
▸denormalize
▸disks are cheap
JON & DANI, I’M STARTING TO GET COLD FEET!
I MISS THE WARM EMBRACE OF RDBMS
I DIDN’T HAVE TO DENORMALIZE
BACK THEN
CHILL OUT
& PREPARE TO BE WOWED
& PREPARE TO BE WOWED
CDM
ROLL UP YOUR SLEEVES
TYPE STUFF
REMEMBER THAT VM?
1.use movielens_small;2.desc tables;3.desc movies;4.select * from movies limit 10;
TRY IT OUT
YOU SHOULD GET…
YOUR 10 MOVIES
ADDING ON5. select * id, name from movies limit 100;6. PICK YOUR FAVORITE MOVIE
BONUS: CAN YOU FIND THE AVERAGE
RATINGS FOR YOUR FAVORITE MOVIE?
MOVIE ID LIST
SELECT A MOVIE
TOP GUN EXAMPLE
TOP GUN EXAMPLE
FIFTH ELEMENT BECAUSE OBVIOUSLY
FIFTH ELEMENT BECAUSE OBVIOUSLY
NICE WORK YOU!
FRIEND #2: SPARK
FRIEND #2: SPARK
BATCH PROCESSING
LOTS OF DATA?
STREAMING & REAL TIME AGGREGATION
MACHINE LEARNING FOR THE INEVITABLE END OF TIMES
GRAPH ANALYTICS
2 WAYS OF WORKING
1. RDDBASED ON FUNCTIONAL PROGRAMMING
blah.map( lambda x : x * 2 )
COOL BUT NOT EASY
COOL BUT NOT EASY
2. DATAFRAMES
PRETTY EASY
SPARK-CASSANDRA-CONNECTOR
TODAY WE TALK BATCH WITH DATAFRAMES AND PYTHON
ROLL UP YOUR SLEEVESOPEN THE OSCON TUTORIAL ON YOUR DESKTOP
FRIENDSHIP LEVELS
OTHER RESOURCES TO LEARN:1. free courses -
www.academy.datatax.com 2. our blogs -
www.rustyrazorblade.com & www.dtrapezoid.com
3. our friend’s blog - https://lostechies.com/ryansvihla/
4. datastax blog - http://www.datastax.com/dev/blog
THANK YOU, MAGICAL HUMANS
@DTRAPEZOID @RUSTYRAZORBLADE