real time data processing with spark & cassandra @ nosqlmatters 2015 paris
TRANSCRIPT
@doanduyhai
Real time data processing with Spark & Cassandra DuyHai DOAN, Technical Advocate
@doanduyhai
Who Am I ?!Duy Hai DOAN Cassandra technical advocate • talks, meetups, confs • open-source devs (Achilles, …) • OSS Cassandra point of contact
[email protected] @doanduyhai
2
@doanduyhai
Datastax!• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 200+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
Spark & Cassandra Integration!
Spark & its eco-system!Cassandra & token ranges!
Stand-alone cluster deployment!!
@doanduyhai
What is Apache Spark ?!Created at Apache Project since 2010 General data processing framework MapReduce is not the A & ΩΩ One-framework-many-components approach
5
@doanduyhai
Spark characteristics!Fast • 10x-100x faster than Hadoop MapReduce • In-memory storage • Single JVM process per node, multi-threaded
Easy • Rich Scala, Java and Python APIs (R is coming …) • 2x-5x less code • Interactive shell
6
@doanduyhai
Spark code example!Setup
Data-set (can be from text, CSV, JSON, Cassandra, HDFS, …)
val$conf$=$new$SparkConf(true)$$ .setAppName("basic_example")$$ .setMaster("local[3]")$$val$sc$=$new$SparkContext(conf)$
val$people$=$List(("jdoe","John$DOE",$33),$$$$$$$$$$$$$$$$$$$("hsue","Helen$SUE",$24),$$$$$$$$$$$$$$$$$$$("rsmith",$"Richard$Smith",$33))$
7
@doanduyhai
RDDs!RDD = Resilient Distributed Dataset val$parallelPeople:$RDD[(String,$String,$Int)]$=$sc.parallelize(people)$$val$extractAge:$RDD[(Int,$(String,$String,$Int))]$=$parallelPeople$$ $ $ $ $ $ .map(tuple$=>$(tuple._3,$tuple))$$val$groupByAge:$RDD[(Int,$Iterable[(String,$String,$Int)])]=extractAge.groupByKey()$$val$countByAge:$Map[Int,$Long]$=$groupByAge.countByKey()$
8
@doanduyhai
RDDs!RDD[A] = distributed collection of A • RDD[Person] • RDD[(String,Int)], …
RDD[A] split into partitions Partitions distributed over n workers à parallel computing
9
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLib GraphX Spark SQL
Persistence
Cluster Manager
…
10
@doanduyhai
Spark eco-system!
Local Standalone cluster YARN Mesos
Spark Core Engine (Scala/Java/Python)
Spark Streaming MLLib GraphX Spark SQL
Persistence
Cluster Manager
…
11
@doanduyhai
What is Apache Cassandra?!Created at Apache Project since 2009 Distributed NoSQL database Eventual consistency (A & P of the CAP theorem) Distributed table abstraction
12
@doanduyhai
Cassandra data distribution reminder!Random: hash of #partition → token = hash(#p) Hash: ]-X, X] X = huge number (264/2)
n1
n2
n3
n4
n5
n6
n7
n8
13
@doanduyhai
Cassandra token ranges!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X] Murmur3 hash function
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
14
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
15
@doanduyhai
Linear scalability!
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
user_id1
user_id2
user_id3
user_id4
user_id5
16
@doanduyhai
Cassandra Query Language (CQL)!
INSERT INTO users(login, name, age) VALUES(‘jdoe’, ‘John DOE’, 33);
UPDATE users SET age = 34 WHERE login = ‘jdoe’;
DELETE age FROM users WHERE login = ‘jdoe’;
SELECT age FROM users WHERE login = ‘jdoe’;
17
@doanduyhai
Why Spark on Cassandra ?!Reliable persistent store (HA)
Structured data (Cassandra CQL à Dataframe API)
Multi data-center !!!
For Spark
18
@doanduyhai
Why Spark on Cassandra ?!Reliable persistent store (HA)
Structured data (Cassandra CQL à Dataframe API)
Multi data-center !!!
Cross-table operations (JOIN, UNION, etc.)
Real-time/batch processing
Complex analytics (e.g. machine learning)
For Spark
For Cassandra
19
@doanduyhai
Use Cases!
Load data from various sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration, Data conversion
20
@doanduyhai
Cluster deployment!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Stand-alone cluster
21
@doanduyhai
Cluster deployment!
Spark Master
Spark Worker Spark Worker Spark Worker Spark Worker
Executor Executor Executor Executor
Driver Program
Cassandra – Spark placement 1 Cassandra process ⟷ 1 Spark worker
C* C* C* C*
22
Spark & Cassandra Connector!
Core API!SparkSQL!
SparkStreaming!
@doanduyhai
Connector architecture!All Cassandra types supported and converted to Scala types Server side data filtering (SELECT … WHERE …) Use Java-driver underneath !Scala and Java support
24
@doanduyhai
Connector architecture – Core API!Cassandra tables exposed as Spark RDDs
Read from and write to Cassandra
Mapping of C* tables and rows to Scala objects • CassandraRow • Scala case class (object mapper) • Scala tuples
25
@doanduyhai
Connector architecture – Spark SQL !
Mapping of Cassandra table to SchemaRDD • CassandraSQLRow à SparkRow • custom query plan • push predicates to CQL for early filtering
SELECT * FROM user_emails WHERE login = ‘jdoe’;
26
@doanduyhai
Connector architecture – Spark Streaming !
Streaming data INTO Cassandra table • trivial setup • be careful about your Cassandra data model !!!
Streaming data OUT of Cassandra tables ? • work in progress …
27
Connector API !
Connector API!Data Locality Implementation!
@doanduyhai
Connector API!Connecting to Cassandra
!//!Import!Cassandra.specific!functions!on!SparkContext!and!RDD!objects!!import!com.datastax.driver.spark._!!!!//!Spark!connection!options!!val!conf!=!new!SparkConf(true)!! .setMaster("spark://192.168.123.10:7077")!! .setAppName("cassandra.demo")!! .set("cassandra.connection.host","192.168.123.10")!//!initial!contact!! .set("cassandra.username",!"cassandra")!! .set("cassandra.password",!"cassandra")!!!val!sc!=!new!SparkContext(conf)!
29
@doanduyhai
Connector API!Preparing test data
CREATE&TABLE&test.words&(word&text&PRIMARY&KEY,&count&int);&&INSERT&INTO&test.words&(word,&count)&VALUES&('bar',&30);&INSERT&INTO&test.words&(word,&count)&VALUES&('foo',&20);&
30
@doanduyhai
Connector API!Reading from Cassandra
!//!Use!table!as!RDD!!val!rdd!=!sc.cassandraTable("test",!"words")!!//!rdd:!CassandraRDD[CassandraRow]!=!CassandraRDD[0]!!!rdd.toArray.foreach(println)!!//!CassandraRow[word:!bar,!count:!30]!!//!CassandraRow[word:!foo,!count:!20]!!!rdd.columnNames!!!!//!Stream(word,!count)!!rdd.size!!!!!!!!!!!//!2!!!val!firstRow!=!rdd.first!!//firstRow:CassandraRow=CassandraRow[word:!bar,!count:!30]!!!firstRow.getInt("count")!!//!Int!=!30!
31
@doanduyhai
Connector API!Writing data to Cassandra
!val!newRdd!=!sc.parallelize(Seq(("cat",!40),!("fox",!50)))!!!//!newRdd:!org.apache.spark.rdd.RDD[(String,!Int)]!=!ParallelCollectionRDD[2]!!!!!newRdd.saveToCassandra("test",!"words",!Seq("word",!"count"))!
SELECT&*&FROM&test.words;&&&&&&word&|&count&&&&&&999999+9999999&&&&&&bar&|&&&&30&&&&&&foo&|&&&&20&&&&&&cat&|&&&&40&&&&&&fox&|&&&&50&&
32
@doanduyhai
Remember token ranges ?!A: ]0, X/8] B: ] X/8, 2X/8] C: ] 2X/8, 3X/8] D: ] 3X/8, 4X/8] E: ] 4X/8, 5X/8] F: ] 5X/8, 6X/8] G: ] 6X/8, 7X/8] H: ] 7X/8, X]
n1
n2
n3
n4
n5
n6
n7
n8
A
B
C
D
E
F
G
H
33
@doanduyhai
Data Locality!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Spark partition RDD
Cassandra tokens ranges
34
@doanduyhai
Data Locality!C*
SparkM SparkW
C* SparkW
C* SparkW
C* SparkW
C* SparkW
Use Murmur3Partitioner
35
@doanduyhai
Read data locality!Read from Cassandra
Spark shuffle operations
36
@doanduyhai
Repartition before write !
Write to Cassandra
rdd.repartitionByCassandraReplica("keyspace","table")
37
@doanduyhai
Or async batch writes!
Async batches fan-out writes to Cassandra
Spark shuffle operations
38
@doanduyhai
Write data locality!
39
• either stream data with Spark using repartitionByCassandraReplica() • or flush data to Cassandra by async batches • in any case, there will be data movement on network (sorry no magic)
@doanduyhai
Joins with data locality!
40
CREATE TABLE artists(name text, style text, … PRIMARY KEY(name));
CREATE TABLE albums(title text, artist text, year int,… PRIMARY KEY(title));
val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) // Repartition RDDs by "artists" PK, which is "name" .repartitionByCassandraReplica(KEYSPACE, ARTISTS) // Join with "artists" table, selecting only "name" and "country" columns .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name"))
@doanduyhai
Joins pipeline with data locality!
41
val join: CassandraJoinRDD[(String,Int), (String,String)] = sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS) // Select only useful columns for join and processing .select("artist","year") .as((_:String, _:Int)) .repartitionByCassandraReplica(KEYSPACE, ARTISTS) .joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country")) .on(SomeColumns("name")) .map(…) .filter(…) .groupByKey() .mapValues(…) .repartitionByCassandraReplica(KEYSPACE, ARTISTS_RATINGS) .joinWithCassandraTable(KEYSPACE, ARTISTS_RATINGS) … !!
@doanduyhai
Perfect data locality scenario!
42
• read localy from Cassandra • use operations that do not require shuffle in Spark (map, filter, …) • repartitionbyCassandraReplica() • à to a table having same partition key as original table • save back into this Cassandra table
Demo
https://github.com/doanduyhai/Cassandra-Spark-Demo
@doanduyhai
What’s for future ?!Datastax Enterprise 4.7 • Cassandra + Spark + Solr as your analytics platform Filter out most data possible with Solr from Cassandra Fetch the filtered data in Spark and perform aggregations Save back final data into Cassandra
44
@doanduyhai
What’s for future ?!What’s about data locality ?
45
@doanduyhai
val join: CassandraJoinRDD[(String,Int), (String,String)] =
sc.cassandraTable[(String,Int)](KEYSPACE, ALBUMS)
// Select only useful columns for join and processing
.select("artist","year").where("solr_query = 'style:*rock* AND ratings:[3 TO *]' ")
.as((_:String, _:Int))
.repartitionByCassandraReplica(KEYSPACE, ARTISTS)
.joinWithCassandraTable[(String,String)](KEYSPACE, ARTISTS, SomeColumns("name","country"))
.on(SomeColumns("name")).where("solr_query = 'age:[20 TO 30]' ")
What’s for future ?!
1. compute Spark partitions using Cassandra token ranges 2. on each partition, use Solr for local data filtering (no fan out !) 3. fetch data back into Spark for aggregations 4. repeat 1 – 3 as many times as necessary
46
@doanduyhai
What’s for future ?!
47
SELECT … FROM … WHERE token(#partition)> 3X/8 AND token(#partition)<= 4X/8 AND solr_query='full text search expression';
1
2
3
Advantages of same JVM Cassandra + Solr integration
1
Single-pass local full text search (no fan out) 2
Data retrieval
D: ] 3X/8, 4X/8]
Q & R
! " !