tsunami alerting with cassandra - lesfurets.github.io · tsunami alerting with cassandra ... •...

74
From 0 to Cassandra on AWS in 30 days Tsunami alerting with Cassandra Andrei Arion - Geoffrey Bérard - Charles Herriau @BeastieFurets 1

Upload: vuongduong

Post on 16-May-2018

237 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

From 0 to Cassandra on AWS in 30 daysTsunami alerting with Cassandra

Andrei Arion - Geoffrey Bérard - Charles Herriau@BeastieFurets

1

Page 2: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

1 Context

2 The project, step by step

3 Feedbacks

4 Conclusion

2

Page 3: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

3

Page 4: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

LesFurets.com

• 1st independant insurance aggregator

• A unique place to compare and buy hundreds of insurance products

• Key figures:

• 2 500 000 quotes/year

• 31% market share on car comparison (january 2015)

• more than 50 insurers

4

Page 5: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

LesFurets.com

5

Page 6: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

@BeastieFurets

6

4 Teams | 25 Engineers | Lean Kanban | Daily delivery

Page 7: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

LesFurets.com

7

Page 8: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Bigdata@Telcom ParisTech

8

Page 9: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Context

• Master Level course

• Big Data Analysis and Management

• Non relational Databases

• 30 hours

• Lectures: 25 %

• Hand’s on: 75 %

9http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html

Page 10: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Context

• 30 students:

• various backgrounds: students and professionals

• various skill levels

• various expertise area: devops, data engineer, marketing

• SQL and Hadoop exposure

10http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html

Main topics:

Page 11: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Project goals

• Choose the right technology

• Implement from scratch a full stack solution

• Optimize data model for a particular use case

• Deploy on cloud

• Discuss tradeoffs

One month to deliver

11http://www.telecom-paristech.fr/formation-continue/masteres-specialises/big-data.html

Page 12: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

The project,Step by step

12

Page 13: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Let’s all become students

13

Page 14: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

14

Labor day

1

Page 15: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

15

Discovering the subject

Page 16: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

16

Discovering the subject

An earthquake occurs in the Sea of Japan. A tsunami is likely to hit the coast.

Notify all the population within a 500km range as soon as possible.

Constraints:

• Team work (3 members)

• Choose a technology seen in the module

• Deploy on AWS (300€ budget)

• The closest data center is lost

• Pre-load the data

2

Page 17: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

17

Discovering Data

Page 18: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

18

How to get the data ?

3• Generated logs from telecom providers

• One month of position tracking

• 1 Mb, 1 Gb, 10 Gb, 100 Gb files

• Available on Amazon S3

Page 19: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

• 10 biggest cities

• 100 network cells/city

• 1.000.000 different phone numbers

19

Data distribution

4

Page 20: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

20

Data format

Timestamp Cell ID Coordinates Phone Number 42015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924

Page 21: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Full dataset (100 GB):

• 240 000 logs per hour / city

• ~ 2 000 000 000 rows

21

Data distribution

5

Page 22: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

22

Data distribution

5 2015-01-04 17:10:52,834;Osa_61;34.232793;135.906451;829924 2015-01-04 17:10:52,928;Yok_85;35.494120;139.424249;121737 2015-01-04 17:10:52,423;Tok_14;35.683104;139.755020;731737 2015-01-04 17:10:53,923;Osa_61;34.232793;135.906451;861343 2015-01-04 17:10:53,153;Kyo_06;34.980933;135.777283;431737 ... 2015-01-04 17:10:55,928;Yok_99;35.030989;140.126021;829924

Page 23: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

23

Data distribution

5

Page 24: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Choosing the stack

24

Page 25: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

25

Storage Technology short list

• Efficient for logs

• Fault Tolerance

• Geolocation

• Easy transform logs into documents

6

Page 26: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

26

Install a local Cassandra cluster

7

Page 27: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

27

Data Modeling

Page 28: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

28

Naive model

Osa_192015-05-19 15:25:57,369 2015-05-21 15:00:57,551

456859 012564

CREATE TABLE phones1 (cell text,instant timestamp,phoneNumber text,PRIMARY KEY (cell, instant)

);9

Page 29: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

29

Avoid overwrites

Osa_192015-05-19 15:25:57,369 2015-05-21 15:00:57,551

456859, 659842 012564

9

CREATE TABLE phones2 (cell text,instant timestamp,phoneNumbers Set<text>,PRIMARY KEY (cell, instant)

);

Page 30: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

30

Compound partition key

Osa_19 : 2015-05-19 15:25:57,369456859 659842

- -

10

CREATE TABLE phones3 (cell text,instant timestamp,phoneNumber text,PRIMARY KEY ((cell, instant), phoneNumber)

);

Page 31: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

34

Hourly bucketing

Osa_19 : 2015-05-19 15:00456859 659842

- -

10

CREATE TABLE phones4 (cell text,instant timestamp,phoneNumber text,PRIMARY KEY ((cell, instant), phoneNumber)

);

Page 32: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

CREATE TABLE phones5 (cell text,instant timestamp,numbers text,PRIMARY KEY ((cell, instant))

);

32

Query first

Osa_19, 2015-05-19 15:00456859,659842

-

11

Page 33: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

33

Importing Data

Page 34: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

34

Import data into Cassandra:

13100GB

2 billions rows

?

Page 35: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

35

Import data into Cassandra

13

Page 36: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

36

Import data into Cassandra: Copy

13

Page 37: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

ISO-8601 vs RFC3339)

37

Import data into Cassandra: Copy

14

100 hours* 200 minutes*

Pre-process COPY

100GB2 billions rows

sed 's/,/./g' | awk -F',' '{ printf "\"%s\",\"%s\",\"%s\"",$2,$1,$5;print ""}'

Page 38: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

38

Import data into Cassandra: parallelize Copy

14

pre-processing,splitting

parallelized copy

100GB2 billions rows

Page 39: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

39

Import data into Cassandra: Copy

15

Page 40: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

40

How to speed up Cassandra import?

16

Page 41: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

41

Cassandra Write Path Documentation

17

Page 42: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

42

SSTableLoader

17

SSTableWriter sstableloader

100GB2 billions rows

300 hours* 2 hours*

Page 43: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

43

Import data into Cassandra from Amazon S3

17

©2011 http://improve.dk/pushing-the-limits-of-amazon-s3-upload-performance/

Page 44: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

44

Import data into Cassandra from Amazon S3

18

S3

100GB2 billions rows

2 billions inserts

Page 45: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

45

Using Spark to parallelize the insert

18

S3

100GB2 billions rows

2 billions inserts

Page 46: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

46

Cheating / Modeling before inserting

18

S3

100GB2 billions rows

2 billions inserts

Page 47: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

47

Cheating / Modeling before inserting

19

S3

100GB2 billions rows

1 million inserts

Page 48: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

48

Using AWS

Page 49: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

49

Choosing the right AWS Instance type

22

Page 50: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

50

Shopping on Amazon

• What can 300E get you: xHours of y Instance Types + Z EBS

22

Page 51: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Good starting point: 6 analytics nodes, M3.large or M3.xlarge, SSD

Finally : not so difficult to choose

51

23

Capacity planning

Page 52: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

52

• Avantages

• install an open source stack

• full control

• latest version

• Inconvenients

• devops skills required

• lot of configuration to do

• version compatibility

• time / budget consuming

• NOT AN OPTION !

Install from scratch

24

Page 53: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

• DataStax Enterprise AMI : 1 analytics node shared between Cassandra and Spark

53

Start small

25

Page 54: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

• Requirement:

• Preload data

• Goal:

• Minimize cluster uptime

• Strategies:

• use EBS -> persistent storage (expensive, slow)

• load the dataset on a node and turn off the others (free, time consuming)

• do everything “the night before” (free, risky)

54

Optimizing budget

26

Page 55: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

55

Preparing for D-Day

Page 56: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

• Preload the data into Cassandra on the AWS cluster

• Input: earthquake date and location

• Kill the closest node

• Start simulation

• monitor in realtime the alerting performance

• Tuning (model, import)

Lather, rinse, repeat...

56

Preparing for the D Day

28

Page 57: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

• Alternatives:

• stop/kill the Cassandra process

• turn off the gossip

• AWS instance shutdown

• The effect is the same

57

How to stop a specific node of the cluster ?

29

Page 58: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

58

Pre-loading data

29

Page 59: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

59

Waiting our turn

30

Page 60: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Conclusion

60

Page 61: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

61

Page 62: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

62

Data Model

Page 63: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

63

Modelling in SQL vs CQL

● SQL => CQL : state of mind shifting ○ 3NF vs denormalization○ model first vs query first

● SQL and CQL are syntactically very similar that can be confusing○ no Joins, range queries on partition keys...

Page 64: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

64

Modeling data: cqlsh vs cli

Page 65: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

65

Importing Data

Page 66: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

66

Development

Page 67: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

67

Architecture

Page 68: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

68

Amazon Web Services

Page 69: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

69

Quality

Page 70: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

70

What is the main problem ?

Modeling DataImporting Data

ArchitectureDevelopmentDeployment

Quality=> Multidisciplinary Approach

Page 71: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

71

Multidisciplinary approach

Team 1 2 3 4 5 6 7 8 9 10

Modeling Data √ √ √ √ √ √ √ √

Importing Data √ √ √

Architecture √ √ √ √ √ √

Development √ √ √ √ √ √

Deployment √ √ √ √ √ √ √ √

Quality √ √ √ √

Grade 8 10 12 13 13 14 14.5 16 16.5 17

Page 72: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

72

Team Skills Balance

Page 73: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

Conclusion

• Easy to learn and easy to teach

• Easy to deploy on AWS

• Performance monitor for tuning the models

• Passing from SQL vs CQL can be challenging

• Fun to implement on a concrete use-case with time/budget constraints

73

Page 74: Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data

74

Thank you

Andrei Arion - [email protected] Geoffrey Bérard - [email protected] Herriau - [email protected]

@BeastieFurets