1 million writes per second on 60 nodes with cassandra and ebs

Post on 13-Feb-2017

2.627 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 million writes per sec. on 60 nodes with Cassandra and EBS

© 2015. All Rights Reserved.

1 Million Writes Per Second w/60 nodes. !

EBS and C*!Jim Plush - Sr Director of Engineering, CrowdStrike!

Dennis Opacki - Sr Cloud Systems Architect!

An Introduction to CrowdStrike

We Are CyberSecurity Technology Company

We Detect, Prevent And Respond To All Attack Types In Real Time, Protecting Organizations From Catastrophic Breaches

We Provide Next Generation Endpoint Protection, Threat Intelligence & Pre &Post IR Services

NEXT- GEN ENDPOINT

INCIDENTRESPONSE

THREATINTEL

http://www.crowdstrike.com/introduction-to-crowdstrike-falcon-host/

CrowdStrike Scale

•  Cloud based endpoint protection

•  Single customer can generate > 2TB daily

•  500K+ Events Per Second

•  Multi PetaBytes of managed data

© 2015. All Rights Reserved.

Truisms???

•  HTTPs is too slow to run everywhere

•  All you need is anti-virus

•  Never run Cassandra on EBS

© 2015. All Rights Reserved.

© 2015. All Rights Reserved.

What is EBS?

EBS Data Volume

EBS Data Volume

/mnt/foo

/mnt/bar

EC2 Instance

§ Network Mounted Hard Drive

§ Ability to snapshot data

§ Data encryption at rest & in flight

Existing EBS Assumptions

•  Jittery I/O aka: Noisy neighbors

•  Single Point of Failure in a Region

•  Cost is too damn high

•  Bad Volumes (dd and destroy) © 2015. All Rights Reserved.

A recent project: initial requirements

•  1PB of incoming event data from millions of devices

•  Modeled as a graph

•  1 million writes per second (burst)

•  Age data out after x days

•  95% write 5% read

© 2015. All Rights Reserved.

We Tried

•  Cassandra + Titan

•  Sharding?

•  Neo4J

•  PostgreSQL, MySQL, SQLite

•  LevelDB/RocksDB

© 2015. All Rights Reserved.

We have to make this work

•  Cassandra had the properties we needed •  Time for a new approach?

© 2015. All Rights Reserved. http://techblog.netflix.com/2014/07/revisiting-1-million-writes-per-second.html

Number of Machines for 1PB

© 2015. All Rights Reserved.

0.

450.

900.

1350.

1800.

2250.

I2.xlarge c4.2XL EBS

Yearly Cost for 1PB Cluster

© 2015. All Rights Reserved.

0.

4.

8.

12.

16.

I2.xlarge-on demand I2.xlarge-reserved c4.2xl - on demand c4.2xl - reserved

Mill

ions

of $

With EBS

Initial Launch

Date Tiered Compaction

© 2015. All Rights Reserved.

…more details by Jeff Jirsa, CrowdStrike

Cassandra Summit 2015 - DTCS

Initial Launch

•  Cassandra 2.0.12 (DSE)

•  m3.2xlarge 8 core

•  Single 4TB EBS GP2 ~10,000 IOPS

•  Default tunings

© 2015. All Rights Reserved.

Performance was terrible

•  12 node cluster

•  ~60K writes per second RF2

•  ~10K writes per 8 core box

•  We went to the experts

© 2015. All Rights Reserved.

© 2015. All Rights Reserved.

Cassandra Summit 2014 Family Search asked the same question: Where’s the bottleneck?

https://www.youtube.com/watch?v=Qfzg7gcSK-g

IOPS Available

© 2015. All Rights Reserved.

0.

12500.

25000.

37500.

50000.

I2.xlarge c4.2xlarge

© 2015. All Rights Reserved.

1.3K IOPS?

© 2015. All Rights Reserved.

IOPS I see you there,

but I can’t reach you!

© 2015. All Rights Reserved.

The magic gates opened…

We hit 1 million writes per second RF3 on 60 nodes

© 2015. All Rights Reserved.

Testing Setup!

Testing Methodology

•  Each test run •  clean C* instances

•  old test keyspaces dropped •  13+TBs of data loaded during read testing •  20 C4.4XL Stress Writers each with their own 1BB sequence

© 2015. All Rights Reserved.

Cluster Topology

© 2015. All Rights Reserved.

Stress Node

10 Instances AZ: 1A

Stress Nodes

10 Instances AZ: 1B

20 C* Nodes AZ: 1A

20 C* Nodes AZ: 1B

20 C* Nodes AZ: 1C

OpsCenter

EBS

© 2015. All Rights Reserved.

Cassandra Stress 2.1.x

© 2015. All Rights Reserved.

bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops\(insert=1\) no-warmup -pop seq=1..1000000000 -mode native cql3 -node 10.10.10.XX -rate threads=1000 -errors ignore !

© 2015. All Rights Reserved.

PCSTAT - Al Tobey

http://www.datastax.com/dev/blog/compaction-improvements-in-cassandra-21

https://github.com/tobert/pcstat

© 2015. All Rights Reserved.

Netflix Test - What is C* capable of?

Netflix Test

© 2015. All Rights Reserved.

1+ Million Writes Per second RF:3 3+ Million Local Writes Per second

NICE!

Netflix Test

© 2015. All Rights Reserved.

Netflix Test

© 2015. All Rights Reserved.

No Dropped Mutations, system healthy at 1.1M after 50 mins

Netflix Test

© 2015. All Rights Reserved.

I/O Util is not pegged Commit Disk = Steady!

Netflix Test

© 2015. All Rights Reserved.

Low IO Wait

Netflix Test

© 2015. All Rights Reserved.

95th Latency = Reasonable

Netflix Test - Read Fail

© 2015. All Rights Reserved.

compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'}

https://issues.apache.org/jira/browse/CASSANDRA-10249 https://issues.apache.org/jira/browse/CASSANDRA-8894

Data Drive Pegged L

Reading Data

•  24 hour read test •  over 10TBs of data in the CF •  sustained > 350K reads per

second over 24 hours •  1M reads/per sec peak •  CL ONE •  12 C4.4XL stress boxes

© 2015. All Rights Reserved.

Reading Data

© 2015. All Rights Reserved.

Reading Data

© 2015. All Rights Reserved.

Reading Data

© 2015. All Rights Reserved.

Not Pegged J

Reading Data

© 2015. All Rights Reserved.

7.2ms 95th latency

Netflix Test resource usage

•  180 Less Cores (45 less i2.xlarge instances) •  24 hour test (sans data transfer cost)

–  Netflix cluster/stress •  Cost: ~$6300 •  285 i2.xlarge $0.85 per hour

–  CrowdStrike cluster/stress with EBS cost •  Cost: ~$2600 •  60 C4.4XL $0.88 per hour

Read Notes with EBS

•  Our test was a single 10K IOPS volume •  More/Bigger Reads?

–  PIOPS gives you as much throughput as you need –  RAID0 multiple EBS volumes

/mnt/data

EBS Vol1 EBS Vol2

© 2015. All Rights Reserved.

What Unlocked Performance!

Major Tweaks

•  Ubuntu HVM types •  Enhanced Networking •  now faster than PVM •  Ubuntu distro tuned for cloud workloads •  XFS Filesystem

© 2015. All Rights Reserved.

Major Tweaks

•  Major Tweaks •  Cassandra 2.1

•  Java 8 •  G1 Garbage Collector - cassandra-env

© 2015. All Rights Reserved.

https://issues.apache.org/jira/browse/CASSANDRA-7486

Major Tweaks

•  C4.4XL 16 core, EBS Optimized •  4TB, 10,000 IOPS EBS GP2 Encrypted Data Drive

–  160MB/s throughput

•  1TB 3000 IOPS EBS GP2 Encrypted Commit Log Drive

© 2015. All Rights Reserved.

Major Tweaks

•  cassandra-env.sh •  MAX_HEAP_SIZE=8G •  JVM_OPTS=“$JVM_OPTS —XX:+UseG1GC” •  Lots of other minor tweaks

© 2015. All Rights Reserved.

cassandra-env.sh

© 2015. All Rights Reserved.

Put PID in batch mode

Mask CPU0 from the process to reduce context switching

Magic From Al Tobey

YAML Settings

•  cassandra.yaml (based on 16 core) •  concurrent_reads: 32 •  concurrent_writes: 64 •  memtable_flush_writers: 8 •  trickle_fsync: true •  trickle_fsync_interval_in_kb: 1000 •  native_transport_max_threads: 256 •  concurrent_compactors: 4

© 2015. All Rights Reserved.

cassandra.yaml

© 2015. All Rights Reserved.

We found a good portion of the CPU load was being used for internode compression which reduced write throughput

internode_compression: none

Lessons Learned

•  EBS was never the bottleneck during testing, GP2 is legit •  If you’re doing batching, write to the same rowkey in the batch •  Builtin types like list and map come at a performance penalty

•  30% hit on our writes using Map type •  DTCS is very young (see Jeff Jirsa’s talk) •  2.1 Stress Tool is tricky but great for modeling workloads •  How will compression affect your read path?

© 2015. All Rights Reserved.

© 2015. All Rights Reserved.

Test your own!

https://github.com/CrowdStrike/cassandra-tools

It’s just python

•  launch 20 nodes in us-east1 •  python launch.py launch --nodes=20 —config=c4-ebs-hvm

—az=us-east-1a •  bootstrap the new nodes with C*, RAID/Format disks, etc…

•  fab -u ubuntu bootstrapcass21:config=c4-highperf •  run arbitrary commands

•  fab -u ubuntu cmd:config=c4-highperf,cmd="sudo rm -rf /mnt/cassandra/data/summit_stress"

© 2015. All Rights Reserved.

Run custom stress profiles… multi-node support

ubuntu@ip-10-10-10.XX:~$ python runstress.py --profile=stress10 —seednode=10.10.10.XX —-threads=50!!!Going to run: /home/ubuntu/apache-cassandra-2.1.5/tools/bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops\(insert=1,simple=9\) no-warmup -pop seq=1..1000000000 -mode native cql3 -node 10.10.10.XX -rate threads=50 -errors ignore !

© 2015. All Rights Reserved.

ubuntu@ip-10-10-10.XX:~$ python runstress.py --profile=stress10 --seednode=10.10.10.XX --threads=50 !!Going to run: /home/ubuntu/apache-cassandra-2.1.5/tools/bin/cassandra-stress user duration=100000m cl=ONE profile=/home/ubuntu/summit_stress.yaml ops\(insert=1,simple=9\) no-warmup -pop seq=1000000001..2000000000 -mode native cql3 -node 10.10.10.XX -rate threads=50 -errors ignore !

export NODENUM=1 !

export NODENUM=2 !

Where are we today?

•  ~3 months on our EBS based cluster •  Hundreds of TBs of graph data and growing in C* •  Billions of vertices/edges •  Changing perceptions?

Special thanks to

© 2015. All Rights Reserved.

•  Leif Jackson •  Marcus King •  Alan Hannan •  Jeff Jirsa

•  Al Tobey •  Nick Panahi •  J.B. Langston •  Marcus Eriksson •  Iian Finlayson •  Dani Traphagen

EBS heading into 2016

© 2015. All Rights Reserved.

4TB  (10k  IOPS)  GP2  

IO  Hit?  Not  enough  to  phase  C*  

© 2015. All Rights Reserved.

   

So  why  the  hate  for  EBS?  

© 2015. All Rights Reserved.

Following  the  Crowd  –  Trust  Issues  

 •  Used  instance-­‐store  image  and  ephemeral  drives  

•  Painful  to  stop/start  instances,  resize  •  Couldn’t  avoid  scheduled  maintenance  (i.e.  Reboot-­‐a-­‐palooza)  

•  EncrypUon  required  shenanigans  

© 2015. All Rights Reserved.

Guess  What?  

•  We  sUll  had  failures  •  Now  we  get  to  rebuild  from  scratch  

© 2015. All Rights Reserved.

What  do  you  mean  my  volume  is  “stuck”?    •  April  2011  –  Ne[lix,  Reddit  and  Quora  •  October  2012  –  Reddit,  Imgur,  Heroku  •  August  2013  –  Vine,  AirBNB  

EBS’s  Troubled  Childhood  

© 2015. All Rights Reserved.

h`p://techblog.ne[lix.com/2011/04/lessons-­‐ne[lix-­‐learned-­‐from-­‐aws-­‐outage.html    •  Spread  services  across  mulUple  regions  •  Test  failure  scenarios  regularly  (Chaos  Monkey)  •  Make  Cassandra  databases  more  resilient  by  avoiding  EBS  

Kiss  of  Death  

© 2015. All Rights Reserved.

Amazon  moves  quickly  and  quietly:    •  March  2011  –  New  EBS  GM  •  July  2012  –  Provisioned  IOPs  •  May  2014  –  NaUve  EncrypUon  •  Jun  2014  –  GP2  (game  changer)  •  Mar  2015  –  16TB  /  10K  GP2/  20K  PIOPS      

RedempUon  

© 2015. All Rights Reserved.

•  PrioriUzed  EBS  availability  and  consistency  beyond  features  and  funcUonality  

•  Compartmentalized  the  control  plane  -­‐  broke  cross-­‐AZ  dependencies  for  running  volumes  

•  Simplified  workflows  to  favor  sustained  operaUon  •  Tested  and  simulated  via  TLA+/PlusCal  -­‐  be`er  understood  corner  cases  •  Dedicated  a  large  fracUon  of  engineering  resources  to  reliability  and  performance  

 

RedempUon  

© 2015. All Rights Reserved.

Reliability  

 EBS  Team  targets  99.999%  availability  

   exceeding  expectaUons  

© 2015. All Rights Reserved.

Crowdstrike  Today  

In  past  12  months,  zero  EBS-­‐related  failures  •  Thousands  of  GP2  data  volumes  (~2PB  data)  •  TransiUoning  all  systems  to  EBS  root  drives  •  Moved  all  data  stores  to  EBS  (C*,  Kapa,  ElasUcsearch,  Postgres,  etc)  

© 2015. All Rights Reserved.

Staying  Safe  -­‐  Architecture  

•  Select  a  region  with  >2  AZs  (e.g  us-­‐east-­‐1  or  us-­‐west-­‐2)  

 

•  Use  EBS  GP2  or  PIOPs  storage  •  Separate  volumes  for  data  and  commit  logs  

© 2015. All Rights Reserved.

Staying  Safe  -­‐  Ops  

•  Use  EBS  volume  monitoring  •  Pre-­‐warm  EBS  volumes?  •  Schedule  snapshots  for  consistent  backups  

© 2015. All Rights Reserved.

Most  Importantly  

•  Challenge  assumpUons  •  Stay  current  on  AWS  blog  •  Talk  with  your  peers  

Thank you @jimplush

@opacki

top related