montreal user group - cloning cassandra

Cloning CassandraAdam Hutson, Data Architect

@adamhutson

Who am I and What do we do?

• Adam Hutson• Data Architect of DataScale -> www.datascale.io• DataStax MVP for Apache Cassandra• DataScale provides hosted data platforms as a service• Offering Cassandra & Spark, with more to come• Currently hosted in Amazon or Azure

http://www.datascale.io/

● Why Clone?● Clone Overview● Backup Existing Cluster

○ Backup Data, Schema, Tokens○ Upload to Off-Site Storage

● Restore to New Cluster○ Create Destination Cluster○ Update Configuration Settings○ Sync Schema & Tokens from Off-Site Storage○ Sync Data from Off-Site Storage

Why Clone?

Why Clone?Why clone your data? Doesn’t replication do that? Can’t we just add a data center? All valid questions.

Scenarios:

1. Need an exact copy of Production for a QA or Development.

2. Switching physical data location (i.e. on-prem to cloud, cloud to cloud)

3. Fire drills are fun, right?!

Clone Overview

Clone OverviewSource cluster:

1. Per node, backup existing cluster’s data, schema, & token assignment

Destination cluster:

2. Create schema on new cluster

3. Assign existing tokens

4. Add auto_bootstrap setting

5. Add load_ring_state setting

6. Restore backed up date to correct nodes

Backup Existing Cluster

Backup DataTake a backup, aka snapshot with a proper tag to identify it later

Flushes all the data from MemTables out to SSTables.

Creates the hard-links to SSTables in /data/<keyspace>/<table>/snapshots/ folder.

nodetool snapshot -t my_snapshot_tag

Personally, I use:

tag=$(date -u +%Y-%m-%d_%H-%M)

nodetool snapshot -t $tag

Backup SchemaNeed to recreate the data schema from the source to the destination cluster.

The best way is to preserve the current schema is a flat file.

The cqlsh utility has a command line parameter for cql commands to run.

Then just redirect the command output to a new flat file.

cqlsh -e ‘DESCRIBE SCHEMA’ > schema.cql

Backup TokensNeed to preserve the token assignment topology of the source cluster.Have to extract the tokens and store them as a comma-separated list in a flat file.The nodetool utility has a command that outputs all the tokens used in the ring.

nodetool ring | grep 0.0.0.0 | awk '{print $NF ", "}' | xargs | cut -d '=' -f

2 | sed 's/,$//' | sed 's/^/initial_token: /' > tokens.txt

This command creates a file starting with “initial_token: “ and adds the tokens as a csv without the trailing comma.

Replace the 0.0.0.0 with the IP that is listed as your broadcast_address.

Upload to Off-Site StorageNeed to get all of the stuff we just created off the local node and off-site.

This can be anywhere … AWS, Azure, Google Cloud.

Doesn’t matter where, just put it somewhere so that it’s safe.

Keep it ordered, so that it’s easy to find later.

Restore to New Cluster

Restore to New ClusterRestoring to a new cluster is synonymous with cloning to a new cluster.

The idea is that you’re moving all of the existing important stuff so that you can have a copy running in a new cluster.

The following steps need to take effect in the specified order.

Most all of these steps can take place in parallel on all nodes at the same time.

Create Destination ClusterCloning of a cluster only works if both the source and destination clusters having the same number of nodes. How you select the size/power of your machines is irrelevant to the clone. The only thing that matters is the quantity of nodes match.

Stop the ClusterI know we just create new cluster, but we’re going to stop the Cassandra service on all of the nodes.

We’re going to be cleaning up all of the new cluster data’s default settings.

We just have to get Cassandra to stop watching what we’re about to do.

sudo service cassandra stop

Delete the system keyspace data filesSeems scary, right? All of the important Cassandra metadata about itself is stored in the system keyspace. And we’re going to blow it all away? It’s all ok, though, because when the Cassandra service starts back up, it’s smart enough to recreate all the necessary metadata keyspace information based on it’s settings that we’re about to change.

sudo rm -rf /var/lib/cassandra/data/system/

Add auto_bootstrapWhen Cassandra starts back up, we don’t want it to try to bootstrap itself right away.

We need to add a line to the cassandra.yaml so that doesn’t happen. This single line contains the following setting, “auto_bootstrap: false”. Normally, having this setting omitted from the yaml defaults to a true value.

The simplest way to do this is with the following one-liner.

sudo sed -i '$ a\auto_bootstrap: false'

/etc/cassandra/cassandra.yaml

Download Token FileRemember that token file we created during the backup of the existing cluster?

We need to get that file from the off-site storage location.

We need to establish a mapping of old node to new node.

That way we place the tokens from old node #1 with the corresponding data from old node #1 to new node #1.

Add initial_tokenNow that we have the token file local, we can use it’s data.

We just need to add the contents of the file to the cassandra.yaml file.

Aren’t we glad that we pre-formatted the contents to be exactly what we need? Yeah, me too.

tokens=$(cat tokens.txt)

$ sudo sed -i '$ a\$tokens' /etc/cassandra/cassandra.yaml

Add load_ring_stateWithin the options that are sent to the Java Virtual Machine, we need to tell the cluster to not try to figure out the ring state.

We do this by adding a JVM option, cassandra.load_ring_state=false, to the cassandra-env.sh file.

sudo sed -i '/JVM_OPTS="$JVM_OPTS

-Dcassandra.load_ring_state=false"/d' /etc/cassandra/cassandra-env.sh

Start the ClusterWe need to start the cluster back up so that the previous two steps can take hold.

When turning a cluster on, you need to start with the seeds, and then proceed to the non-seeds.

Always wait about two minutes after starting each node to make sure that the service has joined the ring and is listening for clients.

sudo service cassandra start

Download the Schema FileThe schema from the existing cluster needs to be created on the new cluster.

The schema.cql file that we created and stored off-site needs to be downloaded.

This only needs to happen on just one node.

Create the SchemaWe have to execute the schema.cql file against the new cluster. The cqlsh utility has a command line parameter that accepts a text file to be executed. Once the schema has been created, Cassandra will create the necessary folder structure for storing the data files.

cqlsh -f schema.sql

Stop the ClusterNow that we have the necessary folder structure in place for the data files, we can stop the cluster.

We will soon be downloading the backed up data files.

Cassandra needs to be turned off so that it doesn’t attempt to read them while the download is happening.

sudo service cassandra stop

Download the backup data filesWe need to download all of the data files from the off-site storage.

The data files will need to be placed into the correct locations according to the schema that we created earlier.

Each keyspace & table in the schema has it’s own /folder/subfolder that data files need to go into.

Use the same mapping of old nodes to new nodes from the earlier token file step.

Start the ClusterNow we have all the configuration settings in place and all of the data files in their respective folders.

The last step is to start the cluster back up.

This will take a little bit longer to start since it has new data files to discover.

Once it’s up and running, you should be able to run queries against it to verify that all of the data from the existing cluster is now in the new cluster.

sudo service cassandra start

Thank You!

Questions?

Adam [email protected]

@AdamHutson@DataScaleInc

montreal user group - cloning cassandra

Documents