predix columnar store€¦ · 4. familiarize yourself with cassandra database concepts • data...

23
Predix Columnar Store © 2020 General Electric Company

Upload: others

Post on 20-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Predix Columnar Store

© 2020 General Electric Company

Page 2: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Contents

Predix Columnar Store Service Overview 1

About Columnar Store 1

Features and Benefits 1

About Cassandra 2

Columnar Store Architecture 3

Getting Started with Predix Columnar Store 4

Columnar Store Service Setup 4

Data Modeling Concepts 5

Data Replication Strategies 6

Schema Design Strategies 7

Creating a Predix Columnar Store Service Instance 8

Binding a Service to an Application 10

About Keyspaces 11

Using Predix Columnar Store 13

Using the Query UI 13

About Cluster Management 13

Updating the Service 13

Unbinding the Service From an Application 14

Deleting the Service 14

About Service Maintenance 14

Repairing an Instance 15

Configuring Read Repair 15

Using Anti-Entropy Repair 16

Using Compaction 16

Backing Up Data 16

Restoring Data From a Backup 16

Viewing Logs 17

About Service Monitoring 17

Using DataStax OpsCenter 18

ii Predix Columnar Store

Page 3: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Copyright GE Digital© 2020 General Electric Company.

GE, the GE Monogram, and Predix are either registered trademarks or trademarks of General Electric Company. All other trademarks are the property of their respective owners.

This document may contain Confidential/Proprietary information of General Electric Company and/or its suppliers or vendors. Distribution or reproduction is prohibited without permission.

THIS DOCUMENT AND ITS CONTENTS ARE PROVIDED "AS IS," WITH NO REPRESENTATION OR WARRANTIES OF ANY KIND, WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF DESIGN, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. ALL OTHER LIABILITY ARISING FROM RELIANCE UPON ANY INFORMATION CONTAINED HEREIN IS EXPRESSLY DISCLAIMED.

Access to and use of the software described in this document is conditioned on acceptance of the End User License Agreement and compliance with its terms.

© 2020 General Electric Company iii

Page 4: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design
Page 5: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Predix Columnar Store Service Overview

About Columnar StorePredix Columnar Store is a data storage service based on Cassandra, a NoSQL database designed tohandle large data workloads across multiple nodes with no single point of failure.

Cassandra has a peer-to-peer distributed system architecture where data is distributed among multiplehomogeneous nodes organized into data centers, and clusters that contain one or more data centers.Data is replicated across nodes and data centers to protect against catastrophic loss and speed requestprocessing. Any authenticated user can connect to any node in any data center to access data by usingCQL (Cassandra Query Language, similar to SQL). Read and write requests can be sent to any node in acluster, and the recipient node acts as a proxy between the client application and the nodes where therequested data are located. If a node or data center is down, data is retrieved from the nearest node, andchanges are synched when the nonfunctional node or data center is restored.

Cassandra Infrastructure Components

• Node: Data is stored in nodes, which can be virtual or physical locations.• Data center: A group of related nodes, either physical or virtual, in the same physical location.

Replication is configured at this level, and data can be written to multiple data centers. Distinctworkloads should be handled by separate data centers to keep requests close to each other andreduce data latency.

• Cluster: A group of one or more data centers that can be distributed across multiple physicallocations.

• Commit log: Data is first written to this log for durability, and then written to disk when log memory isfull. After all data is written to disk, logs can be archived, deleted, or recycled.

• SSTable: A sorted string table file to which Cassandra writes data. These tables are append-only,stored to disk sequentially, and maintained for each Cassandra table.

• CQL Table: A collection of ordered columns that has a primary key and is fetched by table row.

Features and BenefitsColumnar Store provides you with all of the power and flexibility of Cassandra database within the Predixplatform, with pre-built infrastructure and integration and easy provisioning.

Columnar Store has the following features:

• Decentralized: Masterless architecture means all nodes are equal, and there is no single point offailure. Data can be written to and read from all nodes and is automatically distributed among nodes.Hardware failures therefore do not impact your important data, and network bottlenecks areeliminated.

• Fault tolerant: Columnar Store distributes your data across multiple nodes and data centers toprovide even more failover protection. When nodes fail, they can be easily restored or replaced, andthe commit log design prevents data loss.

• Scalable: Easy provisioning means you quickly scale from three to n nodes as your needs evolve.• Fully replicated: You can customize data replication by selecting a replication factor that meets your

requirements.

© 2020 General Electric Company 1

Page 6: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

About CassandraColumnar Store is based on Cassandra, a non relational database that offers benefits not found intraditional RDBMS products.

The changing data landscape of today's online applications has created a need for data storagetechnologies with low latency and massive scalability, continuous uptime, and global data distributionwith the ability to read and write in any location. These key requirements, along with the desire to reducesoftware and operational costs, are the reasons behind the growing popularity of non relational databasetechnologies

Cassandra differs from a more traditional relational database, such as PostgreSQL, in the following ways:

Table 1: Relational Databases Compared to Cassandra

Relational Database Cassandra

Supports moderate incoming data velocity Supports high incoming data velocity

Incoming data from one or few locations Incoming data from many locations

Designed to manage mostly structured data Designed to manage all types of data

Supports complex and nested transactions Supports simple transactions

Single point of failure with failover No single point of failure with continuous uptime

Handles moderate data volumes Handles very high data volumes

Centralized architecture and deployment Decentralized architecture and deployment

Most data written in a single location Data written in many locations

Read scalability support, with consistency sacrifices Read and write scalability support

Vertical scale-up deployment Horizontal scale-out deployment

When deciding whether Columnar Store is the best choice for your data storage needs, consider thefollowing questions:

• What volume of incoming data do you need to store?• Do you anticipate that data volume will grow over time?• What is the expected incoming data velocity?• How many locations generate the data you need to store?• Is your data structured or unstructured?• What level of transaction complexity support do you need?• How important are continuous uptime and data durability?

2 © 2020 General Electric Company

Page 7: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Columnar Store Architecture

Figure 1: Columnar Store Architecture

Predix Columnar Store can exchange data with Cloud Foundry apps, and receive inputs from other Predixservices. Cloud Foundry apps can send data to Columnar Store and other Predix services. External cloudinstances of apps and services are blocked from access to Columnar Store or any other components ofthe Predix Data Services Virtual Private Cloud (VPC).

© 2020 General Electric Company 3

Page 8: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Getting Started with Predix Columnar Store

Columnar Store Service SetupThe Columnar Store setup process includes several tasks. First, you prepare your developmentenvironment. Then, you create a service instance, bind the service to your application, and create akeyspace and apply your data model.

Prerequisites

Before you can begin your Columnar Store service setup, you need a Predix.io account and a fully prepareddevelopment environment, including a Cassandra data model, database instance, and tables.

Important: You must understand and plan the Cassandra data modeling strategies before setting upyour Columnar Store service instance. Review the links provided below.

No. Task Information

1. Familiarize yourself with Predix platform

concepts

What is Predix Platform?

2. Set up your Predix development

environment

Task Roadmap: Predix Development

Environment Setup

3. Create the Predix Hello World application

if you do not have your own application.

Creating and Deploying a Simple Web App

to Cloud Foundry

Tip: You will later bind your service

instance to an app to provision service

details in the VCAP_SERVICES

environmental variable. Cloud Foundry

runtime uses the VCAP_SERVICES

environment variable to communicate

with a deployed application about its

environment

4. Familiarize yourself with Cassandra

database concepts• Data Modeling Concepts on page 5

• Data Replication Strategies on page

6

• Schema Design Strategies on page

7

Setting Up a Local Cassandra Environment

No. Task Information

1. Download and install Apache Cassandra

on your local machine.

Download Cassandra from DataStax

Academy

2. Create a keyspace on your local machine. About Keyspaces on page 11

3. Design and create tables for your

Cassandra instance

4 © 2020 General Electric Company

Page 9: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

No. Task Information

4. Set up your Cassandra data model and

apply it to the Cassandra instance on your

local machine.

6. Decide on a data replication strategy. Data Replication Strategies on page 6

Setting Up Columnar Store Service Instance

No. Task Information

1. Create a Columnar Store service instance. Creating a Predix Columnar Store Service

Instance on page 8

2, Bind the service to your application. Binding a Service to an Application on

page 10

3. In Cloud Foundry, create a keyspace and

apply your data model.

About Keyspaces on page 11

Data Modeling ConceptsBefore you can set up and use Columnar Store, you must first create a data model.

To get the most from Columnar Store, proper data modeling and table design is crucial. Cassandra datamodeling concepts and techniques are significantly different from those of traditional relationaldatabases. For example, the familiar entity-relationship-attribute paradigm does not apply in the samemanner as for relational databases. Unlike legacy RDBMS products, Cassandra can handle very largetables with tens of thousands of columns.

Cassandra data objects are similar to those of a relational database, such as tables, primary keys andindexes, as follows:

• Keyspace: The outermost container for data in Cassandra, similar to a database in many relationaldatabases. A keyspace has a name and attributes that define keyspace-wide behavior, and datareplication is defined at the keyspace level. While a database is a container for tables, a keyspace is acontainer for a list of one or more column families. A column family is similar to a table in a relationaldatabase, and is a container for a collection of rows that contain ordered columns. Column familiesrepresent the structure of your data, and each keyspace has one or more column families.

• Table: In Cassandra, a table is a container that holds rows of columns. Cassandra tables are similar torelational tables, but with vast data volume storage capabilities and very fast row insert and column-level read functionalities.

• Primary key: A unique identifier for a table row that also is used to distribute table rows acrossmultiple nodes within a cluster. A primary key is composed of a partition key plus a clustering key.

◦ Partition key: A value that specifies the nodes in a cluster where data is stored. Simple partitionkeys are based on a single column. Composite partition keys are based on multiple columns, andare used when the volume of data is too large to store in a single partition. When you use acomposite partition key, data is broken into buckets (chunks) for storage on multiple nodes.Cassandra is often used for time series data, and hotspotting (congestion while repeatedly writingdata to one node) is a common issue. When incoming data is grouped into smaller chunks, forexample, by the year:month:day:hour columns, hotspots can be mitigated.

◦ Clustering key: A value that determines how data is sorted within each partition, based onspecific columns. Because data is distributed throughout Cassandra clusters, high latency canresult when data are retrieved from a large partition that must be fully read to find a small amount

© 2020 General Electric Company 5

Page 10: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

of information. When rows for a partition key are stored in the order defined by a clustering key,retrieval is highly efficient. Using clustering values to group table data is the equivalent of JOINs ina relational database, but performance is much better because only one table is accessed.

• Index: Like a relational index, this object speeds some read operations, but is also different fromrelational indices in important ways. Indices allow you to access data in Cassandra by using attributesother than the partition key for fast, efficient lookup of data that matches specified conditions. Columnvalues are indexed in a separate table that is hidden from the table being indexed.

Important: Be aware that you can access your Columnar Store service instance only from theapplications that you bind to it. You cannot access your service instance from a local machine or fromother apps that run elsewhere.

Complete details and video tutorials on data modeling concepts and techniques are available in theDataStax Academy online course DS220: Data Modeling.

Data Replication StrategiesWhen you set up Columnar Store, replicas are created on multiple nodes, data centers (DC), and clustersto ensure reliability and fault tolerance. To specify the nodes where replicas are placed, you select a datareplication strategy when you create a keyspace for your Columnar Store service instance.

The total number of replicas created across a cluster is known as the replication factor. For example, areplication factor value of 1 specifies that there is one copy of each row on one node, and a replicationfactor of 2 specifies that there are two copies of each row on two nodes. There is no primary or masterreplica, and all replicas are equally important. As a general rule, do not specify a replication factor thatexceeds the number of nodes in the cluster. However, you can increase the replication factor and add thedesired number of nodes later on.

There are two configurations available, a single-DC cluster or a multi-DC cluster. At this time, only thenetwork topology replication strategy is available. By using this strategy, you can specify the number ofreplicas you want in each data center. When you run the NetworkTopologyStrategy command,replicas are placed in the same data center by walking the ring clockwise until the first node in anotherrack is reached. With this strategy, replicas are placed on distinct racks because nodes on the same racksor in similar physical groupings often fail simultaneously due to power, cooling, or network issues.

To decide how many replicas to configure in each data center, consider the importance of local readexecution without cross-data-center latency, and keep failure scenarios in mind. The two most commonways to configure multiple data center clusters are:

• Two replicas in each data center: Choose this configuration to support failure tolerance of a singlenode per replication group without disabling local reads at a consistency level of ONE.

• Three replicas in each data center: Choose this configuration to support failure tolerance of eitherone node per replication group at a strong consistency level of LOCAL_QUORUM, or multiple nodes perdata center at a consistency level of ONE.

Asymmetrical replication groupings are also possible. For example, you can have three replicas in one datacenter to serve real-time application requests and use a single replica elsewhere for running analytics.

6 © 2020 General Electric Company

Page 11: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Schema Design StrategiesIf you have a RDMBS background, CQL (Cassandra query language) looks similar to SQL. However, the wayto model data is different.

Important: Choosing the right data model is the trickiest part of using Cassandra database. Cassandra isdesigned to be a high-performance database that discourages inefficient queries. Query efficiency iscontrolled by how data is stored in Cassandra.

Recommendations

The following are key points to consider when designing a schema in Cassandra. For more information,see the DataStax article Basic Rules of Cassandra Data Modeling.

Data Distribution

Data should be distributed across the cluster evenly so every node will have roughly the same amount ofdata. Data distribution is based on the primary key used. Hash is calculated for each partition key and thathash value is used to determine what data will go to which node in the cluster. As such, choosing a goodprimary key is important.

Reduce Number of Partitions Read

Partitions are groups of rows that share the same partition key. When a read query is performed, thecoordinator nodes will request all the partitions that contain data. A delay in response time will occur ifdata is kept in different partitions, due to the overhead in requesting partitions. There should be a fewpartitions as possible.

To minimize partition reads you should focus on modeling the data according to the types of queries beingused. Minimizing partitions reads involves the following considerations.

1. Data modeling according to the queries.When creating a schema, think about the queries you will issue to the Cassandra database. There willbe faster read time if the data for the query is contained in one table.

2. Schema (table) creation should satisfy queries by reading (roughly) one partition.Have one table per query pattern. Different tables should be created to satisfy different needs. It is OKto duplication data among different tables. Focus on serving a read request from one table in order tooptimize the read.

Do Not Focus on Write Counts

Do not be concerned about the number of write requests to the Cassandra database. A write is moreefficient than a read. Write the data in a way that improves the read query efficiency.

Do Not Focus on Data Duplication

Data duplication is necessary for a distributed database like Cassandra. To improve Cassandra reads youshould duplicate the data to ensure the availability of data in case of failures.

© 2020 General Electric Company 7

Page 12: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Creating a Predix Columnar Store Service Instance

Before You Begin

Complete all steps in Prerequisites on page 4

Procedure

1. Log into Predix and Cloud Foundry.

cf login -a <api-url> -u <username> -p <password> -o <organization> -s <space>

Note: If you are a GE employee, you must use the cf login --sso command to log into CloudFoundry. After you enter your SSO, you will receive a one-time passcode URL. Copy this URL and pasteit in a browser to retrieve your one-time passcode. Use this code with the cf command to completethe CF login process.

Option Description

<api-url> API endpoint (Predix URL)

<username> Your Predix username

<password> Your Predix password

<organization> The target organization for your deployment

<space> The target space in the organization for your deployment

2. Create a service instance.

Note: Cloud Foundry CLI syntax can differ between Windows and Linux operating systems. See theCloud Foundry help for the appropriate syntax for your operating system. For example, to see help forthe create service command, run cf cs.

• For example::

cf create-service predix-columnar-store <plan-name> <your-instance-name> -c '{"cluster_name":"<your-cluster-name>", "nodes":<number-of-nodes>, "instance_size":"<your-instance-size>","multi_dc":<true|false>, "solr":<true|false>}'

Option Description

<plan-name> The plan type you selected for the new service instance.

<your-instance-name> The name you want to use for the new service instance.

<your-cluster-name> The name you want to use for your Cassandra cluster.

<number-of-nodes> The number of nodes you want to create in your cluster. You can increase the number of nodes lateron, but cannot downsize nodes (that is, remove nodes) after initial cluster creation.

• Three nodes is the minimum number of nodes required for cluster creation.• 20 nodes per data center is the maximum number of nodes allowed.

8 © 2020 General Electric Company

Page 13: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Option Description

<your-instance-size>

The field will default to Small ("S") if no value is provided. The size of the service instance. Refer to thePredix Catalog Columnar Store service pricing options for service sizing options.

Note: The instance size cannot be changed after the service instance is created.

solr The field will default to false if no value is provided. Set to true to enable Cassandra searchfunctionality, or false to disable search.

multi_dc The field will default to false if no value is provided. Set to true to enable cross-data centerfunctionality on your cluster. When you enable this option, the specified number of nodes aredoubled. For example, if you specify "nodes":3 and "multi_dc":true, three nodes arecreated in two separate data centers. So there will be a total of six nodes in the cluster.

The following example shows how to create a service instance with the Standard plan, three nodes,small instance size, multi-DC, and Solr.

cf create-service predix-columnar-store Standard columnar-test-instance –c '{"cluster_name":"test-cluster", "nodes":3,"instance_size":"S", "solr":true, "multi_dc":true}'

3. After you run the creation request, it takes several minutes for provisioning to be completed. To checkthe status, run the following command:

cf service <your-instance-name>

This command returns values of Completed, In Progress, or Failed. If provisioning fails, tryagain or contact Predix Customer Support.

After the creation request is completed, one of the following status messages is returned.

Success

Creating service instance <your-instance-name> in org <your-org> /space <your-space> as <your-username>…OKCreate in progress. Use 'cf services' or 'cf service <your-instance-name>' to check operation status

Failure

Server error, status code: <status code>, error code: <error code>,message: <error message>

4. After you successfully create a service instance, you must bind the service to an application before youcan use Columnar Store.

Important: Be aware that you can access your Columnar Store service instance only from theapplications that you bind to it. You cannot access your service instance from a local machine or fromother apps that run elsewhere.

© 2020 General Electric Company 9

Page 14: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Binding a Service to an ApplicationAfter you create a data model and service instance, you must bind your Columnar Store service instanceto an application before you can create a keyspace.

About This Task

When you bind an application to your Predix Columnar Store instance, its connection details areprovisioned in the VCAP_SERVICES environment variable. The Cloud Foundry runtime uses theVCAP_SERVICES environment variables to communicate with a deployed application about itsenvironment.

Important: Be aware that you can access your Columnar Store service instance only from theapplications that you bind to it. You cannot access your service instance from a local machine or fromother apps that run elsewhere.

Procedure

1. Bind your application to your new Predix Columnar Store service instance.

cf bind-service <your-app-name> <your-instance-name>

2. Restage your application to ensure the environment variable changes take effect.

cf restage <your-app-name>

3. To retrieve the environment variables for your application and obtain the credentials to access yourcluster, do one of the following.

• Run the following command:

cf env <your-app-name>• Create a service key and use it to retrieve the environment variables for your application:

cf create-service-key my-new-instance my-new-instance-keycf service-key my-new-instance my-new-instance-key

The following status message appears during environment variables retrieval:

Getting key my-new-instance-key for service instance my-new-instance as predix-data-services-deployer...

The result lists the VCAP_SERVICES environment variables, which contain the publish and subscribeendpoint URIs.

{"cqlsh_url": "https://cassini-admin-portal.run.aws-usw02-my.new.instance.predix.io/","datacenters": [{"dc_name": "newinst-dc","nodes": ["192.0.0.0",

10 © 2020 General Electric Company

Page 15: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

"192.0.0.0","192.0.0.0","192.0.0.0","192.0.0.0"]}],"db_name": "mynewdb","opscenter_info": [{"opscenter_url": "https://dataservices-opscenter-54509.run.aws-usw02-my.new.instance.predix.io"},{"password": "","user": "Riverview"},{"password": "","user": "deserver"},{"password": "","user": "administer"}],"password": "","performance": "https://cassandra-stress-cf.run.aws-usw02-my.new.instance.predix.io/rest/","port": 9042,"type": "SUPERUSER","cluster_name": "Cassandra"}

About KeyspacesYou create a keyspace and define a data replication strategy by executing a command from within abound application.

When you use the CREATE KEYSPACE command, a top-level namespace is created. In the command,you set the keyspace name, replica placement strategy class, replication factor, and DURABLE_WRITESoptions for the keyspace. For more information about the replica placement strategy, see ReplicationStrategies. For details on how to create a keyspace, see DataStax documentation Create Keyspace.

When you configure NetworkTopologyStrategy as the replication strategy, you set up one or morevirtual data centers. Alternatively, you use the default data center.

You assign different nodes, depending on the type of workload, to separate data centers. For example, youcan assign Hadoop nodes to one data center and Cassandra real-time nodes to another. By segregatingworkloads, you can ensure that only one type of workload is active per data center. Segregation preventscompatibility issues between workloads, such as inconsistent batch requirements that affectperformance.

You use a map of properties and values to define a keyspace, for example:

{ 'class' : 'NetworkTopologyStrategy'[, '<data center>' : <integer>,'<data center>' : <integer>] . . . };

© 2020 General Electric Company 11

Page 16: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Property Value Description

'class' 'NetworkTopologyStrategy'

Required. The name of the replicaplacement strategy class for the newkeyspace.

'replication_factor' 'number of replicas' Not used. The number of replicas of dataon multiple nodes.

'first data center' 'number of replicas' Required if class isNetworkTopologyStrategyand you provide the name of the first datacenter. This value is the number ofreplicas of data on each node in the firstdata center.

'next data center' 'number of replicas' Required if class isNetworkTopologyStrategyand you provide the name of the seconddata center. The value is the number ofreplicas of data on each node in the datacenter.

12 © 2020 General Electric Company

Page 17: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Using Predix Columnar Store

Using the Query UIYou can find the URL for the Query UI in the Cloud Foundry environment variables.

About This TaskThe VCAP_SERVICES environment variables are applied to applications that you bind to your serviceinstance. You can retrieve the variables from the command line, or create a Cloud Foundry service keythat you can copy and paste to use with Cassandra or OpsCenter. For details on how to create a servicekey, see Cloud Foundry documentation, Create a Service Key.

Procedure

1. Retrieve your application credentials.

cf env <your-app-name>2. In the VCAP_SERVICES section, locate the following information:

• IP address values in nodes array• username and password fields in credentials block• port field

3. In the Query UI, specify the values you located in the previous step.

• Host: IP address value for one of your nodes• Port: port value from VCAP_SERVICES• Username: username value from VCAP_SERVICES• Password: password value from VCAP_SERVICES• Query: CQL query request, for example select * from system_schema.keyspaces

About Cluster ManagementAfter you create a Columnar Store service instance and set up data centers and clusters, you can performa few ongoing cluster management tasks.

Ongoing maintenance tasks for Columnar Store include the following:

• To increase the number of nodes in a cluster, you can update your service instance.• To use Columnar Store with a different application or prepare to delete your instance, you can unbind

the service.• To remove your Columnar Store instance, you can delete the service.

Updating the ServiceAfter you create your Predix Columnar Store service instance, you can update the service to increase thenumber of nodes in the cluster.

About This TaskThe Enterprise Multi-DC plan has nodes split between two data centers, so when you update the service,you must specify the data center to which you want to add nodes. The names of the data centers are the

© 2020 General Electric Company 13

Page 18: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

"dc1_name" and "dc2_name" values in the binding credentials stored in the VCAP_SERVICESenvironment variables.

Note: This feature is currently available only in select environments. Contact Support to determine if thisfeature is enabled in your environment.

Procedure

1. To add nodes to a cluster, run:

cf update-service <your-instance-name> -c '{“add_nodes”:<number-of-nodes-to-add>}'

2. To add nodes to a single data center of a cluster with Multi-DC enabled, run:

cf update-service <your-instance-name> -c '{“add_nodes”:<number-of-nodes-to-add>, “dc_name”:”<your-dc-name>”}

Unbinding the Service From an ApplicationBefore you can delete your Columnar Store service instance, or change the application to which it isbound, you must unbind the service.

Procedure

• To delete a Predix Columnar Store instance, run:

cf unbind-service <your-app-name> <your-instance-name>

Deleting the ServiceAfter you unbind your Predix Columnar Store instance from all bound applications, you can delete theinstance.

About This Task

Deletion is asynchronous.

Procedure

• To delete your Predix Columnar Store instance, run:

cf delete-service <your-instance-name>

About Service MaintenanceTo maintain your Columnar Store service instance, you can perform a few ongoing service maintenancetasks

Ongoing service maintenance tasks include:

• Repairs• Backups

14 © 2020 General Electric Company

Page 19: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

• Recovery• Viewing logs

Repairing an Instance

About This Task

Columnar Store instances may need repair in certain scenarios. For example, keyspace issues can ariseduring replication if data fails to fully propagate to all nodes on write. Another example is when nodes areunresponsive for long periods of time, which causes data inconsistencies across a cluster or data center.In cases like these, you can make repairs to restore data consistency across all replicas.

You can enable the DataStax OpsCenter Repair Service to schedule repairs that run automatically in thebackground. For details, see DataStax documentation, Repair Service.

Additionally, you can enable the Repair Service from the command line by using the cf updatecommand. By default, the service is disabled at cluster creation.

Procedure

1. To enable the Repair Service from the command line, run:

cf update-service <your-instance-name> -c '{"repair":"start"}'2. To disable the Repair Service from the command line, run:

cf update-service <your-instance-name> -c '{"repair":"stop"}'

Configuring Read RepairThe read repair feature is enabled by default, where a key is repaired when read. However, you mayexperience an additional load if the application is performing many read actions. You can disable orconfigure the read repair setting to mitigate performance impact.

The per table parameter read_repair_chance defines the probability that a repair will be performedwhen reading. The read repair is defined at the column family level and is set to 0.1 by default. This meansthat 10% of the read requests will generate a request against all the replicas to compare the stored valuesand make them consistent, if needed.

To change the read repair value for a specific column family, run the following cqlsh command:

>alter table <column_family_name> with read_repair_chance=<value>;

Where <value>:

• 0 (zero) disables read repair• 0.1 is the default• 1 sets that each request will trigger read repair

© 2020 General Electric Company 15

Page 20: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Using Anti-Entropy RepairThis type of repair is initiated by running the command nodetool repair and should be used in thefollowing situations.

• Regularly as routine maintenance. The period when all the nodes are repaired should be less than theper table configuration parameter gc_grace_seconds, which is 10 seconds by default.

• To recover a node when brought back to the cluster after failure or after being down for a long tie.• To recover missing or corrupted data.

During the repair process nodes are building Merkle trees, exchanging and comparing them. If anyinconsistencies are found, data is streamed to any nodes with missing data. In this case, significantadditional load may be generated. Typically, more data than is necessary is streamed due to designlimitation.

Using CompactionCompaction is essential to maintain the health of the Cassandra database. The compaction feature is partof the Cassandra write path.

Compactions are executed automatically based on strategy used and configuration parameters. Typically,SizeTieredCompactionStrategy will be used by default.

To learn more about how to choose the right compaction strategy and use cases, see DataStaxdocumentation, Choosing a compaction strategy.

To learn more about how to configure compaction, see DataStax documentation, Configuring Compaction.

Compaction activity can be monitored from the DataStax OpsCenter under Running Tasks. They causeadditional load by using CPU, memory, and generating I/O operations. You should monitor compactionactivity as a large number of pending contractions might indicate a platform issue that needs attention.

Note: It is normal to see a large increase in compaction activity as a consequence of repairs or noderebuild.

Backing Up DataA Cassandra database backup is a snapshot of all sstables in a data directory.

Backups can be stored on every node (or server), or you can specify another location such as Amazon S3or a local file system to copy the snapshots.

Recurring backups can be scheduled from DataStax OpsCenter. For more information, see DataStaxdocumentation, Creating a recurring scheduled backup.

Restoring Data From a BackupA data restore can be performed if data is lost or corrupted.

Data restore can be done from the DataStax OpsCenter. For more information, see DataStaxdocumentation, Restoring from a backup.

16 © 2020 General Electric Company

Page 21: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Viewing LogsTo view logs, you can download a diagnostics .tar file, or use the OpsCenter UI.

About This Task

For instruction on how to download a diagnostics .tar archive of the latest system.log file from all nodes ina cluster, see DataStax documentation, Downloading Diagnostic Data. To use OpsCenter, perform thesteps.

Procedure

1. In OpsCenter, in the left pane, click one of the following:

• Cluster > Nodes > List View• Cluster > Nodes > Ring View

2. Click a node to view its details.3. In the Node Details dialog box, in the Recent Log Information pane, select one of the following

options:

• Cassandra System Log• Cassandra Debug Log• OpsCenter Agent Log

4. Click Refresh.The last 1000 lines of the log you selected appear in the Recent Log Information pane.

Figure 2: Recent Log Information Pane

About Service MonitoringAt this time, you can monitor your Columnar Store service instance by using the DataStax OpsCenter UI.

OpsCenter is the web-based visual management and monitoring solution for DataStax Enterprise (DSE).OpsCenter has integrated functionality for real-time monitoring, tuning, provisioning, backup and securitymanagement. OpsCenter provides everything you need to intelligently manage and monitor your mission-critical systems at epic scale. For more information, see https://www.datastax.com/products/datastax-opscenter

OpsCenter key features include:

© 2020 General Electric Company 17

Page 22: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

Dashboard monitoringOverview that displays any alerts and condenses multiple clusters into a single dashboard. Clearlydisplays an overview of performance metrics, provides the ability to add and edit graphs, monitorscapabilities of DSE In-Memory tables, and allows you to view the Spark console.

Configuration, security, and administration

• Basic cluster configuration• Security options• Administration tasks• Multiple cluster management• Automatic failover from the primary OpsCenter to a backup instance• Rebalance data across a cluster• Manage multiple nodes simultaneously• Download a cluster report• Generate a diagnostics tarball for support

AlertsAlert warnings of events and issues with built in external notification capabilities and customization.

Metrics

Metrics are collected every 60 seconds and stored in a keyspace created by OpsCenter. Viewhistorical records from more than a week prior.

DataStax Enterprise Management Services

• Backup service: Allows automatic/manual backup and restoration in DSE clusters• Repair service: Continuously runs and performs repair operations across a DataStax Enterprise

cluster• Capacity service: Plan for future capacity and understand current cluster performance trends• Best practice service• Performance service

Lifecycle ManagerCentrally manage cluster, datacenter, and node configuration.

For more information, see https://docs.datastax.com/en/latest-opscenter/opsc/features_c.html

Using DataStax OpsCenterAfter you create an application and bind it to your Columnar Store service instance, you can useOpsCenter to manage and monitor your cluster.

About This Task

Before you can use OpsCenter, you must create an application and bind it to your Columnar Store serviceinstance.

Procedure

1. Bind your application to your Columnar Store service instance.2. To retrieve application information, do one of the following:

• Run the following command:

cf env <your-app-name>

18 © 2020 General Electric Company

Page 23: Predix Columnar Store€¦ · 4. Familiarize yourself with Cassandra database concepts • Data Modeling Concepts on page 5 • Data Replication Strategies on page 6 • Schema Design

• Create a service key and use it to retrieve the environment variables for your application:

cf create-service-key my-new-instance my-new-instance-keycf service-key my-new-instance my-new-instance-key

The following status message appears during environment variables retrieval:

Getting key my-new-instance-key for service instance my-new-instance as predix-data-services-deployer...

3. In VCAP_SERVICES, in the opscenter_info line, use the opscenter_url value as theOpsCenter login URL, and use the credentials in the lines below opscenter_url to log in.

Note: Three sets of credentials are provided (admin, dev, and view), each with differentpermissions.

Next Steps

For complete OpsCenter usage information, see https://docs.datastax.com/en/latest-opscenter/opsc/online_help/opscUsing_g.html.

© 2020 General Electric Company 19