streaming analytics pipeline - s3. · pdf filearchitecture overview ... amazon elasticsearch...

20
Copyright (c) 2016 by Amazon.com, Inc. or its affiliates. Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available at https://aws.amazon.com/asl/ Streaming Analytics Pipeline AWS Implementation Guide Chris Rec December 2016

Upload: tranliem

Post on 17-Mar-2018

231 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Copyright (c) 2016 by Amazon.com, Inc. or its affiliates.

Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available at

https://aws.amazon.com/asl/

Streaming Analytics Pipeline AWS Implementation Guide

Chris Rec

December 2016

Page 2: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 2 of 20

Contents

Overview .................................................................................................................................... 3

Cost ......................................................................................................................................... 4

Architecture Overview ........................................................................................................... 4

Design Considerations .............................................................................................................. 6

Regional Deployments ........................................................................................................... 6

Streaming Data Format ......................................................................................................... 6

Shard Count ........................................................................................................................... 6

Multiple External Destinations ............................................................................................. 6

AWS CloudFormation Templates ............................................................................................. 7

Automated Deployment ............................................................................................................ 7

Prerequisites .......................................................................................................................... 7

Amazon Redshift ................................................................................................................ 7

Amazon Elasticsearch Service ............................................................................................8

What We’ll Cover ...................................................................................................................8

Step 1. Launch the Stack ........................................................................................................8

Step 2. Validate and Start the Application ........................................................................... 11

Step 3. Start Streaming Data ................................................................................................ 11

Security .................................................................................................................................... 12

Security Groups .................................................................................................................... 12

IAM Roles ............................................................................................................................. 12

AWS KMS Encryption .......................................................................................................... 13

Additional Resources .............................................................................................................. 13

Appendix A: Modify Destination Parameters ......................................................................... 14

Appendix B: YAML File Configuration ................................................................................... 15

Appendix C: Sample Amazon Kinesis Analytics Applications ............................................... 17

Simple Continuous Filter ..................................................................................................... 17

Multiple-Step Application.................................................................................................... 17

Page 3: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 3 of 20

Pre-Processing Streams ....................................................................................................... 18

Appendix D: Collection of Anonymous Data .......................................................................... 19

Send Us Feedback .................................................................................................................. 20

Document Revisions............................................................................................................... 20

About This Guide This implementation guide discusses architectural considerations and configuration steps for

deploying the Streaming Analytics Pipeline on the Amazon Web Services (AWS) Cloud. It

includes links to AWS CloudFormation templates that launch, configure, and run the AWS

compute, network, storage, and other services required to deploy this solution on AWS, using

AWS best practices for security and availability.

The guide is intended for IT infrastructure architects, administrators, and DevOps

professionals who have practical experience architecting on the AWS Cloud, and are familiar

with streaming data and analytics.

Overview Many Amazon Web Services (AWS) customers use streaming data to gain real-time insight

into customer activity and immediate business trends. Streaming data is generated

continuously from thousands of data sources, and this information can help companies

make well-informed decisions and proactively respond to changing conditions.

Amazon Kinesis, a platform for streaming data on AWS, offers powerful services that make

it easier to build data processing applications, load massive volumes of data from hundreds

of thousands of sources, and analyze streaming data in real time.

Amazon Kinesis services include Amazon Kinesis Streams which enables you to build your

own custom applications that process or analyze streaming data, Amazon Kinesis Firehose

which captures and automatically loads streaming data into Amazon Simple Storage Service

(Amazon S3) and Amazon Redshift for near real-time analytics with your existing business

intelligence tools, and Amazon Kinesis Analytics which processes streaming data in real

time with standard SQL.

To help customers more easily configure a streaming data architecture, AWS offers the

Streaming Analytics Pipeline. This solution automatically provisions and configures the

AWS services necessary to start consuming and analyzing streaming data in minutes. This

Page 4: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 4 of 20

solution uses Amazon Kinesis Streams to load streaming data, Amazon Kinesis Analytics to

filter and process that data, and Amazon Kinesis Firehose to deliver the data to various data

stores for search, storage, or further analytics.

Cost You are responsible for the cost of the AWS services used while running the Streaming

Analytics Pipeline. The total cost of this solution depends on the amount of data you stream

through the Streaming Analytics Pipeline. As of the date of publication, the cost of running

this solution with the default settings in the US East (N. Virginia) Region is approximately

$1.38 per hour.1 Prices are subject to change. For full details, see the pricing webpage for

each AWS service you will be using in this solution.

We recommend adjusting your AWS Lambda and Amazon Kinesis Firehose batch

configurations as your record count and data size increase to manage costs.

Architecture Overview Deploying this solution with the default parameters builds the following environment in

the AWS Cloud.

Figure 1: Streaming Analytics Pipeline default architecture on AWS

By default, the AWS CloudFormation template creates a new Amazon Kinesis stream with

two shards, an Amazon Kinesis Firehose delivery stream that encrypts data with AWS Key

Management Service, an Amazon Simple Storage Service (Amazon S3) bucket to store raw

and analyzed data, and an AWS Identity and Access Management (IAM) role with least-

1 The cost estimate assumes the solution will stream 1,000 records per second with an average size of three kilobytes per record,

and that the external destination is an Amazon Simple Storage Service (Amazon S3) bucket .

Page 5: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 5 of 20

privilege access permissions. The template also launches an AWS Lambda custom resource

that creates an Amazon Kinesis Analytics application based on settings you specify in a

YAML configuration file. For more information, see Appendix B. The application consumes

records from the source Amazon Kinesis stream and puts records into the Amazon Kinesis

Firehose delivery stream.

Note: If you do not specify a YAML configuration file, the Amazon Kinesis Analytics application will require further modification through the AWS Management Console and/or the service API to efficiently analyze your data.

If you choose to persist raw data, an AWS Lambda function is deployed. The Lambda

function gets raw records from the source Amazon Kinesis stream, decodes the Base64-

encoded data, batches the records, and puts them into another Amazon Kinesis Firehose

delivery stream for delivery to Amazon S3.

The Streaming Analytics Pipeline can be customized to fit your needs. When you deploy the

solution, you can specify an existing Amazon Kinesis stream, a configuration for an Amazon

Kinesis Analytics application, whether or not to encrypt the data, and whether or not to

persist raw data from your source Amazon Kinesis stream to Amazon S3. You can also

choose from four destinations for your analyzed data: an Amazon S3 bucket (default), a pre-

configured Amazon Redshift cluster, a pre-configured Amazon Elasticsearch Service

domain, or an existing Amazon Kinesis stream.

Figure 2: Streaming Analytics Pipeline architecture on AWS

Page 6: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 6 of 20

Design Considerations

Regional Deployments The Streaming Analytics Pipeline uses AWS Lambda and Amazon Kinesis Analytics.

Therefore, you must deploy this solution in an AWS Region that supports both Lambda and

Amazon Kinesis Analytics. As of the date of publication, this includes the US East (N.

Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region.

Streaming Data Format Amazon Kinesis Analytics allows you to specify a schema to classify your streaming data

before it executes SQL queries against your input Amazon Kinesis stream. If you specify a

strict schema for all records, the analysis could fail if some records do not match the

expected format specified in the schema. For this solution, consider applying a flexible

schema to your streaming data to ensure all data is collected. Then, refine the schema using

standard SQL.

Shard Count The number of shards you need for a new Amazon Kinesis stream depends on the amount

of streaming data you plan to produce. Each shard can support up to 1,000 records per

second for writes, up to a maximum total data write rate of 1 MB per second (including

partition keys). For example, an application that produces 100 records per second with a

size of 35 kilobytes per record for a total data input rate of 3.4 megabytes per second needs

4 shards.

The Streaming Analytics Pipeline AWS Lambda function processes data at a default rate of

1,000 records per second. But, you can adjust the timeout and batch size to accommodate

faster processing and delivery of raw data.

While there is no upper limit to the number of shards in a stream or account, each region

has a default shard limit. For information on shard limits, please visit Amazon Kinesis

Streams Limits. To request an increase in your shard limit, please use the Stream Limits

form.

Multiple External Destinations Amazon Kinesis Analytics allows users to specify up to three external destinations for

analyzed data. By default, the Streaming Analytics Pipeline allows users to specify a single

external destination for their analyzed data. For customers who want to send analyzed data

to multiple external destinations, this solution includes a template (add-output) to allow

Page 7: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 7 of 20

you to specify multiple destinations without the use of the AWS Command Line Interface or

a custom script.

AWS CloudFormation Templates This solution uses AWS CloudFormation to automate the deployment of the Streaming

Analytics Pipeline on the AWS Cloud. It includes the following AWS CloudFormation

templates, which you can download before deployment:

streaming-analytics-pipeline.template: Use this template to

launch the Streaming Analytics Pipeline and all associated

components. The default configuration deploys an Amazon Kinesis stream, an AWS

Lambda function (optional), an Amazon Kinesis Firehose delivery stream, an Amazon

Simple Storage Service (Amazon S3) bucket, and an AWS Key Management Service

encryption key, but you can also customize the template based on your specific needs.

add-output.template: Use this template to specify more than one

external destination for the Streaming Analytics Pipeline.

Note: To cleanly delete this solution’s stack, you must delete the add-output stack before you delete the streaming-analytics-pipeline stack.

Automated Deployment Before you launch the automated deployment, please review the architecture, configuration,

and other considerations discussed in this guide. Follow the step-by-step instructions in this

section to configure and deploy a Streaming Analytics Pipeline into your account.

Time to deploy: Approximately five (5) minutes

Prerequisites If you choose Amazon Redshift or Amazon Elasticsearch Service as the destination for your

analyzed data, you must configure them to work with the Streaming Analytics Pipeline

solution.

Amazon Redshift To configure Amazon Redshift, your Amazon Redshift cluster must have a table that is

configured to accept the data in the format that is output by the Amazon Kinesis Analytics

application. You must also have write permissions to write to that table. If your Amazon

Redshift cluster is located in an Amazon Virtual Private Cloud, the cluster must be publicly

View template

View template

Page 8: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 8 of 20

accessible with a public IP address. The cluster’s Amazon Elastic Compute Cloud (Amazon

EC2) security group should allow access from the AWS Region’s Amazon Kinesis Firehose

IP addresses:

US East (N. Virginia) Region: 52.70.63.192/27

US West (Oregon) Region: 52.89.255.224/27

US West (N. California) Region: 52.19.239.192/27

Amazon Elasticsearch Service To configure Amazon Elasticsearch Service, your Amazon Elasticsearch Service domain

should have an existing index and type to which data can be assigned. We also recommend

you create and map your fields to the appropriate data type before you start the Amazon

Kinesis Analytics application to ensure that the solution assigns your data to the right type.

If you do not map the data types before you deploy the Streaming Analytics Pipeline, the

solution will create data types for you. But, these data types may not be the types you want.

What We’ll Cover The procedure for deploying this architecture on AWS consists of the following steps. For

detailed instructions, follow the links for each step.

Step 1. Launch the stack

Launch the AWS CloudFormation template into your AWS account.

Enter values for required parameters.

Review the other template parameters, and adjust if necessary.

Step 2. Validate and Start the Application

Verify that the schema and application code are correct.

Start the application.

Step 3. Start Streaming Data

Start streaming data to the source Amazon Kinesis stream.

View results in your external destination.

Step 1. Launch the Stack This automated AWS CloudFormation template deploys Streaming Analytics Pipeline on

the AWS Cloud. Please make sure that you’ve configured your Amazon Redshift cluster or

Amazon Elasticsearch Service domain before launching the stack, if you chose one of those

as your destination.

Page 9: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 9 of 20

Note: You are responsible for the cost of the AWS services used while running this solution. See the Cost section for more details. For full details, see the pricing webpage for each AWS service you will be using in this solution.

1. Log in to the AWS Management Console and click the button to

the right to launch the streaming-analytics-pipeline AWS

CloudFormation template.

You can also download the template as a starting point for your own implementation.

2. The template is launched in the US East (N. Virginia) Region by default. To launch this

solution in a different AWS Region, use the region selector in the console navigation bar.

Note: This solution uses AWS Lambda and Amazon Kinesis Analytics, which are currently available in the US East (N. Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region. Therefore, you must launch this solution one of those regions2.

3. On the Select Template page, verify that you selected the correct template and choose

Next.

4. On the Specify Details page, assign a name to your Streaming Analytics Pipeline

solution stack.

5. Under Parameters, review the parameters for the template and modify them as

necessary. This solution uses the following default values.

Parameter Default Description

New or Existing Stream New Kinesis Stream The source Amazon Kinesis stream. Create a new

stream or choose an existing stream.

New Stream Shard Count <Requires input> The number of shards to allot to your new stream

Note: If you use an existing stream, leave this parameter blank.

Existing Stream Name <Requires input> The name of an existing stream in the same AWS

Region where you launch the solution

Note: If you use a new stream, leave this parameter blank.

External Destination Amazon S3 The destination for your analyzed data. Select

Amazon S3 (default), Amazon Redshift, Amazon

Elasticsearch Service, or Amazon Kinesis stream.

2 For the most current Lambda and Amazon Kinesis Analytics availability by region, see the AWS service offerings by region.

Launch Solution

Page 10: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 10 of 20

Parameter Default Description

Note: If you choose Amazon Redshift, Amazon Elasticsearch Service, or Amazon Kinesis stream, you must configure the destination. See Appendix A for steps to configure the destination.

Configuration File Location <Requires input> The Amazon S3 bucket and key where the completed

YAML configuration file is stored. For example,

<bucket-name>/<key>.

For information about the YAML file configuration,

see Appendix B.

Encrypt Data at Rest? Yes Specify whether or not the solution will create an

AWS KMS encryption key, and encrypt raw and

analyzed data in Amazon S3

Persist Raw Source Data? Yes Specify whether or not the solution will persist raw

streaming data from your source Amazon Kinesis

stream to Amazon S3

Destination Prefix AggregateData The prefix name that will be created in the Amazon

S3 bucket

Note: Use this parameter only if you choose the default option (Amazon S3) as your destination.

Buffer Interval 300 Specify the number of seconds (60-900) that Amazon

Kinesis Firehose should buffer data before loading it

to Amazon S3

Buffer Size 5 Specify the size of data in MB (1-128) that Amazon

Kinesis Firehose should buffer before loading it to

Amazon S3

Send Anonymous Usage

Data

Yes Send anonymous data to AWS to help us understand

usage across our customer base as a whole. To opt out

of this feature, choose No.

For more information, see Appendix C.

6. Verify that you modified the correct parameters for your chosen destination.

7. Click Next.

8. On the Options page, choose Next.

9. On the Review page, review and confirm the settings. Be sure to check the box

acknowledging that the template will create IAM resources.

10. Click Create to deploy the stack.

Page 11: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 11 of 20

You can view the status of the stack in the AWS CloudFormation Console in the Status

column. You should see a status of CREATE_COMPLETE in roughly five (5) minutes.

Step 2. Validate and Start the Application Once the stack is created, complete the following steps.

1. Navigate to the stack Outputs tab.

2. Note the name of the Amazon Kinesis Analytics application.

3. Navigate to the Amazon Kinesis Analytics console.

4. Select the name of your Analytics application and choose Application Details.

Figure 3: Example Amazon Kinesis Analytics application details

5. To view your data schema, select the pencil icon next to the source Amazon Kinesis

stream, and scroll to the bottom of the page.

6. Under Real-time analytics, choose Go to SQL editor.

7. When asked if you want to start the application, select No, I’ll do this later.

8. Review your SQL code and edit as necessary. Then, choose Save and run SQL.

Your Amazon Kinesis Analytics application will change to a Starting state. Your application will start after 30-90 seconds.

Step 3. Start Streaming Data Once you start your Amazon Kinesis Analytics application, configure your streaming data producers to send streaming records to your source Amazon Kinesis stream. For more information on how to configure streaming data producers, please visit Writing Data to Amazon Kinesis Streams.

Page 12: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 12 of 20

Note: To test this solution with sample data, you can use the Amazon Kinesis Data Producer. The data producer generates records using random data based on a template you provide.

As data flows through the Amazon Kinesis stream, it will automatically be processed by the Analytics application and Amazon Kinesis Firehose delivers the data to the specified external destination.

Once enough data has been sent through the Streaming Analytics Pipeline, or after the Firehose buffer interval has been reached, analyzed data is sent to the destination. If you have chosen to persist raw streaming data to Amazon Simple Storage Service (Amazon S3), you will also see Base64-decoded record data in the solution’s Amazon S3 bucket with the prefix, rawStreamData.

Security When you build systems on AWS infrastructure, security responsibilities are shared between

you and AWS. This shared model can reduce your operational burden as AWS operates,

manages, and controls the components from the host operating system and virtualization

layer down to the physical security of the facilities in which the services operate. For more

information about security on AWS, visit the AWS Security Center.

Security Groups The Streaming Analytics Pipeline does not create any security groups. However, we

recommend that you follow best practices for least-privilege access when creating access

rules for associated resources. If you selected an existing Amazon Redshift cluster as your

external destination, and your cluster is in an Amazon VPC with a publicly available IP

address, you must open the Amazon Redshift security group to the Amazon Kinesis Firehose

CIDR block for your AWS Region. For more information, see Prerequisites.

IAM Roles AWS Identity and Access Management (IAM) roles enable customers to assign granular

access policies and permissions to services and users on the AWS Cloud. Depending on

your configuration, the Streaming Analytics Pipeline creates between two and five IAM

roles. The solution creates the following roles:

A role with granular access policies for each Amazon Kinesis Firehose delivery stream

that the solution creates. The policies allow the Amazon Kinesis Firehose delivery

streams to log their events, get a particular AWS Key Management Service encryption

key to encrypt data in a specific Amazon S3 prefix, and send streaming events to a

Page 13: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 13 of 20

specific Amazon S3 prefix (and to Amazon Elasticsearch Service if the customer has

selected this as their external destination).

A role for the new Amazon Kinesis Analytics application. This role grants the application

least-privilege permissions to get streaming records from the source Amazon Kinesis

stream, put the analyzed results to a specific Firehose delivery stream or Amazon

Kinesis stream, and log its events.

A role for the AWS Lambda custom resource that creates the Amazon Kinesis Analytics

application. This role has permission to create, delete, describe, and list applications, log

its events, and get details from AWS CloudFormation and AWS CloudWatch for

configuration and to gather metrics.

A role that allows AWS Lambda function to get records from the source Amazon Kinesis

stream, put batch events to Amazon Kinesis Firehose, and log its events. This role is only

created if you choose to persist raw streaming data to Amazon S3.

A role that allows an AWS CloudWatch rule to invoke an AWS Lambda function, which

collects and sends anonymous metrics. This role is only created if you choose to send

anonymous data to AWS.

AWS KMS Encryption This solution allows you to encrypt your data at rest when it reaches the destination. If you

choose to encrypt your data, the solution creates an AWS Key Management Service (AWS

KMS) encryption key, and automatically configures the Amazon Kinesis Firehose delivery

streams to use the key. By default, no services or users will have permission to use or

control the AWS KMS encryption key. To set access policies for the key, set them manually

in the AWS KMS console.

Additional Resources

AWS service documentation

AWS CloudFormation

Amazon Kinesis Analytics developer guide

Amazon Kinesis Streams developer guide

Amazon Kinesis Firehose developer guide

AWS Lambda developer guide

Amazon Redshift developer guide

Amazon Elasticsearch Service developer guide

Amazon CloudWatch user guide

Page 14: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 14 of 20

Appendix A: Modify Destination Parameters If you select Amazon Redshift, Amazon Elasticsearch Service, or an Amazon Kinesis stream as the destination for your analyzed data, you must modify the parameter for your selected destination.

For Amazon Redshift, modify the parameters in the following table.

Parameter Default Description

Master User Name <Requires input> Username of the user with permissions to edit the specified

table in the Amazon Redshift cluster

Master User Password <Requires input> Password of the user with permissions to edit the specified

table in the Amazon Redshift cluster.

JDBC URL <Requires input> The JDBC URL of the Amazon Redshift cluster. You can obtain

this from the Amazon Redshift console. The URL has the

following format:

jdbc:redshift://endpoint:port/database

Table Name <Requires input> The name of an existing, preconfigured table in the specified

Amazon Redshift cluster, to which the results of the Amazon

Kinesis Analytics application will be loaded

Column Pattern <Requires input> By default, Amazon Kinesis Firehose will copy records to

Amazon Redshift in the same order they leave the Amazon

Kinesis Analytics application. If you wish to change the order

or enter analyzed data into certain columns, provide a comma-

separated list of the column names in the desired order. For

example, column1, column2, column3, column4.

Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis

Firehose should buffer data before loading it to Amazon

Redshift

Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis

Firehose should buffer before loading it to Amazon Redshift

For Amazon Elasticsearch Service, modify the parameters in the following table.

Parameter Default Description

Domain Name <Requires input> The name of the Amazon Elasticsearch Service domain. You

must deploy the solution in the same AWS Region as the

domain.

Index Name <Requires input> The name of the index for analyzed data

Type Name <Requires input> The name of the type for analyzed data. We recommend that

you create the type before you start the Amazon Kinesis

Analytics application.

Index Rotation NoRotation The frequency at which the specified Amazon Elasticsearch

Service index rotates

Page 15: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 15 of 20

Parameter Default Description

Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis

Firehose should buffer data before loading it to Amazon

Elasticsearch Service.

Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis

Firehose should buffer before loading it to Amazon

Elasticsearch Service.

For an Amazon Kinesis stream, modify the Destination Stream Name parameter. The Destination Stream Name is the name of the stream that will be your destination for your analyzed data.

Appendix B: YAML File Configuration The Streaming Analytics Pipeline includes a YAML file that contains configuration

information for the Amazon Kinesis Analytics application that the solution creates. Review

the parameters in the YAML file and modify them as necessary for your implementation.

Then, upload the file to an Amazon S3 bucket.

streaming-analytics-pipeline-config.yaml: Use this file to

specify your Amazon Kinesis Analytics application configuration.

Parameter Default Description

Input Format Type CSV The format of the records of the source stream. Choose CSV

or JSON.

Record Column

Delimiter

“,” The column delimiter of CSV-formatted data from the

source stream. For example, “|” or “,”.

Note: Leave this parameter blank if you chose JSON as your Input Format Type.

Record Row Delimiter “/n” The row delimiter of CSV-formatted data from the source

stream. For example, “/n”.

Note: Leave this parameter blank if you chose JSON as your Input Format Type.

Record Row Path “$” The path to the top-level parent that contains the records.

Note: Leave this parameter blank if you chose CSV as your Input Format Type.

Output Format Type CSV The format of the analyzed data that is put in the output

stream. Choose CSV or JSON.

View YAML file

Page 16: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 16 of 20

Parameter Default Description

Columns <Requires Input> A list of dictionary values specifying the name, SQL type,

and (if necessary) record row path mapping. For example,

CSV: {Name: pressure, SqlType: DOUBLE} or

JSON: {Name: pressure, SqlType: DOUBLE,

Mapping: $.pressure}.

SQL Code <Requires Input> The Amazon Kinesis Analytics application code. The code

will be copied to the application.

The YAML configuration file is not required to run this solution. If you do not specify a file

location, the solution will launch an Analytics application with the following configuration

with a “catch all” schema.

Note: If you provide a YAML configuration file location, you must complete the Format section of the file.

# Update this file according to your Input Schema and application code

# Note: pay attention to indentation - it matters

format:

InputFormatType: CSV

RecordColumnDelimiter: ","

RecordRowDelimiter: "\n"

RecordRowPath: "$"

OutputFormatType: CSV

columns:

- {Name: temp, SqlType: TINYINT}

- {Name: segmentId, SqlType: CHAR(4)}

- {Name: sensorIp, SqlType: VARCHAR(15)}

- {Name: pressure, SqlType: DOUBLE}

- {Name: incline, SqlType: DOUBLE}

- {Name: flow, SqlType: BIGINT}

- {Name: captureTs, SqlType: TIMESTAMP}

- {Name: sensorId, SqlType: CHAR(4)}

sql_code: |

-- Paste your SQL code here

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (

temp TINYINT,

sensorIp VARCHAR(15),

sensorId CHAR(4),

captureTs TIMESTAMP,

pressure DOUBLE);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO

"DESTINATION_SQL_STREAM"

SELECT STREAM "temp", "sensorIp", "sensorId", "captureTs",

"pressure"

Page 17: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 17 of 20

FROM "SOURCE_SQL_STREAM_001";

Figure 2: Sample YAML configuration file

Appendix C: Sample Amazon Kinesis Analytics

Applications Amazon Kinesis Analytics implements the ANSI 2008 SQL standard with extensions. These

extensions enable you to process streaming data. For detailed information on Amazon

Kinesis Analytics SQL concepts, please see to the Amazon Kinesis Analytics SQL Reference.

Here are some examples of Amazon Kinesis Analytics application code.

Simple Continuous Filter This application performs a continuous SELECT statement on stock ticker data in the source

stream (SOURCE_SQL_STREAM_001) based on a WHERE condition, and insert the results

into an output in-application stream (DESTINATION_SQL_STREAM).

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (

ticker_symbol VARCHAR(4),

sector VARCHAR(16),

price REAL,

change REAL);

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO

"DESTINATION_SQL_STREAM"

SELECT STREAM ticker_symbol, sector, price, change

FROM "SOURCE_SQL_STREAM_001"

WHERE sector SIMILAR TO '%TECH%';

Multiple-Step Application This application uses multiple intermediate in-application streams (IN_APP_STREAM_001

and IN_APP_STREAM_002) to process data in multiple steps. The results of a query against

one in-application stream feed into another in-application stream.

CREATE OR REPLACE STREAM "IN_APP_STREAM_001" (

ingest_time TIMESTAMP,

ticker_symbol VARCHAR(4),

sector VARCHAR(16),

price REAL,

change REAL);

CREATE OR REPLACE PUMP "PUMP_001" AS INSERT INTO "IN_APP_STREAM_001"

Page 18: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 18 of 20

SELECT STREAM APPROXIMATE_ARRIVAL_TIME, ticker_symbol, sector,

price, change

FROM "SOURCE_SQL_STREAM_001";

CREATE OR REPLACE STREAM "IN_APP_STREAM_002" (

ingest_time TIMESTAMP,

ticker_symbol VARCHAR(4),

sector VARCHAR(16),

price REAL,

change REAL);

CREATE OR REPLACE PUMP "PUMP_002" AS INSERT INTO "IN_APP_STREAM_002"

SELECT STREAM ingest_time, ticker_symbol, sector, price, change

Pre-Processing Streams This application retrieves rows of specific types from the in-application input stream and

inserts them in separate in-application streams. Once the record types have been filtered, you

can perform analytics on a particular in-application stream.

CREATE OR REPLACE STREAM "Order_Stream" (

"order_id" integer,

"order_type" varchar(10),

"ticker" varchar(4),

"order_price" DOUBLE,

"record_type" varchar(10));

CREATE OR REPLACE PUMP "Order_Pump" AS INSERT INTO "Order_Stream"

SELECT STREAM "Oid", "Otype","Oticker", "Oprice", "RecordType"

FROM "SOURCE_SQL_STREAM_001"

WHERE "RecordType" = 'Order';

CREATE OR REPLACE STREAM "Trade_Stream" (

"trade_id" integer,

"order_id" integer,

"trade_price" DOUBLE,

"ticker" varchar(4),

"record_type" varchar(10));

CREATE OR REPLACE PUMP "Trade_Pump" AS INSERT INTO "Trade_Stream"

SELECT STREAM "Tid", "Toid", "Tprice", "Tticker", "RecordType"

FROM "SOURCE_SQL_STREAM_001"

WHERE "RecordType" = 'Trade';

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (

"ticker" varchar(4),

"trade_count" integer);

CREATE OR REPLACE PUMP "Output_Pump" AS INSERT INTO

"DESTINATION_SQL_STREAM"

Page 19: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 19 of 20

SELECT STREAM "ticker", count(*) as trade_count

FROM "Trade_Stream"

GROUP BY "ticker", FLOOR("Trade_Stream".ROWTIME TO MINUTE);

Appendix D: Collection of Anonymous Data This solution includes an option to send anonymous usage data to AWS. We use this data to

better understand how customers use this solution to improve the services and products

that we offer. When enabled, the following information is collected and sent to AWS every

15 minutes after you deploy the solution:

Solution ID: The AWS solution identifier

Unique ID (UUID): Randomly generated, unique identifier for each Streaming

Analytics Pipeline deployment

Timestamp: Data-collection timestamp

Streaming Data Rate: Count of the number of records and bytes that enter your

Amazon Kinesis Analytics application for analysis

Example data:

{"metrics":

{"InputRecords":463000.0,"InputBytes":70931638.0}

Note that AWS will own the data gathered via this survey. Data collection will be subject to

the AWS Privacy Policy. To opt out of this feature, set the SendAnonymousData

parameter to No.

Page 20: Streaming Analytics Pipeline - s3. · PDF fileArchitecture Overview ... Amazon Elasticsearch Service ... The Streaming Analytics Pipeline AWS Lambda function processes data at a default

Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016

Page 20 of 20

Send Us Feedback We welcome your questions and comments. Please post your feedback on the AWS

Solutions Forum.

You can visit our GitHub repository to download the templates and scripts for this solution,

and to share your customizations with others.

Document Revisions

Date Change In sections

December 2016 Initial release --

© 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings and

practices as of the date of issue of this document, which are subject to change without notice. Customers are

responsible for making their own independent assessment of the information in this document and any use of

AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or

implied. This document does not create any warranties, representations, contractual commitments,

conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of

AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify,

any agreement between AWS and its customers.

The Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available

at https://aws.amazon.com/asl/.