streaming analytics pipeline - s3. · pdf filearchitecture overview ... amazon elasticsearch...
TRANSCRIPT
Copyright (c) 2016 by Amazon.com, Inc. or its affiliates.
Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available at
https://aws.amazon.com/asl/
Streaming Analytics Pipeline AWS Implementation Guide
Chris Rec
December 2016
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 2 of 20
Contents
Overview .................................................................................................................................... 3
Cost ......................................................................................................................................... 4
Architecture Overview ........................................................................................................... 4
Design Considerations .............................................................................................................. 6
Regional Deployments ........................................................................................................... 6
Streaming Data Format ......................................................................................................... 6
Shard Count ........................................................................................................................... 6
Multiple External Destinations ............................................................................................. 6
AWS CloudFormation Templates ............................................................................................. 7
Automated Deployment ............................................................................................................ 7
Prerequisites .......................................................................................................................... 7
Amazon Redshift ................................................................................................................ 7
Amazon Elasticsearch Service ............................................................................................8
What We’ll Cover ...................................................................................................................8
Step 1. Launch the Stack ........................................................................................................8
Step 2. Validate and Start the Application ........................................................................... 11
Step 3. Start Streaming Data ................................................................................................ 11
Security .................................................................................................................................... 12
Security Groups .................................................................................................................... 12
IAM Roles ............................................................................................................................. 12
AWS KMS Encryption .......................................................................................................... 13
Additional Resources .............................................................................................................. 13
Appendix A: Modify Destination Parameters ......................................................................... 14
Appendix B: YAML File Configuration ................................................................................... 15
Appendix C: Sample Amazon Kinesis Analytics Applications ............................................... 17
Simple Continuous Filter ..................................................................................................... 17
Multiple-Step Application.................................................................................................... 17
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 3 of 20
Pre-Processing Streams ....................................................................................................... 18
Appendix D: Collection of Anonymous Data .......................................................................... 19
Send Us Feedback .................................................................................................................. 20
Document Revisions............................................................................................................... 20
About This Guide This implementation guide discusses architectural considerations and configuration steps for
deploying the Streaming Analytics Pipeline on the Amazon Web Services (AWS) Cloud. It
includes links to AWS CloudFormation templates that launch, configure, and run the AWS
compute, network, storage, and other services required to deploy this solution on AWS, using
AWS best practices for security and availability.
The guide is intended for IT infrastructure architects, administrators, and DevOps
professionals who have practical experience architecting on the AWS Cloud, and are familiar
with streaming data and analytics.
Overview Many Amazon Web Services (AWS) customers use streaming data to gain real-time insight
into customer activity and immediate business trends. Streaming data is generated
continuously from thousands of data sources, and this information can help companies
make well-informed decisions and proactively respond to changing conditions.
Amazon Kinesis, a platform for streaming data on AWS, offers powerful services that make
it easier to build data processing applications, load massive volumes of data from hundreds
of thousands of sources, and analyze streaming data in real time.
Amazon Kinesis services include Amazon Kinesis Streams which enables you to build your
own custom applications that process or analyze streaming data, Amazon Kinesis Firehose
which captures and automatically loads streaming data into Amazon Simple Storage Service
(Amazon S3) and Amazon Redshift for near real-time analytics with your existing business
intelligence tools, and Amazon Kinesis Analytics which processes streaming data in real
time with standard SQL.
To help customers more easily configure a streaming data architecture, AWS offers the
Streaming Analytics Pipeline. This solution automatically provisions and configures the
AWS services necessary to start consuming and analyzing streaming data in minutes. This
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 4 of 20
solution uses Amazon Kinesis Streams to load streaming data, Amazon Kinesis Analytics to
filter and process that data, and Amazon Kinesis Firehose to deliver the data to various data
stores for search, storage, or further analytics.
Cost You are responsible for the cost of the AWS services used while running the Streaming
Analytics Pipeline. The total cost of this solution depends on the amount of data you stream
through the Streaming Analytics Pipeline. As of the date of publication, the cost of running
this solution with the default settings in the US East (N. Virginia) Region is approximately
$1.38 per hour.1 Prices are subject to change. For full details, see the pricing webpage for
each AWS service you will be using in this solution.
We recommend adjusting your AWS Lambda and Amazon Kinesis Firehose batch
configurations as your record count and data size increase to manage costs.
Architecture Overview Deploying this solution with the default parameters builds the following environment in
the AWS Cloud.
Figure 1: Streaming Analytics Pipeline default architecture on AWS
By default, the AWS CloudFormation template creates a new Amazon Kinesis stream with
two shards, an Amazon Kinesis Firehose delivery stream that encrypts data with AWS Key
Management Service, an Amazon Simple Storage Service (Amazon S3) bucket to store raw
and analyzed data, and an AWS Identity and Access Management (IAM) role with least-
1 The cost estimate assumes the solution will stream 1,000 records per second with an average size of three kilobytes per record,
and that the external destination is an Amazon Simple Storage Service (Amazon S3) bucket .
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 5 of 20
privilege access permissions. The template also launches an AWS Lambda custom resource
that creates an Amazon Kinesis Analytics application based on settings you specify in a
YAML configuration file. For more information, see Appendix B. The application consumes
records from the source Amazon Kinesis stream and puts records into the Amazon Kinesis
Firehose delivery stream.
Note: If you do not specify a YAML configuration file, the Amazon Kinesis Analytics application will require further modification through the AWS Management Console and/or the service API to efficiently analyze your data.
If you choose to persist raw data, an AWS Lambda function is deployed. The Lambda
function gets raw records from the source Amazon Kinesis stream, decodes the Base64-
encoded data, batches the records, and puts them into another Amazon Kinesis Firehose
delivery stream for delivery to Amazon S3.
The Streaming Analytics Pipeline can be customized to fit your needs. When you deploy the
solution, you can specify an existing Amazon Kinesis stream, a configuration for an Amazon
Kinesis Analytics application, whether or not to encrypt the data, and whether or not to
persist raw data from your source Amazon Kinesis stream to Amazon S3. You can also
choose from four destinations for your analyzed data: an Amazon S3 bucket (default), a pre-
configured Amazon Redshift cluster, a pre-configured Amazon Elasticsearch Service
domain, or an existing Amazon Kinesis stream.
Figure 2: Streaming Analytics Pipeline architecture on AWS
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 6 of 20
Design Considerations
Regional Deployments The Streaming Analytics Pipeline uses AWS Lambda and Amazon Kinesis Analytics.
Therefore, you must deploy this solution in an AWS Region that supports both Lambda and
Amazon Kinesis Analytics. As of the date of publication, this includes the US East (N.
Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region.
Streaming Data Format Amazon Kinesis Analytics allows you to specify a schema to classify your streaming data
before it executes SQL queries against your input Amazon Kinesis stream. If you specify a
strict schema for all records, the analysis could fail if some records do not match the
expected format specified in the schema. For this solution, consider applying a flexible
schema to your streaming data to ensure all data is collected. Then, refine the schema using
standard SQL.
Shard Count The number of shards you need for a new Amazon Kinesis stream depends on the amount
of streaming data you plan to produce. Each shard can support up to 1,000 records per
second for writes, up to a maximum total data write rate of 1 MB per second (including
partition keys). For example, an application that produces 100 records per second with a
size of 35 kilobytes per record for a total data input rate of 3.4 megabytes per second needs
4 shards.
The Streaming Analytics Pipeline AWS Lambda function processes data at a default rate of
1,000 records per second. But, you can adjust the timeout and batch size to accommodate
faster processing and delivery of raw data.
While there is no upper limit to the number of shards in a stream or account, each region
has a default shard limit. For information on shard limits, please visit Amazon Kinesis
Streams Limits. To request an increase in your shard limit, please use the Stream Limits
form.
Multiple External Destinations Amazon Kinesis Analytics allows users to specify up to three external destinations for
analyzed data. By default, the Streaming Analytics Pipeline allows users to specify a single
external destination for their analyzed data. For customers who want to send analyzed data
to multiple external destinations, this solution includes a template (add-output) to allow
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 7 of 20
you to specify multiple destinations without the use of the AWS Command Line Interface or
a custom script.
AWS CloudFormation Templates This solution uses AWS CloudFormation to automate the deployment of the Streaming
Analytics Pipeline on the AWS Cloud. It includes the following AWS CloudFormation
templates, which you can download before deployment:
streaming-analytics-pipeline.template: Use this template to
launch the Streaming Analytics Pipeline and all associated
components. The default configuration deploys an Amazon Kinesis stream, an AWS
Lambda function (optional), an Amazon Kinesis Firehose delivery stream, an Amazon
Simple Storage Service (Amazon S3) bucket, and an AWS Key Management Service
encryption key, but you can also customize the template based on your specific needs.
add-output.template: Use this template to specify more than one
external destination for the Streaming Analytics Pipeline.
Note: To cleanly delete this solution’s stack, you must delete the add-output stack before you delete the streaming-analytics-pipeline stack.
Automated Deployment Before you launch the automated deployment, please review the architecture, configuration,
and other considerations discussed in this guide. Follow the step-by-step instructions in this
section to configure and deploy a Streaming Analytics Pipeline into your account.
Time to deploy: Approximately five (5) minutes
Prerequisites If you choose Amazon Redshift or Amazon Elasticsearch Service as the destination for your
analyzed data, you must configure them to work with the Streaming Analytics Pipeline
solution.
Amazon Redshift To configure Amazon Redshift, your Amazon Redshift cluster must have a table that is
configured to accept the data in the format that is output by the Amazon Kinesis Analytics
application. You must also have write permissions to write to that table. If your Amazon
Redshift cluster is located in an Amazon Virtual Private Cloud, the cluster must be publicly
View template
View template
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 8 of 20
accessible with a public IP address. The cluster’s Amazon Elastic Compute Cloud (Amazon
EC2) security group should allow access from the AWS Region’s Amazon Kinesis Firehose
IP addresses:
US East (N. Virginia) Region: 52.70.63.192/27
US West (Oregon) Region: 52.89.255.224/27
US West (N. California) Region: 52.19.239.192/27
Amazon Elasticsearch Service To configure Amazon Elasticsearch Service, your Amazon Elasticsearch Service domain
should have an existing index and type to which data can be assigned. We also recommend
you create and map your fields to the appropriate data type before you start the Amazon
Kinesis Analytics application to ensure that the solution assigns your data to the right type.
If you do not map the data types before you deploy the Streaming Analytics Pipeline, the
solution will create data types for you. But, these data types may not be the types you want.
What We’ll Cover The procedure for deploying this architecture on AWS consists of the following steps. For
detailed instructions, follow the links for each step.
Step 1. Launch the stack
Launch the AWS CloudFormation template into your AWS account.
Enter values for required parameters.
Review the other template parameters, and adjust if necessary.
Step 2. Validate and Start the Application
Verify that the schema and application code are correct.
Start the application.
Step 3. Start Streaming Data
Start streaming data to the source Amazon Kinesis stream.
View results in your external destination.
Step 1. Launch the Stack This automated AWS CloudFormation template deploys Streaming Analytics Pipeline on
the AWS Cloud. Please make sure that you’ve configured your Amazon Redshift cluster or
Amazon Elasticsearch Service domain before launching the stack, if you chose one of those
as your destination.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 9 of 20
Note: You are responsible for the cost of the AWS services used while running this solution. See the Cost section for more details. For full details, see the pricing webpage for each AWS service you will be using in this solution.
1. Log in to the AWS Management Console and click the button to
the right to launch the streaming-analytics-pipeline AWS
CloudFormation template.
You can also download the template as a starting point for your own implementation.
2. The template is launched in the US East (N. Virginia) Region by default. To launch this
solution in a different AWS Region, use the region selector in the console navigation bar.
Note: This solution uses AWS Lambda and Amazon Kinesis Analytics, which are currently available in the US East (N. Virginia) Region, the US West (Oregon) Region, and the EU (Ireland) Region. Therefore, you must launch this solution one of those regions2.
3. On the Select Template page, verify that you selected the correct template and choose
Next.
4. On the Specify Details page, assign a name to your Streaming Analytics Pipeline
solution stack.
5. Under Parameters, review the parameters for the template and modify them as
necessary. This solution uses the following default values.
Parameter Default Description
New or Existing Stream New Kinesis Stream The source Amazon Kinesis stream. Create a new
stream or choose an existing stream.
New Stream Shard Count <Requires input> The number of shards to allot to your new stream
Note: If you use an existing stream, leave this parameter blank.
Existing Stream Name <Requires input> The name of an existing stream in the same AWS
Region where you launch the solution
Note: If you use a new stream, leave this parameter blank.
External Destination Amazon S3 The destination for your analyzed data. Select
Amazon S3 (default), Amazon Redshift, Amazon
Elasticsearch Service, or Amazon Kinesis stream.
2 For the most current Lambda and Amazon Kinesis Analytics availability by region, see the AWS service offerings by region.
Launch Solution
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 10 of 20
Parameter Default Description
Note: If you choose Amazon Redshift, Amazon Elasticsearch Service, or Amazon Kinesis stream, you must configure the destination. See Appendix A for steps to configure the destination.
Configuration File Location <Requires input> The Amazon S3 bucket and key where the completed
YAML configuration file is stored. For example,
<bucket-name>/<key>.
For information about the YAML file configuration,
see Appendix B.
Encrypt Data at Rest? Yes Specify whether or not the solution will create an
AWS KMS encryption key, and encrypt raw and
analyzed data in Amazon S3
Persist Raw Source Data? Yes Specify whether or not the solution will persist raw
streaming data from your source Amazon Kinesis
stream to Amazon S3
Destination Prefix AggregateData The prefix name that will be created in the Amazon
S3 bucket
Note: Use this parameter only if you choose the default option (Amazon S3) as your destination.
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon
Kinesis Firehose should buffer data before loading it
to Amazon S3
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon
Kinesis Firehose should buffer before loading it to
Amazon S3
Send Anonymous Usage
Data
Yes Send anonymous data to AWS to help us understand
usage across our customer base as a whole. To opt out
of this feature, choose No.
For more information, see Appendix C.
6. Verify that you modified the correct parameters for your chosen destination.
7. Click Next.
8. On the Options page, choose Next.
9. On the Review page, review and confirm the settings. Be sure to check the box
acknowledging that the template will create IAM resources.
10. Click Create to deploy the stack.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 11 of 20
You can view the status of the stack in the AWS CloudFormation Console in the Status
column. You should see a status of CREATE_COMPLETE in roughly five (5) minutes.
Step 2. Validate and Start the Application Once the stack is created, complete the following steps.
1. Navigate to the stack Outputs tab.
2. Note the name of the Amazon Kinesis Analytics application.
3. Navigate to the Amazon Kinesis Analytics console.
4. Select the name of your Analytics application and choose Application Details.
Figure 3: Example Amazon Kinesis Analytics application details
5. To view your data schema, select the pencil icon next to the source Amazon Kinesis
stream, and scroll to the bottom of the page.
6. Under Real-time analytics, choose Go to SQL editor.
7. When asked if you want to start the application, select No, I’ll do this later.
8. Review your SQL code and edit as necessary. Then, choose Save and run SQL.
Your Amazon Kinesis Analytics application will change to a Starting state. Your application will start after 30-90 seconds.
Step 3. Start Streaming Data Once you start your Amazon Kinesis Analytics application, configure your streaming data producers to send streaming records to your source Amazon Kinesis stream. For more information on how to configure streaming data producers, please visit Writing Data to Amazon Kinesis Streams.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 12 of 20
Note: To test this solution with sample data, you can use the Amazon Kinesis Data Producer. The data producer generates records using random data based on a template you provide.
As data flows through the Amazon Kinesis stream, it will automatically be processed by the Analytics application and Amazon Kinesis Firehose delivers the data to the specified external destination.
Once enough data has been sent through the Streaming Analytics Pipeline, or after the Firehose buffer interval has been reached, analyzed data is sent to the destination. If you have chosen to persist raw streaming data to Amazon Simple Storage Service (Amazon S3), you will also see Base64-decoded record data in the solution’s Amazon S3 bucket with the prefix, rawStreamData.
Security When you build systems on AWS infrastructure, security responsibilities are shared between
you and AWS. This shared model can reduce your operational burden as AWS operates,
manages, and controls the components from the host operating system and virtualization
layer down to the physical security of the facilities in which the services operate. For more
information about security on AWS, visit the AWS Security Center.
Security Groups The Streaming Analytics Pipeline does not create any security groups. However, we
recommend that you follow best practices for least-privilege access when creating access
rules for associated resources. If you selected an existing Amazon Redshift cluster as your
external destination, and your cluster is in an Amazon VPC with a publicly available IP
address, you must open the Amazon Redshift security group to the Amazon Kinesis Firehose
CIDR block for your AWS Region. For more information, see Prerequisites.
IAM Roles AWS Identity and Access Management (IAM) roles enable customers to assign granular
access policies and permissions to services and users on the AWS Cloud. Depending on
your configuration, the Streaming Analytics Pipeline creates between two and five IAM
roles. The solution creates the following roles:
A role with granular access policies for each Amazon Kinesis Firehose delivery stream
that the solution creates. The policies allow the Amazon Kinesis Firehose delivery
streams to log their events, get a particular AWS Key Management Service encryption
key to encrypt data in a specific Amazon S3 prefix, and send streaming events to a
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 13 of 20
specific Amazon S3 prefix (and to Amazon Elasticsearch Service if the customer has
selected this as their external destination).
A role for the new Amazon Kinesis Analytics application. This role grants the application
least-privilege permissions to get streaming records from the source Amazon Kinesis
stream, put the analyzed results to a specific Firehose delivery stream or Amazon
Kinesis stream, and log its events.
A role for the AWS Lambda custom resource that creates the Amazon Kinesis Analytics
application. This role has permission to create, delete, describe, and list applications, log
its events, and get details from AWS CloudFormation and AWS CloudWatch for
configuration and to gather metrics.
A role that allows AWS Lambda function to get records from the source Amazon Kinesis
stream, put batch events to Amazon Kinesis Firehose, and log its events. This role is only
created if you choose to persist raw streaming data to Amazon S3.
A role that allows an AWS CloudWatch rule to invoke an AWS Lambda function, which
collects and sends anonymous metrics. This role is only created if you choose to send
anonymous data to AWS.
AWS KMS Encryption This solution allows you to encrypt your data at rest when it reaches the destination. If you
choose to encrypt your data, the solution creates an AWS Key Management Service (AWS
KMS) encryption key, and automatically configures the Amazon Kinesis Firehose delivery
streams to use the key. By default, no services or users will have permission to use or
control the AWS KMS encryption key. To set access policies for the key, set them manually
in the AWS KMS console.
Additional Resources
AWS service documentation
AWS CloudFormation
Amazon Kinesis Analytics developer guide
Amazon Kinesis Streams developer guide
Amazon Kinesis Firehose developer guide
AWS Lambda developer guide
Amazon Redshift developer guide
Amazon Elasticsearch Service developer guide
Amazon CloudWatch user guide
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 14 of 20
Appendix A: Modify Destination Parameters If you select Amazon Redshift, Amazon Elasticsearch Service, or an Amazon Kinesis stream as the destination for your analyzed data, you must modify the parameter for your selected destination.
For Amazon Redshift, modify the parameters in the following table.
Parameter Default Description
Master User Name <Requires input> Username of the user with permissions to edit the specified
table in the Amazon Redshift cluster
Master User Password <Requires input> Password of the user with permissions to edit the specified
table in the Amazon Redshift cluster.
JDBC URL <Requires input> The JDBC URL of the Amazon Redshift cluster. You can obtain
this from the Amazon Redshift console. The URL has the
following format:
jdbc:redshift://endpoint:port/database
Table Name <Requires input> The name of an existing, preconfigured table in the specified
Amazon Redshift cluster, to which the results of the Amazon
Kinesis Analytics application will be loaded
Column Pattern <Requires input> By default, Amazon Kinesis Firehose will copy records to
Amazon Redshift in the same order they leave the Amazon
Kinesis Analytics application. If you wish to change the order
or enter analyzed data into certain columns, provide a comma-
separated list of the column names in the desired order. For
example, column1, column2, column3, column4.
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis
Firehose should buffer data before loading it to Amazon
Redshift
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis
Firehose should buffer before loading it to Amazon Redshift
For Amazon Elasticsearch Service, modify the parameters in the following table.
Parameter Default Description
Domain Name <Requires input> The name of the Amazon Elasticsearch Service domain. You
must deploy the solution in the same AWS Region as the
domain.
Index Name <Requires input> The name of the index for analyzed data
Type Name <Requires input> The name of the type for analyzed data. We recommend that
you create the type before you start the Amazon Kinesis
Analytics application.
Index Rotation NoRotation The frequency at which the specified Amazon Elasticsearch
Service index rotates
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 15 of 20
Parameter Default Description
Buffer Interval 300 Specify the number of seconds (60-900) that Amazon Kinesis
Firehose should buffer data before loading it to Amazon
Elasticsearch Service.
Buffer Size 5 Specify the size of data in MB (1-128) that Amazon Kinesis
Firehose should buffer before loading it to Amazon
Elasticsearch Service.
For an Amazon Kinesis stream, modify the Destination Stream Name parameter. The Destination Stream Name is the name of the stream that will be your destination for your analyzed data.
Appendix B: YAML File Configuration The Streaming Analytics Pipeline includes a YAML file that contains configuration
information for the Amazon Kinesis Analytics application that the solution creates. Review
the parameters in the YAML file and modify them as necessary for your implementation.
Then, upload the file to an Amazon S3 bucket.
streaming-analytics-pipeline-config.yaml: Use this file to
specify your Amazon Kinesis Analytics application configuration.
Parameter Default Description
Input Format Type CSV The format of the records of the source stream. Choose CSV
or JSON.
Record Column
Delimiter
“,” The column delimiter of CSV-formatted data from the
source stream. For example, “|” or “,”.
Note: Leave this parameter blank if you chose JSON as your Input Format Type.
Record Row Delimiter “/n” The row delimiter of CSV-formatted data from the source
stream. For example, “/n”.
Note: Leave this parameter blank if you chose JSON as your Input Format Type.
Record Row Path “$” The path to the top-level parent that contains the records.
Note: Leave this parameter blank if you chose CSV as your Input Format Type.
Output Format Type CSV The format of the analyzed data that is put in the output
stream. Choose CSV or JSON.
View YAML file
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 16 of 20
Parameter Default Description
Columns <Requires Input> A list of dictionary values specifying the name, SQL type,
and (if necessary) record row path mapping. For example,
CSV: {Name: pressure, SqlType: DOUBLE} or
JSON: {Name: pressure, SqlType: DOUBLE,
Mapping: $.pressure}.
SQL Code <Requires Input> The Amazon Kinesis Analytics application code. The code
will be copied to the application.
The YAML configuration file is not required to run this solution. If you do not specify a file
location, the solution will launch an Analytics application with the following configuration
with a “catch all” schema.
Note: If you provide a YAML configuration file location, you must complete the Format section of the file.
# Update this file according to your Input Schema and application code
# Note: pay attention to indentation - it matters
format:
InputFormatType: CSV
RecordColumnDelimiter: ","
RecordRowDelimiter: "\n"
RecordRowPath: "$"
OutputFormatType: CSV
columns:
- {Name: temp, SqlType: TINYINT}
- {Name: segmentId, SqlType: CHAR(4)}
- {Name: sensorIp, SqlType: VARCHAR(15)}
- {Name: pressure, SqlType: DOUBLE}
- {Name: incline, SqlType: DOUBLE}
- {Name: flow, SqlType: BIGINT}
- {Name: captureTs, SqlType: TIMESTAMP}
- {Name: sensorId, SqlType: CHAR(4)}
sql_code: |
-- Paste your SQL code here
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
temp TINYINT,
sensorIp VARCHAR(15),
sensorId CHAR(4),
captureTs TIMESTAMP,
pressure DOUBLE);
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO
"DESTINATION_SQL_STREAM"
SELECT STREAM "temp", "sensorIp", "sensorId", "captureTs",
"pressure"
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 17 of 20
FROM "SOURCE_SQL_STREAM_001";
Figure 2: Sample YAML configuration file
Appendix C: Sample Amazon Kinesis Analytics
Applications Amazon Kinesis Analytics implements the ANSI 2008 SQL standard with extensions. These
extensions enable you to process streaming data. For detailed information on Amazon
Kinesis Analytics SQL concepts, please see to the Amazon Kinesis Analytics SQL Reference.
Here are some examples of Amazon Kinesis Analytics application code.
Simple Continuous Filter This application performs a continuous SELECT statement on stock ticker data in the source
stream (SOURCE_SQL_STREAM_001) based on a WHERE condition, and insert the results
into an output in-application stream (DESTINATION_SQL_STREAM).
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
ticker_symbol VARCHAR(4),
sector VARCHAR(16),
price REAL,
change REAL);
CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO
"DESTINATION_SQL_STREAM"
SELECT STREAM ticker_symbol, sector, price, change
FROM "SOURCE_SQL_STREAM_001"
WHERE sector SIMILAR TO '%TECH%';
Multiple-Step Application This application uses multiple intermediate in-application streams (IN_APP_STREAM_001
and IN_APP_STREAM_002) to process data in multiple steps. The results of a query against
one in-application stream feed into another in-application stream.
CREATE OR REPLACE STREAM "IN_APP_STREAM_001" (
ingest_time TIMESTAMP,
ticker_symbol VARCHAR(4),
sector VARCHAR(16),
price REAL,
change REAL);
CREATE OR REPLACE PUMP "PUMP_001" AS INSERT INTO "IN_APP_STREAM_001"
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 18 of 20
SELECT STREAM APPROXIMATE_ARRIVAL_TIME, ticker_symbol, sector,
price, change
FROM "SOURCE_SQL_STREAM_001";
CREATE OR REPLACE STREAM "IN_APP_STREAM_002" (
ingest_time TIMESTAMP,
ticker_symbol VARCHAR(4),
sector VARCHAR(16),
price REAL,
change REAL);
CREATE OR REPLACE PUMP "PUMP_002" AS INSERT INTO "IN_APP_STREAM_002"
SELECT STREAM ingest_time, ticker_symbol, sector, price, change
Pre-Processing Streams This application retrieves rows of specific types from the in-application input stream and
inserts them in separate in-application streams. Once the record types have been filtered, you
can perform analytics on a particular in-application stream.
CREATE OR REPLACE STREAM "Order_Stream" (
"order_id" integer,
"order_type" varchar(10),
"ticker" varchar(4),
"order_price" DOUBLE,
"record_type" varchar(10));
CREATE OR REPLACE PUMP "Order_Pump" AS INSERT INTO "Order_Stream"
SELECT STREAM "Oid", "Otype","Oticker", "Oprice", "RecordType"
FROM "SOURCE_SQL_STREAM_001"
WHERE "RecordType" = 'Order';
CREATE OR REPLACE STREAM "Trade_Stream" (
"trade_id" integer,
"order_id" integer,
"trade_price" DOUBLE,
"ticker" varchar(4),
"record_type" varchar(10));
CREATE OR REPLACE PUMP "Trade_Pump" AS INSERT INTO "Trade_Stream"
SELECT STREAM "Tid", "Toid", "Tprice", "Tticker", "RecordType"
FROM "SOURCE_SQL_STREAM_001"
WHERE "RecordType" = 'Trade';
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"ticker" varchar(4),
"trade_count" integer);
CREATE OR REPLACE PUMP "Output_Pump" AS INSERT INTO
"DESTINATION_SQL_STREAM"
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 19 of 20
SELECT STREAM "ticker", count(*) as trade_count
FROM "Trade_Stream"
GROUP BY "ticker", FLOOR("Trade_Stream".ROWTIME TO MINUTE);
Appendix D: Collection of Anonymous Data This solution includes an option to send anonymous usage data to AWS. We use this data to
better understand how customers use this solution to improve the services and products
that we offer. When enabled, the following information is collected and sent to AWS every
15 minutes after you deploy the solution:
Solution ID: The AWS solution identifier
Unique ID (UUID): Randomly generated, unique identifier for each Streaming
Analytics Pipeline deployment
Timestamp: Data-collection timestamp
Streaming Data Rate: Count of the number of records and bytes that enter your
Amazon Kinesis Analytics application for analysis
Example data:
{"metrics":
{"InputRecords":463000.0,"InputBytes":70931638.0}
Note that AWS will own the data gathered via this survey. Data collection will be subject to
the AWS Privacy Policy. To opt out of this feature, set the SendAnonymousData
parameter to No.
Amazon Web Services – Streaming Analytics Pipeline on the AWS Cloud December 2016
Page 20 of 20
Send Us Feedback We welcome your questions and comments. Please post your feedback on the AWS
Solutions Forum.
You can visit our GitHub repository to download the templates and scripts for this solution,
and to share your customizations with others.
Document Revisions
Date Change In sections
December 2016 Initial release --
© 2016, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s current product offerings and
practices as of the date of issue of this document, which are subject to change without notice. Customers are
responsible for making their own independent assessment of the information in this document and any use of
AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or
implied. This document does not create any warranties, representations, contractual commitments,
conditions or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of
AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify,
any agreement between AWS and its customers.
The Streaming Analytics Pipeline is licensed under the terms of the Amazon Software License available
at https://aws.amazon.com/asl/.