redshift - amazon
TRANSCRIPT
1
Amazon Redshift
Author: Douglas Bernardini
2
What is Redshift?• Cloud-Hosted data warehouse services: AWS• Massive parallel processing (MPP)• Analytics workloads on large scale datasets• Stored by a column-oriented DBMS principle. • Large scale datasets. Up petabytes
3
Features and Benefits• Columnar storage• Parallelizing queries • Multiple nodes• Custom JDBC and ODBC drivers• Ready integraded:
• Amazon S3;• Amazon DynamoDB;• Amazon Elastic MapReduce;• Amazon Kinesis• Any SSH-enabled host.
• Fault Tolerant• Automated Backups• Fast Restores• Secure:
• Encryption• Network Isolation• Audit and Compliance
• SQL friendly
4
MarketPlaceBI Tools• Actian• Actuate Corporation• Birst• Chartio• ClearStory Data• Dundas Data Visualization• Infor• Jaspersoft• Jreport• Logi Analytics• Looker (Software)• MicroStrategy• Pentaho• Periscope.io
Data Integrations Tools• Attunity• FlyData• Informatica• SnapLogic• Talend• Xplenty
• Qlik• Redrock BI• SAS (software)• SiSense• Spotfire• Tableau Software
5
Data Load
6
DynamoDB Integration
7
DynamoDB Integration
8
Business Case
9
Data growing fast!• Enterprise Data is growing at an exponential
rate• Structured and Unstructured data• Data requirements change rapidly
• Cost to maintain data is prohibitive• Hardware not scalable• Expensive to support
• Business agility suffers• Reporting unable to change with the pace
of business• Data silos create bottlenecks
10
Solution Proposal
• Leverage the flexibility of Amazon Web Services
• Scalable• Flexible• Cost-Effective
• AWS Redshift• Data Warehouse
• AWS S3• Persistent Storage
• AWS Data Pipeline• Data Orchestration and ETL
• AWS EC2 / MySQL• Transaction Processing
• Qlik Sense Desktop• Business Intelligence Reporting
11
AWS RedshiftPetabyte-Scale Data Warehouse
• Optimized for DW• Columnar Storage• Data Compression• Zone Maps to reduce I/O
• Scalable• Easily change # of Nodes
• 1-32 node configurations
• Cost-Efficient• On-Demand pricing starts @ $.25/hr.• Run as low as $1,000 per TB/yr.
12
AWS RedshiftPetabyte-Scale Data Warehouse
• Get Started in Minutes• Web Console• CLI
• Full Managed
• Fault Tolerant
• Automated Backups / Fast Restores
• Encryption• Data at Rest – AES-256• Can manage own keys
• Compatible• SQL• Data Integrations
13
AWS Simple Storage Service (S3)Online File/Object Storage
• Durable• Data redundantly stored across
multiple facilities/devices
• Available• 99.99% availability• Choose from different AWS regions
• Secure• SSL – Data Transfer• At Rest – Auto-Encrypted
• Scalable• Flexible capacity based on data
demands
• Low Cost• Pay for what you use
14
AWS Simple Storage Service (S3)
Reliable Simple
Scalable Low Cost
• Distributed Infrastructure ensures activity completion
• Integrated with SNS for event notifications
Data Processing and Transfer Platform
• Drag-and drop console• Pre-built templates for other
AWS services• Visual Pipeline editor
• Dispatch work to one machine or many
• Serial and/or Parallel processing
• Charged per Pipeline• Frequency• Volume
15
AWS Elastic Compute Cloud (EC2) + MySQL
Cloud Infrastructure for Applications & Development
• Flexible• Linux and Windows virtual machines• Supports multiple instance types, software packages, resource configs
• Elastic• Increase/Decrease capacity within minutes• Commission any number of server instances simultaneously
• Secure• Security Groups / Network ACLs• VPC / VPN
• Low Cost• On-Demand / Reserved / Spot Instance options
16
Qlik Sense DesktopData Visualization / BI Tool
• Drag-and-drop Visualizations
• Smart Search
• Explore Multiple data sources in single dashboard/report
• Access analytics on multiple device types
• Collaborate and share insights within reports
• Enables self-service simplicity
17
Architecture
18
Demo
19
Tech Demo
• During this demonstration, we will discuss the setup and execution of using Amazon Redshift as an on-demand, cloud-based, data warehouse solution.
• Our sample data comes from the “Million Song Dataset” available from Columbia University - http://labrosa.ee.columbia.edu/millionsong/
• The BI Tool that is used to create a business-focused dashboard is Qlik Sense Desktop, a Windows-based desktop application - http://www.qlik.com/us/explore/products/sense
• In addition, the following services in the Amazon Web Services stack are used: Amazon Redshift, Amazon S3, Pipeline, and EC2 (Linux AMI running MySQL serves as a transactional database for the demo).
20
Demo Steps1. Create new Linux AMI that will host
MySQL for transaction data processing.• Start new Linux instance and update security groups
for MySQL accessibility• Install MySQL• Create new MySQL users, database, and populate with
demonstration dataset (using MySQL Workbench)
2. Create new S3 bucket for Pipeline ETL processes
3. Create Redshift Cluster (data warehouse)• Instantiate cluster• Connect using SQL Workbench (via JDBC)• Create initial data table
4. Create AWS Pipeline(s) for data processing• MySQL -> S3 • Activate Pipeline for initial ETL from MySQL to S3• S3 -> Redshift• Activate Pipeline for initial ETL from S3 to Redshift
5. Install Qlik Sense Desktop• Install Redshift ODBC Drivers locally on desktop• Create Qlik Sense “Report” (Included in FP submission
for simplicity). Verify initial data in report.
6. Solution Demonstration (Using Amazon CLI – Command Line Interface)
• Simulate transactional data load in MySQL • Verify new data (record count) in MySQL using MySQL
Workbench• Delete initial data in S3 bucket (from Round 1)• Trigger AWS Pipeline that loads data to S3 from
MySQL• Verify data load (CSV file) in S3 bucket• Trigger AWS Pipeline that loads data to Redshift from
S3.• Verify data load in Redshift (using SQL Workbench)• Refresh Qlik report to view analytics of initial data
load.
21
Linux AMI hosts MySQL
22
Redshift Cluster
23
Pipes
24
QlikSense Desktop
25
Add New data into MySQL
Insert songs_dataCount (*)
26
Checking Redshift
Select count (*) from song_data
27
Qlik Update
28
Results• Amazon Web Services provides a powerful
platform to extend on-premise Infrastructure to the cloud
• Enables massive data consolidation• Efficient ETL orchestration & workflow• Simplifies resource management and drives
down computing costs across multiple services
• Changing needs of Business Executives can be made quickly and efficiently
• AWS supports industry standard data source connections
• Existing Reporting/Dashboards can consume AWS Redshift data with no code changes