introduction to amazon redshift

Introduction toAmazon Redshift

May, 2014 / Abdullah Cetin CAVDAR @accavdar

http://www.linkedin.com/in/accavdar

http://twitter.com/accavdar

What's Amazon Redshift?Amazon Redshift is a fast and powerful, fully

managed, petabyte-scale data warehouse service inthe cloud

https://aws.amazon.com/redshift/

https://aws.amazon.com/redshift/

FeaturesPetabyte scale, massively parallelRelational data warehouseFully managed, zero adminSSD and HDD platforms$999/TB/Year

Architecture

Client ApplicationsIntegrates with various data loading and ETL (Extract, Transform, andLoad) tools and business intelligence (BI) reporting, data mining, andanalytics toolsRedshift is based on industry-standard PostgreSQL, so most existingSQL client applications will work with only minimal changes

ConnectionsRedshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers

ClustersA cluster is composed of one or more compute nodesLeader Node coordinates the compute nodes and handles externalcommunication

Leader NodeManage communications with client programs and communicationswith compute nodesStore metadataCoordinate query execution

Compute NodesExecute the compiled code, send intermediate results back to theleader node for final aggregationIt has own dedicated CPU, memory, and attached disk storage, whichare determined by the node type

DatabasesA cluster contains one or more databasesUser data is stored on the compute nodesAmazon Redshift is a Relational Database Management System(RDBMS)Amazon Redshift is optimized for high-performance analysis andreporting of very large datasetsAmazon Redshift is based on PostgreSQL

Redshift reduces I/OColumn storage - read data you needData compression - analyzes and compress your dataZone Map

Keep track of minimum and maximum value for each blockSkip over blocks that don't contain data needed for a given queryMinimize unnecessary I/O

Direct attached storageHardware optimized for high performance data processing

Large data block sizesLarge block sizes to make the most of each read

Redshift runs on optimizedhardware

Optimized for I/O intensive workloadsHigh disk densityRuns in HPC - fast network

Redshift parallelizes anddistributes everything

QueryLoadBackup/RestoreResize

Redshift is easy to useProvision in minutesMonitor query performancePoint and click resizeBuilt in securityAutomatic backups

Redshift has security built-inSSL to secure data in transitEncryption to secure data at rest

AES 256 - hardware acceleratedAll blocks on disk and in Amazon S3 encrypted

No direct access to compute nodesAmazon VPC support

Redshift backs up your dataand recovers from failures

Replication within the cluster and backup to Amazon S3Backup to Amazon S3 are continuous, automatic and incrementalContinuous monitoring and automated recovery from failuresAble to restore snapshots to any Availability Zone

Use Cases

Traditional Enterprise DWReduce costs by extending DW rather than adding HWMigrate completely from existing DW systemsRespond faster to business

Companies with Big DataImprove performance by an order of magnitudeMake more data available for analysisAccess business data via standard reporting tools

SaaS CompaniesAdd analytic functionality to applicationsScale DW capacity as demand growsReduce HW and SW costs by an order of magnitude

Use Caseskillpages

http://www.skillpages.com/

Data Architecture

Redshift ImplementationHigh Storage Extra Large (XL) DW NodeETL Activities

Approx. 90 minutes including exports from RDBMS, copying to S3,loading stage tables, loading target tables, vacuuming andanalysing tables

SchemaCompressionRetention

DW Anatomy

Why Redshift works forSkillPages?

Scale - MPPPerformance - Columnar data access and compressionPlatform Integration - S3, DynamoOperational AdvantagesEase of AccessCost

Best PracticesAvoid large number of singleton Data Manipulation Language (DML)statements if possibleUse COPY for uploading large datasetsChoose SORT and DISTRIBUTION keys with careEncode data and time with TIMESTAMP data typeExperiment with WLM (Workload Manager) settings

Slideshttps://github.com/accavdar/AmazonRedshift

https://github.com/accavdar/AmazonRedshift

THE ENDby Abdullah Cetin CAVDAR / @accavdar

introduction to amazon redshift

Technology

data compression

secure data

data mining

data available

various data loading

data zone map

databases user data

analysis access business