introduction to amazon redshift
DESCRIPTION
This presentation summarizes Amazon Redshift data warehouse service, its architecture and best practices for application development using Amazon Redshift.TRANSCRIPT
Introduction toAmazon Redshift
May, 2014 / Abdullah Cetin CAVDAR @accavdar
What's Amazon Redshift?Amazon Redshift is a fast and powerful, fully
managed, petabyte-scale data warehouse service inthe cloud
https://aws.amazon.com/redshift/
FeaturesPetabyte scale, massively parallelRelational data warehouseFully managed, zero adminSSD and HDD platforms$999/TB/Year
Architecture
Client ApplicationsIntegrates with various data loading and ETL (Extract, Transform, andLoad) tools and business intelligence (BI) reporting, data mining, andanalytics toolsRedshift is based on industry-standard PostgreSQL, so most existingSQL client applications will work with only minimal changes
ConnectionsRedshift communicates with client applications by using industry-standard PostgreSQL JDBC and ODBC drivers
ClustersA cluster is composed of one or more compute nodesLeader Node coordinates the compute nodes and handles externalcommunication
Leader NodeManage communications with client programs and communicationswith compute nodesStore metadataCoordinate query execution
Compute NodesExecute the compiled code, send intermediate results back to theleader node for final aggregationIt has own dedicated CPU, memory, and attached disk storage, whichare determined by the node type
DatabasesA cluster contains one or more databasesUser data is stored on the compute nodesAmazon Redshift is a Relational Database Management System(RDBMS)Amazon Redshift is optimized for high-performance analysis andreporting of very large datasetsAmazon Redshift is based on PostgreSQL
Redshift reduces I/OColumn storage - read data you needData compression - analyzes and compress your dataZone Map
Keep track of minimum and maximum value for each blockSkip over blocks that don't contain data needed for a given queryMinimize unnecessary I/O
Direct attached storageHardware optimized for high performance data processing
Large data block sizesLarge block sizes to make the most of each read
Redshift runs on optimizedhardware
Optimized for I/O intensive workloadsHigh disk densityRuns in HPC - fast network
Redshift parallelizes anddistributes everything
QueryLoadBackup/RestoreResize
Redshift is easy to useProvision in minutesMonitor query performancePoint and click resizeBuilt in securityAutomatic backups
Redshift has security built-inSSL to secure data in transitEncryption to secure data at rest
AES 256 - hardware acceleratedAll blocks on disk and in Amazon S3 encrypted
No direct access to compute nodesAmazon VPC support
Redshift backs up your dataand recovers from failures
Replication within the cluster and backup to Amazon S3Backup to Amazon S3 are continuous, automatic and incrementalContinuous monitoring and automated recovery from failuresAble to restore snapshots to any Availability Zone
Use Cases
Traditional Enterprise DWReduce costs by extending DW rather than adding HWMigrate completely from existing DW systemsRespond faster to business
Companies with Big DataImprove performance by an order of magnitudeMake more data available for analysisAccess business data via standard reporting tools
SaaS CompaniesAdd analytic functionality to applicationsScale DW capacity as demand growsReduce HW and SW costs by an order of magnitude
Use Caseskillpages
Data Architecture
Redshift ImplementationHigh Storage Extra Large (XL) DW NodeETL Activities
Approx. 90 minutes including exports from RDBMS, copying to S3,loading stage tables, loading target tables, vacuuming andanalysing tables
SchemaCompressionRetention
DW Anatomy
Why Redshift works forSkillPages?
Scale - MPPPerformance - Columnar data access and compressionPlatform Integration - S3, DynamoOperational AdvantagesEase of AccessCost
Best PracticesAvoid large number of singleton Data Manipulation Language (DML)statements if possibleUse COPY for uploading large datasetsChoose SORT and DISTRIBUTION keys with careEncode data and time with TIMESTAMP data typeExperiment with WLM (Workload Manager) settings
Slideshttps://github.com/accavdar/AmazonRedshift
THE ENDby Abdullah Cetin CAVDAR / @accavdar