redshift vs teradata an in-depth comparison

25
REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON EBOOK AMAZON REDSHIFT TERADATA

Upload: others

Post on 16-Oct-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

EBOOK

AMAZON REDSHIFT TERADATA

Page 2: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

Table of Contents 

Redshift Vs Teradata 1 

Redshift Architecture & Its Features 1 

Teradata Architecture & Its Features 2 

Redshift Data Model 4 

Teradata Data Model 7 Pros 8 Cons 9 

Teradata Pros and Cons 12 Pros 12 Cons 13 

Features supported only by Teradata, not Redshift 15 

Redshift Vs Teradata In A Nutshell 16 

Pricing and Effort Comparison 20 

When and How to Migrate data from Teradata to Redshift 21 

Summary 22 

ETL Challenges While Working With Amazon Redshift 23 

   

Page 3: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

Redshift Vs Teradata 

Redshift versus Teradata has been one of the most debatable data warehouse comparisons. In this ebook, we will cover the detailed comparison between Redshift and Teradata. 

Redshift Architecture & Its Features  

Redshift is a fully managed petabyte scale data warehouse on the cloud. You can even start working from a few Gigabytes or Terabytes of data. Additionally, you can also scale it up to petabytes depending upon your business requirement. Redshift engine is also called a cluster and it is built up from one or more nodes. There are two types of nodes called Compute and Leader node. Compute node contains 2 or more slices depending upon node types. Leader node does multiple roles which include communicating with JDBC/ODBC client and creating the query execution plan to transfer it to compute node(s). Also, the cluster is incomplete without a Leader node.  

You can check out our blog for a detailed article on Redshift Architecture. 

Page 4: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

Teradata Architecture & Its Features   

Teradata is an RDBMS, meant for a data warehouse with an on-premise setup. It requires installation since it is unavailable on cloud platforms. Although Teradata is not over the cloud, you can spin up a Teradata instance on a cloud VM. Teradata is designed on MPP shared nothing architecture.   Here is a diagrammatic representation of Teradata Architecture.  

        

Page 5: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

The four major components of Teradata are as follows:  1. Node: The primary component of Teradata is called Node, which is a basic unit of Teradata. It has its own OS, CPU, RAM, disk space etc.   2. Parsing Engine: Parsing Engine or PE is responsible for preparing the query execution plan.   3. BYNET: BYNET receives query execution plan from PE and transfers it to AMPs aka Virtual Processor and vice versa. It is also called as Message Parsing layer.    4. Access Module Processor (AMP): AMP is an important component of Teradata. AMP manages the processing of data by storing it in vDisks. Data can be stored in any AMP depending on the hash algorithm. In case the first BYNET fails there is an additional BYNET to take over. BYNET is responsible to communicate between the AMPs. In multi-node systems, Teradata will have at least two BYNETs to make the system fault tolerant. 

   

Page 6: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

Redshift Data Model

Redshift data model is designed for Data warehousing purposes. The unique features of Redshift make it a smart Data warehouse choice.  1. Redshift is a fully managed data warehouse. You don't have to worry about setting up and installing the database. You just have to spin up your cluster and the database is ready.  2. Redshift’s backup and restore are fully automatic. Through automatic snapshots, data in Redshift automatically gets backed up in S3 internally at regular intervals.  3. Data is fully secured by inbound security rule and SSL connection. It has VPC for VPC mode and inbound security rule for classic mode cluster.  4. Redshift stores data in the columnar format, unlike other data warehouses storage. For example, if you hit your query for a specific column, Redshift will exclusively search in that specific column instead of the entire row. This saves an enormous amount of time in query processing.  5. Data is stored in blocks of 1 MB instead of typical blocks of 8 KB or 64 KB which helps Redshift to store more data in a single block.  6. Redshift does not have the concept of indexes. Instead, it has zone maps. With the help of zone map Redshift easily identifies which block has lowest and highest value for that column. Zone maps inform the cluster about all the blocks that are needed to be read.  7. Redshift has column compression (encoding). ANALYZE COMPRESSION command automatically tells what compression strategy to apply for that table. Redshift provides various encoding techniques.  Refer AWS documentation for more details on encoding.  

Page 7: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

 8. Redshift has a feature of caching the result of repeat queries for faster performance. To check whether your query has used cache, you can see the output of column source_query available in SVL_QLOG. If your query has used cache it will store the value of query id of which was run by the specific user id.  Example:  

SELECT USERID, QUERY, ELAPSED, SOURCE_QUERY from SVL_QLOG WHERE

USERID in (600, 601); 

 

In the below example, QUERY ID 853219 of USERID 601 has used the cache. (QUERY ID 123456 of USERID 600). Also, QUERY ID 853219 ran by userid → 601 has utilized the cache and elapsed time in microseconds has reduced drastically.   

USERID | QUERY ID | ELAPSED | SOURCE_QUERY

--------+-------------+----------+---------------

600 | 123456 | 90000 | NULL

600 | 567890 | 80000 | NULL

601 | 853219 | 30 | 123456

 

9. Redshift data model is similar to a typical data warehouse when it comes to analytical queries. You can create fact tables, dimension tables, and views. It supports all major query execution strategy i.e., Inner join, Outer join, Subquery, and Common Table Expressions (with clause). 

 

10. From a storage perspective Redshift cluster maintains multiple copies of your data as part of fault tolerance.   

Page 8: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

Teradata Data Model  1. Teradata is a massive parallel Data warehouse with shared-nothing architecture. However, unlike Redshift, the data is stored in a row-based format. 2. Teradata uses a different kind of indexes for fast data retrieval. Indexes include Primary, Secondary, Join, and Hash Indexes, etc. Please note that Secondary Index does not affect the distribution of rows across AMPs. Although, the secondary index takes extra processing overhead.  3. Teradata supports and enforces Primary and Secondary index.  4. Teradata has a hybrid storage concept where frequently used data is stored in SSD while the less accessed data is stored in HDD. Teradata has a higher storage capacity than Redshift.  5. Teradata does support Table partitioning feature, unlike Redshift.  6. Teradata uses the Hash algorithm to distribute data into various disk storage units.  7. Teradata can scale up to 2048 nodes. It has a storage capacity ranging from 10 TB to 94 petabytes thus providing higher storage capacity than Redshift.  8. Teradata supports all kinds of major SQL related features (Primary Index, Secondary Index, Sequences, Stored Procedures, User Defined Functions, and Macros etc) which are compulsorily needed as part of Data Warehouse RDBMS.  9. Teradata's data model is designed to be fault tolerant. It is also designed to be scalable with redundant network connectivity to ensure throughout data connectivity and availability. 

Page 9: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

Redshift Pros and Cons

Pros  1. Loading and unloading of data is exceptionally fast. You can load data in parallel mode. Redshift, even for a high volume of data, supports data loading from the zipped file. Redshift recommends loading the data from the COPY command for faster performance. 2. You can load data from NoSQL database service, AWS DynamoDB.   Refer AWS documentation for more detailed information about DynamoDB. 3. You have an option to choose the node type (Dense Storage or Dense Compute) of your cluster depending upon your data needs and business requirements. 4. You can scale your cluster's storage and CPU for better performance at any instant without any impact to the cluster.  5. You can migrate your data from various data warehouses into Redshift without much hassle. AWS does provide a service for the same called Database Migration Service (DMS). Refer to AWS documentation for more detailed information.  6. You do not have to worry about the security as you can build your cluster inside a VPC and also use SSL encryption for further protection.  7. Redshift backup and restore feature is pretty simple. Through automatic snapshots, your data is automatically backed up regularly. Snapshots are incremental, so you do not have to worry about any misses. You can also copy data to another region in case of any business need. Kindly refer AWS documentation for more details on working with snapshots.   

Page 10: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

8. Redshift has an advanced feature called Redshift Spectrum. Using Redshift Spectrum you can query huge amounts of data directly from S3. While doing so, you can skip the loading of data through COPY command or any other method. You can refer to the detailed guide on Redshift Spectrum for more information.  9. Using Sort Keys, data can be pre-sorted based on specific columns. Also, the query performance can be improved automatically.  10. Using Distribution Keys, data can be easily distributed across nodes equally to increase the query performance.  11. Redshift provides various pre-built system tables and views to help developers and designers to help out during ETL and other processes.  12. Setup related commands can be run through various modes such as AWS console, Command Line Interface (CLI), API, etc.  13. AWS Redshift applies some patches and upgrades to the cluster automatically through maintenance window (configurable value). ence you do not have to worry about applying patches. 

Cons  1. In Redshift, there is no concept of function, triggers, and procedures.  2. There is no concept of sequence column in Redshift. You need to handle it through your ETL logic in case you need to generate sequence number of your column.  3. Unlike other common data warehouses, Redshift does not enforce Primary keys or Foreign keys which can create data integrity issues.  

Page 11: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

4. Only S3, DynamoDB, and EMR support a parallel load in Redshift. In case you want to load data from other services you need to write ETL scripts or use ETL solutions such as Hevo.  5. It requires a good understanding of Sort and Dist key. There are some basic ground rules to set for sort and dist keys. If set improperly then it could lead to hampering of performance.  6. Distribution keys cannot be changed once it is created. You need to be extremely careful while designing your tables. Wrong distribution keys could hamper the overall performance.  7. In Redshift, there is no concept of DBLink, you cannot directly connect to another database/data warehouse tables for your queries.  8. In Redshift, VACUUM and ANALYZE are mandatory on key tables. It can hamper the performance badly if run during business hours. Hence it needs to be handled carefully.  9. In Redshift cluster, there is a limit on the number of nodes, databases, tables, etc. Maximum storage limit is still lesser than data warehouses like Teradata. Here is the node limitation list:  

Node Type  vCPU  Storage per Node  Node Range dc1.large  2  160 GB SSD  1-32 

dc1.8xlarge  32  2.56 TB SSD  2-128 

dc2.large  2  160 GB NVMe-SSD  1-32 

dc2.8xlarge  32  2.56 TB NVMe-SSD  2-128 

ds2.xlarge  4  2 TB HDD  1-32 

ds2.8xlarge  36  16 TB HDD  2-128 

 You can refer to AWS documentation to know more about the limits in Amazon Redshift.   

Page 12: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

10 

10. Although Redshift in classic mode is still in use, its cluster performance is relatively modest.  11. Redshift still supports only a single AZ environment and does not support multi-AZ environment.  12. Redshift has a limit on query concurrency of 15. You can have a maximum of 8 queues in a cluster. If your queues are unmanaged, then it hinders the performance.  13. Your design should make sure that the cluster is not in use during the maintenance window period, else job will fail.  14. There is no concept of table partitioning in Redshift.  15. In Redshift, you do not have a concept of SET and MULTISET tables (SET tables are the tables that do not allow duplicates). This needs to be handled programmatically else it could lead to reporting errors if handled inappropriately.  You can refer to Hevo’s blog which talks about the Pros and Cons of Amazon Redshift in complete detail.     

Page 13: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

11 

Teradata Pros and Cons 

Pros  1. Teradata is a massively parallel data warehouse with shared nothing architecture.  2. Teradata has provided pre-built utilities i.e. Fastload, Multiload, TPT, BTEQ etc.   3. Teradata is linearly scalable. If data volume rises, AMPs or Nodes can also be increased.  4. Teradata also has fallback feature. In case one AMP is down, another AMP will take over for data retrieval.  5. Teradata provides an impressive tool called Teradata Visual Explain. It visually shows the execution plan of queries in a graphical manner. This helps developers/designers to fine-tune their queries.   6. Teradata provides Ferret utility to set and display storage space utilization.    

Page 14: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

12 

Cons  1. One of the biggest cons of Teradata is that it is not cloud-based unless scaled up to run over the cloud. It requires some initial setup or you need to integrate with other cloud service providers i.e, AWS or Azure.  2. It is not a columnar data warehouse.  3. Since Teradata is not a columnar DB, it runs entire row even if you search over a single column. You may end up with performance issues unless your data warehouse is properly designed.   4. If a query runs on a set of different columns over the bigger dataset, it could lead to performance issues; unless query has been run on the indexed columns.  5. Teradata only supports a maximum of 128 joins in a single query. If you want to perform more joins, you need to break them into chunks and handle it accordingly.  6. Redshift outperforms Teradata in Analytical performance, Visualisation on storage, & CPU utilization visualization. Everything can be viewed in a single AWS console or through the Cloudwatch monitor in Redshift. On the other hand, Teradata provides separate visual tools while for few others checks and commands need to be hit in Teradata client.  7. Teradata has no default column compression mechanism. Column compression needs to be done manually, and you can perform up to 256 unique column value compression per column.  8. There are a lot of limitations on the number of columns, table value, and table name length in Teradata. You can refer to Teradata documentation for more detailed information. 

Page 15: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

13 

Features supported only by Redshift, not Teradata  1. The most valuable feature of Redshift is that it is cloud-based and fully managed. Although, Teradata has a Teradata Database Developer (Single Node) a full-featured data warehouse software.  2. No need to worry about backup and restore as manual snapshots and restore can also be done.  3. Backed up data (snapshot) is automatically stored in S3. No need to worry about storing data in tape or any outside system.  4. Redshift has an excellent feature of loading data through COPY command that too in the parallel mode where all nodes/slices can participate together to make the performance faster.  5. Redshift performs automatic column level compression, and it suggests compression mechanisms on all table columns (command is ANALYZE COMPRESSION).  6. Due to the VPC feature in AWS, Redshift security is too tight and well controlled.   

Page 16: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

14 

Features supported only by Teradata, not Redshift  1. Teradata supports various features including Procedures, Triggers, etc.  2. Teradata has a column sequencing feature while Redshift doesn't.  3. Teradata provides various load and unload utilities i.e. TPT, FastLoad, FastExport, Multiload, TPump, and BTEQ. You can use them depending upon data volume, business logic, and leverage it in your ETL logic.  4. Teradata has a few visual utilities which Redshift should have such as Teradata Visual Explain. In Redshift, you need to hit query to view Explain plan.  5. Teradata supports MULTISET and SET tables while Redshift doesn't.  6. Teradata supports Macros but Redshift doesn't. Macros are a set of predefined SQL statements logically stored in Database. Macros also reduce LAN traffic.  Example:

CREATE MACRO Get_Sales AS ( SELECT SalesId, StoreId, StoreName, StoreAddress FROM Stores ORDER BY StoreId;

);

Exec Get_Sales;

→ This macro execute command will retrieve all rows from Stores table. 

   

Page 17: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

15 

Redshift Vs Teradata In A Nutshell 

Items  Redshift  Teradata 

Cloud perspective   

Fully managed Data Warehouse over cloud.     

Core Data Warehouse is not over the cloud. Initial setup is required by DBAs/Export. Teradata can be scaled to run over the cloud (AWS/Azure) with pay-as-you-go model. 

Backup and restore strategy  

Backups are automatically taken care of through the snapshot feature. Snapshots are stored internally stored in S3, which is highly durable. 

Teradata backup and restore can be manual or automated (using BAR) but data is stored in an outside system.  

Data Load and Unload    

Redshift leverages data load through COPY command and unload through UNLOAD command. Using COPY command, data is loaded automatically so that all nodes can participate equally for faster performance. 

In Teradata, we have separate utilities to handle load/unload. Teradata provides TPT, FastExport, FastLoad, etc. They can be leveraged accordingly for your ETL/ELT.   

Table Storage     

Redshift follows columnar storage format. If the query is hit based on a specific set of the columns or only on specific column then it provides an impressive performance. Hence, aggregates are very fast in Redshift as it leverages column level hit. 

Teradata follows row level storage. Teradata requires a proper indexing on columns so that data can be stored properly in AMPs. If indexes are not proper or table hit is done on non-indexed column then it could cause performance issue. 

Page 18: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

16 

 

Internal Storage    

In Redshift, data is stored over chunks of 1 MB blocks of each column. Each block follows zone mapping. Using zone mapping, blocks stores minimum and maximum value of that column.   

In Teradata, the data storage is managed by AMPs under vDisks and data is distributed based on hash algorithm (i.e. based on index defined etc) and data is retrieved accordingly.  

Referential Integrity Model   

Redshift tables do have Primary Keys and Foreign Keys but it does not follow enforcement. You need to apply your logic such that referential integrity model is applied on Redshift tables.  

Teradata tables have Primary Keys and Foreign Keys and it follows enforcement. Hence, it has an additional overhead of doing reference checks while processing.  

Sequence Support  

There is no concept of column sequencing. If you want to create a sequence on any column you need to handle it programmatically. 

You can define Sequence on a column.    

Triggers, Stored Procedures 

In Redshift, there is no concept of Triggers or Stored Procedures. 

You can create Triggers or Stored Procedures in Teradata. 

Visual Features    

Redshift is a part of AWS, an integrated service. Entire Redshift performance can be monitored through AWS console, Cloudwatch, and automatic alerts.   

It has few visual tools like Teradata Visual Explain but they are cluttered.      

Max Concurrency  Maximum 15 concurrent queries.  Runs more than 15 concurrent 

Page 19: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

17 

  By default its concurrency is 5.  

queries.  

Macros Support  No concept of Macros.  Supports Macros. 

NoSQL to Redshift Feature  

Although, Redshift cannot load NoSQL data from other vendors but it can load data from DynamoDB.   

No such feature supported yet.    

Maximum Storage Capacity  

2 PB  (16*128 DS2.8xlarge ~ 2 PB)   

Storage capacity of much more than 2 PB of data.   

Column Compression    

In Redshift, when the table is created it automatically creates default compression on all columns. It also provides a command called ANALYSE COMPRESSION to help on column compression.   

In Teradata, you need to specify column compress on individual columns. You can compress up to 128 unique values per column in a table.   

Maximum Columns Per Table 

Maximum 1600 columns per table.  

Maximum 258 columns per row.   

Maximum Joins  No limit as such.  64 joins per query block. 

Data Warehouse Maintenance/ Updates 

Redshift applies regular patches and does automatic maintenance inside maintenance window.  

In Teradata, DBAs need to take care of all these activities manually or through some tool. 

Table Indexes It does not have table index concept but its performance is 

Teradata does provide various types of index i.e. Primary 

Page 20: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

18 

  unaffected due to zone mapping and sort key features.  

Index, Secondary Index, etc.   

Table partitioning  

Redshift Spectrum has but Redshift doesn’t.  

Tables can be partitioned.   

 Fault Tolerance   

Redshift is Fault Tolerant. In case, there is any node failure, Redshift will automatically replace the failed node with the replacement node. Although, multi-AZ is not supported in Redshift. 

Teradata is also fault tolerant. In case, there is a failover in AMP, fallback AMP will take over automatically.   

Page 21: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

19 

Pricing and Effort Comparison  Redshift leads Teradata in effort and in-house pricing. Redshift is cheaper and easier than Teradata. For Redshift, you only need to turn on the cluster, set up security settings, few other options (maintenance window period, snapshot enabling option, etc), and you are ready to go. This way DBAs efforts get reduced.  However, in terms of storage, Teradata has upper hand because Redshift cluster has limitations. However, in Redshift, we can still handle that through S3 as it does not have any space limitation.  Remember, both Teradata and Redshift Data Warehouses are designed to solve different purposes.  You can refer to Redshift and Teradata to know about pricing.   

Page 22: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

20 

When and How to Migrate data from Teradata to Redshift  There are various considerations that need to be made on whether to migrate from Teradata to AWS/cloud. 

1) How stable is your Teradata Warehouse? 2) How much is your Teradata data volume? 3) How complex is your Teradata data model? 4) How much is your current Teradata data latency? 5) How good is your Teradata RDBMS performance? 6) How many BI tools are you using on your Teradata 

tables/views/cubes? 7) Are you using plenty of unsupported features of Redshift in 

Teradata? 8) Will migrating your data warehouse from Teradata to Redshift 

break your system? 9) Your budget of maintaining the Redshift and other key AWS 

services post-migration.  If all conditions are satisfied, you easily migrate your data from Teradata to Redshift. AWS provides a useful service called Data Migration Service (DMS) and Schema Conversion Tool (SCT). Although, this pretty handy service is not fully automated as some minor manual efforts are required.  Please refer to AWS documentation for migrating data from Teradata to Redshift.   

Page 23: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

21 

Summary  Choosing between Redshift and Teradata is a tough question to answer as both are solving different purposes. Redshift performs analytics and reporting extremely well. Since Redshift is a columnar base data warehouse, its performance is really good when it comes to hitting the table/view based columns and aggregate functions (sum, avg, count(*), etc). As Redshift is a part of AWS service, it is integrated with all vital AWS services. Hence you don't need to store millions of data in Redshift alone as you can archive old data in S3. If required, you can leverage Redshift Spectrum to build your analytics and reports on top of it. Stored procedures can be handled through AWS Lambda Service. In terms of age, Redshift is a comparatively newer data warehouse. Redshift is still developing features which other key data warehouses offer.    On the other hand, Teradata is pretty matured and old. Teradata as an RDBMS may not provide similar performance as Redshift unless it has a properly designed data model, fully leveraged features (FastLoad, Multiload, TPT, BTEQ, etc), and table/views are properly tuned. Although, some established customers might be reluctant to migrate from Teradata to Redshift. They can also look for the hybrid model option.   In conclusion, it is still an ongoing debate, both Redshift and Teradata have its pros and cons.   

   

Page 24: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

  

22 

ETL Challenges While Working With Amazon Redshift  Data loading is one of the biggest challenges of Redshift. To perform ETL to Redshift, you would need to invest precious engineering resources to extract, clean, enrich, and build data pipelines. However, writing complex scripts to automate all of this is not easy. It gets harder if you want to stream your data real-time. Data loss becomes an everyday phenomenon due to issues that crop up with changing sources, unstructured & unclean data, incorrect data mapping at the warehouse, and more.  Using a data integration platform like Hevo can solve all your Redshift ETL problems. With Hevo you can move any data into Redshift in minutes in a hassle-free fashion. Hevo integrates with a variety of data sources ranging from SQL, NoSQL, SaaS, File Storage Base, Webhooks, etc. with the click of a button.   Sign up for a free trial here or view a quick video on how Hevo can help.  

 

 

 

 

 

About Author:  Ankur Shrivastava is a AWS Solution Designer with hands-on experience on Data Warehousing, ETL, and Data Analytics. He is an AWS Certified Solution Architect Associate. In his free time, he enjoys all outdoor sports and practices. 

 

 

Page 25: REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON

Looking for a simple and reliable way to bring Data

from Any Source to AWS Redshift?

TRY HEVO

SIGN UP FOR FREE TRIAL