redshift vs teradata an in-depth comparison

of 25 /25
REDSHIFT VS TERADATA AN IN-DEPTH COMPARISON EBOOK AMAZON REDSHIFT TERADATA

Author: others

Post on 16-Oct-2021

1 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

MergedFileEBOOK
Redshift Data Model 4 
Teradata Pros and Cons 12  Pros 12  Cons 13 
Features supported only by Teradata, not Redshift 15 
Redshift Vs Teradata In A Nutshell 16 
Pricing and Effort Comparison 20 
When and How to Migrate data from Teradata to Redshift 21 
Summary 22 
   
   
Redshift Vs Teradata 
Redshift versus Teradata has been one of the most debatable data  warehouse comparisons. In this ebook, we will cover the detailed  comparison between Redshift and Teradata. 
Redshift Architecture & Its Features   
Redshift is a fully managed petabyte scale data warehouse on the cloud.  You can even start working from a few Gigabytes or Terabytes of data.  Additionally, you can also scale it up to petabytes depending upon your  business requirement. Redshift engine is also called a cluster and it is  built up from one or more nodes. There are two types of nodes called  Compute and Leader node. Compute node contains 2 or more slices  depending upon node types. Leader node does multiple roles which  include communicating with JDBC/ODBC client and creating the query  execution plan to transfer it to compute node(s). Also, the cluster is  incomplete without a Leader node.   
You can check out our blog for a detailed article on Redshift Architecture. 
Teradata Architecture & Its Features    
               
   

   
   
Redshift Data Model
Redshift data model is designed for Data warehousing purposes. The  unique features of Redshift make it a smart Data warehouse choice.    1. Redshift is a fully managed data warehouse. You don't have to worry  about setting up and installing the database. You just have to spin up  your cluster and the database is ready.    2. Redshift’s backup and restore are fully automatic. Through automatic  snapshots, data in Redshift automatically gets backed up in S3 internally  at regular intervals.    3. Data is fully secured by inbound security rule and SSL connection. It  has VPC for VPC mode and inbound security rule for classic mode cluster.    4. Redshift stores data in the columnar format, unlike other data  warehouses storage. For example, if you hit your query for a specific  column, Redshift will exclusively search in that specific column instead of  the entire row. This saves an enormous amount of time in query  processing.    5. Data is stored in blocks of 1 MB instead of typical blocks of 8 KB or 64  KB which helps Redshift to store more data in a single block.    6. Redshift does not have the concept of indexes. Instead, it has zone  maps. With the help of zone map Redshift easily identifies which block  has lowest and highest value for that column. Zone maps inform the  cluster about all the blocks that are needed to be read.    7. Redshift has column compression (encoding). ANALYZE  COMPRESSION command automatically tells what compression strategy  to apply for that table. Redshift provides various encoding techniques.   Refer AWS documentation for more details on encoding.  

  8. Redshift has a feature of caching the result of repeat queries for faster  performance. To check whether your query has used cache, you can see  the output of column source_query available in SVL_QLOG. If your query  has used cache it will store the value of query id of which was run by the  specific user id.    Example:   
SELECT USERID, QUERY, ELAPSED, SOURCE_QUERY from SVL_QLOG WHERE
USERID in (600, 601); 
 
In the below example, QUERY ID 853219 of USERID 601 has used the  cache. (QUERY ID 123456 of USERID 600). Also, QUERY ID 853219 ran  by userid → 601 has utilized the cache and elapsed time in microseconds  has reduced drastically.     
USERID | QUERY ID | ELAPSED | SOURCE_QUERY
--------+-------------+----------+---------------
 
 
   

   
Redshift Pros and Cons
Pros  1. Loading and unloading of data is exceptionally fast. You can load data  in parallel mode. Redshift, even for a high volume of data, supports data  loading from the zipped file. Redshift recommends loading the data from  the COPY command for faster performance. 2. You can load data from NoSQL database service, AWS DynamoDB.     Refer AWS documentation for more detailed information about  DynamoDB. 3. You have an option to choose the node type (Dense Storage or Dense  Compute) of your cluster depending upon your data needs and business  requirements. 4. You can scale your cluster's storage and CPU for better performance at  any instant without any impact to the cluster.    5. You can migrate your data from various data warehouses into Redshift  without much hassle. AWS does provide a service for the same called  Database Migration Service (DMS). Refer to AWS documentation for  more detailed information.    6. You do not have to worry about the security as you can build your  cluster inside a VPC and also use SSL encryption for further protection.    7. Redshift backup and restore feature is pretty simple. Through  automatic snapshots, your data is automatically backed up regularly.  Snapshots are incremental, so you do not have to worry about any  misses. You can also copy data to another region in case of any business  need. Kindly refer AWS documentation for more details on working with  snapshots.   

8. Redshift has an advanced feature called Redshift Spectrum. Using  Redshift Spectrum you can query huge amounts of data directly from S3.  While doing so, you can skip the loading of data through COPY command  or any other method. You can refer to the detailed guide on Redshift  Spectrum for more information.    9. Using Sort Keys, data can be pre-sorted based on specific columns.  Also, the query performance can be improved automatically.    10. Using Distribution Keys, data can be easily distributed across nodes  equally to increase the query performance.    11. Redshift provides various pre-built system tables and views to help  developers and designers to help out during ETL and other processes.    12. Setup related commands can be run through various modes such as  AWS console, Command Line Interface (CLI), API, etc.    13. AWS Redshift applies some patches and upgrades to the cluster  automatically through maintenance window (configurable value). ence  you do not have to worry about applying patches. 
Cons  1. In Redshift, there is no concept of function, triggers, and procedures.    2. There is no concept of sequence column in Redshift. You need to  handle it through your ETL logic in case you need to generate sequence  number of your column.    3. Unlike other common data warehouses, Redshift does not enforce  Primary keys or Foreign keys which can create data integrity issues.   

4. Only S3, DynamoDB, and EMR support a parallel load in Redshift. In  case you want to load data from other services you need to write ETL  scripts or use ETL solutions such as Hevo.    5. It requires a good understanding of Sort and Dist key. There are some  basic ground rules to set for sort and dist keys. If set improperly then it  could lead to hampering of performance.    6. Distribution keys cannot be changed once it is created. You need to be  extremely careful while designing your tables. Wrong distribution keys  could hamper the overall performance.    7. In Redshift, there is no concept of DBLink, you cannot directly connect  to another database/data warehouse tables for your queries.    8. In Redshift, VACUUM and ANALYZE are mandatory on key tables. It  can hamper the performance badly if run during business hours. Hence it  needs to be handled carefully.    9. In Redshift cluster, there is a limit on the number of nodes, databases,  tables, etc. Maximum storage limit is still lesser than data warehouses like  Teradata. Here is the node limitation list:   
Node Type  vCPU  Storage per Node  Node Range  dc1.large  2  160 GB SSD  1-32 
dc1.8xlarge  32  2.56 TB SSD  2-128 
dc2.large  2  160 GB NVMe-SSD  1-32 
dc2.8xlarge  32  2.56 TB NVMe-SSD  2-128 
ds2.xlarge  4  2 TB HDD  1-32 
ds2.8xlarge  36  16 TB HDD  2-128 
  You can refer to AWS documentation to know more about the limits in  Amazon Redshift.   
10 
10. Although Redshift in classic mode is still in use, its cluster  performance is relatively modest.    11. Redshift still supports only a single AZ environment and does not  support multi-AZ environment.    12. Redshift has a limit on query concurrency of 15. You can have a  maximum of 8 queues in a cluster. If your queues are unmanaged, then it  hinders the performance.    13. Your design should make sure that the cluster is not in use during the  maintenance window period, else job will fail.    14. There is no concept of table partitioning in Redshift.    15. In Redshift, you do not have a concept of SET and MULTISET tables  (SET tables are the tables that do not allow duplicates). This needs to be  handled programmatically else it could lead to reporting errors if handled  inappropriately.    You can refer to Hevo’s blog which talks about the Pros and Cons of  Amazon Redshift in complete detail.       
Teradata Pros and Cons 
   
12 
Cons  1. One of the biggest cons of Teradata is that it is not cloud-based unless  scaled up to run over the cloud. It requires some initial setup or you need  to integrate with other cloud service providers i.e, AWS or Azure.    2. It is not a columnar data warehouse.    3. Since Teradata is not a columnar DB, it runs entire row even if you  search over a single column. You may end up with performance issues  unless your data warehouse is properly designed.     4. If a query runs on a set of different columns over the bigger dataset, it  could lead to performance issues; unless query has been run on the  indexed columns.    5. Teradata only supports a maximum of 128 joins in a single query. If you  want to perform more joins, you need to break them into chunks and  handle it accordingly.    6. Redshift outperforms Teradata in Analytical performance, Visualisation  on storage, & CPU utilization visualization. Everything can be viewed in a  single AWS console or through the Cloudwatch monitor in Redshift. On  the other hand, Teradata provides separate visual tools while for few  others checks and commands need to be hit in Teradata client.    7. Teradata has no default column compression mechanism. Column  compression needs to be done manually, and you can perform up to 256  unique column value compression per column.    8. There are a lot of limitations on the number of columns, table value, and  table name length in Teradata. You can refer to Teradata documentation  for more detailed information. 
13 
   
14 
Features supported only by Teradata, not  Redshift  1. Teradata supports various features including Procedures, Triggers, etc.    2. Teradata has a column sequencing feature while Redshift doesn't.    3. Teradata provides various load and unload utilities i.e. TPT, FastLoad,  FastExport, Multiload, TPump, and BTEQ. You can use them depending  upon data volume, business logic, and leverage it in your ETL logic.    4. Teradata has a few visual utilities which Redshift should have such as  Teradata Visual Explain. In Redshift, you need to hit query to view Explain  plan.    5. Teradata supports MULTISET and SET tables while Redshift doesn't.    6. Teradata supports Macros but Redshift doesn't. Macros are a set of  predefined SQL statements logically stored in Database. Macros also  reduce LAN traffic.    Example:
);
Exec Get_Sales;
   
   
Items  Redshift  Teradata 
Fully managed Data Warehouse  over cloud.         
Core Data Warehouse is not  over the cloud. Initial setup is  required by DBAs/Export.  Teradata can be scaled to run  over the cloud (AWS/Azure)  with pay-as-you-go model. 
Backup and restore  strategy   
Backups are automatically taken  care of through the snapshot  feature. Snapshots are stored  internally stored in S3, which is  highly durable. 
Teradata backup and restore  can be manual or automated  (using BAR) but data is stored  in an outside system.   
Data Load and  Unload       
Redshift leverages data load  through COPY command and  unload through UNLOAD  command. Using COPY  command, data is loaded  automatically so that all nodes  can participate equally for faster  performance. 
In Teradata, we have separate  utilities to handle load/unload.  Teradata provides TPT,  FastExport, FastLoad, etc. They  can be leveraged accordingly  for your ETL/ELT.     
Table Storage         
Redshift follows columnar  storage format. If the query is hit  based on a specific set of the  columns or only on specific  column then it provides an  impressive performance. Hence,  aggregates are very fast in  Redshift as it leverages column  level hit. 
   
Internal Storage       
In Redshift, data is stored over  chunks of 1 MB blocks of each  column. Each block follows zone  mapping. Using zone mapping,  blocks stores minimum and  maximum value of that column.     
In Teradata, the data storage is  managed by AMPs under  vDisks and data is distributed  based on hash algorithm (i.e.  based on index defined etc)  and data is retrieved  accordingly.   
Referential  Integrity Model     
Redshift tables do have Primary  Keys and Foreign Keys but it  does not follow enforcement.  You need to apply your logic  such that referential integrity  model is applied on Redshift  tables.   
Teradata tables have Primary  Keys and Foreign Keys and it  follows enforcement.  Hence, it has an additional  overhead of doing reference  checks while processing.   
Sequence Support   
There is no concept of column  sequencing. If you want to create  a sequence on any column you  need to handle it  programmatically. 
You can define Sequence on a  column.       
Triggers, Stored  Procedures 
In Redshift, there is no concept  of Triggers or Stored Procedures. 
You can create Triggers or  Stored Procedures in Teradata. 
Visual Features       
Redshift is a part of AWS,  an integrated service. Entire  Redshift performance can be  monitored through AWS  console, Cloudwatch, and  automatic alerts.     
It has few visual tools like  Teradata Visual Explain but  they are cluttered.           
   
queries.   
NoSQL to Redshift  Feature   
Although, Redshift cannot load  NoSQL data from other vendors  but it can load data from  DynamoDB.     
No such feature supported yet.       
Maximum Storage  Capacity   
Storage capacity of much more  than 2 PB of data.     
Column  Compression       
In Redshift, when the table is  created it automatically creates  default compression on all  columns. It also provides a  command called ANALYSE  COMPRESSION to help on  column compression.     
In Teradata, you need to  specify column compress on  individual columns. You can  compress  up to 128 unique values per  column in a table.     
Maximum Columns  Per Table 
Maximum 1600 columns per  table.   
Maximum 258 columns per  row.     
Maximum Joins  No limit as such.  64 joins per query block. 
Data Warehouse  Maintenance/  Updates 
Redshift applies regular patches  and does automatic maintenance  inside maintenance window.   
In Teradata, DBAs need to take  care of all these activities  manually or through some tool. 
Table Indexes  It does not have table index  concept but its performance is 
Teradata does provide various  types of index i.e. Primary 
   
Index, Secondary Index, etc.     
Tables can be partitioned.     
  Fault Tolerance     
Redshift is Fault Tolerant. In  case, there is any node failure,  Redshift will automatically  replace the failed node with the  replacement node. Although,  multi-AZ is not supported in  Redshift. 
   
19 
Pricing and Effort Comparison  Redshift leads Teradata in effort and in-house pricing. Redshift is cheaper  and easier than Teradata. For Redshift, you only need to turn on the  cluster, set up security settings, few other options (maintenance window  period, snapshot enabling option, etc), and you are ready to go. This way  DBAs efforts get reduced.    However, in terms of storage, Teradata has upper hand because Redshift  cluster has limitations. However, in Redshift, we can still handle that  through S3 as it does not have any space limitation.    Remember, both Teradata and Redshift Data Warehouses are designed  to solve different purposes.    You can refer to Redshift and Teradata to know about pricing.   
20 
When and How to Migrate data from Teradata to  Redshift  There are various considerations that need to be made on whether to  migrate from Teradata to AWS/cloud. 
1) How stable is your Teradata Warehouse?  2) How much is your Teradata data volume?  3) How complex is your Teradata data model?  4) How much is your current Teradata data latency?  5) How good is your Teradata RDBMS performance?  6) How many BI tools are you using on your Teradata 
tables/views/cubes?  7) Are you using plenty of unsupported features of Redshift in 
Teradata?  8) Will migrating your data warehouse from Teradata to Redshift 
break your system?  9) Your budget of maintaining the Redshift and other key AWS 
services post-migration.    If all conditions are satisfied, you easily migrate your data from Teradata  to Redshift. AWS provides a useful service called Data Migration Service  (DMS) and Schema Conversion Tool (SCT). Although, this pretty handy  service is not fully automated as some minor manual efforts are required.    Please refer to AWS documentation for migrating data from Teradata to  Redshift.   
21 
   
   
22 
 
 
 
 
 
 
 
from Any Source to AWS Redshift?
TRY HEVO