cdp data migration guide · the burst to cloud functionality of workload manager utilizes...

CDP Data Migration Guide

CDP Data Migration GuideDate published: 2020-06-01Date modified: 2020-06-01

https://docs.cloudera.com/

https://docs.cloudera.com/

Legal Notice

© Cloudera Inc. 2021. All rights reserved.

The documentation is and contains Cloudera proprietary information protected by copyright and other intellectualproperty rights. No license under copyright or any other intellectual property right is granted herein.

Copyright information for Cloudera software may be found within the documentation accompanying eachcomponent in a particular release.

Cloudera software includes software from various open source or other third party projects, and may bereleased under the Apache Software License 2.0 (“ASLv2”), the Affero General Public License version 3(AGPLv3), or other license terms. Other software included may be released under the terms of alternative opensource licenses. Please review the license and notice files accompanying the software for additional licensinginformation.

Please visit the Cloudera software product page for more information on Cloudera software. For moreinformation on Cloudera support services, please visit either the Support or Sales page. Feel free to contact usdirectly to discuss your specific needs.

Cloudera reserves the right to change any products at any time, and without notice. Cloudera assumes noresponsibility nor liability arising from the use of products, except as expressly agreed to in writing by Cloudera.

Cloudera, Cloudera Altus, HUE, Impala, Cloudera Impala, and other Cloudera marks are registered orunregistered trademarks in the United States and other countries. All other trademarks are the property of theirrespective owners.

Disclaimer: EXCEPT AS EXPRESSLY PROVIDED IN A WRITTEN AGREEMENT WITH CLOUDERA,CLOUDERA DOES NOT MAKE NOR GIVE ANY REPRESENTATION, WARRANTY, NOR COVENANT OFANY KIND, WHETHER EXPRESS OR IMPLIED, IN CONNECTION WITH CLOUDERA TECHNOLOGY ORRELATED SUPPORT PROVIDED IN CONNECTION THEREWITH. CLOUDERA DOES NOT WARRANT THATCLOUDERA PRODUCTS NOR SOFTWARE WILL OPERATE UNINTERRUPTED NOR THAT IT WILL BEFREE FROM DEFECTS NOR ERRORS, THAT IT WILL PROTECT YOUR DATA FROM LOSS, CORRUPTIONNOR UNAVAILABILITY, NOR THAT IT WILL MEET ALL OF CUSTOMER’S BUSINESS REQUIREMENTS.WITHOUT LIMITING THE FOREGOING, AND TO THE MAXIMUM EXTENT PERMITTED BY APPLICABLELAW, CLOUDERA EXPRESSLY DISCLAIMS ANY AND ALL IMPLIED WARRANTIES, INCLUDING, BUT NOTLIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY, QUALITY, NON-INFRINGEMENT, TITLE, ANDFITNESS FOR A PARTICULAR PURPOSE AND ANY REPRESENTATION, WARRANTY, OR COVENANTBASED ON COURSE OF DEALING OR USAGE IN TRADE.

CDP Data Migration Guide | Contents | iii

Contents

Introduction.................................................................................................. 5

Data Migration Tools and Methods........................................................... 5Use Workload Manager to Migrate to CDP Public Cloud...................................................................5

Introduction to Workload Manager........................................................................................... 6Before You Burst - Select a Workload.....................................................................................6Bursting to the Cloud - Steps.................................................................................................12

Use Replication Manager to migrate to CDP Public Cloud.............................................................. 17About Replication Manager.................................................................................................... 17Classic Clusters...................................................................................................................... 19Prerequisities for Replication Manager.................................................................................. 20Port Requirements for Replication Manager.......................................................................... 20Cloudera Data Platform Private Cloud Base clusters............................................................ 20Requirements while using CDH on-premise clusters.............................................................20Working with Cloud Credentials............................................................................................. 21Snapshot Replication using CDH Clusters.............................................................................22Hive Replication Concepts..................................................................................................... 22Data Replication Use Cases.................................................................................................. 27

Use Replication Manager to migrate to CDP Private Cloud Base....................................................30Replication Manager Overview...............................................................................................31Product Compatibility Matrix for Replication Manager........................................................... 32Supported and Unsupported Replication Scenarios.............................................................. 33Data Replication..................................................................................................................... 34Designating a Replication Source.......................................................................................... 35Configuring a Peer Relationship............................................................................................ 35Modifying Peers...................................................................................................................... 36Configuring Peers with SAML Authentication........................................................................ 36HDFS Replication................................................................................................................... 36Hive/Impala Replication.......................................................................................................... 49Enabling, Disabling, or Deleting A Replication Policy............................................................ 58Replicating Data to Impala Clusters.......................................................................................58Using Snapshots with Replication..........................................................................................59Enabling Replication Between Clusters with Kerberos Authentication...................................60Replication of Encrypted Data................................................................................................62Security Considerations..........................................................................................................63Snapshots............................................................................................................................... 63

Data Warehouse to CDP........................................................................... 68Migrating Hive Data to CDP............................................................................................................. 68

Handling Semantic and Syntax Changes...............................................................................68Hive Configuration Property Changes....................................................................................72Customizing Critical Hive Configurations............................................................................... 80Set Hive Configuration Overrides...........................................................................................80Hive Configuration Requirements and Recommendations.....................................................81Removing Hive on Spark Configurations............................................................................... 83Update Ranger Table Policies............................................................................................... 84Setting Up Access Control Lists.............................................................................................84

Configure HiveServer for ETL using YARN queues.............................................................. 84Configure Encryption Zone Security.......................................................................................85Use HWC/Spark Direct Reader for Spark Apps/ETL............................................................. 85Configure HiveServer HTTP Mode.........................................................................................85Unsupported Interfaces and Features....................................................................................86Changes to CDH Tables........................................................................................................ 87Changes to HDP Tables........................................................................................................ 88

Migrating Impala Data to CDP..........................................................................................................89Impala Changes between CDH and CDP..............................................................................89Impala configuration differences in CDH and CDP................................................................98Other Miscellaneous Changes in Impala............................................................................. 100Factors to Consider for Capacity Planning.......................................................................... 102Planning Capacity Using WXM............................................................................................ 104Performance Differences between CDH and CDP.............................................................. 105

Migrating Kudu Data to CDP.......................................................................................................... 106Backing up data in Kudu......................................................................................................106Restoring Kudu data into the new cluster............................................................................107

Operational Database to CDP................................................................ 108Prepare for data migration.............................................................................................................. 109Migrate Data from CDH or HDP to CDP Private Cloud Base........................................................ 109Verify and validate if your Data is Migrated....................................................................................111

Machine Learning and Data Engineering to CDP.................................111Cloudera Data Science Workbench to CDP...................................................................................111Zeppelin to CDP..............................................................................................................................112Spark to CDP.................................................................................................................................. 112Livy to CDP..................................................................................................................................... 113

Streaming to CDP....................................................................................113Migrating Streaming workloads from HDF to CDP Private Cloud Base..........................................113

Set Up a New Streaming Cluster in CDP Private Cloud Base.............................................114Migrate Ranger Policies....................................................................................................... 114Migrate Schema Registry..................................................................................................... 114Migrate Streams Messaging Manager................................................................................. 116Migrate Kafka Using Streams Replication Manager............................................................ 118

Data Flow to CDP.................................................................................... 122

Security and Governance to CDP..........................................................122Migrating Security and Governance Data from CDH to CDP......................................................... 122Migrating Security and Governance Data from HDP to CDP......................................................... 123

Platform Components to CDP................................................................123Cloudera Manager...........................................................................................................................123Fair Scheduler to Capacity Scheduler migration............................................................................ 124

Plan your scheduler migration..............................................................................................125Use the fs2cs conversion utility............................................................................................131Manual configuration of scheduler properties...................................................................... 133

Migrating Oozie to CDP.................................................................................................................. 136

CDP Data Migration Guide Introduction

Introduction

This guide describes how to migrate workloads from CDH or HDP clusters to CDP Public Cloud or CDPPrivate Cloud Base.

Some components require data conversion steps during the upgrade process, but data migration does notrefer to upgrades:

• Data migration refers to moving existing CDH or HDP workloads to CDP Public Cloud or to a newinstallation of CDP Private Cloud Base.

• Upgrade refers to a full in-place upgrade of CDH or HDP to CDP Private Cloud Base, which is notaddressed in this guide.

Data Migration Tools and Methods

Tools and methods you can use to migrate data from CDH and HDP to CDP.

Use Workload Manager to Migrate to CDP Public CloudWorkload Manager can be used to migrate Impala workloads to CDP Public Cloud.

Workload Manager enables you to explore cluster and workload health before migrating data. You canidentify workloads that are good candidates for cloud migration, and optimize workloads before migratingthem to CDP Public Cloud.

• Generates a “cloud-friendliness” score• Auto-generates a sizing/capacity plan for the target environment• Works with Replication Manager to help you build a replication plan• Mitigates the risk of run-away cloud costs caused by suboptimal workload

Supported Scenarios

CDH to CDP Public Cloud (AWS only)

• Impala -- GA• Hive -- Roadmap Q2• Azure support -- Roadmap Q3

HDP to CDP Public Cloud

• Currently requires professional services• Roadmap Q2

Related Cloudera Professional Services Offerings

• Assessment Services• Migration Services

• SmartMigrate – CDH/HDP to CDP Private Cloud Base• Shift2Cloud – CDH/HDP to CDP Public Cloud

Related InformationCloudera Workload Manager

5

https://docs.cloudera.com/workload-manager/cloud/index.html

CDP Data Migration Guide Data Migration Tools and Methods

Introduction to Workload Manager

Workload Manager is a tool that provides insights to help you gain in-depth understanding of the workloadsyou send to clusters managed by Cloudera Manager. In addition, it provides information that can beused for troubleshooting failed jobs and optimizing slow jobs that run on those clusters. After a job ends,information about job execution is sent to Workload Manager with the Telemetry Publisher, a role in theCloudera Manager Management Service.

Workload Manager uses the information to display metrics about the performance of a job. Additionally,Workload Manager compares the current run of a job to previous runs of the same job by creatingbaselines. You can use the knowledge gained from this information to identify and address abnormal ordegraded performance or potential performance improvements.

The Burst to Cloud functionality of Workload Manager utilizes Replication Manager and Data Warehouse tohelp you refine the parameters of your migration. This all done seamlessly within the Burst to Cloud wizard.The sections below walk you through the burst to cloud process, step by step.

Before You Burst - Select a Workload

Before you burst a workload to the cloud, you'll need a workload to burst. Workload Manager provides twoways of defining workloads - auto-generated workloads, and manually-defined workloads.

You can use Workload Manager to analyze your data warehouse workloads at the cluster level, but usingthe Workload View feature, you can analyze workloads with much finer granularity. For example, you cananalyze how queries that access a particular database or that use a specific resource pool are performingagainst SLAs. Or you can examine how all the queries are performing that a specific user sends to yourcluster.

Use workload views that Workload Manager automatically generates, or manually define your ownworkload views to drill down on specific criteria:

Using Auto-Generated Workload Views

To immediately start analyzing your workloads with the Workload View feature, use the following steps touse auto-generated views:

1. Under Data Warehouse in the left menu, select Workloads:

6


2. Click Define New and choose Select recommended views from the drop-down menu:

3. Review the Criteria that are used to create the workload views, select the auto-generated workloadviews you want to use, and then click Add Selected:

7


After you click Add Selected, the workloads you selected are saved and can be viewed on the DataWarehouse Workloads page.

Using Manually-Defined Workload Views

Use the following steps to manually define your own Workload Views, or if you want to see this feature inaction, view the following video:

Video: Classifying Workloads to Gain Insights

Figure 1: Video: Classifying Workloads to Gain Insights

For better video quality, click YouTube in the lower right corner of the video player to watch this video onYouTube.com.

1. In the left menu under Data Warehouse, select Summary, and click the arrow next to the date range inthe upper right corner of the page to select the date range for the workloads you want to analyze:

2. In the left menu, under the Data Warehouse heading, select Define New > Manually define view.Then click Define New:

8

https://www.youtube.com/embed/dQxaIOtLpn8


3. On the Define Data Warehouse Workload View page where you can define a set of criteria thatenables you to analyze a specific set of queries. For example, you can review all failed queries using aspecific database that are subject to a fifteen second SLA:

The above workload view definition, which is named my_dino_view, monitors queries that use thedino database. When 15 percent of these queries miss a 15s SLA, that is total query execution timeexceeds 15 seconds, then the workload is flagged with a failing status.

4. After specifying your criteria, click Preview and a summary of the queries matching this criteriadisplays:

9


5. If you are satisfied with the results of the criteria you specified, click Save in the lower right hand cornerof the page.

6. After saving the workload view, you are returned to the Data Warehouse Workloads page, where yourworkload appears in the list. Use the search bar to search for your workload and click on the workloadto view the workload details:

7. The detail page for your workload view contains several graphs and tabs you can view to analyze howthis group of queries is meeting its SLA. For example:

• In the Trend region, you can view the counts of executing queries By Status or By StatementType:

10


Click the number under Total Queries, Failed Queries, and Query Active Time to view furtherdetails.

• Also in the Trend region, you can view the number of queries executing concurrently when you clickthe Concurrency tab:

In the above example, it shows that the maximum concurrency for this workload view is 20. Thismeans that for the queries in this workload view, a maximum of twenty queries access the samedata at the same time during the specified time period. At the bottom of the graph, it shows howconcurrency fluctuates over the date range specified for the workload view.

• You can also view the different statement types contained in the workload view, the active time ofthe queries, and you can drill down to view more granular details on each query.

11


• In the other widgets on the page you can view things like which queries took the most time, whatstatement type was the most common, and details of failed queries.

Bursting to the Cloud - Steps

The sections below will guide you through the process of bursting your workload to the cloud.

1. Generate a cloud rating for your workload.2. Use the performance rating to determine whether your workload is a good candidate for the cloud.3. Open the Burst to Cloud wizard, which guides you through the burst to cloud process.4. In the Burst to Cloud wizard, you'll create a replication policy.5. After the replication runs, you can view the policy and job history in Replication Manager.6. Next, you'll create a compute cluster, which can be a Data Warehouse Cluster or a Data Hub cluster.7. Finally, you'll burst your workload to the cloud and verify that the burst has been successfully

completed.

Generate a Performance Rating

Before you burst your workload to the cloud, Workload Manager will analyze the workload to determinewhether it is a good candidate for the cloud.

1. To create a perfomance rating, select Burst to Cloud > Generate Performance Rating. When theperformance rating is complete, a pop-up window will alert you that the rating has been generated.

2. Select Burst to Cloud > View Performance Rating Details to find out whether your workload is agood candidate for the cloud.

The Cloud Performance pop-up gives you a general rating as well as details about how WorkloadManager determined that rating.

3. You can use the details of this rating to optimize your workload for the cloud. For example, in the imageabove, the workload has small repetitive reads, which are optimal for cloud migration, but you mightalso look for workloads that are more CPU-intensive.

Workload Manager provides the Cloud Performance Rating to help you determine the best workloads tomigrate to the cloud. A low rating does not prevent you from migrating the workload.

4. Click the Start Burst to Cloud Wizard to begin the migration.

Create a Replication Policy

Workload Manager utilizes the capabilites of Replication Manager to create the replication policy. TheBurst to Cloud wizard takes you through the procedure to create a replication policy to define the source,destination, and schedule for the replication.

12


1. In the Burst to Cloud wizard, click the Create Replication Policy button.

2. This takes you to the first page of the wizard, where you can edit general information about the policy,such as name and description.

3. Click Next to go to the Select Source page, where you enter details about the source of the workloadthat you want to migrate.

On the Select Source page, enter the following information:

• Source Cluster - Select the cluster that contains the workload that you want to migrate.• Source Databases and Tables - Select the databases and tables that you want to include in the

migration.• Run as Username (on source) - Enter the username of the user that you want to run the replicaion

policy as.

4. Click Next to go to go to the next page, where you can choose your migration destination. Choose adata lake from the drop-down menu, enter your credentials, and click Next.

5. On the next page, you schedule when the replication policy will run. You can schedule it to run once orit can be recurring.

13


6. After you schedule when the policy will run, click Next to go to the Additional Settings page, where youcan configure details about how the policy will run, such as the YARN queue name, and whether youwant to include Sentry permissions.

14


7. Click Next and a pop-up will inform you that your policy has been created successfully.8. After you define your replication policy, you can go back to Workload Manager to see the status of

your replication. In Workload Manager, click the Burst to Cloud button to see that the replicaion isrunning and view the replication details in Replication Manager. You also have the option to refresh theperformance rating of your workload.

Create a Compute Cluster

After you've created your replication policy, the last thing you need to do before you burst to the cloud iscreate a compute cluster. This will help you determine the size of the cloud cluster you need.

15


1. In the Burst to Cloud wizard, click the Create Compute button. This takes you to Cloudera DataWarehouse, where you can select a database catalog and create a virtual warehouse for yourmigration.

2. Click the plus symbol (+) to create a virtual warehouse. You can configure the warehouse parameters,such as size, headroom, and wait time.

3. After the virtual warehouse is created, open the menu for the warehouse and seelct Open DAS.

4. This opens the Data Analytics Studio, where you can verify that your workload is in the cloud. Selectthe Database tab and search for the database that you used for the workload. Under the DetailedInformation tab, find the Location row, which shows you where the workload exists. As you can see inthe exmaple below, the is in the cloud.

16


1. Generate a cloud rating for your workload.2. Use the performance rating to determine whether your workload is a good candidate for the cloud.3. Open the Burst to Cloud wizard, which guides you through the burst to cloud process.4. In the Burst to Cloud wizard, you'll create a replication policy.5. After the replication runs, you can view the policy and job history in Replication Manager.6. Next, you'll create a compute cluster, which can be a Data Warehouse Cluster or a Data Hub cluster.7. Finally, you'll burst your workload to the cloud and verify that the burst has been successfully

completed.

Use Replication Manager to migrate to CDP Public CloudReplication Manager can be used to migrate Hive, Impala, and HDFS workloads to CDP Public Cloud.

Features

Replication Manager enables you to replicate data across data centers for disaster recovery scenarios.Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impalametadata (catalog server metadata) associated with Impala tables registered in the Hive metastore.

Supported Scenarios

• CDH to CDP Public Cloud (AWS and ABFS)• HDP to CDP Public Cloud (Technical Preview and not available for general use)• HBase replication - Roadmap Q2

Related Cloudera Professional Services Offerings

• HDP to CDP Public Cloud support

About Replication ManagerYou must understand various components and related information pertaining to Replication Manager.

Accessing Replication Manager UIYou can access the Replication Manager user interface by logging into Cloudera Data Platform > SelectReplication Manager.

17


Replication Manager User InterfaceThe main Replication Manager user interface consists of various entities like Classic Clusters, ReplicationPolicies, Policy notifications, Replication Jobs, Classic Clusters mapping, and anyother issues or updates.

Notice: Replication Manager service is generally available for use with Cloudera Distribution ofHadoop (CDH) clusters. For more information, please contact Cloudera Professional Services.If you have questions regarding this service, contact support by logging a case on our ClouderaSupport Portal.

Product Compatibility Matrix for Replication ManagerThis matrix contains compatibility information across features of Replication Manager.

Important: Currently, Replication Manager service does not fully support replication operationsusing HDP clusters. If you have questions regarding this, contact support by logging a case on ourCloudera Support Portal.

Feature Lowest SupportedCloudera ManagerVersion

Lowest SupportedCDH Version

Lowest supportedCDP-DC version

Supported Services

Replication to AmazonS3

6.3.1 5.13+ 7.1.x HDFS, Hive, Sentry toRanger

18

https://my.cloudera.com/support.html




Feature Lowest SupportedCloudera ManagerVersion

Lowest SupportedCDH Version

Lowest supportedCDP-DC version

Supported Services

Replication to MicrosoftAzure ADLS Gen2(ABFS)

7.1.1 5.13+ 7.1.x HDFS, Hive, Sentry toRanger

Note: While using CDP Private Cloud Base 7.1 clusters, only External Table replication issupported. Sentry to Ranger migration is NOT supported on source CDP Data Center clusters.

How Policies Work in Replication ManagerIn Replication Manager, you create policies to establish the rules you want applied to your replication jobs.The policy rules you set can include which cluster is the source and which is the destination, what data isreplicated, what day and time the replication job occurs, the frequency of job execution, and bandwidthrestrictions.

When scheduling how often you want a replication job to run, you should consider the recovery pointobjective (RPO) of the data being replicated; that is, what is the acceptable lag time between the active siteand the replicated data on the destination.

• The first time you execute a job (an instance of a policy) with data that has not been previouslyreplicated, Replication Manager creates a new folder or database and bootstraps the data.

During a bootstrap operation, all data is replicated from the source cluster to the destination. As aresult, the initial execution of a job can take a significant amount of time, depending on how much datais being replicated, network bandwidth, and so forth. So you should plan the bootstrap accordingly.

After initial bootstrap, data replication is performed incrementally, so only updated data is transferred.Data is in a consistent state only after incremental replication has captured any new changes thatoccurred during bootstrap.

Classic ClustersYou must register your existing on-prem Cloudera Distribution of Hadoop (CDH) or Hortonworks DataPlatform (HDP) clusters on the Management Console, in order to burst your data to the cloud. In the CDPworld, these clusters are called "classic clusters".

You can use the Classic Clusters panel of the Overview page to view the total number of clustersenabled for Replication Manager, the number that are in a error state, the number that are active, and thenumber of clusters for which a warning is issued. You can investigate the issues associated with clustersthat have an error or warning status by navigating to the Ambari web UI incase of HDP on-premise clustersand Cloudera Manager UI for CDH on-premise clusters.

Active

Specifies the total number of clusters currently available to run replication jobs.

Warning

Specifies the total number of clusters for which remaining disk capacity is less than 10%.

19


If this value is greater than zero, you can click the number to open a table that displays thecluster name and remaining capacity.

Total

Specifies the total number of clusters which are in force.

Error

Specifies the total number of clusters currently are not running as expected.

Prerequisities for Replication ManagerYou must perform some of the prerequsite tasks before you can use the Replication Manager service.

• Set up the CDP Data Lake cluster in CDP Public Cloud.• Classic Cluster registration of the CDH cluster.• Network connectivity:

• Outgoing SSH port should be open on the Cloudera Manager host.• For Hive replication, the Datalake Cloudera Manager should be able to communicate with the on-

premise Cloudera Manager

Port Requirements for Replication ManagerWhile using CDH on-premise cluster, make sure that the following ports are open and accessible on thesource hosts to allow communication between the source on-premise cluster and CDP.

Ports that have to be configured on the source cluster:

• Incoming

• 7180 or 7183 for the Data lake Cloudera Manager to communicate to the on-premise ClouderaManager.

• Outgoing

• For AWS / ADLS Gen2, Port 80 or 443 (ssl) should be open on all HDFS nodes• n1, outgoing port for CDP Management Console to communicate with Cloudera Manager

Note: n1 indicates the user chosen port.

Port access to be configured on the Data Lake cluster:

• Outgoing

• 8443, outgoing ports for CDP Management Console to communicate with Cloudera Manager andKnox

• 9443, outgoing ports for CDP Management Console to communicate with FreeIPA

Cloudera Data Platform Private Cloud Base clustersReplication Manager supports data migration from Cloudera Data Platform Private Cloud Base Clusters.

CDP Private Cloud Base is the on-premises version of Cloudera Data Platform. CDP Private Cloud Basecombines the best of Cloudera Enterprise Data Hub and Hortonworks Data Platform Enterprise along withnew features and enhancements across the stack. This unified distribution is a scalable and customizableplatform where you can securely run many types of workloads. For more information, see About CDPPrivate Cloud Base.

Requirements while using CDH on-premise clustersWhile using Replication Manager with CDH on-premise clusters, you must be aware about certainrequirements.

20

https://docs.cloudera.com/cdp-private-cloud/latest/overview/topics/cdpdc-overview.html

https://docs.cloudera.com/cdp-private-cloud/latest/overview/topics/cdpdc-overview.html


Note the following:

• The CDH on-premise cluster on Cloudera Manager instance must be registered on the ManagementConsole. For more information, see Add a CDH Cluster.

• You must upgrade to Cloudera Manager version 6.3 and above to use Replication Manager service.• You must plan to use CDH clusters version 5.13x and above.• While performing HDFS replication, you must ensure that the Replication Manager service interacts

with classic cluster registered Cloudera Manager instance.• While performing Hive replication, you must ensure that the Replication Manager service interacts with

the Data Lake Cloudera Manager instance and vice-versa.• For HDFS replication, you must ensure that you add an external account in the Cloudera Manager

instance. You must also verify if the account has access to the bucket/container, where the HDFS datagets copied. For more information, see How to Configure AWS Credentials and Configuring ADLSAccess Using Cloudera Manager in the Cloudera Manager documentation.

• For Hive replication, in additional to adding an external account in the Cloudera Manager instance,you must add an IAM external account with bdr as the username in the Data Lake cluster ClouderaManager instance. For more information, see IAM Role-based Authentication in the Cloudera Managerdocumentation.

• Additionally, for Hive replication, you must add classic cluster Cloudera Manager as a source in theData Lake Cloudera Manager instance. For more information, see Designating a Replication Source inthe BDR documentation.

Note: If you are blocked with connectivity issues, contact Cloudera Professional Services tomove your data to the Data Lake cluster.

Working with Cloud CredentialsYou need valid cloud account details before you register the same with Replication Manager service.

Adding Cloud CredentialsYou must add the required cloud storage account and related credentials before you plan to submit thereplication policy.

1. Replication Manager > Cloud Credentials > Click Add button. The Add Cloud Credential windowappears.

2. Enter the values as required and click Save.

The newly added cloud credential should list under the Cloud Credentials page.

Update cloud credentialsYou can update cloud credentials based on various factors.

• Changes made to a bucket configuration (secret/access keys, bucket name/endpoint, and encryptiontype) can affect Replication Manager replication policy execution and might require an update toReplication Manager cloud credentials.

• Credential changes are picked up by the next run of the policy. Any policies being run when thecredential changes are made could fail, but succeeding runs will pick up the changes.

Delete credentialsYou can delete unwanted credentials from Replication Manager UI.

• Users can delete cloud credentials, but this triggers failures of any policies based on the deleted cloudcredentials.

• You must delete the Replication Manager cloud policies associated with the deleted credentials andrecreate the policies with the new credentials. You can view a list of policies associated with specificcredentials on the Cloud Credentials page.

21


Unregistered credentialsUnregistered credentials can impact replication process

• Unregistered credentials in Replication Manager are credentials associated with a cluster node that donot have updated credentials.

• An example of how this can arise is, if a node was down when the credentials were changed on abucket, and when the node is brought up it still has the old credentials.

Snapshot Replication using CDH ClustersFor HDFS services, use the File Browser tab to view the HDFS directories associated with a service onyour cluster.

Important: You must have Cluster Administrator role to perform snapshot replication.

You can view the currently saved snapshots for your files, and delete or restore them. From the HDFS FileBrowser tab, you can:

• Designate HDFS directories to be "snapshottable" so snapshots can be created for those directories.• Initiate immediate (unscheduled) snapshots of a HDFS directory.• View the list of saved snapshots currently being maintained. These can include one-off immediate

snapshots, as well as scheduled policy-based snapshots.• Delete a saved snapshot.• Restore an HDFS directory or file from a saved snapshot.• Restore an HDFS directory or file from a saved snapshot to a new directory or file (Restore As).

Before using snapshots, note the following limitations:

• Snapshots that include encrypted directories cannot be restored outside of the zone within which theywere created.

• The Cloudera Manager Admin Console cannot perform snapshot operations (such as create, restore,and delete) for HDFS paths with encryption-at-rest enabled. This limitation only affects the ClouderaManager Admin Console and does not affect CDH command-line tools or actions not performed by theAdmin Console.

Browsing HDFS Directories

To browse the HDFS directories to view snapshot activity:

1. From the Clusters tab, select your designated CDH XXXX HDFS service.2. Go to the File Browser tab.

As you browse the directory structure of your HDFS, basic information about the directory you haveselected is shown at the right (owner, group, and other details).

Hive Replication ConceptsHive replication consists of multiple scenarios that are supported.

Hive tables - Managed and ExternalManaged tables are Hive owned tables where the entire lifecycle of the tables’ data are managed andcontrolled by Hive. External tables are tables where Hive has loose coupling with the data.

All the write operations to the Managed tables are performed using Hive SQL commands. If a Managedtable or partition is dropped, the data and metadata associated with that table or partition are deleted. Thetransactional semantics (ACID) are also supported only on Managed tables.

The writes on External tables can be performed using Hive SQL commands but data files can also beaccessed and managed by processes outside of Hive. If an External table or partition is dropped, only themetadata associated with the table or partition is deleted but the underlying data files stay intact. A typical

22


example for External table is to run analytical queries on HBase or Druid owned data via Hive, where datafiles are written by HBase or Druid and Hive reads them for analytics.

Hive supports replication of External tables with data to target cluster and it retains all the properties ofExternal tables.

The data files permission and ownership are preserved so that the relevant external processes cancontinue to write in it even after failover.

Important: Hive Materialized Views replication is not supported. However, Replication Managerdoes not skip it from getting replicated, but it may not work as expected in the target cluster.

For handling conflicts in External tables’ data location due to replication from multiple source clusters tosame target cluster, Replication Manager assigns a unique base directory for each source cluster underwhich, External tables data from corresponding source cluster would be copied. For example, if Externaltable location at a source cluster is /ext/hbase_data and after replication, the location in target clusterwould be <base_dir>/ext/hbase_data. Users can track the new location of External tables using DESCRIBETABLE command.

Caution: Replication Manager upgrade use-cases: In a normal scenario, if you had Externaltables that were replicated as Managed tables, after the upgrade process, you must drop thosetables from target and set the base directory. In the next instance they get replicated as Externaltables.

Important: When you are replicating from HDP 2.6.5 to 3.1 cluster, if the source table is managedby Hive but the table location is not owned by Hive, at the target cluster, the table is created asManaged table. Later, if you upgrade 2.6.5 to 3.1 using the upgrade tool, the Managed table isautomatically converted to External table. But the same rule is not followed during the replicationprocess. It may happen that, the table type is External at the source cluster but is Managed at thetarget cluster. You must make sure that before you upgrade, the Hive user has the ownership of thetable location in source cluster.

Handle replication conflicts between HDFS and Hive External Table location:

When you run the Hive replication policy on an external table, the data is stored on the target directory ata specific location. Next, when you run the HDFS replication policy which tries to copy data at the sameexternal table location, DLM Engine ensures that the Hive data is not overridden by HDFS. For example:Running the Hive policy on an external table creates a target directory named: /tmp/db1/ext1. WhenHDFS policy is executed, the HDFS should not override data by replicating on /tmp/db1/ext1 directory.

How to avoid conflicts during External Tables replication process

When two Hive replication policies on DB1 and DB2 (either from same source cluster or different clusters)have external tables pointing to the same data location (example: /abc), and if they are replicated to thesame target cluster, it must be noted that we need to set different paths for external table base directoryconfiguration for both the policies (example: /db1 for DB1 and /db2 for DB2). This arrangement ensuresthat the target external table data location would be different for both DBs (/db1/abcd and /db2/abcdrespectively).

Caution: Replication conflicts is NOT supported from On-Premise to Cloud scenario.

Sentry Policy ReplicationDuring Hive replication from an on-premise CDH cluster to a cloud cluster, the Replication Managermigrates Sentry authorization policies into Ranger as part of the replication policy.

Sentry policy migration takes place as part of a replication policy job. When you create the replicationpolicy, you choose the resources that you want to migrate and the Sentry policies will be migrated forthose resources. In the Additional Settings page of the Create Replication Policy wizard, you mustselect Include Sentry Permissions with Metadata. To perform the Sentry policy replication, you must be

23


running the Sentry service on CDH 5.12 or higher, or any CDH 6.x version. The Ranger version running onyour cloud cluster must be 3.1.

The Sentry Permissions section of the Create Replication Policy wizard contains the following options:

• Include Sentry Permissions with Metadata - Select this to migrate Sentry permissions during thereplication job.

• Exclude Sentry Permissions with Metadata - Select this if you do not want to migrate Sentrypermissions during the replication job.

• Skip URI Privileges - Select this if you do not want to include URI privileges when you migrate Sentrypermissions. During the migration, URI privileges are translated to point to an equivalent location in S3.You might not want to migrate URI privileges because if those resources have a different location in S3,the URI privileges will not be valid.

The image below shows the settings in Replication Manager for including Sentry permissions in thereplication policy. Click the option to include Sentry permissions, and you have the option of skipping themigration of URI privileges.

The migration of Sentry policies into Ranger is performed in the following operations:

• Export - The export operation runs in the source cluster. During this operation, the Sentry permissionsare fetched and exported to a JSON file. This file might be in a local file system or HDFS or S3, basedon the configuration that you provided.

• Translate and Ingest - These operations take place on the target cluster. In the translate operation,Sentry permissions are translated into a format that can be read by Ranger. The permissions are thenimported into Ranger. When the permissions are imported, they are tagged with the source clustername and the time that the ingest took place. After the import, the file containing the permissions isdeleted.

24


A Ranger policy is created for each resource, such as a database, table, or column. The policy name isderived from the resource name. For example, the following resource:

Database:dinosaurs, table= theropods

Would result in this policy:

database=dinosarus->table=theropods

The priority for migrated policies is set to normal in Ranger. The normal priority allows you to createanother policy for the same resource that overrides the policy that is imported from Sentry.

Sentry to Ranger Permissions

Because there is not a one-to-one mapping between Sentry privileges and Ranger service policies, theSentry privileges are translated into their equivalents within Ranger service policies.

Read through the following points to see how Sentry privileges will appear in Ranger after the migration:

• Sentry permissions that are granted to roles will be granted to groups in Ranger.• Sentry permissions that are granted to a parent object are granted to the child object as well. The

migration process preserves the permissions that are applied to child objects. For example, apermission that is applied at the database level will also apply to the tables within that database.

• Sentry OWNER privileges are translated to the Ranger OWNDER privilege.• Sentry OWNER WITH GRANT OPTION privileges are translated to Ranger OWNER with Delegated

Admin checked.• Sentry does not differentiate between tables and views. When view permissions are migrated, they are

treated as table names.• Sentry privileges on URIs will use the object store location as the base location.• If your cluster contains the Kafka service and the Kafka sentry policy had "action": "ALL" permission,

the migrated Ranger policy for "cluster" resource will be missing the "alter" permission. This is onlyapplicable for "cluster" resource. You will need to add the policy manually after the upgrade. Thismissing permission will not have any functional impact. Adding the "alter" permission post upgrade isneeded only for completeness because the 'configure' permission will allow alter operations.

The table below shows how actions in Sentry will be applied to the corresponding action in Ranger:

Table 1: Sentry Actions to Ranger Actions

Sentry Action Ranger Action

SELECT SELECT

INSERT UPDATE

CREATE CREATE

REFRESH REFRESH

ALL ALL

SELECT with Grant INSERT

INSERT with Grant INSERT

CREATE with Grant CREATE

ALL with Grant ALL with Delegated Admin Checked

25


Table Level ReplicationTo enable Table Level replication, you must specify the list of tables to be replicated in the given replicationpolicy.

Table level replication enables you to replicate only those critical tables instead of replicating all the tables.Enabling table level replication policy helps you speed-up the replication process and also reduces networkbandwidth utilization.

Attention: Currently, Table Level replication feature is supported only on CDH clusters.

You can define table level replication policy using regular expressions, for example, db.marketing_*.You can dynamically add or remove tables to the list by manually changing the replication policy duringrun time. Hive automatically bootstrap the table if it is dynamically added to the policy and automaticallydrop the table if it is dynamically excluded. Hive also automatically validates the rename table operationto check if the new table name is included or excluded as per the defined replication policy and actaccordingly.

Hive supports database level replication policy of the format <db_name>.*. In the real-time world, the policy format is similar to <db_name>.(t1, t3, …). The tables list canbe specified using Java supported regular expressions in the replication policy of format:<db_name>.<include_regex>.<exclude_regex>.

The replication policy has three parts separated with a DOT (.). First part is the DB name, second part issingle regex to represent the included tables list, and third part is single regex to represent the tables thatneeds to be excluded from the list even if it matches the include_regex format.

For Example:

1. <db_name> -- Full DB replication which is currently supported.2. <db_name>.'.*?' -- Full DB replication.3. <db_name>.'t1|t3' -- DB replication with static list of tables t1 and t3 included.4. <db_name>.'(t1*)|t2'.'t100' -- DB replication with all tables having prefix t1 and also include

table t2 which does not have prefix t1 and exclude t100 which has the prefix t1.

Limitations using Table Level Replication

• If any table is dynamically added for replication due to changes in regular expression or added to theinclude list, the tables data may not be point-in-time consistent with other tables which are alreadyreplicated incrementally. However, this inconsistency is seen for a very small duration of completingnext incremental replication after tables are added in the bootstrapped manner.

• Hive does not support single replication policy with tables from different databases. Each DB makesindependent policies.

• Hive does not support overlapping replication policies such as db.,, db.[t1], and *. to same targetdatabase. However, it works fine if the target database is different.

Bootstrap and incremental replicationReplication Manager allows you to replicate Hive databases from a source cluster to a target location on adestination cluster.

When you initiate the replication of Hive data, all of the data from the source location is copied to thedestination. This bootstrapping of data can take hours to days, depending on factors such as the amountof data being copied and available network bandwidth. Subsequent replication jobs from the same sourcelocation to the same target on the destination are incremental, so only the changed data is copied.

If a bootstrap replication is interrupted, such as due to a network failure or an unrecoverable error,Replication Manager automatically retries the job. If a retry succeeds, the replication job continues from thepoint at which it was interrupted. If the automatic retries are not successful, you must manually correct theproblem before running the policy again. When you activate the policy again, the replication job resumesfrom the point at which it was suspended.

26


After the bootstrap replication succeeds, an incremental replication is automatically performed. This jobsynchronizes, between the source and destination clusters, any events that occurred during the bootstrapprocess. After the data is synchronized, the replicated data is ready for use on the destination.

Functions such as User Defined Functions (UDF) in Hive are replicated. To enable this, UDFs have to becreated using a syntax. An example of UDF creation syntax:

CREATE FUNCTION [db_name.]function_name AS class_name USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ;

• ACID tables, external tables, storage handler-based tables (such as HBase), and column statistics arecurrently not replicated.

• When creating a schedule for a Hive replication policy, you should set the frequency so that changesare replicated often enough to avoid overly large copies.

Incremental Replication

The incremental replication in Hive is achieved using notification events maintained by Hive in HiveMetastore.

Hive logs notification events for all operations (both metadata and data changes) on the managed tablebut in case of external tables, data writes cannot be tracked by Hive as it is performed by external sourcesdirectly without using Hive SQL commands. Therefore, Hive always copies the latest data from externaltables to target cluster to avoid any loss of data.

Data Replication Use CasesThis page provides information about the data replication use cases pertaining to HDFS and HIVE servicesin Replication Manager.

Replicating using CDH on-premise clusterWhile using CDH clusters, data related to HDFS and Hive can be replicated.Hive replication from On-premise to CloudFor creating a Hive metadata replication job from on-premise to the cloud account you must register yourcloud account credentials with Replication Manager instance, so that Replication Manager can accessyour cloud storage. The replication load happens on the source on-premise cluster. Before performingHive replication using classic clusters, refer to "Requirements while using CDH on-premise clusters" andWorking with Cloud Credentials on page 21.Related InformationRequirements while using CDH on-premise clusters

Replicating Hive Metadata from On-premise to CloudYou must create a new data replication policy to replicate Hive metadata from On-premise to cloud.

About this task

Before you create a new replication policy, you must register the cloud account with the ReplicationManager service. Before you commence Hive replication, make sure to go through Requirements whileusing CDH on-premise clusters.

Caution: You must set the Ranger policy for hdfs user on target cluster to perform all operationson all databases and tables. Since the same user role is used for importing Hive Metastore. On thetarget cluster, the hive user must have Ranger admin privileges. The same hive user performing themetadata import operation.

The hdfs user should have access to all Hive datasets, including all operations. Else, Hive import failsduring the replication process. To provide access, follow these steps:

1. Log in to Ranger Admin UI.2. Provide hdfs user permission to "all-database, table, column" in hdfs under the Hadoop_SQL section.

27

https://docs.cloudera.com/replication-manager/cloud/operations/topics/rm-requirements-for-bdr-cdh-clusters.html




Note: You can replicate data On-premise to cloud with a single cluster. The Metastore must berunning on the cloud.

Caution: You can create Hive replication policy using CDH on-premise clusters.

Procedure

1. Management Console > Replication Manager > Policies and click Add Policy.

2. Select HIVE as the service in the Create Replication Policy page.

3. Enter the Hive replication Policy Name and Description. Click Next.

4. Select Source Cluster from the drop-down.

5. Enter the value for Source Databases and Tables.

You can click icon to include additional databases and tables.

6. Enter the value for Source User.

This user will have the necessary permissions to replicate data.

7. Click Next.The Destination Data Lake page appears.

8. Select the Destination Data Lake cluster from the drop-down.

The Warehouse Path and The Hive External Table Base Directory path are listed. For example: S3://bucket_name/path

For ABFS: abfs://[email protected]/cc-dmx-7y4aqf/warehouse/tablespace/external/hive

9. Select Cloud Credential from the drop-down.

Note: You can also add cloud credentials using Add Cloud Credentials link.

10.Enter the Username.

11.Click Validate Policy.The Replication Manager verifies the data with a status Validate Policy Source andDestination information.

28


12.Click Next to proceed to Schedule the replication policy.The replication policy schedule page provides a couple of options:

• Run Now (Default) - The replication policy is immediately submitted and processed.• Schedule Run - The replication policy can be scheduled to run at specified time interval.

13.Click Next.The Additional Settings page appears. On this page you can enter values for:

• YARN Queue Name• Maximum Maps Slots• Maximum Bandwidth

14.Click Create.Once the newly created replication policy is successful, view the newly created replication job statusfrom the Policies page. Verify that the job starts and runs as expected.

Note: If the CDH source database contains functions, you must explicitly run reloadfunction command to view the migrated replication functions in the target location.

HDFS replication from On-premise to CloudFor creating a replication job from on-premise to S3, you must register your cloud account credentialswith Replication Manager service, so that Replication Manager can access your cloud storage. Beforeperforming HDFS replication using classic clusters, refer to "Requirements while using CDH on-premiseclusters" and Working with Cloud Credentials on page 21.Related InformationRequirements while using CDH on-premise clusters

Replicating HDFS data from On-premise to CloudYou must create a new replication policy to replicate data from on-premise to cloud.

About this taskBefore you create a new replication policy, you must register cloud account with the Replication Managerservice.

Note: You can replicate data On-premise to Cloud storage account with a single cluster.

29



Procedure

1. Management Console > Replication Manager > Policies and click Add Policy.

2. Select HDFS as the service in the Create Replication Policy page.

3. Enter the HDFS replication Policy Name and Description. Click Next.

4. Select Source Cluster from the drop-down.

5. Enter the value for Source Path where the source data resides.

6. Enter the Source User.

7. Click Next.

8. The destination Type is listed as S3 or ABFS.

9. Select Cloud Credential from the drop-down.

Note: You can also add cloud credentials using Add Cloud Credentials link.

10.Provide a folder path bucket_name/path for S3 cloud storage.

When you select ABFS as your target cloud storage, you must provide the storage container and thefile system. For example:

abfs://<filesystem>@<storage_account>/<location>

11.Click Validate Policy.The Replication Manager verifies the data with a status Validate Policy Source andDestination information.

12.Click Next to proceed to Schedule the replication policy.The replication policy schedule page provides a couple of options:

• Run Now (Default) - The replication policy is immediately submitted and processed.• Schedule Run - The replication policy can be scheduled to run at specified time interval.

13.Click Next.The Additional Settings page appears. On this page you can enter values for:

• YARN Queue Name• Maximum Maps Slots• Maximum Bandwidth

14.Click Create.Once the newly created replication policy is successful, view the newly created replication job statusfrom the Policies page. Verify that the job starts and runs as expected.

Use Replication Manager to migrate to CDP Private Cloud BaseReplication Manager can be used to migrate Hive, Impala, and HDFS workloads to CDP Data Center.

Features

Replication Manager is a service for copying and migrating data between environments within theenterprise data cloud. It is a simple, easy-to-use, and feature-rich data movement capability to moveexisting data and metadata to the cloud to fuel new workloads.

Supported Scenarios

• CDH to CDP Private Cloud Base

30


Replication Manager OverviewCloudera Manager provides an integrated, easy-to-use management solution for enabling data protectionon the Hadoop platform.

Replication Manager enables you to replicate data across data centers for disaster recovery scenarios.Replications can include data stored in HDFS, data stored in Hive tables, Hive metastore data, and Impalametadata (catalog server metadata) associated with Impala tables registered in the Hive metastore. Whencritical data is stored on HDFS, Cloudera Manager helps to ensure that the data is available at all times,even in case of complete shutdown of a datacenter.

You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBasereplications.)

To understand more about Cloudera license requirements, see Managing Licenses.

You can also use Cloudera Manager to schedule, save, and restore snapshots of HDFS directories andHBase tables.

Cloudera Manager provides key functionality in the Cloudera Manager Admin Console:

• Select - Choose datasets that are critical for your business operations.• Schedule - Create an appropriate schedule for data replication and snapshots. Trigger replication and

snapshots as required for your business needs.• Monitor - Track progress of your snapshots and replication jobs through a central console and easily

identify issues or files that failed to be transferred.• Alert - Issue alerts when a snapshot or replication job fails or is aborted so that the problem can be

diagnosed quickly.

Replication Manager functions consistently across HDFS and Hive:

• You can set it up on files or directories in HDFS and on External tables in Hive—without manualtranslation of Hive datasets to HDFS datasets, or vice versa. Hive Metastore information is alsoreplicated.

• Applications that depend on External table definitions stored in Hive, operate on both replica and sourceas table definitions are updated.

• The hdfs user should have access to all Hive datasets, including all operations. Else, Hive import failsduring the replication process. To provide access, follow these steps:

1. Log in to Ranger Admin UI2. Provide hdfs user permission to "all-database, table, column" in hdfs under the Hadoop_SQL

section.

You can also perform a “dry run” to verify configuration and understand the cost of the overall operationbefore actually copying the entire dataset.

31

https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/managing-clusters/topics/cm-managing-licenses.html


Product Compatibility Matrix for Replication ManagerThis matrix contains compatibility information across features of Replication Manager. The supportedversions for CDH and Cloudera Manager versions are detailed in the table.

Feature Lowest supported ClouderaManager Version

Lowest supported CDHVersion

Supported Services

Replication Cloudera Manager 5.14+ CDH 5.13+ HDFS, Hive, Impala

Replication to and fromAmazon S3*

Cloudera Manager 5.14+ CDH 5.13+ HDFS, Hive, Impala

Snapshots Cloudera Manager 5.15+ CDH 5.15+ HDFS, Hive, Impala

Replication to and fromMicrosoft ADLS Gen1

Cloudera Manager 5.15, 5.16,6.1+

CDH 5.13+ HDFS, Hive, Impala

Replication to and fromMicrosoft ADLS Gen2 (ABFS)

Cloudera Manager 6.1+ CDH 5.13+ HDFS, Hive, Impala

Attention: Replication Manager service is supported from CDP Private Cloud Base 7.0.3 to 7.1.1(source cluster versions). Cloudera Manager version will be 7.0.3 onwards.

Replication Manager does not support S3 as a source or destination when S3 is configured to use SSE-KMS.

Starting in Cloudera Manager 6.1.0, Replication Manager ignores Hive tables backed by Kudu duringreplication. The change does not affect functionality since Replication Manager does not support tablesbacked by Kudu. This change was made to guard against data loss due to how the Hive Mestastore,Impala, and Kudu interact.

Supported Replication Scenarios

Versions

To replicate data to or from clusters managed by Cloudera Manager 7.x, the source ordestination cluster must be managed by Cloudera Manager 5.14+ or higher. Note thatsome functionality may not be available in Cloudera Manager 5.14.0 and higher or 6.0.0and higher.

Kerberos

Replication Manager supports the following replication scenarios when Kerberosauthentication is used on a cluster:

• Secure source to a secure destination.• Insecure source to an insecure destination.• Insecure source to a secure destination. Keep the following requirements in mind:

• In replication scenarios where a destination cluster has multiple source clusters, allthe source clusters must either be secure or insecure. Replication Manager does notsupport replication from a mixture of secure and insecure source clusters.

• The destination cluster must run Cloudera Manager 7.x or higher.• The source cluster must run a compatible Cloudera Manager version.• This replication scenario requires additional configuration. For more information, see

Replicating from Unsecure to Secure Clusters.

Cloud Storage

Replication Manager supports replicating to or from Amazon S3, Microsoft Azure ADLSGen1, and Microsoft Azure ADLS Gen2 (ABFS).

TLS

32

https://docs.cloudera.com/cloudera-manager/7.0.3/replication-manager/topics/rm-dc-replicating-from-insecure-to-secure-clusters.html


You can use TLS with Replication Manager. Additionally, Replication Manager supportsreplication scenarios where TLS is enabled for non-Hadoop services (Hive/Impala) andTLS is disabled Hadoop services (such as HDFS,YARN, and MapReduce).

Unsupported Replication Scenarios

Kerberos

Replication Manager does not support the following replication scenarios when Kerberosauthentication is used on a cluster:

• Secure source to an unsecure destination is not supported.

Hive Replication

While using Hive replication, Managed tables are not supported. Replication Managerdoes not support migration using Managed tables on source and destination clusters.Replication Manager stores the replicated table as an External table.

Supported and Unsupported Replication ScenariosThis page provides information about the supported/unsupported replication scenarios.

Scenarios that are supported

Versions - To replicate data to or from clusters managed by Cloudera Manager 7.x, the source ordestination cluster must be managed by Cloudera Manager 5.14+ or higher. Note that some functionalitymay not be available in Cloudera Manager 5.14.0 and higher or 6.0.0 and higher.

Kerberos - Replication Manager supports the following replication scenarios when Kerberos authenticationis used on a cluster:

• Secure source to a secure destination.• Insecure source to an insecure destination.• Insecure source to a secure destination. Keep the following requirements in mind:

• In replication scenarios where a destination cluster has multiple source clusters, all the sourceclusters must either be secure or insecure. Replication Manager does not support replication from amixture of secure and insecure source clusters.

• The destination cluster must run Cloudera Manager 7.x or higher.• The source cluster must run a compatible Cloudera Manager version.• This replication scenario requires additional configuration.

Hive - Managed tables from the source, after the replication process is completed are translated asExternal tables.

Transport Layer Security (TLS)- You can use TLS with Replication Manager. Additionally, ReplicationManager supports scenarios where TLS is enabled for non-Hadoop services (Hive/Impala) and TLS isdisabled Hadoop services (such as HDFS,YARN, and MapReduce).

Cloud Storage - Replication Manager supports replicating to or from Amazon S3, Microsoft Azure ADLSGen1, and Microsoft Azure ADLS Gen2 (ABFS).

Scenarios that are not supported

Versions - Replicating to or from Cloudera Manager 6 managed clusters with Cloudera Manager versionsearlier than 5.14.0 are not supported.

Hive Replication- Replication Manager does not support Managed to Managed table replication. Ittranslates the Managed table from the source clusters to the CDP Private Cloud Base cluster as anExternal table. Replication Manager stores the replicated table as an External table.

General - Replicating data from CDP Private Cloud Base to CDH clusters is not supported.

33


Kerberos - When Kerberos authentication is used on a cluster, replication from a secure source to aninsecure destination is not supported.

Data ReplicationYou must understand some of the requirements about data replication.

You can also use the HBase shell to replicate HBase data. (Cloudera Manager does not manage HBasereplications.)

View a video about Backing up Data Using Cloudera Manager .

Figure 2: Video: Backing up Data Using Cloudera Manager

Cloudera License Requirements for ReplicationBoth the source and destination clusters must have a Cloudera Enterprise license.

To understand more about Cloudera license requirements, see Managing Licenses.

Replicating Directories with Thousands of Files and SubdirectoriesReplicating data includes a directory with several hundred thousand files or subdirectories.

Procedure

1. On the destination Cloudera Manager instance, go to the HDFS service page.

2. Click the Configuration tab.

3. Expand SCOPE and select HDFS service name (Service-Wide) option.

4. Expand CATEGORY and select Advanced option.

5. Locate the HDFS Replication Environment Advanced Configuration Snippet (Safety Valve) forhadoop-env.sh property.

6. Increase the heap size by adding a key-value pair, for instance, HADOOP_CLIENT_OPTS=-Xmx1g. Inthis example, 1g sets the heap size to 1 GB. This value should be adjusted depending on the numberof files and directories being replicated.

7. Enter a Reason for change, and then click Save Changes to commit the changes.

Replication Manager Log RetentionBy default, Cloudera Manager retains Replication Manager logs for 90 days. You can change the numberof days Cloudera Manager retains logs for or disable log retention completely.

1. In the Cloudera Manager Admin Console, search for the following property: Backup and DisasterLog Retention.

2. Enter the number of days you want to retain logs for. To disable log retention, enter -1.

To set up the Backup and Disaster Log Retention property, navigate to the ClouderaManager > HDFS > Configuration section.

Important: Automatic log expiration purges custom set replication log and metadata files too.These paths are set by Log Path and Directory for Metadata arguments that are present on theUI as per the schedule fields. It is the user's responsibility to set valid paths (For example, specifythe legal HDFS paths that are writable by current user) and maintain this information for eachreplication schedule.

Replicating from Unsecure to Secure ClustersYou can use Replication Manager to replicate data from an unsecure cluster, one that does not useKerberos authentication, to a secure cluster, a cluster that uses Kerberos. Note that the reverse is not true.

34

https://youtu.be/lHn5ziIKN6w

https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/managing-clusters/topics/cm-managing-licenses.html


About this taskReplication Manager does not support replicating from a secure cluster to an unsecure cluster. To performthe replication, the destination cluster must be managed by Cloudera Manager 6.1.0 or higher. The sourcecluster must run Cloudera Manager 5.14.0 or higher in order to be able to replicate to Cloudera Manager 6.

Note: In replication scenarios where a destination cluster has multiple source clusters, allthe source clusters must either be secure or unsecure. Replication Manager does not supportreplication from a mixture of secure and unsecure source clusters.

To enable replication from an unsecure cluster to a secure cluster, you need a user that exists on all thehosts on both the source cluster and destination cluster. Specify this user in the Run As Username fieldwhen you create a replication schedule.

Procedure

1. On a host in the source or destination cluster, add a user with the following command:

sudo -u hdfs hdfs dfs -mkdir -p /user/<username>

For example, the following command creates a user named milton:

sudo -u hdfs hdfs dfs -mkdir -p /user/milton

2. Set the permissions for the user directory with the following command:

sudo -u hdfs hdfs dfs -chown <username> /user/username

For example, the following command makes milton the owner of the milton directory:

sudo -u hdfs hdfs dfs -chown milton /user/milton

3. Create the supergroup group for the user you created in step 1 with the following command:

groupadd supergroup

4. Add the user you created in step 1 to the group you created:

usermod -G supergroup <username>

For example, add milton to the group named supergroup:

usermod -G supergroup milton

5. Repeat this process for all hosts in the source and destination clusters so that the user and group existson all of them.

What to do nextAfter you complete this process, specify the user you created in the Run As Username field when youcreate a replication policy.

Designating a Replication SourceYou must assign the source cluster to replicate the data.

Minimum Required Role: Cluster Administrator (also provided by Full Administrator)

The Cloudera Manager Server that you are logged into is the destination for replications set up using thatCloudera Manager instance. From the Admin Console of this destination Cloudera Manager instance, youcan designate a peer Cloudera Manager Server as a source of HDFS and Apache Hive data for replication.

Configuring a Peer RelationshipYou must connect Cloudera Manager with the peer and later test the connectivity.

About this taskIf your cluster uses SAML Authentication, see before configuring a peer.

35

https://docs.cloudera.com/cloudera-manager/7.2.6/managing-clusters/topics/cm-user-roles.html


Procedure

1. From Cloudera Manager, select Replication > Peers in the left navigation bar. If there are no existingpeers, an Add Peer button appears in addition to a short message. If peers already exist, they display inthe Peers list.

2. Click Add Peer.

3. In the Add Peer dialog box, provide a name, the peer URL (including the port) of the Cloudera ManagerServer source for the data to be replicated, and the login credentials for that server.

Important: The role assigned to the login on the source server must be either a UserAdministrator or a Full Administrator.

Cloudera recommends that TLS/SSL be used. A warning is shown if the URL scheme is http instead ofhttps. After configuring both peers to use TLS/SSL, add the remote source Cloudera Manager TLS/SSLcertificate to the local Cloudera Manager truststore, and vice versa.

4. Click the Add button in the dialog box to create the peer relationship.

ResultsThe peer is added to the Peers list. Cloudera Manager automatically tests the connection between theCloudera Manager Server and the peer. You can also click Test Connectivity to test the connection. TestConnectivity also tests the Kerberos configuration for the clusters.

Modifying PeersYou can modify peers.

1. Do one of the following:

• Edit

a. In the row for the peer, select Edit.b. Make your changes.c. Click Update Peer to save your changes.

• Delete - In the row for the peer, click Delete.

Configuring Peers with SAML AuthenticationIf your cluster uses SAML Authentication, perform the following before you create a peer.

Procedure

1. Create a Cloudera Manager user account that has the User Administrator or Full Administrator role.

You can also use an existing user that has one of these roles. Since you use this user to create thepeer relationship, you can delete the user account after you add the peer.

2. Create or modify the peer, as described in this topic.

3. Delete the Cloudera Manager user account that was just created.

HDFS ReplicationReplication related to HDFS data is discussed in this section.

36


This page contains references to CDH 5 components or features that have been removed from CDH 6.These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager 6.

Minimum Required Role: Replication Administrator (also provided by Full Administrator)

HDFS replication enables you to copy (replicate) your HDFS data from one HDFS service to another,synchronizing the data set on the destination service with the data set on the source service, based on aspecified replication policy. The destination service must be managed by the Cloudera Manager Serverwhere the replication is being set up, and the source service can be managed by that same server or by apeer Cloudera Manager Server. You can also replicate HDFS data within a cluster by specifying differentsource and destination directories.

Remote Replication Manager automatically copies HDFS metadata to the destination cluster as it copiesfiles. HDFS metadata need only be backed up locally.

Source DataWhen a replication job runs, ensure that the source directory is not modified.

A file added during replication does not get replicated. If you delete a file during replication, the replicationfails.

Additionally, ensure that all files in the directory are closed. Replication fails if source files are open. If youcannot ensure that all source files are closed, you can configure the replication to continue despite errors.Uncheck the Abort on Error option for the HDFS replication.

After the replication completes, you can view the log for the replication to identify opened files. Ensurethese files are closed before the next replication occurs.

Network Latency and ReplicationHigh latency among clusters can cause replication jobs to run more slowly, but does not cause them to fail.

For best performance, latency between the source cluster NameNode and the destination clusterNameNode should be less than 80 milliseconds. (You can test latency using the Linux ping command.)Cloudera has successfully tested replications with latency of up to 360 milliseconds. As latency increases,replication performance degrades.

Performance and Scalability LimitationsHDFS replication has some limitations.

• Maximum number of files for a single replication job: 100 million.• Maximum number of files for a replication policy hat runs more frequently than once in 8 hours: 10

million.• The throughput of the replication job depends on the absolute read and write throughput of the source

and destination clusters.• Regular rebalancing of your HDFS clusters is required for efficient operation of replications.

Note: Cloudera Manager provides downloadable data that you can use to diagnose HDFSreplication performance.

Replication with Sentry EnabledIf the cluster has Sentry enabled and you are using Replication Manager to replicate files or tables andtheir permissions, configuration changes to HDFS are required.

Before you begin

The configuration changes are required due to how HDFS manages ACLs. When a user reads ACLs,HDFS provides the ACLs configured in the External Authorization Provider, which is Sentry. If Sentry is notavailable or it does not manage authorization of the particular resource, such as the file or directory, thenHDFS falls back to its own internal ACLs. But when ACLs are written to HDFS, HDFS always writes theseinternal ACLs even when Sentry is configured. This causes HDFS metadata to be polluted with Sentry

37



ACLs. It can also cause a replication failure in replication when Sentry ACLs are not compatible with HDFSACLs.

To prevent issues with HDFS and Sentry ACLs, complete the following steps:

Procedure

1. Create a user account that is only used for Replication Manager jobs since Sentry ACLs will bebypassed for this user.For example, create a user named bdr-only-user.

2. Configure HDFS on the source cluster:

a) In the Cloudera Manager Admin Console, select Clusters > HDFS Service.b) Select Configuration and search for the following property: NameNode Advanced

Configuration Snippet (Safety Valve) for hdfs-site.xml.c) Add the following property:

Name: Use the following property name: dfs.namenode.inode.attributes.provider.bypass.users

Value: Provide the following information: <username>, <username>@<RealmName>

Replace <username> with the user you created in step 1 and <RealmName> with the name of theKerberos realm.

For example, the user bdr-only-user on the realm elephant requires the following value:

bdr-only-user, bdr-only-user@ElephantRealm

Description: This field is optional.d) Restart the NameNode.

3. Repeat step 2 on the destination cluster.

4. When you create a policy, specify the user you created in step 1 in the Run As Username and Run onPeer as Username (if available) fields.

Note: The Run As Username field is used to launch MapReduce job for copying data. Run onPeer as Username field is used to run copy listing on source, if different than Run as Username.

What to do next

Important: Make sure to set the value of Run on Peer as Username same as Run as Username,else Replication Manager reads ACL from the source as hdfs, which pulls the Sentry providedACLs over to the target cluster and applies them to the files in HDFS. It can result in additionalusage of NameNode heap in the target cluster.

Guidelines for Snapshot Diff-based ReplicationBy default, Replication Manager uses snapshot differences ("diff") to improve performance by comparingHDFS snapshots and only replicating the files that are changed in the source directory.

While Hive metadata requires a full replication, the data stored in Hive tables can take advantage ofsnapshot diff-based replication.

To use this feature, follow these guidelines:

• If a Hive replication policy is created to replicate a database, ensure all the HDFS paths for the tablesin that database are either snapshottable or under a snapshottable root. For example, if the databasethat is being replicated has external tables, all the external table HDFS data locations should besnapshottable too. Failing to do so might cause the Replication Manager to fail to generate a diff report.Without a diff report, Replication Manager does not use snapshot diff.

• After every replication, Replication Manager retains a snapshot on the source cluster. Using thesnapshot copy on the source cluster, Replication Manager performs incremental backups for the nextreplication cycle. Replication Manager retains snapshots on the source cluster only if:

38


• Source and target clusters in the Cloudera Manager are 5.15 and higher• Source and target CDH are 5.13.3+, 5.14.2+, and 5.15+ respectively

Configuring Replication of HDFS DataYou must set up your clusters before you plan HDFS data replication.

Procedure

1. Verify that your cluster conforms to one of the supported replication scenarios.

2. f you are using different Kerberos principals for the source and destination clusters, add the destinationprincipal as a proxy user on the source cluster. For example, if you are using the hdfssrc principal onthe source cluster and the hdfsdest principal on the destination cluster, add the following properties tothe HDFS service Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xmlproperty on the source cluster:

<property> <name>hadoop.proxyuser.hdfsdest.groups</name> <value>*</value></property><property> <name>hadoop.proxyuser.hdfsdest.hosts</name> <value>*</value></property>

Deploy the client configuration and restart all services on the source cluster, if the source cluster ismanaged by a different Cloudera Manager server than the destination cluster.

3. From Cloudera Manager, select Replication > Replication Policies.

4. Select HDFS Replication Policy.

39


The Create HDFS Replication Policy dialog box displays, and opens displaying the General tab.

5. Select the General tab to configure the following:

• Click the Name field and add a unique name for the replication policy.• Click the Source field and select the source HDFS service. You can select HDFS services managed

by a peer Cloudera Manager Server, local HDFS services (managed by the Cloudera ManagerServer for the Admin Console you are logged into).

• Enter the Source Path to the directory or file you want to replicate.• Click the Destination field and select the destination HDFS service from the HDFS services

managed by the Cloudera Manager Server for the Admin Console you are logged into.• Enter the Destination Path where the source files should be saved.• Select a Schedule:

• Immediate - Run the schedule immediately.• Once - Run the schedule one time in the future. Set the date and time.• Recurring - Run the schedule periodically in the future. Set the date, time, and interval between

runs.• Enter the user to run the replication job in the Run As Username field. By default this is hdfs. If

you want to run the job as a different user, enter the user name here. If you are using Kerberos,you must provide a user name here, and it must be one with an ID greater than 1000. You can alsoconfigure the minimum user ID number with the min.user.id property in the YARN or MapReduceservice. Verify that the user running the job has a home directory, /user/username, ownedby username:supergroup in HDFS. This user must have permissions to read from the sourcedirectory and write to the destination directory.

Note the following:

• The User must not be present in the list of banned users specified with the Banned SystemUsers property in the YARN configuration (Go to the YARN service, select Configuration tab andsearch for the property). For security purposes, the hdfs user is banned by default from runningYARN containers.

• The requirement for a user ID that is greater than 1000 can be overridden by adding the userto the "white list" of users that is specified with the Allowed System Users property. (Go to theYARN service, select Configuration tab and search for the property.)

6. Select the Resources tab to configure the following:

Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter is usedby the MapReduce Service you specified when Cloudera Manager executes the MapReduce job for thereplication. The job specifies the value using one of these properties:

• • MapReduce – Fair scheduler: mapred.fairscheduler.pool• MapReduce – Capacity scheduler: queue.name• YARN – mapreduce.job.queuename

40


• Maximum Map Slots - Limits for the number of map slots per mapper. The default value is 20.• Maximum Bandwidth - Limits for the bandwidth per mapper. The default is 100 MB.• Replication Strategy - Whether file replication tasks should be distributed among the mappers

statically or dynamically. (The default is Dynamic.) Static replication distributes file replication tasksamong the mappers up front to achieve a uniform distribution based on the file sizes. Dynamicreplication distributes file replication tasks in small sets to the mappers, and as each mappercompletes its tasks, it dynamically acquires and processes the next unallocated set of tasks.

7. Select the Advanced Options tab to configure the following options:

• Add Exclusion - Click the link to exclude one or more paths from the replication. The RegularExpression-Based Path Exclusion field displays, where you can enter a regular expression-basedpath. When you add an exclusion, include the snapshotted relative path for the regex. For example,to exclude the /user/bdr directory, use the following regular expression, which includes thesnapshots for the bdr directory:

.*/user/\.snapshot/.+/bdr.*

To exclude top-level directories from replication in a globbed source path, you can specify therelative path for the regex without including .snapshot in the path. For example, to exclude thebdr directory from replication, use the following regular expression:

.*/user+/bdr.*

Note: When you set a path exclusion filter (and have delete policy set to delete), it isexpected that path on target cluster remains the same. However, the current behavior is that,the directories/files are deleted on target cluster even if they match the exclusion filter.

You can add more than one regular expression to exclude.• • MapReduce Service - The MapReduce or YARN service to use.

• Log path - An alternate path for the logs.• Description - A description of the replication policy.• Error Handling - You can select the following:

• Skip Checksum Checks - Whether to skip checksum checks on the copied files. If checked,checksums are not validated. Checksums are checked by default.

Important: You must skip checksum checks to prevent replication failure due to non-matching checksums in the following cases:

• Replications from an encrypted zone on the source cluster to an encrypted zone ona destination cluster.

• Replications from an encryption zone on the source cluster to an unencrypted zoneon the destination cluster.

• Replications from an unencrypted zone on the source cluster to an encrypted zoneon the destination cluster.

Checksums are used for two purposes:

• To skip replication of files that have already been copied. If Skip Checksum Checksis selected, the replication job skips copying a file if the file lengths and modificationtimes are identical between the source and destination clusters. Otherwise, the jobcopies the file from the source to the destination.

• To redundantly verify the integrity of data. However, checksums are not required toguarantee accurate transfers between clusters. HDFS data transfers are protectedby checksums during transfer and storage hardware also uses checksums toensure that data is accurately stored. These two mechanisms work together tovalidate the integrity of the copied data.

41


• Skip Listing Checksum Checks - Whether to skip checksum check when comparing two filesto determine whether they are same or not. If skipped, the file size and last modified timeare used to determine if files are the same or not. Skipping the check improves performanceduring the mapper phase. Note that if you select the Skip Checksum Checks option, thischeck is also skipped.

• Abort on Error - Whether to abort the job on an error. If selected, files copied up to that pointremain on the destination, but no additional files are copied. Abort on Error is off by default.

• Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, ReplicationManager uses a complete copy to replicate data. If you select this option, the ReplicationManager aborts the replication when it encounters an error instead.

• Preserve - Whether to preserve the block size, replication count, permissions (including ACLs),and extended attributes (XAttrs) as they exist on the source file system, or to use the settings asconfigured on the destination file system. By default source system settings are preserved. WhenPermission is checked, and both the source and destination clusters support ACLs, replicationpreserves ACLs. Otherwise, ACLs are not replicated. When Extended attributes is checked, andboth the source and destination clusters support extended attributes, replication preserves them.(This option only displays when both source and destination clusters support extended attributes.)

Note: To preserve permissions to HDFS, you must be running as a superuser on thedestination cluster. Use the "Run As Username" option to ensure that is the case.

• Delete Policy - Whether files that were deleted on the source should also be deleted from thedestination directory. This policy also determines the handling of files in the destination location thatare unrelated to the source. Options include:

• Keep Deleted Files - Retains the destination files even when they no longer exist at the source.(This is the default.).

• Delete to Trash - If the HDFS trash is enabled, files are moved to the trash folder.• Delete Permanently - Uses the least amount of space; use with caution.

• Alerts - Whether to generate alerts for various state changes in the replication workflow. You canalert on failure, on start, on success, or when the replication workflow is aborted.

8. Click Save Policy.The replication task now appears as a row in the Replications Policy table. It can take up to 15 secondsfor the task to appear.

If you selected Immediate in the Schedule field, the replication job begins running when you click SavePolicy.

What to do next

To specify additional replication tasks, select Create > HDFS Replication.

Note: If your replication job takes a long time to complete, and files change before the replicationfinishes, the replication may fail. Consider making the directories snapshottable, so that thereplication job creates snapshots of the directories before copying the files and then copies filesfrom these snapshottable directories when executing the replication.

Limiting Replication HostsIf your cluster has clients installed on hosts with limited resources, HDFS replication may use these hoststo run commands for the replication, which can cause performance degradation. You can limit HDFSreplication to run only on selected DataNodes by specifying a "whitelist" of DataNode hosts.

Procedure

1. Click Clusters > HDFS Service > Configuration.

2. Type HDFS Replication in the search box.

3. Locate the HDFS Replication Environment Advanced Configuration Snippet (Safety Valve) property.

42


4. Add the HOST_WHITELIST property. Enter a comma-separated list of DataNode hostnames to use forHDFS replication. For example:

HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com

5. Click Save Changes to commit the changes.

Viewing Replication PoliciesThe Replications Policies page displays a row of information about each scheduled replication job. Eachrow also displays recent messages regarding the last time the replication job ran.

Figure 3: Replication Policies Table

Only one job corresponding to a replication policy can occur at a time; if another job associated with thatsame replication policy starts before the previous one has finished, the second one is canceled.

You can limit the replication jobs that are displayed by selecting filters on the left. If you do not see anexpected policy, adjust or clear the filters. Use the search box to search the list of policies for path,database, or table names.

The Replication Policies columns are described in the following table.

Table 2: Replication Policies Table

Column Description

ID An internally generated ID number that identifies the policy. Provides a convenient way to identify a policy.

Click the ID column label to sort the replication policy table by ID.

Name The unique name you specify when you create a policy.

Type The type of replication policy, either HDFS or Hive.

Source The source cluster for the replication.

Destination The destination cluster for the replication.

Throughput Average throughput per mapper/file of all the files written. Note that throughput does not include thefollowing information: the combined throughput of all mappers and the time taken to perform a checksum ona file after the file is written.

Progress The progress of the replication.

43


Column Description

Completed The time when the replication job completed.

Click the Completed column label to sort the replication policies table by time.

Next Run The date and time when the next replication is scheduled, based on the schedule parameters specified forthe schedule. Hover over the date to view additional details about the scheduled replication.

Click the Next Run column label to sort the Replication Policies table by the next run date.

Actions The following items are available from the Action button:

• Show History - Opens the Replication History page for a replication.• Edit Configuration - Opens the Edit Replication Policy page.• Dry Run - Simulates a run of the replication task but does not actually copy any files or tables. After

a Dry Run, you can select Show History, which opens the Replication History page where you canview any error messages and the number and size of files or tables that would be copied in an actualreplication.

• Run Now - Runs the replication task immediately.• Click Collect Diagnostic Data to open the Send Diagnostic Data screen, which allows you to collect

replication-specific diagnostic data for the last 10 runs of the policy.

In the Send Diagnostic Data screen, select Send Diagnostic Data to Cloudera to automatically sendthe bundle to Cloudera Support. You can also enter a ticket number and comments when sending thebundle. After you click Collect and Send Diagnostic Data, the Replication Manager generates thebundle and opens the Replications Diagnostics Command screen. When the command finishes, clickDownload Result Data to download a zip file containing the bundle.

• Disable | Enable - Disables or enables the replication policy. No further replications are scheduled fordisabled replication policies.

• Delete - Deletes the policy. Deleting a replication policy does not delete copied files or tables.

• While a job is in progress, the Last Run column displays a spinner and progress bar, and each stage ofthe replication task is indicated in the message beneath the job's row. Click the Command Details linkto view details about the execution of the command.

• If the job is successful, the number of files copied is indicated. If there have been no changes to a fileat the source since the previous job, then that file is not copied. As a result, after the initial job, only asubset of the files may actually be copied, and this is indicated in the success message.

• If the job fails, the icon displays.• To view more information about a completed job, select Actions > Show History.

Viewing Replication HistoryYou can view historical details about replication jobs on the Replication History page.

To view the history of a replication job:


The list of available replication policies appear.2. Select the policy, and click Actions > Show History.

Figure 4: Replication History Screen (HDFS)

44


Replication History Table

The Replication History page displays a table of previously run replication jobs with the followingcolumns:

45


Column Description

Start Time Shows the details about the job.

You can expand the section to view the following job details:

• Started At - Displays the time the replication job started.• Duration - Displays the time duration for the job to complete.• Command Details - Displays the command details in a new tab after you click View.

The Command Details page displays the details and messages about each step during commandrun. On this page, click Context to view the service status page relevant to the command, and clickDownload to download the summary as a JSON file.

To view the command details, expand the Step section and then choose Show All Steps, ShowOnly Failed Steps, or Show Only Running Steps. In this section, you can perform the followingtasks:

• View the actual command string.• View the start time and duration for the command run.• View the host status page for the command by clicking the host link.• View the full log file for the command by selecting the stdout or stderr tab.

See Viewing Running and Recent Commands.• MapReduce Job. Click the link to view the job details.• HDS Replication Report. Click Download CSV to view the following options:

• Listing - Click to download the CSV file that contains the replication report. The file lists the list offiles and directories copied during the replication job.

• Status - Click to download the CSV file that contains the complete status report. The filecontains the full status report of the files where the status of the replication is one of thefollowing:

• ERROR – An error occurred and the file was not copied.• DELETED – A deleted file.• SKIPPED – A file where the replication was skipped because it was up-to-date.

• Error Status Only - Click to download the CSV file that contains the status report of all copiedfiles with errors. The file lists the status, path, and message for the copied files with errors.

• Deleted Status Only - Click to download the CSV file that contains the status report of all deletedfiles. The file lists the status, path, and message for the databases and tables that were deleted.

• Skipped Status Only - Click to download the CSV file that contains the status report of allskipped files. The file lists the status, path, and message for the databases and tables that wereskipped.

• Performance - Click to download a CSV file which contains a summary report about theperformance of the running replication job. The performance summary report includes the lastperformance sample for each mapper that is working on the replication job.

• Full Performance - Click to download the CSV file that contains the performance report ofthe job. The performance report shows the samples taken for all the mappers during the fullexecution of the replication job.

• (Dry Run only) View the number of Replicable Files. Displays the number of files that would bereplicated during an actual replication.

• (Dry Run only) View the number of Replicable Bytes. Displays the number of bytes that would bereplicated during an actual replication.

• View the number of Impala UDFs replicated. (Displays only for Hive/Impala replications whereReplicate Impala Metadata is selected.)

• If a user was specified in the Run As Username field when creating the replication job, the selecteduser displays.

• View messages returned from the replication job.

Duration Amount of time the replication job took to complete.

Outcome Indicates success or failure of the replication job.

Files Expected Number of files expected to be copied and its file size based on the parameters of the replication policy.

Files Copied Number of files copied and its file size for the replication job.

Files Failed Number of files that failed to be copied and its file size for the replication job.

Files Deleted Number of files that were deleted and its file size for the replication job.

46

https://docs.cloudera.com/cloudera-manager/7.2.6/monitoring-and-diagnostics/topics/cm-viewing-running-and-recent-commands.html


Column Description

Files Skipped Number of files skipped and its file size for the replication job. The replication process skips files thatalready exist in the destination and have not changed.

Monitoring the Performance of HDFS ReplicationsYou can monitor the progress of an HDFS replication policy using performance data that you download asa CSV file from the Cloudera Manager Admin console.

About this taskThis file contains information about the files being replicated, the average throughput, and other details thatcan help diagnose performance issues during HDFS replications. You can view this performance data forrunning HDFS replication jobs and for completed jobs.

View the performance data for a running HDFS replication policy, perform the following steps:

Procedure


2. Select the policy, and click Actions > Show History.

3. Click Download CSV, and then choose one of the following options to view the performance report:

• Performance - Click to download a CSV file which contains a summary report about theperformance of the replication job. The performance summary report includes the last performancesample for each mapper that is working on the replication job.

• Full Performance - Click to download the CSV file that contains the performance report of thejob. The complete performance report includes all the samples taken for all mappers during the fullexecution of the replication job.

4. To view the data, open the file in a spreadsheet program such as Microsoft Excel.

What to do nextThe following table shows the columns that you can view in the CSV file:

47


Table 3: HDFS Performance Report Columns

Performance Data Columns Description

Timestamp Time when the performance data was collected

Host Name of the host where the YARN or MapReduce job was running.

Bytes Copied Number of bytes copied for the file currently being copied.

Time Elapsed (ms) Total time elapsed in milliseconds for the copy operation of the file currently being copied.

Files Copied Number of files copied.

Avg Throughput (KB/s) Average throughput since the start of the file currently being copied in kilobytes per second.

Last File (bytes) File size of the last file in bytes.

Last File Time (ms) Time taken to copy the last file in milliseconds.

Last file throughput (KB/s) Throughput since the start of the last file being copied in kilobytes per second.

In addition to the performance reports, you can view the reports of files with errors, files that are deleted,and files that are skipped during the replication job. To view the reports, perform the following steps:

• On the Replication Policies page, select the policy and click Actions > Show History.

The Replication History page for the replication policy appears. Expand to view the replication jobdetails.

• Click Download CSV for the following options:• • Listing - Click to download the CSV file that contains the replication report. The file lists the list of

files and directories copied during the replication job.• Status - Click to download the CSV file that contains the complete status report. The file contains the

full status report of the files where the status of the replication is one of the following:


• Error Status Only - Click to download the CSV file that contains the status report of all copied fileswith errors. The file lists the status, path, and message for the copied files with errors.


• Skipped Status Only - Click to download the CSV file that contains the status report of all skippedfiles. The file lists the status, path, and message for the databases and tables that were skipped.

• Performance - Click to download a CSV file which contains a summary report about the performanceof the running replication job. The performance summary report includes the last performancesample for each mapper that is working on the replication job.

• Full Performance - Click to download the CSV file that contains the performance report of the job.The performance report shows the samples taken for all the mappers during the full execution of thereplication job.

To view the data, open the file in a spreadsheet program such as Microsoft Excel.

The performance data is collected every two minutes. Therefore, no data is available during the initialexecution of a replication job because not enough samples are available to estimate throughput and otherreported data.

A sample CSV file, as presented in Excel, is shown here:

48


Note the following limitations and known issues:

• If you click the CSV download too soon after the replication job starts, Cloudera Manager returns anempty file or a CSV file that has columns headers only and a message to try later when performancedata has actually been collected.

• If you employ a proxy user with the form user@domain, performance data is not available through thelinks.

• If the replication job only replicates small files that can be transferred in less than a few minutes, noperformance statistics are collected.

• For replication policies that specify the Dynamic Replication Strategy, statistics regarding the last filetransferred by a MapReduce job hide previous transfers performed by that MapReduce job.

• Only the last trace per MapReduce job is reported in the CSV file.

Hive/Impala ReplicationHive/Impala replication enables you to copy (replicate) your Hive metastore and data from one cluster toanother and synchronize the Hive metastore and data set on the destination cluster with the source, basedon a specified replication policy.

This page contains references to CDH 5 components or features that have been removed from CDH 6.These references are only applicable if you are managing a CDH 5 cluster with Cloudera Manager 6.


The destination cluster must be managed by the Cloudera Manager Server where the replication is beingset up, and the source cluster can be managed by that same server or by a peer Cloudera ManagerServer.

Caution: Because of the warehouse directory changes between CDH clusters and CDP-DC,Hive replication does not copy the table data from the database and tables specified in the sourcecluster. But the replication job gets successfully run without any disruptions. While replicatingfrom CDH clusters to CDP-DC, it is recommended that the HDFS Destination Path is defined.If HDFS Destination Path is not defined and Replicate HDFS File is set as true, the data isreplicated with the original source name. For example, the replicated table data was to resideunder /warehouse/tablespace/external/hive directory but the data was replicated to /user/hive/warehouse location. Also, not defining HDFS Destination Path before the replicationprocess can result in a large chunk of HDFS space being used for unwanted data movement.

Important: Since Hive3 has a different default table type and warehouse directory structure, thefollowing changes apply while replicating Hive data from CDH5 or CDH6 versions to CDP-DC:

• All tables become External tables during Hive replication. This is because the default tabletype is ACID in Hive3, which is the only managed table type. As of this release, BDR does notsupport Hive2 -> Hive3 replication into ACID tables and all the tables are necessarily replicatedas External tables.

• Replicated tables will be created under external Hive warehouse directory setby hive.metastore.warehouse.external.dir Hive configuration parameter. Users haveto make sure that this has a different value than hive.metastore.warehouse.dir Hiveconfiguration parameter, that is the location of Managed tables.

• If users want to replicate the same database from Hive2 to Hive3 (that will have different pathsby design), they need to use Force Overwrite option per policy to avoid any mismatch issues.

49



Note: While replicating from Sentry to Ranger, the minimum supported Cloudera Manager versionis 6.3.1 and above.

Configuration notes:

• If the hadoop.proxyuser.hive.groups configuration has been changed to restrict access to theHive Metastore Server to certain users or groups, the hdfs group or a group containing the hdfsuser must also be included in the list of groups specified for Hive/Impala replication to work. Thisconfiguration can be specified either on the Hive service as an override, or in the core-site HDFSconfiguration. This applies to configuration settings on both the source and destination clusters.

• If you configured on the target cluster for the directory where HDFS data is copied during Hive/Impalareplication, the permissions that were copied during replication, are overwritten by the HDFS ACLsynchronization and are not preserved.

Note: If your deployment includes tables backed by Kudu, Replication Manager filters out Kudutables for a Hive replication in order to prevent data loss or corruption.

Host Selection for Hive/Impala ReplicationIf your cluster has Hive clients installed on hosts with limited resources, Hive/Impala replication may usethese hosts to run commands for the replication, which can cause the performance of the replication todegrade.

About this taskTo improve performance, you can specify the hosts (a ”white list”) to use during replication so that thelower-resource hosts are not used.

Procedure

1. Click Clusters > Hive Service > Configuration.

2. Type Hive Replication in the search box.

3. Locate the Hive Replication Environment Advanced Configuration Snippet (Safety Valve) property.

4. Add the HOST_WHITELIST property. Enter a comma-separated list of hostnames to use for Hive/Impala replication.

HOST_WHITELIST=host-1.mycompany.com,host-2.mycompany.com


Hive Tables and DDL CommandsThe following applies when using the drop table and truncate table DDL commands.

• If you configure replication of a Hive table and then later drop that table, the table remains on thedestination cluster. The table is not dropped when subsequent replications occur.

• If you drop a table on the destination cluster, and the table is still included in the replication job, thetable is re-created on the destination during the replication.

• If you drop a table partition or index on the source cluster, the replication job also drops them on thedestination cluster.

• If you truncate a table, and the Delete Policy for the replication job is set to Delete to Trash or DeletePermanently, the corresponding data files are deleted on the destination during a replication.

Replication of ParametersParameters of databases, tables, partitions, and indexes are replicated by default during Hive/Impalareplications.

To disable replication of parameters, perform the following step:

1. Log in to the Cloudera Manager Admin Console.2. Go to the Hive service.3. Click the Configuration tab.

50


4. Search for Hive Replication Environment Advanced Configuration Snippet property.5. Add the following parameter:

REPLICATE_PARAMETERS=false

6. Click Save Changes.

Hive Replication in Dynamic EnvironmentsTo use Replication Manager for Hive replication in environments where the Hive Metastore changes, suchas when a database or table gets created or deleted, additional configuration is needed.

Procedure

1. Open the Cloudera Manager Admin Console.

2. Search for the HDFS Client Advanced Configuration Snippet (Safety Valve) for hdfs-site.xmlproperty on the source cluster.

3. Add the following properties:

a) Name: replication.hive.ignoreDatabaseNotFound

Value: trueb) replication.hive.ignoreTableNotFound

value: true

4. Save the changes.

5. Restart the HDFS service.

Configuring Replication of Hive/Impala DataHive/Impala data configuration

1. Verify that your cluster conforms to one of the supported replication scenarios.2. If the source cluster is managed by a different Cloudera Manager server than the destination cluster,

configure a peer relationship.3. From Cloudera Manager, select Replication > Replication Policies.

4. Select Hive Replication Policy.

51


The Create Hive Replication Policy dialog box displays, and opens displaying the General tab.5. Select the General tab to configure the following:

a. Use the Name field to provide a unique name for the replication policy.b. Use the Source drop-down list to select the cluster with the Hive service you want to replicate.c. Use the Destination drop-down list to select the destination for the replication. If there is only one

Hive service managed by Cloudera Manager available as a destination, this is specified as thedestination. If more than one Hive service is managed by this Cloudera Manager, select from amongthem.

d. Based on the type of destination cluster you plan to use, select Use HDFS Destination.e. Select one of the following permissions:

• Do not import Sentry Permissions (Default)• If Sentry permissions were exported from the CDH cluster, import both Hive object and URL

permissions• If Sentry permissions were exported from the CDH cluster, import only Hive object permissions

f. Leave Replicate All checked to replicate all the Hive databases from the source. To replicateonly selected databases, clear this option and enter the database name(s) and tables you want toreplicate.

• You can specify multiple databases and tables using the plus symbol to add more rows to thespecification.

• You can specify multiple databases on a single line by separating their names with the pipe (|)character. For example: mydbname1|mydbname2|mydbname3.

52


• Regular expressions can be used in either database or table fields, as described in the followingtable:

Regular Expression Result

[\w].+Any database or table name.

(?!myname\b).+Any database or table except the one named myname.

db1|db2[\w_]+

All tables of the db1 and db2 databases.

db1[\w_]+

Click the "+" button and then enter

db2[\w_]+

All tables of the db1 and db2 databases (alternatemethod).

g. To specify the user that should run the MapReduce job, use the Run As Username option. Bydefault, MapReduce jobs run as hdfs. To run the MapReduce job as a different user, enter the username. If you are using Kerberos, you must provide a user name here, and it must have an ID greaterthan 1000.

Note: The user running the MapReduce job should have read and execute permissionson the Hive warehouse directory on the source cluster. If you configure the replication job topreserve permissions, superuser privileges are required on the destination cluster.

h. Specify the Run on peer as Username option if the peer cluster is configured with a differentsuperuser. This is only applicable while working in a kerberized environment.

6. Select the Resources tab to configure the following:

• Scheduler Pool – (Optional) Enter the name of a resource pool in the field. The value you enter isused by the MapReduce Service you specified when Cloudera Manager executes the MapReducejob for the replication. The job specifies the value using one of these properties:

• MapReduce – Fair scheduler: mapred.fairscheduler.pool• MapReduce – Capacity scheduler: queue.name• YARN – mapreduce.job.queuename

• Maximum Map Slots and Maximum Bandwidth – Limits for the number of map slots and forbandwidth per mapper. The default is 100 MB.

• Replication Strategy – Whether file replication should be static (the default) or dynamic. Staticreplication distributes file replication tasks among the mappers up front to achieve a uniformdistribution based on file sizes. Dynamic replication distributes file replication tasks in small sets tothe mappers, and as each mapper processes its tasks, it dynamically acquires and processes thenext unallocated set of tasks.

7. Select the Advanced tab to specify an export location, modify the parameters of the MapReduce jobthat performs the replication, and set other options. You can select a MapReduce service (if there ismore than one in your cluster) and change the following parameters:

• Clear the Replicate HDFS Files option to skip replicating the associated data files.• If both the source and destination clusters use CDH 5.7.0 or later up to and including 5.11.x,

select the Replicate Impala Metadata drop-down list and select No to avoid redundant replicationof Impala metadata. (This option only displays when supported by both source and destinationclusters.) You can select the following options for Replicate Impala Metadata:

53


• Yes – replicates the Impala metadata.• No – does not replicate the Impala metadata.• Auto – Cloudera Manager determines whether or not to replicate the Impala metadata based on

the CDH version.

To replicate Impala UDFs when the version of CDH managed by Cloudera Manager is 5.7 or lower,see for information on when to select this option.

• The Force Overwrite option, if checked, forces overwriting data in the destination metastore ifincompatible changes are detected. For example, if the destination metastore was modified, and anew partition was added to a table, this option forces deletion of that partition, overwriting the tablewith the version found on the source.

Important: If the Force Overwrite option is not set, and the Hive/Impala replication processdetects incompatible changes on the source cluster, Hive/Impala replication fails. Thissometimes occurs with recurring replications, where the metadata associated with an existingdatabase or table on the source cluster changes over time.

• By default, Hive metadata is exported to a default HDFS location (/user/${user.name}/.cm/hive) and then imported from this HDFS file to the destination Hive metastore. In this example,user.name is the process user of the HDFS service on the destination cluster. To override thedefault HDFS location for this export file, specify a path in the Export Path field.

Note: In a Kerberized cluster, the HDFS principal on the source cluster must have read,write, and execute access to the Export Path directory on the destination cluster.

• Number of concurrent HMS connections - The number of concurrent Hive Metastore connections.These connections are used to concurrently import and export metadata from Hive. Increasing thenumber of threads can improve Replication Manager performance. By default, a new replicationpolicy uses five connections.

If you set the value to 1 or more, Replication Manager uses multi-threading with the number ofconnections specified. If you set the value to 0 or fewer, Replication Manager uses single threadingand a single connection.

Note that the source and destination clusters must run a Cloudera Manager version that supportsconcurrent HMS connections, Cloudera Manager 5.15.0 or higher and Cloudera Manager 6.1.0 orhigher.

• By default, Hive HDFS data files (for example, /user/hive/warehouse/db1/t1) are replicatedto a location relative to "/" (in this example, to /user/hive/warehouse/db1/t1). To overridethe default, enter a path in the HDFS Destination Path field. For example, if you enter /ReplicatedData, the data files would be replicated to /ReplicatedData/user/hive/warehouse/db1/t1.

• Select the MapReduce Service to use for this replication (if there is more than one in your cluster).• Log Path - An alternative path for the logs.• Description - A description for the replication policy.• Skip Checksum Checks - Whether to skip checksum checks, which are performed by default.• Skip Listing Checksum Checks - Whether to skip checksum check when comparing two files to

determine whether they are same or not. If skipped, the file size and last modified time are used todetermine if files are the same or not. Skipping the check improves performance during the mapperphase. Note that if you select the Skip Checksum Checks option, this check is also skipped.

• Abort on Error - Whether to abort the job on an error. By selecting the check box, files copied upto that point remain on the destination, but no additional files are copied. Abort on Error is off bydefault.

• Abort on Snapshot Diff Failures - If a snapshot diff fails during replication, Replication Manageruses a complete copy to replicate data. If you select this option, the Replication Manager aborts thereplication when it encounters an error instead.

• Delete Policy - Whether files that were on the source should also be deleted from the destinationdirectory. Options include:

54


• Preserve - Whether to preserve the Block Size, Replication Count, and Permissions as theyexist on the source file system, or to use the settings as configured on the destination file system. Bydefault, settings are preserved on the source.

Note: You must be running as a superuser to preserve permissions. Use the "Run AsUsername" option to ensure that is the case.

• Alerts - Whether to generate alerts for various state changes in the replication workflow. You canalert On Failure, On Start, On Success, or On Abort (when the replication workflow is aborted).

8. Click Save Policy.

The replication task appears as a row in the Replications Policies table.

To specify additional replication tasks, select Create > Hive Replication.

Note: If your replication job takes a long time to complete, and tables change before the replicationfinishes, the replication may fail. Consider making the Hive Warehouse Directory and thedirectories of any external tables snapshottable, so that the replication job creates snapshots of thedirectories before copying the files.

Sentry to Ranger ReplicationAs part of a Hive replication policy, you can choose to migrate relevant Hive or Impala Sentry policies intoRanger.

When you choose to migrate the Sentry policies to Ranger, the Replication Manager performs the followingtasks automatically:

1. Exports each Sentry policy as a single JSON file using the authzmigrator tool. The JSON file contains alist of resources, such as URI, database, table, or column and the policies that apply to it.

2. Copies the exported Sentry policies to the target cluster using the DistCp tool.3. Ingests the Sentry policies into Ranger after filtering the policies related to the replication job using the

authzmigrator tool through the Ranger rest endpoint. To filter the policies, the Replication Manager usesa filter expression that is passed to the authzmigrator tool by Cloudera Manager.

Note: If you are replicating a subset of the tables in a database, database-level policies getconverted to equivalent table-level policies for each table being replicated. (For example, ALL ondatabase -> ALL on table individually for each table replicated).

Caution: There will be no reference to the original role names in Ranger. The permissions aregranted directly to groups and users with respect to the resource and not the role. This is a differentformat to the Sentry to Ranger migration during an in-place upgrade to CDP Private Cloud Base,which does import and use the Sentry roles.

Attention: Regardless of whether a policy was modified or not, each policy will be re-created oneach replication. If you wish to continue scheduling data replication but you also want to modify thetarget cluster’s Ranger policies (and keep those modifications), you should disable the Sentry toRanger migration on subsequent runs.

Replication of Impala and Hive User Defined Functions (UDFs)By default, for clusters where the version of CDH is 5.7 or higher, Impala and Hive UDFs are persisted inthe Hive Metastore and are replicated automatically as part of Hive/Impala replications.

To replicate Impala UDFs when the version of CDH managed by Cloudera Manager is 5.6 or lower, youcan select the Replicate Impala Metadata option on the Advanced tab when creating a Hive/Impalareplication policy.

After a replication job has run, you can see the number of Impala and Hive UDFs that were replicatedduring the last run of the schedule on the Replication Policies page. You can also view the number ofreplicated UDFs on the Replication History page for previously-run replications.

55


Monitoring the Performance of Hive or Impala ReplicationsYou can monitor the progress of a Hive/Impala replication policy using performance data that youdownload as a CSV file from the Cloudera Manager Admin console.

Note: This page contains references to CDH 5 components or features that have been removedfrom CDH 6. These references are only applicable if you are managing a CDH 5 cluster withCloudera Manager 6.

This file contains information about the tables and partitions being replicated, the average throughput, andother details that can help diagnose performance issues during Hive/Impala replications. You can view thisperformance data for running Hive/Impala replication jobs and for completed jobs.

To view the performance data for a running Hive/Impala replication policy:

1. In Cloudera Manager, select Replication > Replication Policies.2. Locate the row for the policy, select the policy, and click Actions > Show History.3. Click Download CSV for HDFS Replication Report, and then choose one of the following options to

view the performance report:

• Performance - Click to download a CSV file which contains a summary report about the performanceof the replication job. The performance summary report includes the last performance sample foreach mapper that is working on the replication job.

• Full Performance - Click to download the CSV file that contains the performance report of the job.The complete performance report includes all the samples taken for all mappers during the fullexecution of the replication job.


In addition to the performance reports, you can view the reports of files with errors, files that are deleted,and files that are skipped during the replication job. To view the reports, perform the following steps:

1. On the Replication Policies page, locate the policy and click Actions > Show History.

The Replication History page for the replication policy appears. Expand to view the replication jobdetails.

2. Click Download CSV for the following options:3. • Listing - Click to download the CSV file that contains the replication report. The file lists the list of

files and directories copied during the replication job.• Status - Click to download the CSV file that contains the complete status report. The file contains the

full status report of the files where the status of the replication is one of the following:


• Error Status Only - Click to download the CSV file that contains the status report of all copied fileswith errors. The file lists the status, path, and message for the copied files with errors.


• Skipped Status Only - Click to download the CSV file that contains the status report of all skippedfiles. The file lists the status, path, and message for the databases and tables that were skipped.

• Performance - Click to download a CSV file which contains a summary report about the performanceof the running replication job. The performance summary report includes the last performancesample for each mapper that is working on the replication job.

• Full Performance - Click to download the CSV file that contains the performance report of the job.The performance report shows the samples taken for all the mappers during the full execution of thereplication job.


56



To view the performance data for a completed Hive/Impala replication policy:

1. In Cloudera Manager, select Replication > Replication Policies.2. Locate the row for the policy, select the policy, and click Actions > Show History.3. To view performance of the Hive phase, click Download CSV next to the Hive Replication Report label

and select one of the following options:

• Results - Downloads a listing of replicated tables in a CSV file.• Performance - Downloads a performance report for the Hive replication in a CSV file.

Note: The option to download the HDFS replication reports might not appear if the HDFS phaseof the replication skipped all the HDFS files because they have not changed, or if the ReplicateHDFS Files option (located on the Advanced tab when creating Hive/Impala replication policies)is not selected.



The data returned by the CSV files downloaded from the Cloudera Manager Admin console has thefollowing structure:

Table 4: Hive Performance Report Columns

Hive Performance Data Columns Description

Timestamp Time when the performance data was collected

Host Name of the host where the YARN or MapReduce job was running.

DbName Name of the database.

TableName Name of the table.

TotalElapsedTimeSecs Number of seconds elapsed from the start of the copy operation.

TotalTableCount Total number of tables to be copied.

The value of the column will be -1 for replications where Cloudera Manager cannotdetermine the number of tables being changed.

TotalPartitionCount Total number of partitions to be copied.

If the source cluster is running Cloudera Manager 5.9 or lower, this column contains a valueof -1 because older releases do not report this information.

DbCount Current number of databases copied.

DbErrorCount Number of failed database copy operations.

TableCount Total number of tables (for all databases) copied so far.

CurrentTableCount Total number of tables copied for current database.

TableErrorCount Total number of failed table copy operations.

PartitionCount Total number of partitions copied so far (for all tables).

CurrPartitionCount Total number of partitions copied for the current table.

PartitionSkippedCount Number of partitions skipped because they were copied in the previous run of the replicationjob.

IndexCount Total number of index files copied (for all databases).

CurrIndexCount Total number of index files copied for the current database.

57


Hive Performance Data Columns Description

IndexSkippedCount Number of Index files skipped because they were not altered.

Due to a bug in Hive, this value is always zero.

HiveFunctionCount Number of Hive functions copied.

ImpalaObjectCount Number of Impala objects copied.

A sample CSV file, as presented in Excel, is shown here:

Note the following limitations and known issues:

• If you click the CSV download too soon after the replication job starts, Cloudera Manager returns anempty file or a CSV file that has columns headers only and a message to try later when performancedata has actually been collected.

• If you employ a proxy user with the form user@domain, performance data is not available through thelinks.

• If the replication job only replicates small files that can be transferred in less than a few minutes, noperformance statistics are collected.

• For replication policies that specify the Dynamic Replication Strategy, statistics regarding the last filetransferred by a MapReduce job hide previous transfers performed by that MapReduce job.

• Only the last trace of each MapReduce job is reported in the CSV file.

Enabling, Disabling, or Deleting A Replication PolicyWhen you create a new replication policy, it is automatically enabled. If you disable a replication policy, itcan be re-enabled at a later time.

About this taskManaging replication policies.

Procedure


2. Select Actions for Selected drop-down and Enable | Disable | Delete as applicable.

To enable, disable, or delete multiple replication policies, you can select those policies from theReplication Policies page and repeat step 2.

Replicating Data to Impala ClustersImpala metadata is replicated as part of regular Hive/Impala replication operations.

Replicating Impala Metadata

Note: This feature is not available if the source and destination clusters run CDH 5.12 or higher.This feature replicated legacy Impala UDFs, which are no longer supported.

Impala metadata replication is performed as a part of Hive replication. Impala replication is only supportedbetween two CDH clusters. The Impala and Hive services must be running on both clusters.

To enable Impala metadata replication, perform the following tasks:

1. Schedule a Hive replication.2. Confirm that the Replicate Impala Metadata option is set to Yes on the Advanced tab in the Create

Hive Replication dialog.

58


When you set the Replicate Impala Metadata option to Yes, Impala UDFs (user-defined functions) will beavailable on the target cluster, just as on the source cluster. As part of replicating the UDFs, the binaries inwhich they are defined are also replicated.

Note: To run queries or execute DDL statements on tables that have been replicated to adestination cluster, you must run the Impala INVALIDATE METADATA statement on the destinationcluster to prevent queries from failing.

Invalidating Impala Metadata

For Impala clusters that do not use LDAP authentication, you can configure Hive/Impala replication jobs toautomatically invalidate Impala metadata after replication completes. If the clusters use Sentry, the Impalauser should have permissions to run INVALIDATE METADATA.

The configuration causes the Hive/Impala replication job to run the Impala INVALIDATE METADATAstatement per table on the destination cluster after completing the replication. The statement purges themetadata of the replicated tables and views within the destination cluster's Impala upon completion ofreplication, allowing other Impala clients at the destination to query these tables successfully with accurateresults. However, this operation is potentially unsafe if DDL operations are being performed on any ofthe replicated tables or views while the replication is running. In general, directly modifying replicateddata/metadata on the destination is not recommended. Ignoring this can lead to unexpected or incorrectbehavior of applications and queries using these tables or views.

Note: If the source contains UDFs, you must run the INVALIDATE METADATA statement manuallyand without any tables specified even if you configure the automatic invalidation.

To configure the option, perform the following tasks:

1. Schedule a Hive/Impala replication.2. On the Advanced tab, select the Invalidate Impala Metadata on Destination option.

Alternatively, you can run the INVALIDATE METADATA statement manually for replicated tables.

Using Snapshots with ReplicationSome replications, especially those that require a long time to finish, can fail because source files aremodified during the replication process.

You can prevent such failures by using Snapshots in conjunction with Replication. This use of snapshotsis automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevantdirectories for snapshots (also called making the directory snapshottable).

When the replication job runs, it checks to see whether the specified source directory is snapshottable.Before replicating any files, the replication job creates point-in-time snapshots of these directories anduses them as the source for file copies. This ensures that the replicated data is consistent with the sourcedata as of the start of the replication job. The latest snapshot for the subsequent runs is retained after thereplication process is completed.

A directory is snapshottable because it has been enabled for snapshots, or because a parent directory isenabled for snapshots. Subdirectories of a snapshottable directory are included in the snapshot.

Hive/Impala Replication with SnapshotsIf you are using Hive replication, Cloudera recommends that you make the Hive Warehouse Directorysnapshottable.

The Hive warehouse directory is located in the HDFS file system in the location specified by thehive.metastore.warehouse.dir property. The default location is /user/hive/warehouse.

To access the hive.metastore.warehouse.dir property, perform the following steps:

1. Open Cloudera Manager and browse to the Hive service.2. Click the Configuration tab.

59


3. In the Search box, type hive.metastore.warehouse.dir.

The Hive Warehouse Directory property appears.

If you are using external tables in Hive, also make the directories hosting any external tables not stored inthe Hive warehouse directory snapshottable.

Similarly, if you are using Impala and are replicating any Impala tables using Hive/Impala replication,ensure that the storage locations for the tables and associated databases are also snapshottable.

Enabling Replication Between Clusters with Kerberos AuthenticationTo enable replication between clusters, additional setup steps are required to ensure that the source anddestination clusters can communicate.


Important: Cloudera Replication Manager works with clusters in different Kerberos realms evenwithout a Kerberos realm trust relationship. The Cloudera Manager configuration propertiesTrusted Kerberos Realms and Kerberos Trusted Realms are used for Cloudera Manager andCDH configuration, and are not related to Kerberos realm trust relationships.

If you are using standalone DistCp between clusters in different Kerberos realms, you mustconfigure a realm trust.

Related InformationPort Requirements for Replication Manager

PortsWhen using Replication Manager with Kerberos authentication enabled, Replication Manager requires allthe ports listed in the "Port Requirements for Replication Manager" page.

Additionally, the port used for the Kerberos KDC Server and KRB5 services must be open to all hosts onthe destination cluster. By default, this is port 88.

Related InformationPort Requirements for Replication Manager

Considerations for Realm NamesIf the source and destination clusters each use Kerberos for authentication, use one of the followingconfigurations to prevent conflicts when running replication jobs.

• If the clusters do not use the same KDC (Kerberos Key Distribution Center), Cloudera recommends thatyou use different realm names for each cluster.

• You can use the same realm name if the clusters use the same KDC or different KDCs that are part of aunified realm, for example where one KDC is the master and the other is a slave KDC.

Note: If you have multiple clusters that are used to segregate production and non-productionenvironments, this configuration could result in principals that have equal permissions in bothenvironments. Make sure that permissions are set appropriately for each type of environment.

Important: If the source and destination clusters are in the same realm but do not use the sameKDC or the KDCs that are not part of a unified realm, the replication job fails.

HDFS, Hive, and Impala ReplicationConfiguring source and destination clusters.

1. On the hosts in the destination cluster, ensure that the krb5.conf file (typically located at /etc/kbr5.conf) on each host has the following information:

• The KDC information for the source cluster's Kerberos realm. For example:

[realms]

60

https://docs.cloudera.com/replication-manager/cloud/operations/topics/rm-port-requirements-cdh.html

https://docs.cloudera.com/replication-manager/cloud/operations/topics/rm-port-requirements-cdh.html


SRC.EXAMPLE.COM = { kdc = kdc01.src.example.com:88 admin_server = kdc01.example.com:749 default_domain = src.example.com } DST.EXAMPLE.COM = { kdc = kdc01.dst.example.com:88 admin_server = kdc01.dst.example.com:749 default_domain = dst.example.com }

• Realm mapping for the source cluster domain. You configure these mappings in the[domain_realm] section. For example:

[domain_realm] .dst.example.com = DST.EXAMPLE.COM dst.example.com = DST.EXAMPLE.COM .src.example.com = SRC.EXAMPLE.COM src.example.com = SRC.EXAMPLE.COM

Caution: If you have a scenario where the hostname(s) are inconsistent, you must navigateto Cloudera ManagerHostAll Hosts and ensure that all those hosts are covered in a similarmanner as seen in domain_realm section.

2. On the destination cluster, use Cloudera Manager to add the realm of the source cluster to the TrustedKerberos Realms configuration property:

a. Go to the HDFS service.b. Click the Configuration tab.c. In the search field type Trusted Kerberos to find the Trusted Kerberos Realms property.d. Click the plus sign icon, and then enter the source cluster realm.e. Enter a Reason for change, and then click Save Changes to commit the changes.

3. Go to Administration > Settings.4. In the search field, type domain name.5. In the Domain Name(s) field, enter any domain or host names you want to map to the destination

cluster KDC. Use the plus sign icon to add as many entries as you need. The entries in this property areused to generate the domain_realm section in krb5.conf.

6. If domain_realm is configured in the Advanced Configuration Snippet (Safety Valve) forremaining krb5.conf, remove the entries for it.


Kerberos Connectivity TestAs part of Test Connectivity, Cloudera Manager tests for properly configured Kerberos authentication onthe source and destination clusters that run the replication. Test Connectivity runs automatically whenyou add a peer for replication, or you can manually initiate Test Connectivity from the Actions menu.

This feature is available when the source and destination clusters run Cloudera Manager 5.12 or later. Youcan disable the Kerberos connectivity test by setting feature_flag_test_kerberos_connectivityto false with the Cloudera Manager API: api/<version>/cm/config.

If the test detects any issues with the Kerberos configuration, Cloudera Manager provides resolution stepsbased on whether Cloudera Manager manages the Kerberos configuration file.

Cloudera Manager tests the following scenarios:

• Whether both clusters have Kerberos enabled or not.• Replication is supported from unsecure cluster to secure cluster starting Cloudera Manager 6.1 and

later.• Replication is not supported if the source cluster uses Kerberos and target cluster is unsecure.

61


• Whether both clusters are in the same Kerberos realm. Clusters in the same realm must share thesame KDC or the KDCs must be in a unified realm.

• Whether clusters are in different Kerberos realms. If the clusters are in different realms, the destinationcluster must be configured according to the following criteria:

• Destination HDFS services must have the correct Trusted Kerberos Realms setting.• The krb5.conf file has the correct domain_realm mapping on all the hosts.• The krb5.conf file has the correct realms information on all the hosts.

• Whether the local and peer KDC are running on an available port. This port must be open for all hostsin the cluster. The default port is 88.

After Cloudera Manager runs the tests, Cloudera Manager makes recommendations to resolve anyKerberos configuration issues.

Kerberos Recommendations

If Cloudera Manager manages the Kerberos configuration file, Cloudera Manager configures Kerberoscorrectly for you and then provides the set of commands that you must manually run to finish configuringthe clusters.

If Cloudera Manager does not manage the Kerberos configuration file, Cloudera manager provides themanual steps required to correct the issue.

Replication of Encrypted DataHDFS supports encryption of data at rest including data accessed through Hive.

This topic describes how replication works within and between encryption zones and how to configurereplication to avoid failures due to encryption.

Encrypting Data in Transit Between ClustersA source directory and destination directory may or may not be in an encryption zone. If the destinationdirectory is in an encryption zone, the data on the destination directory is encrypted.

If the destination directory is not in an encryption zone, the data on that directory is not encrypted, even ifthe source directory is in an encryption zone. Encryption zones are not supported in CDH versions 5.1 orlower.

When you configure encryption zones, you also configure a Key Management Server (KMS) to manageencryption keys. During replication, Cloudera Manager uses TLS/SSL to encrypt the keys when they aretransferred from the source cluster to the destination cluster.

When you configure encryption zones, you also configure a Key Management Server (KMS) to manageencryption keys. When a HDFS replication command that specifies an encrypted source directory runs,Cloudera Manager temporarily copies the encryption keys from the source cluster to the destinationcluster, using TLS/SSL (if configured for the KMS) to encrypt the keys. Cloudera Manager then uses thesekeys to decrypt the encrypted files when they are received from the source cluster before writing the files tothe destination cluster.

Important: When you configure HDFS replication, you must select the Skip Checksum checkproperty to prevent replication failure in the following cases:

• Replications from an encrypted zone on the source cluster to an encrypted zone on a destinationcluster.

• Replications from an encryption zone on the source cluster to an unencrypted zone on thedestination cluster.

• Replications from an unencrypted zone on the source cluster to an encrypted zone on thedestination cluster.

Even when the source and destination directories are both in encryption zones, the data is decryptedas it is read from the source cluster (using the key for the source encryption zone) and encrypted again

62


when it is written to the destination cluster (using the key for the destination encryption zone). The datatransmission is encrypted if you have configured encryption for HDFS Data Transfer.

Note: The decryption and encryption steps happen in the same process on the hosts where theMapReduce jobs that copy the data run. Therefore, data in plain text only exists within the memoryof the Mapper task. If a KMS is in use on either the source or destination clusters, and you areusing encrypted zones for either the source or destination directories, configure TLS/SSL for theKMS to prevent transferring the key to the mapper task as plain text.

During replication, data travels from the source cluster to the destination cluster using distcp. Forclusters that use encryption zones, configure encryption of KMS key transfers between the source anddestination using TLS/SSL.

To configure encryption of data transmission between source and destination clusters:

• Enable TLS/SSL for HDFS clients on both the source and the destination clusters. You may also needto configure trust between the SSL certificates on the source and destination.

• Enable TLS/SSL for the two peer Cloudera Manager Servers.• Encrypt data transfer using HDFS Data Transfer Encryption.

The following blog post provides additional information about encryption with HDFS: https://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/.

Security ConsiderationsThe user you specify with the Run As field when scheduling a replication job requires full access toboth the key and the data directories being replicated. This is not a recommended best practice for KMSmanagement.

If you change permissions in the KMS to enable this requirement, you could accidentally provide accessfor this user to data in other encryption zones using the same key. If a user is not specified in the Run Asfield, the replication runs as the default user, hdfs.

To access encrypted data, the user must be authorized on the KMS for the encryption zones they needto interact with. The user you specify with the Run As field when scheduling a replication must have thisauthorization. The key administrator must add ACLs to the KMS for that user to prevent authorizationfailure.

Key transfer using the KMS protocol from source to the client uses the REST protocol, which requires thatyou configure TLS/SSL for the KMS. When TLS/SSL is enabled, keys are not transferred over the networkas plain text.

SnapshotsYou can create HBase and HDFS snapshots using Cloudera Manager or by using the command-line.

• HBase snapshots allow you to create point-in-time backups of tables without making data copies, andwith minimal impact on RegionServers. HBase snapshots are supported for clusters running CDH 4.2 orhigher.

• HDFS snapshots allow you to create point-in-time backups of directories or the entire filesystem withoutactually cloning the data. They can improve data replication performance and prevent errors caused bychanges to a source directory. These snapshots appear on the filesystem as read-only directories thatcan be accessed just like other ordinary directories.

Cloudera Manager Snapshot PoliciesCloudera Manager enables the creation of snapshot policies that define the directories or tables to besnapshotted, the intervals at which snapshots should be taken, and the number of snapshots that shouldbe kept for each snapshot interval.

For example, you can create a policy that takes both daily and weekly snapshots, and specify that sevendaily snapshots and five weekly snapshots should be maintained.


63

https://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/

https://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/



Note: You can improve the reliability of by also using snapshots.

Managing Snapshot PoliciesYou must enable an HDFS directory for snapshots to allow snapshot policies to be created for thatdirectory.

To create a snapshot policy:

1. From Cloudera Manager, select Replication > Snapshot Policies.

Existing snapshot policies are shown in a table.2. To create a new policy, click Create Snapshot Policy.3. From the drop-down list, select the service (HDFS or HBase) and cluster for which you want to create a

policy.4. Provide a name for the policy. Optionally, provide a description.5. Specify the directories, namespaces or tables to include in the snapshot.

Important: Do not take snapshots of the root directory.

• For an HDFS service, select the paths of the directories to include in the snapshot. The drop-downlist allows you to select only directories that are enabled for snapshotting. If no directories areenabled for snapshotting, a warning displays.

Click to add a path and to remove a path.• For an HBase service, list the tables to include in your snapshot. You can use a Java regular

expression to specify a set of tables. For example, finance.* matchs all tables with names startingwith finance. You can also create a snapshot for all tables in a given namespace, using the{namespace}:.* syntax.

6. Specify the snapshot Schedule. You can schedule snapshots hourly, daily, weekly, monthly, or yearly,or any combination of those. Depending on the frequency you select, you can specify the time of dayto take the snapshot, the day of the week, day of the month, or month of the year, and the number ofsnapshots to keep at each interval. Each time unit in the schedule information is shared with the timeunits of larger granularity. That is, the minute value is shared by all the selected schedules, hour by allthe schedules for which hour is applicable, and so on. For example, if you specify that hourly snapshotsare taken at the half hour, and daily snapshots taken at the hour 20, the daily snapshot will occur at20:30.

To select an interval, select its box. Fields display where you can edit the time and number ofsnapshots to keep.

64

https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html

https://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html


7. Specify whether Alerts should be generated for various state changes in the snapshot workflow. Youcan alert on failure, on start, on success, or when the snapshot workflow is aborted.

8. Click Save Policy.

The new Policy appears on the Snapshot Policies page.

To edit or delete a snapshot policy:

1. From Cloudera Manager, select Replication > Snapshot Policies.

Existing snapshot policies appear in a table.2. To edit a snapshot, select a policy and click Actions > Edit.3. To delete a snapshot, select a policy and click Actions > Delete.

Snapshots HistoryThe Snapshots History page displays information about Snapshot jobs that have been run or attempted.

The page displays a table of snapshot jobs with the following columns:

Table 5: Snapshots History

Column Description

Start Time Time when the snapshot job started execution.

Click to display details about the snapshot.

Click the View link to open the Managed scheduled snapshots Command page, which displays details andmessages about each step in the execution of the command.

Outcome Displays whether the snapshot succeeded or failed.

Paths | TablesProcessed

HDFS snapshots: the number of Paths Processed for the snapshot.

HBase snapshots: the number of Tables Processed for the snapshot.

Paths | TablesUnprocessed

HDFS Snapshots: the number of Paths Unprocessed for the snapshot.

HBase Snapshots: the number of Tables Unprocessed for the snapshot.

SnapshotsCreated

Number of snapshots created.

SnapshotsDeleted

Number of snapshots deleted.

Errors DuringCreation

Displays a list of errors that occurred when creating the snapshot. Each error shows the related path and theerror message.

Errors DuringDeletion

Displays a list of errors that occurred when deleting the snapshot. Each error shows the related path and theerror message.

Orphaned SnapshotsWhen a snapshot policy includes a limit on the number of snapshots to keep, Cloudera Manager checksthe total number of stored snapshots each time a new snapshot is added, and automatically deletes theoldest existing snapshot if necessary.

When a snapshot policy is edited or deleted, files, directories, or tables that were removed from the policymay leave "orphaned" snapshots behind that are not deleted automatically because they are no longerassociated with a current snapshot policy. Cloudera Manager never selects these snapshots for automaticdeletion because selection for deletion only occurs when the policy creates a new snapshot containingthose files, directories, or tables.

You can delete snapshots manually through Cloudera Manager or by creating a command-line scriptthat uses the HDFS or HBase snapshot commands. Orphaned snapshots can be hard to locate formanual deletion. Snapshot policies automatically receive the prefix cm-auto followed by a globallyunique identifier (GUID). You can locate all snapshots for a specific policy by searching for t the prefix cm-auto-guid that is unique to that policy.

65


To avoid orphaned snapshots, delete snapshots before editing or deleting the associated snapshotpolicy, or record the identifying name for the snapshots you want to delete. This prefix is displayed inthe summary of the policy in the policy list and appears in the delete dialog box. Recording the snapshotnames, including the associated policy prefix, is necessary because the prefix associated with a policycannot be determined after the policy has been deleted, and snapshot names do not contain recognizablereferences to snapshot policies.

Managing HDFS SnapshotsThis topic demonstrates how to manage HDFS snapshots using either Cloudera Manager or the commandline.

For HDFS services, use the File Browser tab to view the HDFS directories associated with a service onyour cluster. You can view the currently saved snapshots for your files, and delete or restore them.

From the HDFS File Browser tab, you can:

• Designate HDFS directories to be "snapshottable" so snapshots can be created for those directories.• Initiate immediate (unscheduled) snapshots of a HDFS directory.• View the list of saved snapshots currently being maintained. These can include one-off immediate

snapshots, as well as scheduled policy-based snapshots.• Delete a saved snapshot.• Restore an HDFS directory or file from a saved snapshot.• Restore an HDFS directory or file from a saved snapshot to a new directory or file (Restore As).

Before using snapshots, note the following limitations:

• Snapshots that include encrypted directories cannot be restored outside of the zone within which theywere created.

• The Cloudera Manager Admin Console cannot perform snapshot operations (such as create, restore,and delete) for HDFS paths with encryption-at-rest enabled. This limitation only affects the ClouderaManager Admin Console and does not affect CDH command-line tools or actions not performed by theAdmin Console, such as Replication Manager which uses command-line tools. For more informationabout snapshot operations, see the Apache HDFS snapshots documentation.

Browsing HDFS DirectoriesYou can browse through the HDFS directories to select the right cluster.

To browse the HDFS directories to view snapshot activity:

1. From the Clusters tab, select the CDH HDFS service.2. Go to the File Browser tab.

As you browse the directory structure of your HDFS, basic information about the directory you haveselected is shown at the right such as owner, group, and so on.

Enabling and Disabling HDFS SnapshotsFor snapshots to be created, HDFS directories must be enabled for snapshots. You cannot specify adirectory as part of a snapshot policy unless it has been enabled for snapshots.


Enabling an HDFS Directory for Snapshots

1. From the Clusters tab, select the CDH HDFS service.2. Go to the File Browser tab.3. Go to the directory you want to enable for snapshots.4. In the File Browser, click the drop-down menu next to the full file path and select Enable Snapshots.

Note: Once you enable snapshots for a directory, you cannot enable snapshots on any of itssubdirectories. Snapshots can be taken only on directories that have snapshots enabled.

66

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html#Snapshot_Operations


Disabling a Snapshottable Directory

To disable snapshots for a directory that has snapshots enabled, use Disable Snapshots from the drop-down menu button at the upper right. If snapshots of the directory exist, they must be deleted beforesnapshots can be disabled.

Taking and Deleting HDFS SnapshotsTo manage HDFS snapshots, first enable an HDFS directory for snapshots.


Taking Snapshots

Note: You can also schedule snapshots to occur regularly by creating a snapshot policy.

1. From the Clusters tab, select the CDH HDFS service.2. Go to the File Browser tab.3. Go to the directory with the snapshot you want to restore.4. Click the drop-down menu next to the full path name and select Take Snapshot.

The Take Snapshot screen displays.5. Enter a name for the snapshot.6. Click OK.

The Take Snapshot button is present, enabling an immediate snapshot of the directory.7. To take a snapshot, click Take Snapshot, specify the name of the snapshot, and click Take Snapshot.

The snapshot is added to the snapshot list.

Any snapshots that have been taken are listed by the time at which they were taken, along with theirnames and a menu button.

Deleting Snapshots

1. From the Clusters tab, select the CDH HDFS service.2. Go to the File Browser tab.3. Go to the directory with the snapshot you want to delete.4.

In the list of snapshots, locate the snapshot you want to delete and click .5. Select Delete.

Restoring SnapshotsBefore you restore from a snapshot, ensure that there is adequate disk space.

1. From the Clusters tab, select the CDH HDFS service.2. Go to the File Browser tab.3. Go to the directory you want to restore.4. In the File Browser, click the drop-down menu next to the full file path (to the right of the file browser

listings) and select one of the following:

• Restore Directory From Snapshot• Restore Directory From Snapshot As...

The Restore Snapshot screen displays.5. If you selected Restore Directory From Snapshot As..., enter the username to apply when restoring

the snapshot.6. Select one of the following:

67


CDP Data Migration Guide Data Warehouse to CDP

• Use HDFS 'copy' command - This option executes more slowly and does not require credentials ina secure cluster. It copies the contents of the snapshot as a subdirectory or as files within the targetdirectory.

• Use DistCp / MapReduce - This options executes more quickly and requires credentials (Run As)in secure clusters. It merges the target directory with the contents of the source snapshot. Whenyou select this option, the following additional fields, which are similar to those available whenconfiguring a replication appear under More Options:

• When restoring HDFS data, if a MapReduce or YARN service is present in the cluster,DistributedCopy (DistCp) is used to restore directories, increasing the speed of restoration. TheRestore Snapshots screen HDFS (under More Options) allows selection of either MapReduceor YARN as the MapReduce service. For files, or if a MapReduce or YARN service is notpresent, a normal copy is performed.

• Skip Checksum Checks - Whether to skip checksum checks (the default is to perform them). Ifchecked, checksum validation will not be performed.

You must select the this property to prevent failure when restoring snapshots in the followingcases:

• Restoring a snapshot within a single encryption zone.• Restoring a snapshot from one encryption zone to a different encryption zone.• Restoring a snapshot from an unencrypted zone to an encrypted zone.

Data Warehouse to CDP

How to migrate data warehouse workloads from CDH and HDP to CDP.

Migrating Hive Data to CDPAfter migrating Hive data to CDP using WXM or Replication Manager, there are additional tasks you mightneed to perform. You need to know the semantic differences between Hive 3.x and earlier versions. Someof these differences require you to change your Hive scripts or workflow. Also, you need to convert scriptsthat use Hive CLI, which CDP does not support, to Beeline.

You need to know where your tables are located and the property changes that the upgrade processmakes. You need to perform some post-migration tasks before using Hive tables. Understanding ApacheHive 3 major design features, such as default ACID transaction processing, can help you use Hive toaddress the growing needs of enterprise data warehouse systems. For more information, see Apache Hive3 Architectural Changes and Apache Hive Key Features.

Related InformationApache Hive 3 Key FeaturesApache Hive 3 Architectural Overview

Handling Semantic and Syntax ChangesYou need to perform a number of migration-related changes due to semantic changes between previousversions of Hive running in CDH or HDP and Hive 3 in CDP. A couple of syntax changes in Hive 3, relatedto db.table references and DROP CASCADE, might require changes to your applications.

Casting TimestampsResults of applications that cast numerics to timestamps differ from Hive 2 to Hive 3. Apache Hive changedthe behavior of CAST to comply with the SQL Standard, which does not associate a time zone with theTIMESTAMP type.

Before Upgrade to CDP

68

https://docs.cloudera.com/runtime/7.2.6/hive-introduction/topics/hive_whats_new_in_this_release_hive.html

https://docs.cloudera.com/runtime/7.2.6/hive-introduction/topics/hive-apache-hive-3-architectural-overview.html


Casting a numeric type value into a timestamp could be used to produce a result that reflected the timezone of the cluster. For example, 1597217764557 is 2020-08-12 00:36:04 PDT. Running the followingquery casts the numeric to a timestamp in PDT:

> SELECT CAST(1597217764557 AS TIMESTAMP); | 2020-08-12 00:36:04 |

After Upgrade to CDP

Casting a numeric type value into a timestamp produces a result that reflects the UTC instead of the timezone of the cluster. Running the following query casts the numeric to a timestamp in UTC.

> SELECT CAST(1597217764557 AS TIMESTAMP); | 2020-08-12 07:36:04.557 |

Action Required

Change applications. Do not cast from a numeral to obtain a local time zone. Built-in functionsfrom_utc_timestamp and to_utc_timestamp can be used to mimic behavior before the upgrade.

Related InformationApache Hive web site summary of timestamp semantics

Checking Compatibility of Column ChangesA default configuration change can cause applications that change column types to fail.


In HDP 2.x hive.metastore.disallow.incompatible.col.type.changes is false by default toallow changes to incompatible column types. For example, you can change a STRING column to a columnof an incompatible type, such as MAP<STRING, STRING>. No error occurs.


In CDP, hive.metastore.disallow.incompatible.col.type.changes is true by default. Hiveprevents changes to incompatible column types. Compatible column type changes, such as INT, STRING,BIGINT, are not blocked.

Action Required

Change applications to disallow incompatible column type changes to prevent possible data corruption.Check ALTER TABLE statements and change those that would fail due to incompatible column types.

Related InformationHIVE-12320

Creating a TableTo improve useability and functionality, Hive 3 significantly changed table creation.

Hive has changed table creation in the following ways:

• Creates ACID-compliant table, which is the default in CDP• Supports simple writes and inserts• Writes to multiple partitions• Inserts multiple data updates in a single SELECT statement• Eliminates the need for bucketing.

If you have an ETL pipeline that creates tables in Hive, the tables will be created as ACID. Hive now tightlycontrols access and performs compaction periodically on the tables. The way you access managed Hivetables from Spark and other clients changes. In CDP, access to external tables requires you to set upsecurity access permissions.


69

https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types

https://issues.apache.org/jira/browse/HIVE-12320


In CDH and HDP 2.6.5, by default CREATE TABLE created a non-ACID table.


In CDP, by default CREATE TABLE creates a full, ACID transactional table in ORC format.

Action Required

Perform one or more of the following actions.

• The upgrade process converts Hive managed tables in CDH to external tables. You must change yourscripts to create the types of tables required by your use case. For more information, see Apache Hive3 Tables.

• Configure legacy CREATE TABLE behavior (see link below) to create external tables by default.

• To read Hive ACID tables from Spark, you connect to Hive using the Hive Warehouse Connector(HWC) or the HWC Spark Direct Reader. To write ACID tables to Hive from Spark, you use the HWCand HWC API. Spark creates an external table with the purge property when you do not use the HWCAPI. For more information, see HWC Spark Direct Reader and Hive Warehouse Connector.

• Set up Ranger policies and HDFS ACLs for tables. For more information, see HDFS ACLs and HDFSACL Permissions.

Related InformationApache Hive 3 Key FeaturesApache Hive 3 TablesConfiguring legacy CREATE TABLE behavior

Correcting `db.table` in QueriesFor ANSI SQL compliance, Hive 3.x rejects `db.table` in SQL queries. A dot (.) is not allowed in tablenames.

You need to change queries that use such `db.table` references to prevent Hive from interpreting the entiredb.table string as the table name. You enclose the database name and the table name in backticks.

Related InformationAdd Backticks to Table References

Add Backticks to Table ReferencesCDP includes the Hive-16907 bug fix, which rejects `db.table` in SQL queries. A dot (.) is not allowed intable names. You need to change queries that use such references to prevent Hive from interpreting theentire db.table string as the table name.

Procedure

1. Find a table having the problematic table reference.

math.students

appears in a CREATE TABLE statement.

2. Enclose the database name and the table name in backticks.

CREATE TABLE `math`.`students` (name VARCHAR(64), age INT, gpa DECIMAL(3,2));

Disabling Partition Type CheckingAn enhancement in Hive 3 checks the types of partitions. This feature can be disabled by setting aproperty. For more information, see the ASF Apache Hive Language Manual.

70

https://docs.cloudera.com/runtime/7.2.6/hive-introduction/topics/hive_whats_new_in_this_release_hive.html

https://docs.cloudera.com/runtime/7.2.6/using-hiveql/topics/hive_hive_3_tables.html

https://docs.cloudera.com/runtime/7.2.6/configuring-apache-hive/topics/hive_create_table_default.html



In CDH 5.x, partition values are not type checked.


Partition values specified in the partition specification are type checked, converted, and normalized toconform to their column types if the property hive.typecheck.on.insert is set to true (default). Thevalues can be numbers.

Action Required

If type checking of partitions causes problems, disable the feature. To disable partition type checking, sethive.typecheck.on.insert to false. For example:

SET hive.typecheck.on.insert=false;

Related InformationHive Language Manual: Alter Partition

Dropping PartitionsThe OFFLINE and NO_DROP keywords in the CASCADE clause for dropping partitions causesperformance problems and is no longer supported.


You could use OFFLINE and NO_DROP keywords in the DROP CASCADE clause to prevent partitionsfrom being read or dropped.


OFFLINE and NO_DROP are not supported in the DROP CASCADE clause.

Action Required

Change applications to remove OFFLINE and NO_DROP from the DROP CASCADE clause. Use anauthorization scheme, such as Ranger, to prevent partitions from being dropped or read.

Handling Output of greatest and least Functions


The greatest function returned the highest value of the list of values. The least function returned the lowestvalue of the list of values.


Returns NULL when one or more arguments are NULL.

Action Required

Use NULL filters or the nvl function on the columns you use as arguments to the greatest or leastfunctions.

SELECT greatest(nvl(col1,default value incase of NULL),nvl(col2,default value incase of NULL));

Renaming TablesTo harden the system, Hive data can be stored in HDFS encryption zones. RENAME has been changed toprevent moving a table outside the same encryption zone or into a no-encryption zone.


In CDH and HDP, renaming a managed table moves its HDFS location.


71

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables


Renaming a managed table moves its location only if the table is created without a LOCATION clause andis under its database directory.

Action Required

None

Handling the Keyword APPLICATIONIf you use the keyword APPLICATION in your queries, you might need to modify the queries to preventfailure.

To prevent a query that uses a keyword from failing, enclose the query in backticks.


In CDH releases, such as CDH 5.13, queries that use the word APPLICATION in queries executesuccessfully. For example, you could use this word as a table name.

> select f1, f2 from application


A query that uses the keyword APPLICATION fails.

Action Required

Change applications. Enclose queries in backticks. SELECT field1, field2 FROM `application`;

Hive Configuration Property ChangesYou need to know the property value changes made by the upgrade process as the change might impactyour work. You might need to consider reconfiguring property value defaults that the upgrade changes.

Hive Configuration Property Values

The upgrade process changes the default values of some Hive configuration properties and adds newproperties. The following list describes those changes that occur after upgrading from CDH or HDP toCDP.

datanucleus.connectionPool.maxPoolSize

Before upgrade: 30

After upgrade: 10

datanucleus.connectionPoolingType

Before upgrade: BONECP

After upgrade: HikariCP

hive.auto.convert.join.noconditionaltask.size

Before upgrade: 20971520

After upgrade: 52428800

Exception: Preserves pre-upgrade value if old default is overridden; otherwise, uses newdefault.

hive.auto.convert.sortmerge.join

Before upgrade: FALSE in the old CDH; TRUE in the old HDP.

After upgrade: TRUE

hive.auto.convert.sortmerge.join.to.mapjoin

Before upgrade: FALSE

72


After upgrade: TRUE

hive.cbo.enable


After upgrade: TRUE

hive.cbo.show.warnings


After upgrade: TRUE

hive.compactor.worker.threads

Before upgrade: 0

After upgrade: 5

hive.compute.query.using.stats


After upgrade: TRUE

hive.conf.hidden.list

Before upgrade:

javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,hive.metastore.dbaccess.ssl.truststore.password,fs.s3.awsAccessKeyId,fs.s3.awsSecretAccessKey,fs.s3n.awsAccessKeyId,fs.s3n.awsSecretAccessKey,fs.s3a.access.key,fs.s3a.secret.key,fs.s3a.proxy.password,dfs.adls.oauth2.credential,fs.adl.oauth2.credential,fs.azure.account.oauth2.client.secret

After upgrade:

javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,hive.druid.metadata.password,hive.driver.parallel.compilation.global.limit

hive.conf.restricted.list

Before upgrade:

hive.security.authenticator.manager,hive.security.authorization.manager,hive.users.in.admin.role,hive.server2.xsrf.filter.enabled,hive.spark.client.connect.timeout,hive.spark.client.server.connect.timeout,hive.spark.client.channel.log.level,hive.spark.client.rpc.max.size,hive.spark.client.rpc.threads,hive.spark.client.secret.bits,hive.spark.client.rpc.server.address,hive.spark.client.rpc.server.port,hive.spark.client.rpc.sasl.mechanisms,hadoop.bin.path,yarn.bin.path,spark.home,bonecp.,hikaricp.,hive.driver.parallel.compilation.global.limit,_hive.local.session.path,_hive.hdfs.session.path,_hive.tmp_table_space,_hive.local.session.path,_hive.hdfs.session.path,_hive.tmp_table_space

After upgrade:

hive.security.authenticator.manager,hive.security.authorization.manager,hive.security.metastore.authorization.manager,hive.security.metastore.authenticator.manager,hive.users.in.admin.role,hive.server2.xsrf.filter.enabled,hive.security.authorization.enabled,hive.distcp.privileged.doAs,hive.server2.authentication.ldap.baseDN,hive.server2.authentication.ldap.url,hive.server2.authentication.ldap.Domain,hive.server2.authentication.ldap.groupDNPattern,hive.server2.authentication.ldap.groupFilter,hive.server2.authentication.ldap.userDNPattern,hive.server2.authentication.ldap.userFilter,hive.server2.authentication.ldap.groupMembershipKey,hive.server2.authentication.ldap.userMembershipKey,hive.server2.authentication.ldap.groupClassKey,hive.server2.authentication.ldap.customLDAPQuery,hive.privilege.synchronizer.interval,hive.spark.client.connect.timeout,hive.spark.client.server.connect.timeout,hive.spark.client.channel.log.level,hive.spark.client.rpc.max.size,hive.spark.client.rpc.threads,hive.spark.client.secret.bits,hive.spark.client.rpc.server.address,hive.spark.client.rpc.server.port,hive.spark.client.rpc.sasl.mechanisms,bonecp.,hive.druid.broker.address.default,hive.druid.coordinator.address.default,hikaricp.,hadoop.bin.path,yarn.bin.path,spark.home,hive.driver.parallel.compilation.global.limit,_hive.local.session.path,_hive.hdfs.session.path,_hive.tmp_table_space,_hive.local.session.path,_hive.hdfs.session.path,_hive.tmp_table_space

hive.default.fileformat.managed

Before upgrade: None

After upgrade: ORC

hive.default.rcfile.serde

Before upgrade: org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe

After upgrade:org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe

Not supported in Impala. Impala cannot read Hive-created RC tables.

hive.driver.parallel.compilation


After upgrade: TRUE

hive.exec.dynamic.partition.mode

Before upgrade: strict

After upgrade: nonstrict

73


In CDP Private Cloud Base, accidental use of dynamic partitioning feature is not preventedby default.

hive.exec.max.dynamic.partitions


After upgrade: 5000

In CDP Private Cloud Base, fewer restrictions on dynamic paritioning occur than in the pre-upgrade CDH or HDP cluster.

hive.exec.max.dynamic.partitions.pernode

Before upgrade: 100

After upgrade: 2000

In CDP Private Cloud Base, fewer restrictions on dynamic paritioning occur than in the pre-upgrade CDH or HDP cluster.

hive.exec.post.hooks

Before upgrade:

com.cloudera.navigator.audit.hive.HiveExecHookContext,org.apache.hadoop.hive.ql.hooks.LineageLogger

After upgrade: org.apache.hadoop.hive.ql.hooks.HiveProtoLoggingHook

A prime number is recommended.

hive.exec.reducers.max


After upgrade: 1009

Exception: Preserves pre-upgrade value if old default is overridden; otherwise, uses newdefault

hive.execution.engine

Before upgrade: mr

After upgrade: tez

Tez is now the only supported execution engine, existing queries that change executionmode to Spark or MapReduce within a session, for example, fail.

hive.fetch.task.conversion

Before upgrade: minimal

After upgrade: more

hive.fetch.task.conversion.threshold

Before upgrade: 256MB

After upgrade: 1GB


hive.hashtable.key.count.adjustment

Before upgrade: 1

After upgrade: 0.99


hive.limit.optimize.enable

74



After upgrade: TRUE

hive.limit.pushdown.memory.usage

Before upgrade: 0.1

After upgrade: 0.04


hive.mapjoin.hybridgrace.hashtable

Before upgrade: TRUE

After upgrade: FALSE

hive.mapred.reduce.tasks.speculative.execution



hive.metastore.aggregate.stats.cache.enabled



hive.metastore.disallow.incompatible.col.type.changes


After upgrade: TRUE

Schema evolution is more restrictive in CDP Private Cloud Base than in CDH to avoid datacorruption. The new default disallows column type changes if the old and new types areincompatible.

hive.metastore.dml.events


After upgrade: TRUE

hive.metastore.event.message.factory

Before upgrade:org.apache.hadoop.hive.metastore.messaging.json.ExtendedJSONMessageFactory

After upgrade:org.apache.hadoop.hive.metastore.messaging.json.gzip.GzipJSONMessageEncoder

hive.metastore.uri.selection

Before upgrade: SEQUENTIAL

After upgrade: RANDOM

hive.metastore.warehouse.dir

Before upgrade: /user/hive/warehouse

After upgrade from CDP: /user/hive/warehouse

After upgrade from HDP: /warehouse/tablespace/managed/hive

Tables existing in the old (HDFS) warehouse path are migrated to the new location.

hive.optimize.metadataonly


After upgrade: TRUE

75


hive.optimize.point.lookup.min

Before upgrade: 31

After upgrade: 2

hive.prewarm.numcontainers

Before upgrade: 10

After upgrade: 3

hive.script.operator.env.blacklist

Before upgrade: hive.txn.valid.txns,hive.script.operator.env.blacklist

After upgrade:hive.txn.valid.txns,hive.txn.tables.valid.writeids,hive.txn.valid.writeids,hive.script.operator.env.blacklist

hive.security.authorization.sqlstd.confwhitelist

Before upgrade:

hive\.auto\..*hive\.cbo\..*hive\.convert\..*hive\.exec\.dynamic\.partition.*hive\.exec\..*\.dynamic\.partitions\..*hive\.exec\.compress\..*hive\.exec\.infer\..*hive\.exec\.mode.local\..*hive\.exec\.orc\..*hive\.exec\.parallel.*hive\.explain\..*hive\.fetch.task\..*hive\.groupby\..*hive\.hbase\..*hive\.index\..*hive\.index\..*hive\.intermediate\..*hive\.join\..*hive\.limit\..*hive\.log\..*hive\.mapjoin\..*hive\.merge\..*hive\.optimize\..*hive\.orc\..*hive\.outerjoin\..*hive\.parquet\..*hive\.ppd\..*hive\.prewarm\..*hive\.server2\.proxy\.userhive\.skewjoin\..*hive\.smbjoin\..*hive\.stats\..*hive\.strict\..*hive\.tez\..*hive\.vectorized\..*mapred\.map\..*mapred\.reduce\..*mapred\.output\.compression\.codecmapred\.job\.queuenamemapred\.output\.compression\.typemapred\.min\.split\.sizemapreduce\.job\.reduce\.slowstart\.completedmapsmapreduce\.job\.queuenamemapreduce\.job\.tagsmapreduce\.input\.fileinputformat\.split\.minsizemapreduce\.map\..*mapreduce\.reduce\..*mapreduce\.output\.fileoutputformat\.compress\.codecmapreduce\.output\.fileoutputformat\.compress\.typeoozie\..*tez\.am\..*tez\.task\..*tez\.runtime\..*tez\.queue\.namehive\.transpose\.aggr\.joinhive\.exec\.reducers\.bytes\.per\.reducerhive\.client\.stats\.countershive\.exec\.default\.partition\.namehive\.exec\.drop\.ignorenonexistenthive\.counters\.group\.namehive\.default\.fileformat\.managedhive\.enforce\.bucketmapjoinhive\.enforce\.sortmergebucketmapjoinhive\.cache\.expr\.evaluationhive\.query\.result\.fileformathive\.hashtable\.loadfactorhive\.hashtable\.initialCapacityhive\.ignore\.mapjoin\.hinthive\.limit\.row\.max\.sizehive\.mapred\.modehive\.map\.aggrhive\.compute\.query\.using\.statshive\.exec\.rowoffsethive\.variable\.substitutehive\.variable\.substitute\.depthhive\.autogen\.columnalias\.prefix\.includefuncnamehive\.autogen\.columnalias\.prefix\.labelhive\.exec\.check\.crossproductshive\.cli\.tez\.session\.asynchive\.compathive\.exec\.concatenate\.check\.indexhive\.display\.partition\.cols\.separatelyhive\.error\.on\.empty\.partitionhive\.execution\.enginehive\.exec\.copyfile\.maxsizehive\.exim\.uri\.scheme\.whitelisthive\.file\.max\.footerhive\.insert\.into\.multilevel\.dirshive\.localize\.resource\.num\.wait\.attemptshive\.multi\.insert\.move\.tasks\.share\.dependencieshive\.support\.quoted\.identifiershive\.resultset\.use\.unique\.column\.nameshive\.analyze\.stmt\.collect\.partlevel\.statshive\.exec\.schema\.evolutionhive\.server2\.logging\.operation\.levelhive\.server2\.thrift\.resultset\.serialize\.in\.taskshive\.support\.special\.characters\.tablenamehive\.exec\.job\.debug\.capture\.stacktraceshive\.exec\.job\.debug\.timeouthive\.llap\.io

76


\.enabledhive\.llap\.io\.use\.fileid\.pathhive\.llap\.daemon\.service\.hostshive\.llap\.execution\.modehive\.llap\.auto\.allow\.uberhive\.llap\.auto\.enforce\.treehive\.llap\.auto\.enforce\.vectorizedhive\.llap\.auto\.enforce\.statshive\.llap\.auto\.max\.input\.sizehive\.llap\.auto\.max\.output\.sizehive\.llap\.skip\.compile\.udf\.checkhive\.llap\.client\.consistent\.splitshive\.llap\.enable\.grace\.join\.in\.llaphive\.llap\.allow\.permanent\.fnshive\.exec\.max\.created\.fileshive\.exec\.reducers\.maxhive\.reorder\.nway\.joinshive\.output\.file\.extensionhive\.exec\.show\.job\.failure\.debug\.infohive\.exec\.tasklog\.debug\.timeouthive\.query\.id

After upgrade:

hive\.auto\..*hive\.cbo\..*hive\.convert\..*hive\.druid\..*hive\.exec\.dynamic\.partition.*hive\.exec\.max\.dynamic\.partitions.*hive\.exec\.compress\..*hive\.exec\.infer\..*hive\.exec\.mode.local\..*hive\.exec\.orc\..*hive\.exec\.parallel.*hive\.exec\.query\.redactor\..*hive\.explain\..*hive\.fetch.task\..*hive\.groupby\..*hive\.hbase\..*hive\.index\..*hive\.index\..*hive\.intermediate\..*hive\.jdbc\..*hive\.join\..*hive\.limit\..*hive\.log\..*hive\.mapjoin\..*hive\.merge\..*hive\.optimize\..*hive\.materializedview\..*hive\.orc\..*hive\.outerjoin\..*hive\.parquet\..*hive\.ppd\..*hive\.prewarm\..*hive\.query\.redaction\..*hive\.server2\.thrift\.resultset\.default\.fetch\.sizehive\.server2\.proxy\.userhive\.skewjoin\..*hive\.smbjoin\..*hive\.stats\..*hive\.strict\..*hive\.tez\..*hive\.vectorized\..*hive\.query\.reexecution\..*reexec\.overlay\..*fs\.defaultFSssl\.client\.truststore\.locationdistcp\.atomicdistcp\.ignore\.failuresdistcp\.preserve\.statusdistcp\.preserve\.rawxattrsdistcp\.sync\.foldersdistcp\.delete\.missing\.sourcedistcp\.keystore\.resourcedistcp\.liststatus\.threadsdistcp\.max\.mapsdistcp\.copy\.strategydistcp\.skip\.crcdistcp\.copy\.overwritedistcp\.copy\.appenddistcp\.map\.bandwidth\.mbdistcp\.dynamic\..*distcp\.meta\.folderdistcp\.copy\.listing\.classdistcp\.filters\.classdistcp\.options\.skipcrccheckdistcp\.options\.mdistcp\.options\.numListstatusThreadsdistcp\.options\.mapredSslConfdistcp\.options\.bandwidthdistcp\.options\.overwritedistcp\.options\.strategydistcp\.options\.idistcp\.options\.p.*distcp\.options\.updatedistcp\.options\.deletemapred\.map\..*mapred\.reduce\..*mapred\.output\.compression\.codecmapred\.job\.queue\.namemapred\.output\.compression\.typemapred\.min\.split\.sizemapreduce\.job\.reduce\.slowstart\.completedmapsmapreduce\.job\.queuenamemapreduce\.job\.tagsmapreduce\.input\.fileinputformat\.split\.minsizemapreduce\.map\..*mapreduce\.reduce\..*mapreduce\.output\.fileoutputformat\.compress\.codecmapreduce\.output\.fileoutputformat\.compress\.typeoozie\..*tez\.am\..*tez\.task\..*tez\.runtime\..*tez\.queue\.namehive\.transpose\.aggr\.joinhive\.exec\.reducers\.bytes\.per\.reducerhive\.client\.stats\.countershive\.exec\.default\.partition\.namehive\.exec\.drop\.ignorenonexistenthive\.counters\.group\.namehive\.default\.fileformat\.managedhive\.enforce\.bucketmapjoinhive\.enforce\.sortmergebucketmapjoinhive\.cache\.expr\.evaluationhive\.query\.result\.fileformathive\.hashtable\.loadfactorhive\.hashtable\.initialCapacityhive\.ignore\.mapjoin\.hinthive\.limit\.row\.max\.sizehive\.mapred\.modehive\.map\.aggrhive\.compute\.query\.using\.statshive\.exec\.rowoffsethive\.variable\.substitutehive\.variable\.substitute\.depthhive\.autogen\.columnalias\.prefix\.includefuncnamehive\.autogen\.columnalias\.prefix\.labelhive\.exec\.check\.crossproductshive\.cli\.tez\.session\.asynchive\.compathive\.display\.partition\.cols\.separatelyhive\.error

77


\.on\.empty\.partitionhive\.execution\.enginehive\.exec\.copyfile\.maxsizehive\.exim\.uri\.scheme\.whitelisthive\.file\.max\.footerhive\.insert\.into\.multilevel\.dirshive\.localize\.resource\.num\.wait\.attemptshive\.multi\.insert\.move\.tasks\.share\.dependencieshive\.query\.results\.cache\.enabledhive\.query\.results\.cache\.wait\.for\.pending\.resultshive\.support\.quoted\.identifiershive\.resultset\.use\.unique\.column\.nameshive\.analyze\.stmt\.collect\.partlevel\.statshive\.exec\.schema\.evolutionhive\.server2\.logging\.operation\.levelhive\.server2\.thrift\.resultset\.serialize\.in\.taskshive\.support\.special\.characters\.tablenamehive\.exec\.job\.debug\.capture\.stacktraceshive\.exec\.job\.debug\.timeouthive\.llap\.io\.enabledhive\.llap\.io\.use\.fileid\.pathhive\.llap\.daemon\.service\.hostshive\.llap\.execution\.modehive\.llap\.auto\.allow\.uberhive\.llap\.auto\.enforce\.treehive\.llap\.auto\.enforce\.vectorizedhive\.llap\.auto\.enforce\.statshive\.llap\.auto\.max\.input\.sizehive\.llap\.auto\.max\.output\.sizehive\.llap\.skip\.compile\.udf\.checkhive\.llap\.client\.consistent\.splitshive\.llap\.enable\.grace\.join\.in\.llaphive\.llap\.allow\.permanent\.fnshive\.exec\.max\.created\.fileshive\.exec\.reducers\.maxhive\.reorder\.nway\.joinshive\.output\.file\.extensionhive\.exec\.show\.job\.failure\.debug\.infohive\.exec\.tasklog\.debug\.timeouthive\.query\.idhive\.query\.tag

hive.security.command.whitelist

Before upgrade: set,reset,dfs,add,list,delete,reload,compile

After upgrade: set,reset,dfs,add,list,delete,reload,compile,llap

hive.server2.enable.doAs

Before upgrade: TRUE (in case of an insecure cluster only)

After upgrade: FALSE (in all cases)

Affects only insecure clusters by turning off impersonation. Permission issues are expectedto arise for affected clusters.

hive.server2.idle.session.timeout

Before upgrade: 12 hours

After upgrade: 24 hours

Exception:Preserves pre-upgrade value if old default is overridden; otherwise, uses newdefault.

hive.server2.max.start.attempts

Before upgrade: 30

After upgrade: 5

hive.server2.parallel.ops.in.session



A Tez limitation requires disabling this property; otherwise, queries submitted concurrentlyon a single JDBC connection fail or execute slower.

hive.server2.support.dynamic.service.discovery


After upgrade: TRUE

hive.server2.tez.initialize.default.sessions


78


After upgrade: TRUE

hive.server2.thrift.max.worker.threads

Before upgrade: 100

After upgrade: 500

Exception: Preserves pre-upgrade value if the old default is overridden; otherwise, usesnew default.

hive.server2.thrift.resultset.max.fetch.size


After upgrade: 10000

hive.service.metrics.file.location

Before upgrade: /var/log/hive/metrics-hiveserver2/metrics.log

After upgrade: /var/log/hive/metrics-hiveserver2-hiveontez/metrics.log

This location change is due to a service name change.

hive.stats.column.autogather


After upgrade: TRUE

hive.stats.deserialization.factor

Before upgrade: 1

After upgrade: 10

hive.support.special.characters.tablename


After upgrade: TRUE

hive.tez.auto.reducer.parallelism


After upgrade: TRUE

hive.tez.bucket.pruning


After upgrade: TRUE

hive.tez.container.size

Before upgrade: -1

After upgrade: 4096

hive.tez.exec.print.summary


After upgrade: TRUE

hive.txn.manager

Before upgrade: org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager

After upgrade: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.vectorized.execution.mapjoin.minmax.enabled


After upgrade: TRUE

79


hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled


After upgrade: TRUE

hive.vectorized.use.row.serde.deserialize


After upgrade: TRUE

Related InformationCustom Configuration (about Cloudera Manager Safety Valve)

Customizing Critical Hive ConfigurationsYou receive property configuration guidelines, including which properties you need to reconfigure afterupgrading. You understand which the upgrade process carries over from your old cluster to your newcluster.

The CDP upgrade process tries to preserve your Hive configuration property overrides. Theseoverrides are the custom values you set to configure Hive in the old CDH or HDP cluster. Theupgrade process does not perserve all overrides. For example, a custom value you set forhive.exec.max.dynamic.partitions.pernode is preserved. In the case of other properties, forexample hive.cbo.enable, the upgrade ignores any override and just sets the CDP-recommendedvalued. Hive Configuration Requirements and Recommendations (link below) indicates which overrides theupgrade process preserves or disregards (Safety Valve Overrides column).

The upgrade process does not preserve overrides to the configuration values of the following propertiesthat you likely need to reconfigure to meet your needs:

• hive.conf.hidden.list

• hive.conf.restricted.list

• hive.exec.post.hooks

• hive.script.operator.env.blacklist

• hive.security.authorization.sqlstd.confwhitelist

• hive.security.command.whitelist

The Apache Hive Wiki (link below) describes these properties. The values of these properties are lists.

The upgrade process ignores your old list and sets a new generic list. For example, thehive.security.command.whitelist value is a list of security commands on the whitelist. Anywhitelist overrides you set in the old cluster are not preserved. The new default is probably a shorter (morerestrictive) list than the original default you were using in the old cluster. You need to customize the CDPwhitelist to match your needs.

Check and change each property listed above after upgrading as described in the next topic.

Consider reconfiguring more property values than the six listed above. Even if you didn't override thedefault value in the old cluster, the CDP default might have changed in a way that impacts your work. HiveConfiguration Changes (link below) lists the old CDH/HDP and new CDP defaults.

Related InformationHive Configuration Property ChangesApache Hive Wiki: Configuration PropertiesHive Configuration Requirements and Recommendations

Set Hive Configuration OverridesYou need to know how to configure the critical customizations that the upgrade process does not preservefrom your old Hive cluster. Referring to your records about your old configuration, you follow steps to set atleast six critical property values.

80

https://docs.cloudera.com/cloudera-manager/7.2.6/configuring-clusters/topics/cm-configuration-snippet.html

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties


About this taskBy design, the six critical properties that you need to customize are not visible in Cloudera Manager, asyou can see from the Visible in CM column of Configurations Requirements and Recommendations (seelink below). You use the Safety Valve to add these properties to hive-site.xml as shown in this task.

Procedure

1. In Cloudera Manager > Clusters select the Hive on Tez service. Click Configuration, and search forhive-site.xml.

2. In Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml, click +.

3. In Name, add the hive.conf.hidden.list property.

4. In Value, add your custom list.

5. Customize the other critical properties: hive.conf.restricted.list,hive.exec.post.hooks, hive.script.operator.env.blacklist,hive.security.authorization.sqlstd.confwhitelist,hive.security.command.whitelist

6. Save the changes and restart the Hive service.

7. Look at the Configurations Requirements and Recommendations to understand which overrides werepreserved or not.

Related InformationHive Configuration Requirements and Recommendations

Hive Configuration Requirements and RecommendationsYou need to set certain Hive and HiveServer (HS2) configuration properties after upgrading. Youreview recommendations for setting up CDP Private Cloud Base for your needs, and understand whichconfigurations remain unchanged after upgrading, which impact performance, and default values.

Requirements and Recommendations

The following table includes the Hive service and HiveServer properties that the upgrade process changes.Other property values (not shown) are carried over unchanged from CDH or HDP to CDP

• Set After Upgrade column: properties you need to manually configure after the upgrade to CDP. Pre-existing customized values are not preserved after the upgrade.

• Default Recommended column: properties that the upgrade process changes to a new value that youare strongly advised to use.

• Impacts Performance column: properties changed by the upgrade process that you set to tuneperformance.

• Safety Value Overrides column: How the upgrade process handles Safety Valve overrides.

81


• Disregards: the upgrade process removes any old CDH Safety Valve configuration snippets from thenew CDP configuration.

• Preserves means the upgrade process carries over any old CDH snippets to the new CDPconfiguration.

• Not applicable means the value of the old parameter is preserved.• Visible in CM column: property is visible in Cloudera Manager after upgrading.

If a property is not visible, and you want to configure it, use the Cloudera Manager Safety Valve (seelink below) to safely add the parameter to the correct file, for example to a cluster-wide, hive-site.xmlfile.

Table 6:

Property Set AfterUpgrade

DefaultRecommended

ImpactsPerformance

NewFeature

Safety Valve Overrides Visiblein CM

datanucleus.connectionPool.maxPoolSize # Preserve

datanucleus.connectionPoolingType # Disregard

hive.async.log.enabled Disregard #

hive.auto.convert.join.noconditionaltask.size Not applicable #

hive.auto.convert.sortmerge.join Preserve

hive.auto.convert.sortmerge.join.to.mapjoin Preserve

hive.cbo.enable Disregard #

hive.cbo.show.warnings Disregard

hive.compactor.worker.threads # Disregard #

hive.compute.query.using.stats # Disregard #

hive.conf.hidden.list # Disregard

hive.conf.restricted.list # Disregard

hive.default.fileformat.managed Disregard #

hive.default.rcfile.serde # Preserve

hive.driver.parallel.compilation Disregard #

hive.exec.dynamic.partition.mode Disregard

hive.exec.max.dynamic.partitions Preserve

hive.exec.max.dynamic.partitions.pernode Preserve

hive.exec.post.hooks # Disregard

hive.exec.reducers.max # orotherprimenumber

Not applicable #

hive.execution.engine Disregard

hive.fetch.task.conversion # Not applicable #

hive.fetch.task.conversion.threshold # Not appliable #

hive.hashtable.key.count.adjustment # Preserve

hive.limit.optimize.enable # Disregard

hive.limit.pushdown.memory.usage # Not Applicable #

hive.mapjoin.hybridgrace.hashtable # # Disregard

hive.mapred.reduce.tasks.speculative.execution # Disregard

82


Property Set AfterUpgrade

DefaultRecommended

ImpactsPerformance

NewFeature

Safety Valve Overrides Visiblein CM

hive.metastore.aggregate.stats.cache.enabled # # Disregard

hive.metastore.disallow.incompatible.col.type.changes Disregard

hive.metastore.dml.events Disregard #

hive.metastore.event.message.factory # Disregard

hive.metastore.uri.selection # Disregard

hive.metastore.warehouse.dir Preserve #

hive.optimize.metadataonly # Disregard

hive.optimize.point.lookup.min Disregard

hive.prewarm.numcontainers Disregard

hive.script.operator.env.blacklist # Disregard

hive.security.authorization.sqlstd.confwhitelist # Disregard

hive.security.command.whitelist # Disregard

hive.server2.enable.doAs Disregard #

hive.server2.idle.session.timeout Not applicable #

hive.server2.max.start.attempts Preserve

hive.server2.parallel.ops.in.session Preserve

hive.server2.support.dynamic.service.discovery # Disregard #

hive.server2.tez.initialize.default.sessions # Disregard

hive.server2.thrift.max.worker.threads Not Applicable #

hive.server2.thrift.resultset.max.fetch.size Preserve

hive.service.metrics.file.location Disregard #

hive.stats.column.autogather # Disregard

hive.stats.deserialization.factor # Disregard

hive.support.special.characters.tablename # Disregard

hive.tez.auto.reducer.parallelism # Disregard #

hive.tez.bucket.pruning # Disregard #

hive.tez.container.size # Disregard #

hive.tez.exec.print.summary # Disregard #

hive.txn.manager # Disregard #

hive.vectorized.execution.mapjoin.minmax.enabled # Disregard

hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled# Disregard

hive.vectorized.use.row.serde.deserialize # Disregard

Related InformationCustom Configuration (about Cloudera Manager Safety Valve)

Removing Hive on Spark ConfigurationsInformation presented to you includes the Hive on Spark configuration, which is not longer supported, andthat you must remove as instructed.

In CDP, there is no Hive-Spark dependency. The Spark site and libs are not in the classpath.

83

https://docs.cloudera.com/cloudera-manager/7.2.6/configuring-clusters/topics/cm-configuration-snippet.html



CDH supported Hive on Spark and the following configuration to enable Hive on Spark: sethive.execution.engine=spark


CDP does not support Hive on Spark. Scripts that enable Hive on Spark do not work.

Action Required

Remove set hive.execution.engine=spark from your scripts.

Update Ranger Table PoliciesAlthough the upgrade process makes no change to the location of external tables, if you moved tablesduring the upgrade process, you need to know the methods for accessing external tables in HDFS.

About this task

Set up access to external tables in HDFS using one of the following methods.

• Set up a Hive HDFS policy in Ranger (recommended) to include the paths to external table data.• Put an HDFS ACL in place. Store the external text file, for example a comma-separated values (CSV)

file, in HDFS that will serve as the data source for the external table.

Setting Up Access Control ListsSeveral sources of information about setting up HDFS ACLS plus a brief Ranger overview and pointer toRanger information prepare you to set up Hive authorization.

In CDP, HDFS supports POSIX ACLs (Access Control Lists) to assign permissions to users and groups.In lieu of Ranger policies, you use HDFS ACLs to check and make any necessary changes in HDFSpermission changes. For more information, see HDFS ACLs, Apache Software Foundation HDFSPermissions Guide, and HDFS ACL Permissions.

In Ranger, you give multiple groups and users specific permissions based on your use case. You applypermissions to a directory tree instead of dealing with individual files. For more information, see AuthorizingApache Hive Access.

Related InformationHDFS ACL Permissions ModelHDFS ACLSApache Hive 3 Architectural OverviewConfigure a Resource-based Policy: Hive

Configure HiveServer for ETL using YARN queuesIf you upgrade from CDH and want to run an ETL job, you need to add several configuration properties toallow placement of the Hive workload on the Yarn queue manager.

Procedure

1. In Cloudera Manager, click Clusters > Hive > Configuration.

2. Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xmlsetting.

3. In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, click+.

4. In Name enter the property hive.server2.tez.initialize.default.sessions and in valueenter false.

5. In Name enter the property hive.server2.tez.queue.access.check and in value enter true.

84

https://docs.cloudera.com/runtime/7.2.6/securing-hive/topics/hive_sba_permissions_model.html

https://docs.cloudera.com/runtime/7.2.6/hdfs-acls/topics/hdfs-acls.html

https://docs.cloudera.com/runtime/7.2.6/hive-introduction/topics/hive-apache-hive-3-architectural-overview.html

https://docs.cloudera.com/runtime/7.2.6/security-ranger-authorization/topics/security-ranger-resource-policy-configure-hive.html


6. In Name enter the property hive.server2.tez.sessions.custom.queue.allowed and in valueenter true.

Configure Encryption Zone SecurityUnder certain conditions, you need to perform a security-related task to allow access to tables stored inencryption zones. You find out how to prevent access problems to these tables.

About this taskHive on Tez cannot run some queries on tables stored in encryption zones under certain conditions. Whenthe Hadoop Key Management Server (KMS) connection is SSL-encrypted and a self signed certificate isused, perform the following procedure.

Procedure

1. Perform either of the following actions:

• Install self signed SSL certificate into the cacerts file on all hosts and skip the steps below.• Perform the steps below.

2. Copy the ssl-client.xml to a directory that is available on all hosts.

3. In Cloudera Manager, click Clusters > Hive on Tez > Configuration. Clusters > Hive on Tez >Configuration.

4. Search for the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xmlsetting.

5. In the Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml setting, click+.

6. In Name enter the property tez.aux.uris and in value enter path-to-ssl-client.xml.

Use HWC/Spark Direct Reader for Spark Apps/ETLYou need to know a little about Hive Warehouse Connector (HWC) and how to find more informationbecause to access Hive from Spark, you need to use HWC implicitly or explicitly.

HWC is a Spark library/plugin that is launched with the Spark app. Use the Spark Direct Reader and HWCfor ETL.

The Hive Warehouse Connector is designed to access managed ACID v2 Hive tables from Spark. ApacheRanger and the HiveWarehouseConnector library provide row and column, fine-grained access to the data.HWC supports spark-submit and pyspark. The spark thrift server is not supported.

Related InformationHive Warehouse Connector for accessing Apache Spark dataSpark Direct Reader for accessing Spark data

Configure HiveServer HTTP ModeIf you use Knox, you might need to change the HTTP mode configuration. If you installed Knox on CDPPrivate Cloud Base and want to proxy HiveServer with Knox, you need to change the default HiveServertransport mode (hive.server2.transport.mode) from binary to http.

Procedure

1. Click Cloudera Manager > Clusters > HIVE_ON_TEZ > Configuration

2. In Search, type transport.

3. In HiveServer2 Transport Mode, select http.

85

https://docs.cloudera.com/runtime/7.2.6/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

https://docs.cloudera.com/runtime/7.2.6/integrating-hive-and-bi/topics/hive_spark_direct_reader.html


4. Save and restart Hive on Tez.

Unsupported Interfaces and FeaturesYou need to know the interfaces available in HDP or CDH platforms that are no longer supported in CDP.Some features you might have used are also unsupported.

Unsupported Interfaces

• Druid• Hcat CLI• Hive CLI (replaced by Beeline)• Hive View• LLAP (available in CDP Public Cloud only)• MapReduce execution engine (replaced by Tez)• Pig• S3 (available in CDP Public Cloud only)• Spark execution engine (replaced by Tez)• Spark thrift server

Spark and Hive tables interoperate using the Hive Warehouse Connector.• SQL Standard Authorization• Tez View• WebHCat

You can use Hue in lieu of Hive View.

Unsupported Features

CDP does not support the following features that were available in HDP and CDH platforms:

• CREATE TABLE that specifies a managed table location

Do not use the LOCATION clause to create a managed table. Hive assigns a default location in thewarehouse to managed tables.

• CREATE INDEX

Hive builds and stores indexes in ORC or Parquet within the main table, instead of a differenttable, automatically. Set hive.optimize.index.filter to enable use (not recommended--use

86


materialized views instead). Existing indexes are preserved and migrated in Parquet or ORC to CDPduring upgrade.

Unsupported Connector Use

CDP does not support the Sqoop exports using the Hadoop jar command (the Java API) that Teradatadocuments. For more information, see link below.

Changes to CDH TablesThe location of existing tables after a CDH to CDP upgrade does not change. Upgrading CDH to CDPPrivate Cloud Base converts Hive managed tables to external tables in Hive 3.

When the upgrade process converts a managed table to external, it sets the table propertyexternal.table.purge set to true. The table is equivalent to a managed tables having purge set totrue in your old CDH cluster.

Managed tables on the HDFS in /user/hive/warehouse before the upgrade remain there after theconversion to external. The upgrade process sets the hive.metastore.warehouse.dir property tothis location, designating it the Hive warehouse location. New tables that you create in CDP are stored inthe Hive warehouse you designate.

Tables that were external before the upgrade are not relocated. You need to set HDFS policies to accessexternal tables in Ranger, or set up HDFS ACLs (see link below).

You can change the location of the warehouse using the Hive Metastore Action menu in ClouderaManager.

• Hive > Action Menu > Create Hive Warehouse Directory• Hive > Action Menu > Create Hive Warehouse External Directory

Related InformationHDFS ACLS

87



Changes to HDP TablesTo locate and use your Apache Hive 3 tables after an upgrade, you need to understand the changes thatoccur during the upgrade process.

Managed, ACID tables that are not owned by the hive user remain managed tables after the upgrade, buthive becomes the owner.

After the upgrade, the format of a Hive table is the same as before the upgrade. For example, native ornon-native tables remain native or non-native, respectively.

After the upgrade, the location of managed tables or partitions do not change under any one of thefollowing conditions:

• The old table or partition directory was not in its default location /apps/hive/warehouse before theupgrade.

• The old table or partition is in a different file system than the new warehouse directory.• The old table or partition directory is in a different encryption zone than the new warehouse directory.

Otherwise, the upgrade process from HDP to CDP moves managed files to the Hive warehouse /warehouse/tablespace/managed/hive. The upgrade process carries the external files over to CDPwith no change in location. By default, Hive places any new external tables you create in /warehouse/tablespace/external/hive. The upgrade process sets the hive.metastore.warehouse.dirproperty to this location, designating it the Hive warehouse location.

Changes to table references using dot notation

Upgrading to CDP includes the Hive-16907 bug fix, which rejects `db.table` in SQL queries. The dot (.) isnot allowed in table names. To reference the database and table in a table name, both must be enclosed inbackticks as follows: `db`.`table`.

Changes to ACID properties

Hive 3.x in CDP Private Cloud Base supports transactional and non-transactional tables. Transactionaltables have atomic, consistent, isolation, and durable (ACID) properties. In Hive 2.x, the initial version ofACID transaction processing was ACID v1. In Hive 3.x, the mature version of ACID is ACID v2, which isthe default table type in CDP Private Cloud Base.

Native and non-native storage formats

Storage formats are a factor in upgrade changes to table types. Hive 2.x and 3.x support the followingnative and non-native storage formats:

• Native: Tables with built-in support in Hive, such as those in the following file formats:

• Text• Sequence File• RC File• AVRO File• ORC File• Parquet File

• Non-native: Tables that use a storage handler, such as the DruidStorageHandler orHBaseStorageHandler

CDP upgrade changes to HDP table types

The following table compares Hive table types and ACID operations before an upgrade from HDP 2.x andafter an upgrade to CDP. The ownership of the Hive table file is a factor in determining table types andACID operations after the upgrade.

88


Table 7: HDP 2.x and CDP Table Type Comparison

HDP 2.x CDP

Table Type ACID v1 Format Owner (user) ofHive Table File

Table Type ACID v2

External No Native or non-native hive or non-hive External No

Managed Yes ORC hive or non-hive Managed,updatable

Yes

hive Managed,updatable

YesManaged No ORC

non-hive External, with datadelete

No

hive Managed, insertonly

YesManaged No Native (but non-ORC)

non-hive External, with datadelete

No

Managed No Non-native hive or non-hive External, with datadelete

No

Migrating Impala Data to CDPBefore migrating Impala workloads from the CDH platform to CDP, you must be aware of the semantic andbehavioral differences between CDH and CDP Impala and the activities that need to be performed prior tothe data migration.

To successfully migrate your critical Impala workloads to the Cloud environment you must learn about thecapacity requirements in the target environment and also understand the performance difference betweenyour current environment and the targeted environment.

Related InformationOn-demand MetadataImpala AuthorizationImpala metadata collectionSQL transactions in Impala

Impala Changes between CDH and CDPThere are some differences between Impala in CDH and Impala in CDP. These changes affect Impalaafter you migrate your workload from CDH 5.13-5.16 or CDH 6.1 or later to CDP Private Cloud Base orPublic Cloud. Some of these differences require you to change your Impala scripts or workflow.

The version of Impala you used in CDH version 5.11 - 5.16 or 6.1 or later changes to Impala 3.4 when youmigrate the workload to CDP Private Cloud Base or Public Cloud.

Change location of DatafilesIf Impala managed tables are located on the HDFS in /user/hive/warehouse before the migration, thetables, converted to external, remain there. The migration process sets the hive.metastore.warehouse.dirproperty to this location, designating it the Hive warehouse location. You can change the location of thewarehouse using Cloudera Manager.

About this task

The location of existing tables does not change after a CDH to CDP migration. In CDP, there are separateHDFS directories for managed and external tables.

89

https://docs.cloudera.com/runtime/7.2.6/impala-reference/topics/impala-on-demand-metadata.html

https://docs.cloudera.com/runtime/7.2.6/impala-manage/topics/impala-authorization.html

https://docs.cloudera.com/runtime/7.2.6/atlas-reference/topics/atlas-impala-metadata-collection.html

https://docs.cloudera.com/runtime/7.2.6/impala-reference/topics/impala-transactions.html


• The data files for managed tables are available in the warehouse location specified by the ClouderaManager configuration setting, Hive Warehouse Directory.

• The data files for external tables are available in the warehouse location specified by the ClouderaManager configuration setting, Hive Warehouse External Directory.

After migration, the (hive.metastore.warehouse.dir) is set to /user/hive/warehouse wherethe Impala managed tables are located.

You can change the location of the warehouse using the Hive Metastore Action menu in ClouderaManager.

Procedure

Create Hive Directories using the Hive Configuration page

a) Hive > Action Menu > Create Hive User Directoryb) Hive > Action Menu > Create Hive Warehouse Directoryc) Hive > Action Menu > Create Hive Warehouse External Directory

Set Storage Engine ACLsYou must be aware of the steps to set ACLs for Impala to allow Impala to write to the Hive WarehouseDirectory.

90


About this task

After migration, the (hive.metastore.warehouse.dir) is set to /user/hive/warehouse wherethe Impala managed tables are located. When the Impala workload is migrated from CDH to CDP, the ACLsettings are automatically set for the default warehouse directories. If you changed the default location ofthe warehouse directories after migrating to CDP then follow the steps to allow Impala to write to the HiveWarehouse Directory.

Complete the initial configurations in the free-form fields on the Hive/Impala Configuration pages inCloudera Manager to allow Impala to write to the Hive Warehouse Directory.

Procedure

1. Set Up Impala User ACLs using the Impala Configuration page

a) Impala > Action Menu > Set the Impala user ACLs on warehouse directoryb) Impala > Action Menu > Set the Impala user ACLs on external warehouse directory

2. Cloudera Manager sets the ACL for the user "Impala". However before starting the Impala service,verify permissions and ACLs set on the individual database directories using the sub-commandsgetfacl and setfacl.

a) Verify the ACLs of HDFS directories for managed and external tables using getfacl.

Example:

$ hdfs dfs -getfacl hdfs:///warehouse/tablespace/managed/hive# file: hdfs:///warehouse/tablespace/managed/hive# owner: hive# group: hiveuser::rwxgroup::rwxother::---default:user::rwxdefault:user:impala:rwxdefault:group::rwxdefault:mask::rwxdefault:other::---

$ hdfs dfs -getfacl hdfs:///warehouse/tablespace/external/hive# file: hdfs:///warehouse/tablespace/external/hive# owner: hive

91


# group: hive# flags: --tuser::rwxgroup::rwxother::rwxdefault:user::rwxdefault:user:impala:rwxdefault:group::rwxdefault:mask::rwxdefault:other::rwx

b) If necessary, set the ACLs of HDFS directories using setfacl

Example:

$ hdfs dfs -setfacl hdfs:///warehouse/tablespace/managed/hive$ hdfs dfs -setfacl hdfs:///warehouse/tablespace/external/hive

For more information on using the sub-commands getfacl and setfacl, see Using CLIcommands to create and list ACLs.

c) The above examples show the user Impala as part of the Hive group. If in your setup, the userImpala does not belong to the group Hive then ensure that the Group user Impala belongs to hasWRITE privileges assigned on the directory.

To view the Group user Impala belongs to:

$ id -Gn impalauid=973(impala) gid=971(impala) groups=971(impala),972(hive)

Related InformationHDFS ACLS

Automatic Invalidation/Refresh of MetadataTo pick up new information when raw data is ingested into Tables you can use thehms_event_polling_interval_s flag.

New Default Behavior

When raw data is ingested into Tables, new HMS metadata and filesystem metadata are generated.In CDH, to pick up this new information, you must manually issue an Invalidate or Refresh command.However in CDP, this feature is controlled by the hms_event_polling_interval_s flag. This flag is set to2 seconds by default. This option automatically refreshes the tables as changes are detected in HMS.When automatic invalidate/refresh of metadata is enabled, the Catalog Server polls Hive Metastore (HMS)notification events at a configurable interval and automatically applies the changes to Impala catalog. Ifspecific tables that are not supported by event polling need to be refreshed, you must run a table levelInvalidate or Refresh command.

For more information on Automatic Invalidation of metadata, see Automatic Invalidation

Related InformationAutomatic Invalidation

Metadata ImprovementsIn CDP, all catalog metadata improvements are enabled by default. You may use these few knobs tocontrol how Impala manages its metadata to improve performance and scalability.

use_local_catalog

In CDP, the on-demand use_local_catalog mode is set to True by default on all the Impalacoordinators so that the Impala coordinators pull metadata as needed from catalogd and cache it locally.

92




https://docs.cloudera.com/runtime/7.2.6/impala-manage/topics/impala-auto-metadata-sync.html



This results in many performance and scalability improvements, such as reduced memory footprint oncoordinators and automatic cache eviction.

catalog_topic_mode

The granularity of on-demand metadata fetches is at the partition level between the coordinator andcatalogd. Common use cases like add/drop partitions do not trigger unnecessary serialization/deserialization of large metadata.

The feature can be used in either of the following modes.Metadata on-demand mode

In this mode, all coordinators use the metadata on-demand.

Set the following on catalogd:

--catalog_topic_mode=minimal

Set the following on all impalad coordinators:

--use_local_catalog=true

Mixed mode

In this mode, only some coordinators are enabled to use the metadata on-demand.

Cloudera recommends that you use the mixed mode only for testing local catalog’s impacton heap usage.

Set the following on catalogd:

--catalog_topic_mode=mixed

Set the following on impalad coordinators with metdadata on-demand:

--use_local_catalog=true

Limitation:

HDFS caching is not supported in On-demand metadata mode coordinators.

Reference:

See Impala Metadata Management for the details about catalog improvements.

Related InformationImpala Metadata Management

Default Managed TablesIn CDP, managed tables are transactional tables with the insert_only property by default. You must beaware of the new default behavior of modifying file systems on a managed table in CDP and the methodsto switch to the old behavior.


• You can no longer perform file system modifications (add/remove files) on a managed table in CDP.The directory structure for transactional tables is different than non-transactional tables, and any out-of-band files which are added may or may not be picked up by Hive and Impala.

• The insert_only transactional tables cannot be currently altered in Impala. The ALTER TABLEstatement on a transactional table currently displays an error.

• Impala does not currently support compaction on transaction tables. You should use Hive to compactthe tables.

93

https://docs.cloudera.com/runtime/7.2.6/impala-manage/topics/impala-metadata.html

https://docs.cloudera.com/runtime/7.2.6/impala-manage/topics/impala-metadata.html


• The SELECT, INSERT, INSERT OVERWRITE, and TRUNCATE statements are supported on the insert-only transactional tables.

Steps to switch to the CDH behavior:

• If you do not want transactional tables, set the DEFAULT_TRANSACTIONAL_TYPE query option to NONEso that any newly created managed tables are not transactional by default.

• External tables do not drop the data files when the table is dropped. To purge the data along with thetable when the table is dropped, add external.table.purge = true in the table properties. Whenexternal.table.purge is set to true, the data is removed when the DROP TABLE statement isexecuted.

Automatic Refresh of Tables on Impala ClustersThe property enable_insert_events is used in CDP to refresh the tables or partitions automatically onother Impala clusters when Impala inserts into a table.

enable_insert_events

If Impala inserts into a table it refreshes the underlying table or partition. When this configurationenable_insert_events is set to True (default) Impala generates INSERT event types which whenreceived by other Impala clusters automatically refreshes the tables or partitions.

Note: Event processing must be ON, for this property to work.

Related InformationAutomatic Invalidation

Interoperability between Hive and ImpalaThis topic describes the changes made in CDP for the optimal interoperability between Hive and Impala forthe improved user experience.

Statistics Interoperability Between Hive and Impala

New default behavior:

Statistics for tables are engine specific, namely, Hive or Impala, so that each engine could use its ownstatistics and not overwrite the statistics generated by the other engine.

When you issue the COMPUTE STATS statement on Impala, you need to issue the correspondingstatement on Hive to ensure both Hive and Impala statistics are accurate.

Impala COMPUTE STATS command does not overwrite the Hive stats for the same table.


There is no workaround.

Hive Default File Format Interoperability


The managed tables created by Hive are of ORC file format, by default, and support full transactionalcapabilities. If you create a table without specifying the STORED AS clause and load data from Hive, thensuch tables are not readable or writable by Impala. But Impala can continue to read non-transactional andinsert-only transactional ORC tables.


• You must use the STORED AS PARQUET clause when you create tables in Hive if you wantinteroperability with Impala on those tables.

94



• If you want to change this default file format at the system level, in the Hive_on_Tez serviceconfiguration in Cloudera Manager, set the hive_default_fileformat_managed field to parquet.

Impala supports a number of file formats used in Apache Hadoop. It can also load and query data filesproduced by other Hadoop components such as hive. After upgrading from any CDH 5.x version to CDPPrivate Cloud Base 7.1, if you create a RC file in Hive using the default LazyBinaryColumnarSerDe, Impalawill not be able to read the RC file. However you can set the configuration option of hive.default.rcfile.serdeto ColumnarSerDe to maintain the interoperability between hive and impala.

Managed and External Tablespace Directories


In CDP, there are separate HDFS directories for managed and external tables.

• The data files for managed tables are located in warehouse location specified by the Cloudera Managerconfiguration setting, hive_warehouse_directory.

• The data files for external tables are located in warehouse location specified by the Cloudera Managerconfiguration setting, hive_warehouse_external_directory.

If you perform file system level operations for adding/removing files on the table, you need to consider if itsan external table or managed table to find the location of the table directory.


Check the output of the DESCRIBE FORMATTED command to find the table location.

Related InformationDATE Data TypeTIMESTAMP

ORC Support Disabled for Full-Transactional TablesIn CDP 7.2.0 and earlier versions, ORC table support is disabled for Impala queries. However, you have anoption to switch to the CDH behavior by using the command line argument ENABLE_ORC_SCANNER.


In CDP 7.2.0 and earlier versions, if you use Impala to query ORC tables you will see it fail. To mitigatethis situation, you must add explicit STORED AS clause to code creating Hive tables and use aformat Impala can read. Another option is to set global configuration in Cloudera Manager to changehive_default_fileformat_managed.


Set the query option ENABLE_ORC_SCANNER to TRUE to re-enable ORC table support.

This option does not work on a full transactional ORC table, and the queries return an error.

Note: If you are using CDP 7.2.1 or later versions, the argument ENABLE_ORC_SCANNER isenabled by default and you can use Impala to query ORC tables without any manual interventions.

ORC vs Parquet in CDP

The differences between Optimized Row Columnar (ORC) file format for storing Hive data and Parquetfor storing Impala data are important to understand. Query performance improves when you use theappropriate format for your application. The following table compares Hive and Impala support for ORCand Parquet in CDP Public Cloud and CDP Private Cloud Base. ORC vs Parquet

Authorization Provider for ImpalaUsing the BDR service available in CDH you can migrate the permissions in CDP because Ranger is theauthorization provider instead of Sentry. You must be aware how Ranger enforces a policy in CDP whichmay be different from using Sentry.

95

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-date.html

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-timestamp.html

https://docs.cloudera.com/runtime/7.2.6/using-hiveql/topics/hive-orc-parquet-compare.html


New behavior:

• The CREATE ROLE, GRANT ROLE, SHOW ROLE statements are not supported as Ranger currently doesnot support roles.

• When a particular resource is renamed, currently, the policy is not automatically transferred to the newlyrenamed resource.

• SHOW GRANT with an invalid user/group does not return an error.

The following table lists the different access type requirements to run SQL statements in Impala.

SQL Statement Impala Access Requirement

DESCRIBE view VIEW_METADATA on the underlying tables

ALTER TABLE RENAME

ALTER VIEW RENAME

ALL on the target table / view

ALTER on the source table / view

SHOW DATABASES

SHOW TABLES

VIEW_METADATA

where:

• VIEW_METADATA privilege denotes the SELECT, INSERT, or REFRESH privileges.• ALL privilege denotes the SELECT, INSERT, CREATE, ALTER, DROP, and REFRESH privileges.

For more information on the minimum level of privileges and the scope required to execute SQLstatements in Impala, see Impala Authorization.

Migrating Sentry Policies

As CDP leverages Ranger as its authorization service, you must migrate permissions from Sentry toRanger. You can use the BDR service available in CDH to migrate the permissions. This service migratesSentry authorization policies into Ranger as a part of the replication policy job. When you create thereplication policy, choose the resources that you want to migrate and the Sentry policies are migrated forthose resources. You can migrate all permissions or permissions on a set of objects to Ranger.

The Sentry Permissions section of the Create Replication Policy wizard contains the followingoptions:

• Include Sentry Permissions with Metadata - Select this to migrate Sentry permissions during thereplication job.

• Exclude Sentry Permissions from Metadata - Select this if you do not want to migrate Sentrypermissions during the replication job.

The Replication Option section of the Create Replication Policy wizard contains the following options:

• Include Metadata and Data• Include Metadata Only

Stages of Migration

Sentry and Ranger have different permission models. Sentry permissions are granted to roles and users.These are translated to permissions for groups and users since Ranger currently does not support roles.This is followed by grouping by resource because Ranger policies are grouped by resource. All thepermissions that are granted to a resource are considered a single Ranger policy.

The migration of Sentry policies into Ranger is performed in the following operations:

• Export - The export operation runs in the source cluster. During this operation, the Sentry permissionsare fetched and exported to a JSON file. This file might be in a local file system or HDFS or S3, basedon the configuration that you provided.

96



• Translate and Ingest - These operations take place on the target cluster. In the translate operation,Sentry permissions are translated into a format that can be read by Ranger. The permissions are thenimported into Ranger. When the permissions are imported, they are tagged with the source clustername and the time that the ingest took place. After the import, the file containing the permissions isdeleted.

Because there is no one-to-one mapping between Sentry privileges and Ranger service policies, theSentry privileges are translated into their equivalents within Ranger service policies. For more informationon how Sentry actions is applied to the corresponding action in Ranger, see Sentry to Ranger Permissions.

Note: Because the authorization model in Ranger is different from Sentry's model, not all policiescan be migrated using BDR. For certain resources you must manually create the permissions aftermigrating the workload from CDH to CDP.

Related InformationImpala AuthorizationSentry to Ranger PermissionsCloudera Runtime Security and Governance

Data Governance Support by AtlasBoth CDH and CDP environments support governance functionality for Impala operations. When migratingyour workload from CDH to CDP, you must migrate your Navigator metadata to Atlas manually becausethere is no automatic migration of Navigator metadata from CDH to CDP.

The two environments collect similar information to describe Impala activities, including:

• Audits for Impala access requests• Metadata describing Impala queries• Metadata describing any new data assets created or updated by Impala operations

The services that support these operations are different in the two environments. Functionality isdistributed across services as follows:

Feature CDH CDP

Auditing

• Access requests Audit tab in Navigator console Audit page in Ranger console

• Service operations that create orupdate metadata catalog entries

Audit tab in Navigator console Audit tab for each entity in Atlasdashboard

• Service operations in general Audit tab in Navigator console No other audits collected.

Metadata Catalog

• Impala operations:

• CREATE TABLE AS SELECT

• CREATE VIEW

• ALTER VIEW AS SELECT

• INSERT INTO

• INSERT

• OVERWRITE

Process and Process Execution entities

Column- and table-level lineage

Process and Process Execution entities

Column- and table-level lineage

Migrating Navigator content to Atlas

As part of migrating your workload from CDH to CDP, you must use Atlas as the Cloudera NavigatorData Management for your cluster in CDP. You can choose to migrate your Navigator metadata to Atlasmanually because there is no automatic migration of Navigator metadata from CDH to CDP. Atlas 'rebuilds'the metadata for existing cluster assets and lineage using new operations. However Navigator Managedmetadata tags and any metadata you manually entered in CDH must be manually ported to Atlas Business

97

https://docs.cloudera.com/replication-manager/cloud/core-concepts/topics/rm-sentry-ranger-permissions.html


https://docs.cloudera.com/replication-manager/cloud/core-concepts/topics/rm-sentry-ranger-permissions.html

https://docs.cloudera.com/runtime/7.2.6/cdp-security-overview/topics/security-data-lake-security.html


Metadata Tags. If you have applications that are using the Navigator, you must port them to use AtlasAPIs.

Note: Navigator Audit information is not ported. To keep legacy audit information you can keep a"read-only" Navigator instance until it is no longer needed. You may need to upgrade CM/Navigatoron your legacy cluster to a newer version to avoid EOL.

Migrating content from Navigator to Atlas involves 3 steps:

• extracting the content from Navigator• transforming that content into a form Atlas can consume• importing the content into Atlas

Related InformationCloudera Runtime Security and Governance

Impala configuration differences in CDH and CDPThere are some configuration differences related to Impala in CDH and CDP. These differences are dueto the changes made in CDP for the optimal interoperability between Hive and Impala for improved userexperience. Review the changes before you migrate your Impala workload from CDH to CDP.

Default Value Changes in Configuration Options

Configuration Option Scope Default in CDH 6.x Default in CDP

DEFAULT_FILE_FORMAT Query TEXT PARQUET

hms_event_polling_interval_s Catalogd 0 2

ENABLE_ORC_SCANNER Query TRUE FALSE

use_local_catalog Coordinator / Catalogd false true

catalog_topic_mode Coordinator full minimal

New Configuration Options

Configuration Option Scope Default Value

default_transactional_type Coordinator insert_only

DEFAULT_TRANSACTIONAL_TYPE Query insert_only

disable_hdfs_num_rows_estimate Impalad false

disconnected_session_timeout Coordinator 900

PARQUET_OBJECT_STORE_SPLIT_SIZE Query 256 MB

SPOOL_QUERY_RESULTS Query FALSE

MAX_RESULT_SPOOLING_MEM Query 100 MB

MAX_SPILLED_RESULT_SPOOLING_MEM Query 1 GB

FETCH_ROWS_TIMEOUT_MS Query N/A

DISABLE_HBASE_NUM_ROWS_ESTIMATE Query FALSE

enable_insert_events TRUE

Default File FormatsTo improve useability and functionality, Impala significantly changed table creation. In CDP, the default fileformat of the tables is Parquet.

98

https://docs.cloudera.com/runtime/7.2.6/cdp-security-overview/topics/security-data-lake-security.html



When you issue the CREATE TABLE statement without the STORED AS clause, Impala creates a Parquettable instead of a Text table as in CDH.

For example, if you create an external table based on a text file without providing the STORED AS clauseand then issue a select query, the query fails in CDP, because Impala expects the file to be in the Parquetfile format.


1. Add the explicitly stored as clause to in the CREATE TABLE statements if the file format is notParquet.

2. Start Coordinator with the default_transactional_type flag set to text for all tables.3. Set the default_file_format query option to TEXT to revert to the default Text format for one or

more CREATE TABLE statements.

For more information on transactions supported by Impala, see SQL transactions in Impala

Reconnect to HS2 SessionClients can disconnect from Impala while keeping the HiveSever2 (HS2) session running and can alsoreconnect to the same session by presenting the session_token.


By default, disconnected sessions are terminated after 15 min.

• Clients will not notice a difference because of this behavioral change.• If clients are disconnected without the driver explicitly closing the session (for example, because of

a network fault), disconnected sessions and the queries associated with them may remain open andcontinue consuming resources until the disconnected session is timed out. Administrators may noticethese disconnected sessions and/or the associated resource consumption.


You can adjust the --disconnected_session_timeout flag to a lower value so that disconnectedsessions are cleaned up quickly.

Automatic Row Count EstimationTo optimize complex or multi-table queries Impala has access to statistics about the volume of data andhow the values are distributed. Impala uses this information to help parallelize and distribute the work for aquery.


The Impala query planner can make use of statistics about entire tables and partitions. This informationincludes physical characteristics such as the number of rows, number of data files, the total size of the datafiles, and the file format. For partitioned tables, the numbers are calculated per partition, and as totals forthe whole table. This metadata is stored in the Metastore database, and can be updated by either Impalaor Hive.

If there are no statistics available on a table, Impala estimates the cardinality by estimating the size of tablebased on the number of rows in the table. This behavior is switched on by default and should result inbetter plans for most cases when statistics are not available.

For some edge cases, it is possible that Impala generates a bad plan (when compared to the samequery in CDH) when the statistics are not present on that table and could negatively affect the queryperformance.


Set the DISABLE_HDFS_NUM_ROWS_ESTIMATE query option to TRUE to disable this optimization.

99

https://docs.cloudera.com/runtime/7.2.6/impala-reference/topics/impala-transactions.html


Using Reserved Words in SQL QueriesFor ANSI SQL compliance, Impala rejects reserved words in SQL queries in CDP. A reserved word is onethat cannot be used directly as an identifier. If you need to use it as an identifier, you must quote it withbackticks.

About this taskNew reserved words are added in CDH 6. To port SQL statements from CDH 5 that has different sets ofreserved words, you must change queries that use references to such Tables or Databases using reservedwords in the SQL syntax.

Procedure

1. Find a table having the problematic table reference, such as a create table statement that uses areserved word such as select in the CREATE statement.

2. Enclose the table name in backticks.

CREATE TABLE select (x INT): failsCREATE TABLE `select` (x INT): succeeds

For more information, see Impala identifiers and Impala reserved words.

Other Miscellaneous Changes in ImpalaReview the changes to Impala syntax or service that might affect Impala after migrating your workload fromCDH version 5.13-5.16 or CDH version 6.1 or later to CDP Private Cloud Base or CDP Public Cloud. Theversion of Impala you used in CDH version 5.11 - 5.16 or 6.1 or later changes to Impala 3.4 in CDP PrivateCloud Base.

Decimal V2 Default

In CDP, Impala uses DECIMAL V2 by default.

To continue using the first version of the DECIMAL type for the backward compatibility of your queries, setthe DECIMAL_V2 query option to FALSE:

SET DECIMAL_V2=FALSE;

Column Aliases Substitution

To conform to the SQL standard, Impala no longer performs alias substitution in the subexpressions ofGROUP BY, HAVING, and ORDER BY.

The example below references to the actual column sum(ss_quantity) in the ORDER BY clause insteadof the alias Total_Quantity_Purchased and also references to the actual column ss_item_skin the GROUP BY clause instead of the alias Item because aliases are no longer supported in thesubexpressions.

select ss_item_sk as Item, count(ss_item_sk) as Times_Purchased, sum(ss_quantity) as Total_Quantity_Purchased from store_sales group by ss_item_sk order by sum(ss_quantity) desc limit 5; +-------+-----------------+--------------------------+ | item | times_purchased | total_quantity_purchased | +-------+-----------------+--------------------------+ | 9325 | 372 | 19072 |

100

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-identifiers.html

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-reserved-words.html


| 4279 | 357 | 18501 | | 7507 | 371 | 18475 | | 5953 | 369 | 18451 | | 16753 | 375 | 18446 | +-------+-----------------+--------------------------+

Default PARQUET_ARRAY_RESOLUTION

The PARQUET_ARRAY_RESOLUTION query option controls the behavior of the indexed-basedresolution for nested arrays in Parquet. In Parquet, you can represent an array using a 2-level or 3-levelrepresentation. The default value for the PARQUET_ARRAY_RESOLUTION is THREE_LEVEL to matchthe Parquet standard 3-level encoding. See Parquet_Array_Resolution Query Option for more information.

Clustered Hint Default

The clustered hint is enabled by default, which adds a local sort by the partitioning columns in HDFS andKudu tables to a query plan. The noclustered hint, which prevents clustering in tables having orderingcolumns, is ignored with a warning.

Query Options Removed

The following query options have been removed:

• DEFAULT_ORDER_BY_LIMIT

• ABORT_ON_DEFAULT_LIMIT_EXCEEDED

• V_CPU_CORES

• RESERVATION_REQUEST_TIMEOUT

• RM_INITIAL_MEM

• SCAN_NODE_CODEGEN_THRESHOLD

• MAX_IO_BUFFERS

• RM_INITIAL_MEM

• DISABLE_CACHED_READS

Shell Option refresh_after_connect

The refresh_after_connect option for starting the Impala Shell is removed.

EXTRACT and DATE_PART Functions

The EXTRACT and DATE_PART functions changed in the following way:

• The output type of the EXTRACT and DATE_PART functions changed to BIGINT.• Extracting the millisecond part from a TIMESTAMP returns the seconds component and the milliseconds

component. For example, EXTRACT (CAST('2006-05-12 18:27:28.123456789' ASTIMESTAMP), 'MILLISECOND') returns 28123.

Port for SHUTDOWN Command

If you upgraded from CDH 6.1 or later and specified a port as a part of the SHUTDOWN command, changethe port number parameter to use the Kudu7: RPC (KRPC) port for communication between the Impalabrokers.

Change in Client Connection Timeout

The default behavior of client connection timeout changes after upgrading.

In CDH 6.2 and lower, the client waited indefinitely to open the new session if the maximum number ofthreads specified by --fe_service_threads has been allocated.

101

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-query-options.html#query_options__parquet_array_resolution.html


After upgrading, the server requires a new startup flag, --accepted_client_cnxn_timeout to controltreatment of new connection requests. The configured number of server threads is insufficient for theworkload.

If --accepted_client_cnxn_timeout > 0, new connection requests are rejected after the specifiedtimeout.

If --accepted_client_cnxn_timeout=0, client waits indefinitely to connect to Impala. This sets pre-upgrade behavior.

The default timeout is 5 minutes.

Interoperability between Hive and Impala

Impala supports a number of file formats used in Apache Hadoop. It can also load and querydata files produced by other Hadoop components such as Hive. After upgrading from any CDH5.x version to CDP Private Cloud Base 7.1, if you create an RC file in Hive using the defaultLazyBinaryColumnarSerDe, Impala can not read the RC file. However you can set the configurationoption of hive.default.rcfile.serde to ColumnarSerDe to maintain the interoperability betweenhive and impala.

Improvements in Metadata

After upgrading from CDH to CDP, the on-demand use_local_catalog mode is set to True by defaulton all the Impala coordinators so that the Impala coordinators pull metadata from catalogd and cache itlocally. This reduces memory footprint on coordinators and automates the cache eviction.

In CDP, the catalog_topic_mode is set to minimal by default to enable on-demand metadata for allcoordinators.

Recompute the Statistics

After migrating the workload from any CDH 5.x version to CDP Private Cloud Base 7.1, recompute thestatistics for Impala. Even though CDH 5.x statistics are available after the upgrade, the queries do notbenefit from the new features until the statistics are recomputed.

Mitigating Excess Network Traffic

The catalog metadata can become large and lead to excessive network traffic due to disseminationthrough the statestore. The --compact_catalog_topic flag was introduced to mitigate this issue bycompressing the catalog topic entries to reduce their serialized size. This saves network bandwidth at thecost of a small quantity of CPU time. This flag is enabled by default.

Related InformationDecimal Data TypeImpala AliasesImpala Query Options

Factors to Consider for Capacity PlanningChoosing the right size of your cloud environment before migrating your workload from CDH to CDP PublicCloud is critical for preserving performance characteristics. There are several factors from your queryworkload to consider when choosing the CDP capacity for your environment.

• Query memory requirements• CPU utilizations• Disk bandwidth• Working set size• Concurrent query execution

102

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-decimal.html

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-aliases.html

https://docs.cloudera.com/runtime/7.2.6/impala-sql-reference/topics/impala-query-options.html


Before getting into sizing specifics, it is important to understand the core hardware differences between PCand on-prem hosts:

CDH Host Recommendation AWS R5D.4xlarge Instance

CPU cores 20-80 16

Memory 128GB min. 256GB+recommended

128

Network 10Gbps min. 40Gbpsrecommended

Up to 10Gbps

Ephemeral Storage 12 x 2TB drives (1000MBpssequential)

2 x 300GB NVMe SSD (1100MBps sequential)

Network Storage N/A gp2 250 MB/s per volume

An R5D.4xlarge instance closely matches the CDH recommended CPU, memory, and bandwidth specsand so it is recommended as the instance type for CDP. However, AWS ephemeral storage cannot beused as primary database storage since it is transient and lacks sufficient capacity. This core differencerequires a different strategy for achieving good scan performance.

CDP Sizing and Scaling

Before migration, scaling and concurrency must be planned. In a public cloud environment, the ability toacquire better scaling and concurrency elastically in response to workload demand enables the system tooperate at a lower cost than the maximum limits you plan for. If you configure your target environment toaccommodate your peak workload as a constant default configuration, you might have cost overruns whensystem demand falls below that level.

In CDP, the T-Shirt size defines the number of executor instances for an individual cluster and hencedetermines memory limits and performance of individual queries. Conversely, the warehouse sizingparameter and auto-scaling determine how many clusters are allocated to support concurrent queryexecution.

Sizes Number of Executors

X-Small 2

Small 10

Medium 20

Large 40

The T-shirt size must be at least large enough to support the highest memory usage by a single query. Inmost cases, the size will not need to be larger but may provide better data caching if there is commonalityin working sets between queries. Increasing the T-Shirt size can directly increase single-user capacity andalso increase multi-user capacity. This is because the additional memory and resources from the largercluster allows larger datasets to be processed and can also support concurrent query execution by sharingresources. Choosing a size that is too small can lead to poor data caching, spilling of intermediate results,or memory paging. Choosing a size this is too large can incur excessive PC run cost due to idle executors.

One caveat to consider when choosing a T-shirt size based on existing hardware is what other processesare running on the same host in your on-prem environment. In particular HDFS or other locally hostedfilesystems may be consuming significant resources. You may be able to choose a smaller size for CDPsince these processes will be isolated in their own pod in the CDP environment. It may be helpful tolook at CM per-process metrics to isolate impalad and Impala frontend java processes that CDP willput on executor instances and to aggregate these metrics across your cluster. The java processes canaccumulate significant memory usage due to metadata caching.

103


Concurrency

The size of your target environment corresponds to the peak concurrency the system can handle.Concurrency is the number of queries that can be run at the same time.

Each executor group can run 12 queries concurrently and occasional peaks can be handled transparentlyusing the auto scaling feature. An autoscaling leading to add one more executor group doubles the queryconncurrency to 24. Scaling the warehouse by adding more clusters allows for additional concurrentqueries to run but will not improve single-user capacity or performance. This is because executors from theadditional clusters are private to that cluster. Concurrently executed queries will be routed to the differentclusters and execute independently. The number of clusters can be changed to match concurrent usage bychanging the autoscaling parameters.

Caching Hot Dataset

Currently CDH supports caching mechanisms on the compute nodes to cache the working set that is readfrom remote filesystems, such as, remote HDFS data node, S3, ABFS, and ADLS. This offsets the input/output performance difference.

In CDP Public Cloud, frequently accessed data is cached in a storage layer on SSD so that it can bequickly retrieved for subsequent queries, which boosts the performance. Each executor can have upto200 GB of cache. So a medium-size can keep 200 * 20 = 4 TB of data in cache. For columnar formats (forexample, ORC) data in cache is decompressed but not decoded. If the expected size of the hot datasetis 6 TB, which requires approximately 30 executors, you can choose to overprovision (choose a large) toensure full cache coverage; or underprovision (choose a medium) to reduce cost at the expense of a lowercache hit rate.

Note: Impala data cache uses LIRS based algorithm.

Planning Capacity Using WXMYou can generate a capacity plan for your target environment using WXM if you have it deployed in yourenvironment. To build a custom cloud environment that meets your capacity requirements you mustanalyze your existing CDH architecture, understand your business needs, and generate a capacity plan.

There might be differences in how you should size your impala compute clusters (either in Datahub or inCDW service) because the compute node sizes (CPU and RAM) are different than what you are currentlyusing in CDH. If you are currently using a 20 node CDH cluster, it does not necessarily mean that you willneed a 20 node Datahub cluster or a 20 node Impala virtual warehouse in CDW.

Using WXM Functionality to Generate a Capacity Plan

Benefits of using WXM

• You can explore your cluster and analyze your workload before migrating the data. You can alsoidentify Impala workloads that are good candidates for cloud migration.

• You also have the provision to optimize your workload before migrating them to CDP Public Cloud. Thismitigates the risk of run-away costs in the cloud due to suboptimal workloads.

• You can generate a cloud-friendliness score for your workload to be migrated.• You have an option to auto-generate capacity for your target environment.• WXM in conjunction with the Replication manager automates the replication plan.

Prerequisites to Use WXM

Before you set up Cloudera Manager Telemetry Publisher service to send diagnostic data to Workload XM,you must ensure you have the correct versions of Cloudera Manager and CDH.

To use Workload XM with CDH clusters managed by Cloudera Manager, you must have the followingversions:

104


• For CDH 5.x clusters:

• Cloudera Manager version 5.15.1 or later• CDH version 5.8 or later

• For CDH 6.x clusters:

• Cloudera Manager version 6.1 or later• CDH version 6.1 or later

Note: Workload XM is not available on Cloudera Manager 6.0 whether you are managing CDH 5.xor CDH 6.x clusters.

After you have verified that you have the correct versions of Cloudera Manager and CDH, you mustconfigure data redaction and your firewall.

For information on configuring firewall, see Configuring a Firewall for Workload XM.

For information on redacting data before sending to WXM, see Redaction Capabilities for Diagnostic Data

Steps to Auto-generate Capacity Plan

If you have a Cloudera workload manager deployed in your environment, follow the high level steps togenerate the capacity plan and migrate the Impala workload to the cloud.

1. On the Cloudera workload manager page, choose a cluster to analyze your data warehouse workloads.The Summary page for your workload view contains several graphs and tabs you can view to analyze.Using the Workloads View feature you can analyze workloads with much finer granularity. For example,you can analyze how queries that access a particular database or that use a specific resource pool areperforming against SLAs. Or you can examine how all the queries are performing that a specific usersends to your cluster.

2. On the Data Warehouse Workloads View page you can choose an Auto-generated Workload Views byclicking Define New and choosing Select recommended views from the drop-down menu. Review theCriteria that are used to create the workload views, select the one from the auto-generated workloadviews that aligns with your requirements.

3. You can custom build the workload to be migrated by clicking on Define New and choosing Manuallydefine view from the drop-down menu. You have the option to define a set of criteria that enables you toanalyze a specific set of your workloads.

4. If you chose to custom build, once the custom build workload is generated, you are returned to the DataWarehouse Workloads page, where your workload appears in the list. Use the search bar to search foryour workload and click on the workload to view the workload details.

5. The detail page for your workload view contains several graphs and tabs you can view to analyze.Review the workload and make sure that this is the workload you want to migrate to the cloud.

6. After you are satisfied with the workload you want to burst, click the Burst to Cloud option and selectView Performance Rating Details.

7. Review the cloud performance rating details and make the call to proceed with the migration to cloud byclicking Start Burst to Cloud Wizard.

8. Burst to Cloud wizard walks you through the steps to generate the capacity plan and to replicate theworkload you selected to the destination on the cloud.

Performance Differences between CDH and CDPAssess the performance changes this migration can bring. If you are planning to migrate the current Impalaworkload to Public Cloud, conduct a performance impact analysis to evaluate how this migration will affectyou.

IO Performance Considerations

On-prem CDH hosts often have substantial IO hardware attached to support large scan operations onHDFS, potentially providing 10s of GB per seconds of bandwidth with many SSD devices and dedicated

105

https://docs.cloudera.com/documentation/wxm/latest/topics/wxm_firewall_config.html

https://docs.cloudera.com/workload-manager/cloud/overview/topics/wm-diagnostic-data-redaction.html


interconnects. Due to the transient nature and cost structure of cloud instances, such a model is notpractical for primary storage in CDP.

Like many AWS database offerings, HFS in CDP uses EBS volumes for persistence. The EBS gp2 has abandwidth limit of 250MB/sec/volume. In addition, EBS may be throttled to zero throughput for extendeddurations if bandwidth exceeds thresholds. Because of these limitations, it is not practical in many cases torely on direct IO to EBS for performance. EBS is also routed over shared network hardware and may haveadditional performance limitations due to redundancy.

To mitigate the PC IO bandwidth discrepancy, ephemeral storage is relied on heavily for caching workingsets. While this is existing Impala behavior carried over from CDH, the penalty for going to primary storageis much higher so more data must be cached locally to maintain equivalent performance. Since ephemeralstorage is also used for spilling of intermediate results, it is import to avoid excess spilling which couldcompete for bandwidth.

Migrating Kudu Data to CDPLearn about how to migrate Kudu data from CDH to CDP.

About this taskWhen you migrate your Kudu data from CDH to CDP you have to use the Kudu backup tool to back up andthen restore your Kudu data.

Note: Data migration refers to moving existing CDH workloads to CDP Public Cloud or to a newinstallation of CDP Private Cloud Base.

Procedure

1. Back up all your data in Kudu using the kudu-backup-tools.jar Kudu backup tool.

2. Manually apply any custom Kudu configuration in your new cluster that you had in your old cluster.

3. Copy your backed up data to the target CDP cluster.

4. Restore your backup up Kudu data using the Kudu backup tool.

Backing up data in KuduYou can back up all your data in Kudu using the kudu-backup-tools.jar Kudu backup tool.

The Kudu backup tool runs a Spark job that builds the backup data file and writes it to HDFS or AWS S3,based on what you specify. Note that if you are backing up to S3, you have to provide S3 credentials tospark-submit as described in Specifying Credentials to Access S3 from Spark

The Kudu backup tool creates a full backup of your data on the first run. Subsequently, the tool createsincremental backups.

Important: Incremental backup and restore functionality is available only CDH 6.3.0 and later.Therefore, if you have active ingest processes, such as Spark jobs, Impala SQL batches, or Nifiinserting or updating data in Kudu, you might need to pause these processes before starting fullbackup to avoid losing data changes happening after starting the Kudu backup process.

Run the following command to start the backup process:

spark-submit --class org.apache.kudu.backup.KuduBackup <path to kudu-backup2_2.11-1.12.0.jar> \--kuduMasterAddresses <addresses of Kudu masters> \--rootPath <path to store the backed up data> \<table_name>

where

106

https://docs.cloudera.com/documentation/enterprise/latest/topics/spark_s3.html#spark_s3_credentials


• --kuduMasterAddresses is used to specify the addresses of the Kudu masters as a comma-separated list. For example, master1-host,master-2-host,master-3-host which are the actualhostnames of Kudu masters.

• --rootPath is used to specify the path to store the backed up data. It accepts any Spark-compatiblepath.

• Example for HDFS: hdfs:///kudu-backups• Example for AWS S3: s3a://kudu-backup/

If you are backing up to S3 and see the “Exception in thread "main"java.lang.IllegalArgumentException: path must be absolute” error, ensure that S3path ends with a forward slash (/).

• <table_name> can be a table or a list of tables to be backed up.

Example:

spark-submit --class org.apache.kudu.backup.KuduBackup /opt/cloudera/parcels/CDH-7.2.1-1.cdh7.2.1.p0.4041380/lib/kudu/kudu-backup2_2.11.jar \--kuduMasterAddresses cluster-1.cluster_name.root.hwx.site,cluster-2.cluster_name.root.hwx.site \--rootPath hdfs:///kudu-backups \my_table

Restoring Kudu data into the new clusterOnce you have backed up your data in Kudu, you can copy the data to the target CDP cluster and thenrestore it using the Kudu backup tool.

Before you begin

If you applied any custom Kudu configurations in your old clusters, then you manually have to apply thoseconfigurations in your target cluster.

If you have changed the value of tablet_history_max_age_sec and you plan to run incrementalbackups of Kudu on the target cluster, we recommend resetting tablet_history_max_age_sec to thedefault value of 1 week (see https://issues.apache.org/jira/browse/KUDU-2677).

Examples of commonly modified configuration flags:

• rpc_max_message_size

• tablet_transaction_memory

• rpc_service_queue_length

• raft_heartbeat_interval

• heartbeat_interval_ms

• memory_limit_hard_bytes

• block_cache_capacity_mb

Once you manually applied custom configurations, restart the Kudu cluster.

Procedure

1. Copy your backed up data to the target CDP cluster in one of the following ways:

• Using distcp:

sudo -u hdfs hadoop distcp hdfs:///kudu/kudu-backups/* hdfs://cluster-2.cluster_name.root.hwx.site/kudu/kudu-backups/

• Using Replication Manager. For more information, see HDFS Replication.

107

https://issues.apache.org/jira/browse/KUDU-2677

CDP Data Migration Guide Operational Database to CDP

2. Run the following command to restore the backup on the target cluster:

spark-submit --class org.apache.kudu.backup.KuduRestore <path to kudu-backup2_2.11-1.12.0.jar> \--kuduMasterAddresses <addresses of Kudu masters> \--rootPath <path to the stored backed up data> \<table_name>

where

• --kuduMasterAddresses is used to specify the addresses of the Kudu masters as a comma-separated list. For example, master1-host,master-2-host,master-3-host which are theactual hostnames of Kudu masters.

• --rootPath is used to specify the path at which you stored the backed up data.. It accepts anySpark-compatible path.

• Example for HDFS: hdfs:///kudu-backups• Example for AWS S3: s3a://kudu-backup/

If you are backed up to S3 and see the “Exception in thread "main"java.lang.IllegalArgumentException: path must be absolute” error, ensure that S3path ends with a forward slash (/).

• <table_name> can be a table or a list of tables to be backed up.• Optional: --tableSuffix, if set, adds suffices to the restored table names. It can only be used

when the createTables property is true.• Optional: --timestampMs is a UNIX timestamp in milliseconds that defined the latest time to use

when selecting restore candidates. Its default value is System.currentTimeMillis().

sudo -u hdfs spark-submit --class org.apache.kudu.backup.KuduRestore /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.3758356/lib/kudu/kudu-backup2_2.11.jar \--kuduMasterAddresses cluster-1.cluster_name.root.hwx.site \--rootPath hdfs:///kudu/kudu-backups \my_table

3. Restart the Kudu service in Cloudera Manager.

Operational Database to CDP

You can migrate your Apache HBase workloads from CDH and HDP to CDP Data Center . To successfullymigrate your Apache HBase workloads, you must first understand the data management differencesbetween the two platforms and prepare your source data to be compatible with the destination CDPplatform.

Migrating your workload means migrating your data to CDP and making your applications access the datain CDP.

• When migrating your data to the CDP Private Cloud Base deployment, you must use the ApacheHBase replication and snapshot features, along with the HashTable/SyncTable tool.

Using the Apache HBase replication and snapshot feature ensures that you do not face any data migrationbottlenecks even when you have large amounts of data in your source cluster. The HashTable/SyncTabletool ensures that the data migrated to the destination cluster is synchronized with your source cluster, andlets you verify if your migration is successful.

108


Prepare for data migrationBefore you start the data migration from CDH 5.x or HDP 2.x to CDP, you must understand therequirements and complete certain tasks on CDH/HDP and CDP to ensure a successful migration.

Procedure

• If you are migrating from CDH, configure Ranger ACLs in CDP corresponding to the HBase ACLs inyour existing CDH cluster.

• If you are migrating from HDP perform the following steps:

a) Configure Ranger ACLs in CDP corresponding to the HBase or Ranger ACLs in your existing HDPcluster.

For more information, see Configure a resource-based service: HBase.b) Migrate your applications to use the new HBase-Spark connector because the Spark-HBase

connector that you were using in CDH or HDP is no longer supported in CDP.

For more information, see Using the HBase-Spark connector.• Review the deprecated APIs and incompatibilities when upgrading from HDP 2.x or CDH 5.x to CDP.

For more information, see Deprecation Notices in Apache HBase.• Ensure that all data has been migrated to a supported encoding type before the upgrade.

For more information, see Remove PREFIX_TREE Data Block Encoding.• Ensure that you upgrade any external co-processors manually because they are not automatically

upgraded during your upgrade.

Before upgrading, ensure that your co-processors classes are compatible with the CDP. For moreinformation, see Check co-processor classes.

Migrate Data from CDH or HDP to CDP Private Cloud BaseBefore you migrate your data, you must have an Apache HBase cluster created on CDP Data Center. YourCDH or HDP cluster is your source cluster, and your CDP Private Cloud Base cluster is your destinationcluster.

Procedure

1. Deploy HBase replication on both the source and the destination cluster.

For instructions, see Deploy HBase replication.

2. Enable replication on both the source and destination clusters by running the following commands inthe HBase Shell.

On the source cluster

create 't1',{NAME=>'f1', REPLICATION_SCOPE=>1}

On the destination cluster

create 't1',{NAME=>'f1', KEEP_DELETED_CELLS=>'true'}

Note: Cloudera recommends enabling KEEP_DELETED_CELLS on column families in thedestination cluster, where REPLICATION_SCOPE=1 in the source cluster.

3. Run the add_peer command in the HBase Shell on the source cluster to add the destination cluster asa peer.

add_peer 'ID', 'DESTINATION_CLUSTER_KEY'

109

https://docs.cloudera.com/runtime/7.2.6/security-ranger-authorization/topics/security-ranger-resource-service-configure-hbase.html

https://docs.cloudera.com/runtime/7.2.6/managing-hbase/topics/hbase-using-hbase-spark-connector.html

https://docs.cloudera.com/runtime/7.2.6/release-notes/topics/rt-pubc-deprecated-hbase.html

https://docs.cloudera.com/cdp/latest/upgrade-cdh/topics/ug-hbase-remove-prefix-tree-data-block-encoding.html

https://docs.cloudera.com/cdp/latest/upgrade-cdh/topics/ug-hbase-co-processor-classes.html

https://docs.cloudera.com/runtime/7.2.6/hbase-backup-dr/topics/hbase-deploying-replication.html


You can get the DESTINATION_CLUSTER_KEY value from the HBase Master user interface that youcan access using Cloudera Manager.

4. Run the disable_peer ("<peerID>") command in the HBase Shell on the source cluster todisable the peer in the source cluster

disable_peer("ID1")

This stop the replication with the peer, but the logs are retained for future reference.

5. Take a snapshot in Cloudera Manager.

a) Select the HBase service.b) Click the Table Browser tab.c) Click a table.d) Click Take Snapshot.e) Specify the name of the snapshot, and click Take Snapshot.

6. Run the ExportSnapshot command in the HBase Shell on the source cluster to export a snapshotfrom the source to the destination cluster. You must run the ExportSnapshot command as the hbaseuser or the user that owns the files.

The ExportSnapshot tool executes a MapReduce Job similar to distcp to copy files to the othercluster. ExportSnapshot works at the file-system level, so the HBase cluster can be offline.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot name> -copy-to hdfs://destination:hdfs_port/hbase -mappers 16

Here, destination (hdfs://destination:hdfs_port/hbase) is the destination CDP Private CloudBase cluster. Replace the HDFS server path and port with the ones you have used for your cluster.

Important: Snapshots must be enabled on the source and destination clusters. When youexport a snapshot, the table's HFiles, logs, and the snapshot metadata are copied from thesource cluster to the destination cluster.

7. Run the in the HBase Shell on the source cluster to enable the peer in the source and destinationclusters.

Run this command in the HBase Shell on the source cluster to enable the peer in the source anddestination clusters

enable_peer("ID1")

8. Run the HashTable command on the source cluster and the SyncTable command on the destinationcluster to synchronize the table data between your source and destination clusters.

On the source cluster

HashTable [options] <tablename> <outputpath>

On the destination cluster

SyncTable [options] <sourcehashdir> <sourcetable> <targettable>

For more information and examples about using HashTable and SyncTable, see Use HashTable andSyncTable tool.

110

https://docs.cloudera.com/runtime/7.2.6/managing-hbase/topics/hbase-use-hashtable-synctable-tool.html

https://docs.cloudera.com/runtime/7.2.6/managing-hbase/topics/hbase-use-hashtable-synctable-tool.html

CDP Data Migration Guide Machine Learning and Data Engineering to CDP

Verify and validate if your Data is MigratedYou can use the SyncTable command with the --dryrun parameter to verify if the tables are in syncbetween your source and your destination clusters. The SyncTable --dryrun option makes this run ofyour SyncTable command as read-only.

About this task

Use this command for a dry run SyncTable of tableA from a remote source cluster to a target tableB on thedestination Data Hub cluster.

Procedure

• Run this command in the HBase Shell of the destination Data Hub cluster

hbase org.apache.hadoop.hbase.mapreduce.SyncTable --dryrun=true --sourcezkcluster=zk1.example.com,zk2.example.com,zk3.example.com:2181:/hbase hdfs://nn:8020/hashes/testTable testTableA testTableB

Machine Learning and Data Engineering to CDP

How to migrate machine learning and data engineering workloads from CDH and HDP to CDP.

Cloudera Data Science Workbench to CDPHow to migrate Cloudera Data Science Workbench (CDSW) data from CDH to CDP.

CDSW with CDH or HDP to CDSW with CDP Private Cloud Base 7.x

1. Upgrade to the latest CDSW 1.7.x version.2. Follow the documented migration steps to move the CDSW artifacts.

Cloudera Data Science Workbench is supported on both CDH and CDP, so you can run your CDSWworkloads on CDP without any additional data migration steps.

CDSW to CML on CDP Public Cloud – Option 1

Migrate individual projects:

1. Enable a new CML workspace.2. Create new projects and use code migration via Git.3. The standard engine images should allow you to use your code as-is. If you created custom engine

images, you must rebuild them.4. In order for data to be accessed through CML, it must be migrated to CDP Public Cloud using the

Replication Manager for access thru CML. Be sure to update data access in your code to use the newlocations.

5. Use jobs and models as needed by recreating jobs and deploying models in the new cluster.

CDSW to CML on CDP Public Cloud – Option 2

Administrator-level cluster migration:

1. Upgrade to the latest CDSW 1.7.x version.2. Create a CDSW backup.

111

https://docs.cloudera.com/documentation/data-science-workbench/1-8-x/topics/cdsw_migrate.html

https://docs.cloudera.com/machine-learning/cloud/projects/topics/ml-linking-an-existing-project-to-a-git-remote.html

https://docs.cloudera.com/machine-learning/cloud/engines/topics/ml-customized-engines.html

https://docs.cloudera.com/machine-learning/cloud/engines/topics/ml-customized-engines.html

https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-accessing-data-in-amazon-s3-buckets.html

https://docs.cloudera.com/documentation/data-science-workbench/1-8-x/topics/cdsw_bdr.html

CDP Data Migration Guide Machine Learning and Data Engineering to CDP

3. Create a new CML Workspace. Do not log in or create CML projects and sessions until the migration iscomplete.

4. Import data from the backup for the DB, Project files, Livelog, S2I Registry, and the Git server .5. In order for data to be accessed through CML, it must be migrated to CDP Public Cloud using the

Replication Manager for access thru CML. Be sure to update data access in your code to use the newlocations.

Zeppelin to CDPHow to migrate Apache Zeppelin data from HDP to CDP.

Cloudera Data Science Workbench

Cloudera Data Science Workbench does not currently support Zeppelin as an editor, but CDSW supportfor Apache Zeppelin is planned for an upcoming release. At that point you will be able to manually copyZeppelin notebooks from HDP to CDP.

Export and Import Zeppelin Notes from HDP to CDP

You can export individual notes from HDP and then import them into CDP Zeppelin running on a DataHub Data Engineering cluster in CDP Public Cloud, or to Zeppelin running on a CDP Private Cloud Basedeployment.

Spark to CDPHow to migrate Apache Spark data to CDP.

CDP Spark Versions

• CDP Private Cloud Base ships with Spark 2.4 on YARN.• CDP Public Cloud has two options:

• The DataHub Data Engineering template includes Spark 2.4 on YARN. An experimental Spark 3template is also provided.

• CML provides Spark 2.4 on Kubernetes.

Spark Migration

• Primarily impacts CDH 5.x / 6.x or HDP 2.x due to Hive 3 changes -- ACID tables.

• You can use HDFS and Hive replication to move Spark data from CDH and HDP to CDP.

• Existing external tables are not impacted -- existing applications can continue using the Spark APIwith external tables without code changes.

• Managed tables that are migrated are Hive 3 ACID, which breaks Spark compatibility.• You can use the Hive 3 metadata and table upgrade utilities to understand which HMS tables would

be ACID.

Access to Hive 3 managed (ACID) tables requires the Hive Warehouse Connector (HWC).

• Supported applications : Spark Shell, PySpark, spark-submit.• Include required HWC libraries and configurations.• Read/Write operations use the HWC APIs.• HiveServer2 and Apache Ranger provide fine-grained access control. FGAC 3 CDP DC 7.0 read

operations using HWC is limited 4 CDP DC 7.1 read operations improved for high volume

112

https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-accessing-data-in-amazon-s3-buckets.html

https://github.com/dstreev/cloudera_upgrade_utils/tree/master/hive-sre

https://docs.cloudera.com/runtime/7.2.6/integrating-hive-and-bi/topics/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

https://docs.cloudera.com/runtime/7.2.6/integrating-hive-and-bi/topics/hive-read-write-operations.html

https://docs.cloudera.com/runtime/7.2.6/integrating-hive-and-bi/topics/hive_hivewarehousesession_api_operations.html

https://docs.cloudera.com/runtime/7.2.6/security-ranger-authorization/index.html

CDP Data Migration Guide Streaming to CDP

Livy to CDPHow to migrate Apache Livy data to CDP.

CDP Livy Versions

• Livy has been upgraded to version 0.7 to in CDP Private Cloud Base 7.1.x.• CDP Public Cloud includes a DataHub Data Engineering template that contains with Livy and Spark.

Features

• JDBC/Thrift Server support has been added to Livy. This is an improvement over Spark Thrift Server,which is neither secure nor fault-tolerant.

• The Thrift server is disabled by default in Livy, but you can use Cloudera Manager to enable the ThriftServer.

Usage

• Use the HWC .jar and configuration when accessing managed tables.• Writes to managed tables via the Livy Thrift Server are not currently supported.

Streaming to CDP

How to migrate Streaming workloads from HDF to CDP.

Migrating Streaming workloads from HDF to CDP Private Cloud BaseLearn how you can migrate streaming workloads from HDF to CDP Private Cloud Base.

In this scenario data is migrated from a HDF 3.4 cluster with Streams Messaging Manager (SMM), Kafka,and Schema Registry to a CDP Private Cloud Base 7.1 cluster with SMM, Kafka, Schema Registry, andStreams Replication Manager (SRM).

Multiple methods are provided for migrating SMM and Schema Registry. Kafka is migrated with SRM setup on the target CDP Private Cloud Base 7.1 cluster. Two distinct configuration methods are provided forthe setup and configuration of SRM.

113


Complete the following steps in order to migrate your streaming workloads:

Set Up a New Streaming Cluster in CDP Private Cloud BaseHow to set up a new Streaming cluster in CDP Private Cloud Base when migrating data from HDF.

In order to migrate your HDF workloads you need to set up a new Streaming Cluster in CDP Private CloudBase. See the CDP Private Cloud Base Installation Guide as well as the Streaming Documentation forRuntime for installation, setup, and configuration instructions.

Related InformationCDP Private Cloud Base Installation GuideStreaming Documentation for Runtime

Migrate Ranger PoliciesHow to migrate Ranger policies for Streaming clusters from HDF to CDP Private Cloud Base.

Use the Ranger Console (Ranger UI) or the Ranger REST API to import or export policies. For moreinformation, see the Ranger documentation for Runtime or the Ranger documentation for HDP

Related InformationRanger Documentation for RuntimeRanger Documentation for HDP

Migrate Schema RegistryOverview of the methods that can be used to migrate Schema Registry from HDF to CDP Private CloudBase.

There are two methods that you can use to migrate Schema Registry. You can either copy raw data orreuse existing storage. Review and choose one of the following methods:

Note: Cloudera recommends that you copy raw data.

Copy Raw DataHow to migrate Schema Registry from HDF to CDP Private Cloud Base by copying raw data.

114

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/installation/topics/cdpdc-installation.html

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/howto-streaming.html

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/security-ranger-authorization/topics/security-ranger-resource-policies-importing-exporting.html

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/authorization-ranger/content/resource_policy_importing_and_exporting_resource-based_policies.html


Procedure

1. Stop existing Schema Registry clients.

2. Stop the HDF Schema Registry server.

3. Backup/restore the Schema Registry database from old database to new database:

• MySQL - See the MySQL Backup and Recovery MySQL document.• PostgreSQL - See Chapter 24. Backup and Restore in the PostgresSQL documentation.

4. Copy all serdes from the HDF Schema Registry serdes jar location (local/HDFS) to the CDP SchemaRegistry serdes jar location (local/HDFS)

5. Configure CDP Schema Registry to connect to the new database:

a) In Cloudera Manager select the Schema Registry service.b) Go to Configuration.c) Find and configure the following database related properties:

• Schema Registry Database Type• Schema Registry Database Name• Schema Registry Database Host• Schema Registry Database Port• Schema Registry Database User• Schema Registry Database User Password

d) Click Save Changes.

6. Start the CDP Schema Registry Server.

7. Reconfigure Schema Registry clients to point to the CDP Schema Registry Server.

8. Restart Schema Registry clients.

ResultsSchema Registry is migrated. The HDF Schema Registry is no longer required.

What to do nextMigrate Streams Messaging Manager.Related InformationMySQL Backup and RecoveryPostgreSQL Backup and RestoreMigrate Streams Messaging Manager

Reuse Existing StorageHow to migrate Schema Registry from HDF to CDP Private Cloud Base by reusing existing storage.

Before you begin

Make sure that the existing database is compatible with and supported by CDP Private Cloud Base. Formore information, see Database Requirements in the CDP Release Guide.

Procedure

1. Stop existing Schema Registry clients.

2. Stop the HDF Schema Registry Server.

3. Configure CDP Schema Registry to connect to the database previously owned by HDF SchemaRegistry:

a) In Cloudera Manager select the Schema Registry service.b) Go to Configuration.c) Find and configure the following database related properties:

115

https://dev.mysql.com/doc/mysql-backup-excerpt/5.7/en/backup-and-recovery.html

https://www.postgresql.org/docs/10/backup.html


• Schema Registry Database Type• Schema Registry Database Name• Schema Registry Database Host• Schema Registry Database Port• Schema Registry Database User• Schema Registry Database User Password

4. Configure the CDP Schema Registry serdes jar location to point to the location used by the old HDFSchema Registry:

a) In Cloudera Manager select the Schema Registry service.b) Go to Configuration.c) Find and configure the following properties:

• Schema Registry Jar Storage Type• Schema Registry Jar Storage Directory Path• Schema Registry Jar Storage HDFS URL

d) Click Save Changes.

5. Start the CDP Schema Registry Server.

6. Reconfigure Schema Registry clients to point to the CDP Schema Registry Server.

7. Restart Schema Registry clients.

Results

Schema Registry is migrated. The HDF Schema Registry is no longer required.

What to do nextMigrate Streams Messaging Manager.Related InformationCDP Database RequirementsMigrate Streams Messaging Manager

Migrate Streams Messaging ManagerOverview of the methods that can be used to migrate Streams Messaging Manger from HDF to CDPPrivate Cloud Base.

Streams Messaging Manager (SMM) migration involves the migration of alert policies. The SMM UI isstateless and no data migration is required.

Warning: SMM in HDF stores metrics in Ambari Metric Server (AMS). This data can not bemigrated. Therefore, historic data is lost during the migration.

There are two methods you can use to migrate SMM alert policies. You can either copy raw data or reuseexisting storage.

In addition to these two migration methods, you can also choose to manually recreate alert policies in thenew environment. Steps are provided for both storage reuse and data copy methods. For more informationon manually recreating alert policies, see Managing Alert Policies and Notifiers in the SMM documentation.

Related InformationManaging Alert Policies and Notifiers

Copy Raw DataHow to migrate Streams Messaging Manager alert policies from HDF to CDP Private Cloud Base bycopying raw data.

116

https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-database-requirements.html

https://docs.cloudera.com/runtime/7.2.6/managing-alert-policies/topics/smm-managing-alertpolicies-notifications.html


Procedure

1. Stop the HDF Streams Messaging Manager (SMM).

2. Backup/restore the SMM database from old database to new database:

• MySQL - See the MySQL Backup and Recovery MySQL document.• PostgreSQL - See Chapter 24. Backup and Restore in the PostgresSQL documentation.

3. Configure CDP SMM to connect to the new database:

a) In Cloudera Manager select the SMM service.b) Go to Configuration.c) Find and configure the following database related properties:

• Streams Messaging Manager Database Type• Streams Messaging Manager Database Name• Streams Messaging Manager Database Host• Streams Messaging Manager Database Port• Streams Messaging Manager Database User• Streams Messaging Manager Database User Password

d) Click Save Changes.e) Start the service.

Results

SMM alert policies are migrated.

What to do nextMigrate Kafka using Streams Replication Manager.Related InformationMySQL Backup and RecoveryPostgreSQL Backup and RestoreMigrate Kafka Using Streams Replication Manager

Reuse Existing DatabaseHow to migrate Streams Messaging Manager alert policies from HDF to CDP Private Cloud Base byreusing existing storage.

Before you begin

Make sure that the existing database is compatible with and supported by CDP Private Cloud Base. Formore information, see Database Requirements in the CDP Release Guide.

Procedure

1. Stop the HDF Streams Messaging Manager (SMM).

2. Configure CDP SMM to connect to the database previously owned by HDF SMM:

a) In Cloudera Manager select the SMM service.b) Go to Configuration.c) Find and configure the following database related properties:

• Streams Messaging Manager Database Type• Streams Messaging Manager Database Name• Streams Messaging Manager Database Host• Streams Messaging Manager Database Port• Streams Messaging Manager Database User• Streams Messaging Manager Database User Password

117

https://dev.mysql.com/doc/mysql-backup-excerpt/5.7/en/backup-and-recovery.html

https://www.postgresql.org/docs/10/backup.html


d) Click Save Changes.e) Start the service.

Results

SMM alert policies are migrated.

What to do nextMigrate Kafka using Streams Replication Manager.Related InformationCDP Database RequirementsMigrate Kafka Using Streams Replication Manager

Migrate Kafka Using Streams Replication ManagerOverview of the methods that can be used to migrate Kafka from HDF to CDP Private Cloud Base usingStreams Replication Manager.

There are two approaches to using Streams Replication Manager (SRM) for data migration. You can eitheruse SRM out of the box with its default replication policy, or configure it to use a custom one.

SRM replicates source topics to remote topics (replicated topics) on the target cluster. Because of howSRM’s default replication policy works, remote topics on the target cluster will have different names thanthe topics on the source cluster. The name references the source cluster that the topic was replicated fromin the form of a prefix. For example, the topic1 topic from the us-west source cluster creates the us-west.topic1 remote topic on the target cluster. As a result, the replicated environment differs from thesource environment.

Kafka does not support renaming topics. Therefore, reconfiguring existing Kafka clients to use the remotetopic names is required once all topics are replicated.

However, in many cases this approach may not be suitable, therefore, an alternative method is alsoprovided. This involves configuring SRM to use a custom replication policy, which does not change theremote topic names.

Review the following notes and decide which approach is better suited for your environment.

Migration with the default replication policy

• SRM can be used out of the box.• Remote topics will have a different name in the target cluster.• Reconfiguration of all Kafka clients is required.

Migration with the custom replication policy

• Advanced configuration of SRM to use an alternate replication policy is required.• Source and remote topics will have identical names.• Reconfiguration of all Kafka clients is not required.• Using an SRM service configured with the custom replication policy for any other

scenario than data migration is not supported. Once migration is complete, the SRMinstance you set up has to be reverted to use the default replication policy or deletedfrom the cluster.

Migrate Kafka Using the Default Replication PolicyHow to Migrate Kafka with Streams Replication Manager using the default replication policy.

About this task

You can use Streams Replication Manager (SRM) out of the box with its default replication policy tomigrate Kafka data from a HDF cluster to a CDP Private Cloud Base cluster. With this method, the remote

118

https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-database-requirements.html


topics in the target cluster will have different names. As a result, the replicated environment will differ fromthe source environment. You will need to make significant changes to all Kafka producers and consumersotherwise they will not be able to connect to the correct topics in the CDP Private Cloud Base cluster.

Before you begin

Setup and Configure Streams Replication Manager in the CDP Private Cloud Base cluster. For moreinformation, see Add and Configure SRM and the SRM Configuration Examples in the SRM documentationfor Runtime.

Procedure

1. Use the srm-control tool to whitelist every topic and every consumer group.

Whitelisting consumer groups is required for offset translation.

srm-control topics --source [SOURCE_CLUSTER] --target [TARGET_CLUSTER] --add ".*"

srm-control groups --source [SOURCE_CLUSTER] --target [TARGET_CLUSTER] --add ".*"

2. Validate that data is being migrated.

Use the Cluster Replications page on the Streams Messaging Manager (SMM) UI to monitor andvalidate the status of the migration.

3. Stop producers.

4. Stop consumer.

5. Reconfigure all consumers to read from CDP Private Cloud Base Kafka and apply offset translationusing SRM.

6. Start consumers.

7. Reconfigure all producers to write to CDP Private Cloud Base Kafka.

The HDF instances of Kafka and SMM are no longer required.

8. Start producers.

Results

Kafka is migrated. Kafka clients produce and consume from the CDP Private Cloud Base cluster. Migrationis complete.

Related InformationSRM Configuration ExamplesAdd and Configure SRM

Migrate Kafka Using a Custom Replication PolicyHow to migrate Kafka with Streams Replication Manager using a custom replication policy.

About this taskThe default replication policy used by Streams Replication Manager (SRM) renames remote topics ontarget clusters. It adds the name of the source cluster as a prefix to the topic names. If this behaviour isnot viable, you can configure SRM to use a custom replication policy that retains the original names of thetopics.

Review the following notes about the custom replication policy:

• While this policy is in effect, SRM is unable to differentiate between local topics and replicated topics.

This is because replicated topics do not have the source cluster prefix.

119

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/srm-configuration/topics/srm-conf-examples.html

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/srm-configuration/topics/csp-add-srm.html


• When this policy is used, the Streams Messaging Manager (SMM) UI incorrectly displays local topics asreplicated in the Cluster Replications tab.

This is because replicated topics do not have the source cluster prefix.• This replication policy is only supported with a unidirectional data replication setup where replication

happens from a single source cluster to a single target cluster.

Configuring additional hops or bi-directional replication is not supported and can lead to severereplication issues.

• Using this replication policy in any other scenario than data migration is not supported.

Once migration is complete, you need to reconfigure SRM to use the default replication policy or deletethe service from the cluster.

The following steps describe how you can configure SRM to use the custom replication policy and howdata can be migrated once configuration is complete.

Before you beginSetup and Configure SRM in the CDP Private Cloud Base cluster for unidirectional replication. You canconfigure unidirectional replication by adding and enabling a single replication in the Streams ReplicationManager's Replication Configs property. For example:

HDF->CDP.enabled=true

For more information on setup and configuration, see Add and Configure SRM in the SRM documentationfor Runtime.

Procedure

1. Implement, compile, and package (JAR) the following custom replication policy that overrides SRM’sdefault behavior.

package com.cloudera.dim.mirror;import java.util.Map;import org.apache.kafka.common.Configurable;import org.apache.kafka.connect.mirror.ReplicationPolicy;import org.apache.kafka.connect.mirror.MirrorConnectorConfig; public class MigratingReplicationPolicy implements ReplicationPolicy, Configurable { private String sourceClusterAlias; @Override public void configure(Map<String, ?> props) { // The source cluster alias cannot be determined just by looking at the prefix of the remote topic name. // We extract this info from the configuration. sourceClusterAlias = (String) props.get(MirrorConnectorConfig.SOURCE_CLUSTER_ALIAS); } @Override public String formatRemoteTopic(String sourceClusterAlias, String topic) { // We do not apply any prefix. return topic; } @Override public String topicSource(String topic) { // return from config

120


return topic == null ? null : sourceClusterAlias; } @Override public String upstreamTopic(String topic) { return null; }}

2. Modify the classpath of the SRM driver to include the compiled artifact when the SRM driver is started:

Important: Complete this step on all hosts that SRM is deployed on.

a) Find the srm-driver script located at /opt/cloudera/parcels/CDH/lib/streams_replication_manager/bin/srm-driver.

b) Modify the -cp flag in the srm-driver script to include the additional .jar. For example:

exec $JAVA $SRM_HEAP_OPTS $SRM_JVM_PERF_OPTS $SRM_KERBEROS_OPTS $GC_LOG_OPTS $SRM_JMX_OPTS -DdefaultConfig=$SRM_CONFIG_DIR/srm.properties -DdefaultYaml=$SRM_CONFIG_DIR/srm-service.yaml -cp [PATH_TO_CUSTOM_POLICY_JAR]:$SRM_LIB_DIR/srm-driver-1.0.0.7.1.1.0-567.jar:$SRM_LIB_DIR/srm-common-1.0.0.7.1.1.0-567.jar:...

3. Configure the SRM service to use the custom replication policy:

a) In Cloudera Manager, select the Streams Replication Manager service.b) Go to Configuration.c) FInd the Streams Replication Manager’s Replications Config property and add the following:

• replication.policy.class=com.cloudera.dim.mirror.MigratingReplicationPolicy

• source.cluster.alias=[SOURCE_CLUSTER_ALIAS]

Replace [SOURCE_CLUSTER_ALIAS] with the alias of the source cluster where topics are beingmigrated from.

Setting the replication.policy.class property configures SRM to use the custom replicationpolicy instead of the default one.

Setting source.cluster.alias is required in order for the SRM service to correctly identify thesource cluster from the prefixless topic names on the target cluster.

d) Click Save Changes.e) Restart the service.

4. Use the srm-control tool to whitelist every topic and every consumer group.

Whitelisting consumer groups is required for offset translation.

srm-control topics --source [SOURCE_CLUSTER] --target [TARGET_CLUSTER] --add ".*"

srm-control groups --source [SOURCE_CLUSTER] --target [TARGET_CLUSTER] --add ".*"

5. Validate that data is being migrated.

Use the Cluster Replications page on the SMM UI to monitor and validate the status of the migration.

6. Stop producers.

7. Stop consumers.

8. Reconfigure all consumers to read from CDP Private Cloud Base Kafka and apply offset translationusing SRM.

121

CDP Data Migration Guide Data Flow to CDP

9. Start consumers.

10.Reconfigure all producers to write to CDP Private Cloud Base Kafka.

The HDF instances of Kafka and SMM are no longer required.

11.Start producers.

Results

Kafka is migrated. Kafka clients produce and consume from the CDP Private Cloud Base cluster. Migrationis complete.

Related InformationAdd and Configure SRM

Data Flow to CDP

How to migrate data flow workloads to CDP.

Data Flow data migration information is available in the Cloudera Flow Management Migration guide.

Security and Governance to CDP

How to migrate security and governance data from CDH and HDP to CDP.

Migrating Security and Governance Data from CDH to CDPHow to migrate security and governance data from CDH to CDP.

Sentry to Ranger

• Hive/Impala replication in the Replication Manager can be used to convert and migrate Sentry policiesto Ranger (for CDP Public Cloud).

• Kafka and Solr permissions must be manually converted to Ranger policies.• HDFS ACLs that are automatically set up by Sentry must be manually converted to Ranger policies.

Key Trustee Server, Key Trustee KMS, Key HSM, HSM KMS

Use BDR/Replication Manager to migrate encrypted data to CDP Public Cloud or CDP Data Center.

NavEncrypt

• Migrate data from encrypted volumes to Cloud-native encrypted storage (for CDP Public Cloud) or toanother NavEncrypt encrypted volume (in CDP Data Center).

• Data re-encryption will take place during the migration.

Navigator to Atlas Migration

• CDP has Atlas wired up to all workloads. Ported workloads will recreate lineage.• Navigator-managed metadata tags and any manually entered data must be manually ported to Atlas

Business Metadata Tags.• Any applications using the Navigator SDK must be ported to use Atlas APIs.

122

https://docs.cloudera.com/cdp-private-cloud-base/7.1.5/srm-configuration/topics/csp-add-srm.html

https://docs.cloudera.com/cfm/2.0.4/hdf-migration/topics/cfm-migration-before-you-begin.html

CDP Data Migration Guide Platform Components to CDP

• Navigator Audit information is not ported. To retain legacy audit information you can maintain a read-only Navigator instance until it is no longer needed. You may need to upgrade Cloudera Manager orNavigator on the legacy cluster to a newer version to avoid end-of-life issues.

Related InformationUse Replication Manager to migrate to CDP Public CloudUse Replication Manager to migrate to CDP Private Cloud Base

Migrating Security and Governance Data from HDP to CDPHow to migrate security and governance data from HDP to CDP.

Ranger Policy Migration

• The Ranger Policy Import/Export feature can be used to migrate existing Ranger resource-based andtag-based policies to CDP Public Cloud or CDP Private Cloud Base.

• You can use the Ranger UI or the REST API to export and import policies.• Supported formats: JSON, Excel, CSV

Ranger KMS

• Use DistCp to copy data into Cloud-native encrypted storage (for CDP Public Cloud) or into anotherHDFS encryption zone (in CDP Private Cloud Base).

• Data re-encryption will take place during the copy.

Atlas Data Migration

• CDP has Atlas wired up to all workloads. Ported workloads will recreate lineage.• Use Atlas Export/Import tools (targeted migration) to copy legacy Atlas data to a new deployment.• Use the Atlas Migration utility tools (migrates all data) to manually migrate legacy atlas data to a new

deployment.

Related InformationImporting and exporting resource-based policiesImporting and exporting tag-based policies

Platform Components to CDP

How to migrate platform components from CDH and HDP to CDP.

Cloudera ManagerHow to migrate from Ambari and Cloudera Manager 5/6 to CDP Cloudera Manager 7.

Ambari to Cloudera Manager

There is no data migration mechanism for moving from Ambari to Cloudera Manager (CM) – Ambari usersmust learn how to use Cloudera Manager for cluster management. However, there are a few points Ambariusers should be aware of when moving to Cloudera Manager.

• Cloudera Manager enables you to consider new approaches to cluster management:

• Consider creating and managing multiple clusters with one instance of Cloudera Manager.• Consider setting up separate compute and storage clusters.

123

https://docs.cloudera.com/runtime/7.2.6/security-ranger-authorization/topics/security-ranger-resource-policies-importing-exporting.html

https://docs.cloudera.com/runtime/7.2.6/security-ranger-authorization/topics/security-ranger-tag-policy-importing-exporting.html


• Other aspects of cluster lifecycle automation are different than Ambari:

• Install and upgrade of Cloudera Runtime is via parcels and the CM API, not RPMs.• Other actions (applying standard configurations, backup) use the CM API.

• Update any scripting and alert monitoring to use the CM API (REST, Java, Python).• Grafana is not natively supported for monitoring dashboards. Use health monitoring APIs and

ts_query for metrics.• What doesn’t change: CM install and upgrade is via RPMs, alerts are via EMAIL/SNMP, etc.

CM 5/6 to CM 7

• If you are using the CM 5 API, update scripts and applications to use the new Swagger-based APIclient.

• Migrate CM configuration settings as appropriate based on usage:

• LDAP and Kerberos settings• CM6+ custom user roles and assignments• Alerts configuration• 3rd party parcels

• For 3rd party custom services (CSDs), try to obtain a CM 7 compatible version from the vendor.

Fair Scheduler to Capacity Scheduler migrationYou must migrate from Fair Scheduler to Capacity Scheduler when migrating your cluster to CDP. Themigration process involves automatically converting certain Fair Scheduler configuration to CapacityScheduler configuration prior to the migration and manual fine tuning after the migration.

In CDP, Capacity Scheduler is the default and supported scheduler. You have to transition from FairScheduler to Capacity Scheduler before migrating from CDH to CDP Public Cloud.

The scheduler transition process includes migrating the YARN settings from Fair Scheduler to CapacityScheduler:

1. Pre-migration: Use the fs2cs conversion utility to automatically convert Fair Scheduler into CapacityScheduler.

2. Migrate all of your data to CDP.3. Post-migration: Manually configure and fine-tune the scheduler after you migrated all of your data.

Important: The features of Capacity Scheduler are not the same as the features of Fair Scheduler.Hence, the fs2cs conversion utility cannot convert every Fair Scheduler configuration into aCapacity Scheduler configuration. After the automatic conversion and once the migration iscompleted, you must manually tune the scheduler configurations to ensure that the resultingscheduling configuration fits your organization’s internal goals and SLAs.

For more information, click on the step that interests you:

rect 21, 81, 98, 230 https://docs.cloudera.com/runtime/7.2.6/yarn-reference/topics/yarn-why-CS.html

124

https://docs.cloudera.com/runtime/7.2.6/yarn-reference/topics/yarn-why-CS.html


rect 118, 83, 196, 230 https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-plan-scheduler-migration.htmlrect 215, 82, 294, 230 https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-use-fs2cs-conversion-utility.htmlrect 409, 85, 487, 229 https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-configure-scheduler-properties-manually.htmlrect 507, 83, 585, 230 https://docs.cloudera.com/runtime/7.2.6/yarn-allocate-resources/topics/yarn-cluster-management.htmlrect 313, 84, 390, 230 https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-introduction.html

Plan your scheduler migrationBefore starting the scheduler migration, you must learn about what Fair Scheduler configuration can beconverted into a Capacity Scheduler configuration prior to the migration, and what configuration requiresmanual configuration and fine-tuning.

The features of Capacity Scheduler are not exactly the same as the features of Fair Scheduler. Hence,the fs2cs conversion utility cannot convert every Fair Scheduler configuration into a Capacity Schedulerconfiguration. You must learn about what properties are auto-converted and what requires manualconfiguration. In addition, there are Fair Scheduler features that do not have an equivalent feature inCapacity Scheduler.

Scheduler migration limitationsThere are some hard limitations on converting a Fair Scheduler configuration into a Capacity Schedulerconfiguration as these two schedulers are not equivalent. Learning about these major limitations can helpyou understand the difficulties you might encounter after the scheduler migration.

The features and configurations of Capacity Scheduler differ from the features and configurations of FairScheduler resulting in scheduler migration limitations. These limitations sometimes can be overcome eitherby manual configuration, fine-tuning or some trial-and-error, but in many cases there is no workaround.

Note: This is not a complete list. It only contains the scheduler migration limitations that mostcommonly cause issues.

Static and dynamic leaf queues cannot be created on the same level

If you have a parent queue defined in capacity-scheduler.xml file with at least a single leaf queue, itis not possible to dynamically create a new leaf under this particular parent.

Placement rules and mapping rules are different

Placement rules (used in Fair Scheduler) and mapping rules (used in Capacity Scheduler) are verydifferent, therefore auto-conversion is not possible. You manually have to configure placement rules andmapping rules once the migration from CDH to CDP is completed. There are multiple reasons for this. Thefollowing are the most substantial differences:

• In Fair Scheduler you can use special placement rules like "default" or "specified" which are completelyabsent in Capacity Scheduler.

• In Fair Scheduler you can set a "create" flag for every rule. Mapping rules do not support this.• In Fair Scheduler in case of nested rules the "create" flag is interpreted for both rules. This is not true in

Capacity Scheduler.• If a rule can return a valid queue in Fair Scheduler, it proceeds to the next rule. Capacity Scheduler, on

the other hand, returns “root.default”.

The capacity value of dynamic queues is fixed

In Fair Scheduler, fair shares are recalculated each time a new queue is created. In contrast, CapacityScheduler assigns a predefined percentage value for dynamically created queues.

125

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-plan-scheduler-migration.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-plan-scheduler-migration.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-use-fs2cs-conversion-utility.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-use-fs2cs-conversion-utility.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-configure-scheduler-properties-manually.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-yarn-configure-scheduler-properties-manually.html

https://docs.cloudera.com/runtime/7.2.6/yarn-allocate-resources/topics/yarn-cluster-management.html

https://docs.cloudera.com/runtime/7.2.6/yarn-allocate-resources/topics/yarn-cluster-management.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-introduction.html

https://docs.cloudera.com/cdp/latest/data-migration/topics/cdp-data-migration-introduction.html


This predefined percentage can be changed, but it is fixed until the scheduler is reconfigured. Once thisvalue reaches 100, the next dynamic queue will be created with the value 0. For example, if the value is setto 25.00, then the fifth queue under the same parent will have a capacity of 0.

Auto-converted Fair Scheduler propertiesThe fs2cs conversion utility automatically converts certain Fair Scheduler properties into CapacityScheduler properties. Reviewing the list of auto-converted properties enables you to verify the conversionand plan the manual fine-tuning that requires to be done after the migration is completed.

Table 8: Queue resource-quota related features

Property Description

Pre-created hierarchical queues. The same queue hierarchy is achieved after conversion.

<weight> Weight: The steady fair share of a queue.

The queue.capacity property will be set with the same ratio.

<maxAMShare> Maximum AM share: Limits the fraction of the queue’s fair sharethat can be used to run application masters

<maxRunningApps> Maximum running apps: Limits the number of apps from thequeue to run at once

<maxContainerAllocation> Maximum container allocation: Maximum amount of resources aqueue can allocate for a single container.

<schedulingPolicy> Scheduling policy of a queue (for example, how submittedapplications are ordered over time).

It is converted with some limitations. For more information, seeFair Scheduler features and the conversion details.

<aclSubmitApps> <aclAdministerApps> ACL settings: List of users and/or groups that can submit apps tothe queue or can administer a queue.

maximum-applications Specifies the maximum number of concurrent active applicationsat any one time in the queue.

maximum-am-resource-percent Specifies the maximum percentage of resources in the clusterwhich can be used to run application masters for the queue.

acl_submit_applications Specifies the ACL which controls who can submit applications tothe given queue.

acl_administer_queue Specifies the ACL which controls who can administerapplications in the given queue.

ordering-policy Specifies the queue ordering policies to FIFO or fair on the givenqueue.

Table 9: Global scheduling settings


yarn.scheduler.fair.allow-undeclared-pools Allow undeclared pools.

Sets whether new queues can be created at applicationsubmission time.

yarn.scheduler.fair.sizebasedweight Size based weight.

Whether to assign shares to individual apps based on their size,rather than providing an equal share to all apps regardless ofsize.

126



<queueMaxAppsDefault> Queue max apps default: Sets the default running app limit forall queues.

<queueMaxAMShareDefault> Default max AM share: Sets the default AM resource limit forqueue.

yarn.scheduler.fair.locality.threshold.node Locality threshold node: For applications that request containerson particular nodes, the number of scheduling opportunitiessince the last container assignment to wait before accepting aplacement on another node.

yarn.scheduler.fair.locality.threshold.rack Locality threshold rack: For applications that request containerson particular racks, the number of scheduling opportunitiessince the last container assignment to wait before accepting aplacement on another rack.

yarn.scheduler.fair.max.assign Maximum assignments: If assignmultiple is true anddynamic.max.assign is false, the maximum amount of containersthat can be assigned in one heartbeat.

yarn.scheduler.fair.assignmultiple Assign multiple: Whether to allow multiple containerassignments in one heartbeat.

yarn.resourcemanager.scheduler.monitor.enable Allows higher-priority applications to preempt lower-priorityapplications.

yarn.scheduler.capacity.maximum-am-resource-percent

Specifies the maximum percentage of resources in the clusterwhich can be used to run application masters.

Table 10: Preemption


yarn.scheduler.fair.preemption Fair Scheduler preemption turned on.

After the conversion capacity Scheduler preemption is turned onby default using the default values.

<allowPreemptionFrom> Per-queue preemption disabled.

After the conversion the same queue preemption disabled bydefault.

yarn.scheduler.fair.waitTimeBeforeKill Wait time before killing a container

disable_preemption Disables preemption of application containers submitted to agiven queue.

Fair Scheduler features and conversion detailsCertain Fair Scheduler properties cannot be auto-converted by the fs2cs conversion utility. Review the listof these properties and if they are supported in Capacity Scheduler and by Queue Manager UI to learnhow you can configure them.

Table 11: Queue resource-quota related features

Property Description Conversion information

<minResources> Minimum resources the queue is entitledto.

Partially supported in Capacity Scheduler.

Ignored by the fs2cs conversion utility.

Not supported by Queue Manager UI.

127



<maxResources> Maximum amount of resources that will beallocated to a queue.

There is an equivalent feature in CapacityScheduler.

Ignored by the fs2cs conversion utility. Foreach queue, max-capacity will be set to100%.

Supported by Queue Manager UI.

<maxChildResources> Maximum amount of resources that canbe allocated to an ad hoc child queue.


Ignored by the fs2cs conversion utility.Its value can be two distinct percentages(vcore/memory) or an absolute resources,but the leaf-queue-template only acceptsa single percentage.


<schedulingPolicy> Scheduling policy of a queue (forexample, how submitted applicationsshould be ordered over time). .


Manual fine tuning might be necessary.

Note: if DRF is used anywherein Fair Scheduler, then theconverted configuration utilizesDRF everywhere and it is notpossible to place a queue with“Fair” policy under one whichhas “DRF” enabled.


Table 12: Global scheduling settings


<user name="..."><maxRunningApps>...</maxRunningApps></user>

Maximum running apps per user There is an equivalent feature in CapacityScheduler.

Fine-tuning of the following threeproperties are required:

• Maximum apps per queue• User limit percent• User limit factor


<userMaxAppsDefault> Default maximum running apps Not supported in Capacity Scheduler.

yarn.scheduler.fair.max.assign Dynamic maximum assign There is an equivalent feature in CapacityScheduler.

Fine-tuning of the following threeproperties are required:

• yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enable

• yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments

• yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments


128



yarn.scheduler.fair.user-as-default-queue

User as default queue There is a very similar feature in CapacityScheduler. Perform the following steps:

1. Create a queue, such asroot.users and enable the suto-create-child-queue setting for it.

2. Use the following placement rule: "u%user:%user"

The following restrictions apply:

• It is not possible to have root asa parent for dynamically createdqueues.

• root.users cannot have static leafs,that is, queues that are defined in thecapacity-scheduler.xml file.

For more information, see the PlacementRules table.


Table 13: Preemption


yarn.scheduler.fair.preemption.cluster-utilization-threshold

The utilization threshold after whichpreemption kicks in.

There is an equivalent featurein Capacity Scheduler:yarn.resourcemanager.monitor.capacity.preemption.max_ignored_over_capacity.It specifies the resource usage thresholdover its configured capacity that aqueue must meet before it is eligible forpreemption.


minSharePreemptionTimeout The number of seconds the queue isunder its minimum share before it will tryto preempt containers to take resourcesfrom other queue.s

Not supported in Capacity Scheduler.

fairSharePreemptionTimeout The number of seconds the queue isunder its fair share threshold before itwill try to preempt containers to takeresources from other queues.


This can be achieved by using thefollowing configurations together:

• yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor

• yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill


fairSharePreemptionThreshold The fair share preemption threshold forthe queue.


This can be achieved by using thefollowing configurations together:

• yarn.resourcemanager.monitor.capacity.preemption.natural_termination_factor

• yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill


129


Table 14: Placement rules

Fair Scheduler placement rules Description Conversion information

create="false" or "true" Disable or enable creating a queuedynamically in YARN. This option can bespecified on all rules.


Use the Capacity Scheduler DynamicQueue Mappings policies:

• u:%user:[managedParentQueueName].[queueName]

• u:%user:[managedParentQueueName].%user

• u:%user:[managedParentQueueName].%primary_group

• u:%user:[managedParentQueueName].%secondary_group


<rule name="specified"/> If a user has submitted the applicationby specifying a queue name (other thanthe “default” queue), then this rule will besuccessful. Hence the remaining set ofrules won't be executed.


<rulename="primaryGroupExistingQueue"/>

If submitted user’s(userA) primary groupname (groupA) exists, submit to groupA.

There is an equivalent placement rule inCapacity Scheduler: <value>u:%user:%primary_group</value>


<rulename="secondaryGroupExistingQueue"/>

If submitted user’s(userA) secondarygroup name (groupA) exists, submit togroupA.

There is an equivalent placement rule inCapacity Scheduler: <value>u:%user:%secondary_group</value>


<rule name="nestedUserQueue"> If submitted the embedded rule, all rulesare allowed except for the reject rule, isexecuted to generate a parent queue andthe user’s (userA) name is created as achild of the parent.


<rule name="default"queue=”qName”/>

Fall back policy by which rule will fall backto queue named in the property ‘queue’ orthe “default“ queue if no queue property isspecified (if all matches fail).

There is an equivalent placement rulein Capacity Scheduler: <value>u:%user:default</value>


Example: Convert weights of Fair Scheduler queuesBy reviewing the example of how you can convert the Fair Scheduler queue weights to Capacity Schedulerqueue capacity (percentage relative to its parents) you can understand the Fair Scheduler conversionusing the fs2cs conversion utility.

Table 15: Weight conversion example

Queue Path Weight Capacity Scheduler equivalent(capacity)

yarn.scheduler.capacity.<queue-path>.capacity

root 1 100%

root.default 10 25%

130


Queue Path Weight Capacity Scheduler equivalent(capacity)

yarn.scheduler.capacity.<queue-path>.capacity

root.users 30 75%

root.users.alice 1 33.333%

root.users.bob 1 33.333%

root.users.charlie 1 33.334%

The fs2cs conversion utility ensures that all percentages of direct children under one parent queue addup exactly to 100.000%, as it is demonstrated in the table. For example, all queues under root.users:root.users.alice + root.users.bob + root.users.charlie = 100.000%.

Weights are converted into percentage-based capacities the following way: On queue-level root, there are2 queues: default and users. Because it is specified as 10 + 30 weights (40 altogether), 1 “unit of weight”is 2.5%. This is why root.default has 25% and root.users has 75% of the capacity. This calculation can beapplied to all queue-levels.

Use the fs2cs conversion utilityYou can use the fs2cs conversion utility to automatically convert certain Fair Scheduler configuration toCapacity Scheduler configuration.

About this task

From the CDP 7.1 release, Cloudera provides a conversion tool, called fs2cs conversion utility. This utilityis a CLI application that is part of the yarn CLI command. It generates capacity-scheduler.xml andyarn-site.xml as output files.

Important: The features of Capacity Scheduler are not exactly the same as the features of FairScheduler. Hence, the fs2cs conversion utility cannot convert every Fair Scheduler configurationinto a Capacity Scheduler configuration. After the automatic conversion and once the migrationis completed, you must manually tune the scheduler configurations to ensure that the resultingscheduling configuration fits your organization’s internal goals and SLAs after conversion.

Before you begin

• Be aware of the Fair Scheduler properties that are auto-converted, those that require manualconfiguration, and those that do not have an equivalent feature in Capacity Scheduler.

• You must have downloaded and distributed parcels for the target version of CDP.• In VPC, to use your current Compute Cluster queue configurations in your new installation after the

upgrade, you must have manually saved them before starting the update process and then added theconfigurations to your new installation. Else, your Compute Cluster queue configurations will be lostbecause the Upgrade Wizard transitions only the queues from your Base Cluster.

1. In Cloudera Manager, navigate to Host > All Hosts.2. Find the host with the ResourceManager role and click the YARN ResourceManager role.3. Click the Processes tab.4. Find and save the fair-scheduler.xml and yarn-site.xml configuration files for future

reference.

Procedure

1. Download the Fair Scheduler configuration files from the Cloudera Manager data store

a) In Cloudera Manager, navigate to Host > All Hosts.b) Find the host with the ResourceManager role and click the YARN ResourceManager role.

131


c) Click the Process tab.d) Find and save the fair-scheduler.xml and yarn-site.xml configuration files for future

reference.

2. Use the fs2cs conversion utility

a) Log in to the host machine where you downloaded the fair-scheduler.xml and yarn-site.xml files using ssh.

b) Create a new directory to save the capacity-scheduler.xml file that is generated by the fs2csconversion utility:

$ mkdir -p output

c) Use the fs2cs conversion utility to auto-convert the structure of resource pools. Options listedbetween braces [] are optional:

$ yarn fs2cs [--cluster-resource <vcores/memory>][--no-terminal-rule-check] --yarnsiteconfig </path/to/yarn-site.xml> [--fsconfig </path/to/fair-scheduler.xml>] --output-directory </output/path/> [--print] [--skip-validation]

3. Provide the generated Capacity Scheduler configuration in Cloudera Manager.

a) In Cloudera Manager, select the YARN service.b) Click the Configuration tab.c) Search for capacity-scheduler and find the Capacity Scheduler Configuration Advanced

Configuration Snipper (Safety Valves).d) Click View as XML and insert the full content of the capacity-scheduler.xml file that was

generated by the converter tool.e) Click Save Changes.f) Search for yarn-site and find the YARN Service Advanced Configuration Snippet (Safety

Valve) for yarn-site.xml.g) Click View as XML and insert the full content of the yarn-site.xml file that was generated by the

converted tool.h) Click Save Changes.

4. Restart the YARN and Queue Manager services.

What to do nextProceed with the migration to CDP.

After the migration is completed, manually tune the configuration generated by the fs2cs conversion utilityusing Queue Manager UI and Cloudera Manager Advanced configuration snippet (Safety Valves).

CLI options of the fs2cs conversion toolList of the CLI options of the fs2cs conversion tool.

Option Description

-d,--dry-run Performs a dry-run of the conversion. Outputs whether theconversion is possible or not.

-f,--fsconfig <arg> Absolute path to a valid fair-scheduler.xml configurationfile.

By default, yarn-site.xml contains the property whichdefines the path of fair-scheduler.xml. Therefore, the -f / --fsconfig settings are optional.

-h,--help Displays the list of options.

132


Option Description

-o,--output-directory <arg> Output directory for yarn-site.xml and capacity-scheduler.xmlfiles. Must have write permission for the user who is running thisscript.

If -p or --print is specified, the xml files are emitted to thestandard output, so the -o / --output-directory isignored.

-p,--print If defined, the converted configuration will only be emitted to theconsole.

If -p or --print is specified, the xml files are emitted to thestandard output, so the -o / --output-directory is ignored.

-r,--rulesconfig <arg> Optional parameter. If specified, should point to a valid path tothe conversion rules file (property format).

-s, --skip-validation It does not validate the converted Capacity Schedulerconfiguration. By default, the utility starts an internal CapacityScheduler instance to see whether it can start up properly or not.This switch disables this behaviour.

-t,--no-terminal-rule-check Disables checking whether a placement rule is terminal tomaintain backward compatibility with configs that were madebefore YARN-8967.

By default, Fair Scheduler performs a strict check of whether aplacement rule is terminal or not. This means that if you use a<reject> rule which is followed by a <specified> rule, then thisis not allowed, since the latter is unreachable. However, beforeYARN-8967, Fair Scheduler was more lenient and allowedcertain sequences of rules that are no longer valid. Inside thetool, a Fair Scheduler instance is instantiated to read and parsethe allocation file. In order to have Fair Scheduler accept suchconfigurations, the -t or --no-terminal-rule-checkargument must be supplied to avoid the Fair Scheduler instancethrowing an exception.

-y,--yarnsiteconfig <arg> Path to a valid yarn-site.xml configuration file.

Manual configuration of scheduler propertiesAfter migrating to CDP, you must manually fine-tune the scheduler configurations using the YARN QueueManager UI to ensure that the resulting configurations suit your requirements. You can use ClouderaManager Advanced configuration snippet (Safety Valve) to configure a property that is missing from theYARN Queue Manager UI.

The features of Capacity Scheduler are not exactly the same as the features of Fair Scheduler. Hence,the conversion utility cannot convert every Fair Scheduler configuration into a Capacity Schedulerconfiguration. Therefore, you must manually tune the scheduler configurations to ensure that the resultingscheduling configuration fits your organization’s internal goals and SLAs after conversion. If needed,further change the scheduler properties in the capacity-scheduler.xml and yarn-site.xml outputfiles generated by the fs2cs conversion utility. For information about the Fair Scheduler properties that areauto-converted by the fs2csconversion utility, see Auto-converted Fair Scheduler properties.

You can configure the properties manually using the YARN Queue Manager UI after the If you see aproperty that is unavailable in the Queue Manager UI, you can use Cloudera Manager configurationsnipper (Safety Valves) to configure them.

Important: You must not use the Queue Manager UI and Cloudera Manager Safety Valves at thesame time as safety valves overwrite the configuration set using Queue Manager UI.

Related InformationAuto-converted Fair Scheduler properties

133

https://issues.apache.org/jira/browse/YARN-9867

https://issues.apache.org/jira/browse/YARN-8967


Use YARN Queue Manager UI to configure scheduler propertiesAfter migrating to CDP, you must configure the Capacity Scheduler properties using the output filesgenerated by the fs2cs conversion utility. You can configure the properties manually using the YARNQueue Manager UI service.

Before you begin

• Use the fs2cs conversion utility to generate the capacity-scheduler.xml and yarn-site.xmloutput files.

• Complete the migration process.• Identify properties that require manual configuration and can be configured using the Queue Manager

UI.

For more information about scheduler properties, see Fair Scheduler feature and conversion details.

Procedure

1. In Cloudera Manager, click Clusters and select the YARN Queue Manager UI service.

2. In the YARN Queue Manager window, click the Scheduler Configuration tab.

134


3. In the Scheduler Configuration window, enter the value of the property and click Save.

Use Cloudera Manager Safety Valves to configure scheduler propertiesCertain scheduler properties can neither be converted by the fs2csconversion utility nor be configuredusing the YARN Queue Manager UI service. After migrating to CDP, you must manually configure theseproperties using the Cloudera Manager advanced configuration snippet (Safety Valves).

Before you begin

• Use the fs2cs conversion utility to generate the capacity-scheduler.xml and yarn-site.xmloutput files.

• Complete the migration process.• Identify the scheduler properties that need to be configured manually and not supported by the Queue

Manager UI.

Procedure

1. In Cloudera Manager, select the YARN service.

2. Click the Configuration tab.

3. Search for capacity-scheduler, and find the Capacity Scheduler Configuration AdvancedConfiguration Snippet (Safety Valve).

4. Click View as XML, and insert the complete capacity-scheduler.xml file, generated by theconverter tool.

5. Add the necessary configuration properties.


135


7. Search for yarn-site, and find the YARN Service Advanced Configuration Snippet (Safety Valve)for yarn-site.xml.

8. Click View as XML and add the required configuration in an XML format.

Optionally, use + and - to add and remove properties.


10.Restart the YARN service.

Migrating Oozie to CDPAfter you migrate the Oozie data to CDP, you must first configure Oozie, and then migrate customShareLib jars to your new cluster.

You must configure Oozie to work with different CDP services including Sqoop actions, Yarn jobs, andHDFS HA. For information on configuring these services, see Configure Oozie.

Note: By default, the Oozie service schedules an internal job to purge all the Oozie workflows olderthan 30 days from the database. However, actions associated with long-running coordinators donot purge until the coordinators complete. Cloudera recommends that you configure Oozie with anempty database for CDP. To view the old data, you can take a backup and run SQL queries on thatdata. You must recreate long-running coordinator jobs because the CDP environment is differentfrom that of HDP and CDH.

Procedure

After you configure Oozie, you must migrate the custom ShareLib jars to the new cluster.

1. Copy the Oozie ShareLib jar from your HDP or CDH cluster:

cp /user/oozie/share/lib/lib_{TIMESTAMP}/{COMPONENT}

The location of the Oozie ShareLib is the same across the HDP, CDH, and CDP environments.2. Paste the ShareLib in the new file system of the CDP environment.

paste /user/oozie/share/lib/lib_{TIMESTAMP}/{COMPONENT}

Note: These files must be present on the storage like HDFS, S3, and so on, and not on thehosts where you installed Oozie.

3. Execute a ShareLib update:

oozie admin -oozie {OOZIE_URL} -sharelibupdate

After you migrate the custom ShareLib jars, update the workflow XML files for DFS paths, JDBC URLs (forexample, Hive), and so on, to manage the new environment.

136

https://docs.cloudera.com/runtime/7.2.6/configuring-oozie/topics/oozie-introduction.html