teradata connector for hadoop tutorial connector for... · teradata connector for hadoop tutorial...

Teradata Connector for Hadoop

Tutorial

Version: 1.4

December 2015

Teradata Connector for Hadoop Tutorial v1.4.docxdata Connector for Hadoop Tutorial

Table of Contents

1 Introduction ................................................................................................................... 5

1.1 Overview ............................................................................................................... 5

1.2 Audience ............................................................................................................... 5

1.3 Architecture ........................................................................................................... 6

1.3.1 MapReduce .................................................................................................... 6

1.3.2 Controlling the Degree of Parallelism ............................................................. 7

1.3.3 Plugin Architecture ......................................................................................... 7

1.3.4 Stages of the TDCH job ................................................................................. 7

1.4 TDCH Plugins and Features ................................................................................. 9

1.4.1 Defining Plugins via the Command Line Interface .......................................... 9

1.4.2 HDFS Source and Target Plugins .................................................................. 9

1.4.3 Hive Source and Target Plugins ..................................................................... 9

1.4.4 HCatalog Source and Target Plugins ........................................................... 10

1.4.5 Teradata Source Plugins .............................................................................. 10

1.4.6 Teradata Target Plugins ............................................................................... 11

1.5 Teradata Plugin Space Requirements ................................................................ 12

1.5.1 Space Required by Teradata Target Plugins ............................................... 12

1.5.2 Storage Space Required for Extracting Data from Teradata ........................ 12

1.6 Teradata Plugin Privilege Requirements ............................................................ 13

2 Supported Plugin Properties ..................................................................................... 14

2.1 Source Plugin Definition Properties .................................................................... 14

2.2 Target Plugin Definition Properties ..................................................................... 15

2.3 Common Properties ............................................................................................ 16

2.4 Teradata Source Plugin Properties ..................................................................... 20

2.5 Teradata Target Plugin Properties ...................................................................... 26

2.6 HDFS Source Plugin Properties ......................................................................... 32

2.7 HDFS Target Properties ..................................................................................... 35

2.8 Hive Source Properties ....................................................................................... 38

2.9 Hive Target Properties ........................................................................................ 41

2.10 HCat Source Properties ................................................................................... 44

2.11 HCat Target Properties .................................................................................... 45

3 Installing Connector ................................................................................................... 46

3.1 Prerequisites ....................................................................................................... 46

3.2 Software Download ............................................................................................. 46

3.3 RPM Installation .................................................................................................. 46

3.4 ConfigureOozie Installation ................................................................................. 47

4 Launching TDCH Jobs ............................................................................................... 49

4.1 TDCH’s Command Line Interface ....................................................................... 49

4.2 Runtime Dependencies ...................................................................................... 49

4.3 Launching TDCH with Oozie workflows .............................................................. 50

4.4 TDCH’s Java API ................................................................................................ 50

5 Use Case Examples .................................................................................................... 52

5.1 Environment Variables for Runtime Dependencies ............................................ 52

5.2 Use Case: Import to HDFS File from Teradata Table ......................................... 53

5.2.1 Setup: Create a Teradata Table with Data ................................................... 53

5.2.2 Run: ConnectorImportTool command .......................................................... 54


5.3 Use Case: Export from HDFS File to Teradata Table ......................................... 54

5.3.1 Setup: Create a Teradata Table ................................................................... 54

5.3.2 Setup: Create an HDFS File ......................................................................... 55

5.3.3 Run: ConnectorExportTool command .......................................................... 55

5.4 Use Case: Import to Existing Hive Table from Teradata Table ........................... 56


5.4.2 Setup: Create a Hive Table .......................................................................... 56

5.4.3 Run: ConnectorImportTool Command .......................................................... 56


5.5 Use Case: Import to New Hive Table from Teradata Table ................................ 57



5.6 Use Case: Export from Hive Table to Teradata Table ........................................ 58

5.6.1 Setup: Create a Teradata Table ................................................................... 58

5.6.2 Setup: Create a Hive Table with Data .......................................................... 59

5.6.3 Run: ConnectorExportTool Command ......................................................... 60

5.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table ................ 60

5.7.1 Setup: Create a Teradata PPI Table with Data ............................................ 60

5.7.2 Setup: Create a Hive Partitioned Table ........................................................ 61


5.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table ............... 61

5.8.1 Setup: Create a Teradata PPI Table ............................................................ 61

5.8.2 Setup: Create a Hive Partitioned Table with Data ........................................ 62

5.8.3 Run: ConnectorExportTool command .......................................................... 63

5.9 Use Case: Import to Teradata Table from HCatalog Table ................................. 63


5.9.2 Setup: Create a Hive Table .......................................................................... 64


5.10 Use Case: Export from HCatalog Table to Teradata Table ............................. 64

5.10.1 Setup: Create a Teradata Table ................................................................. 64

5.10.2 Setup: Create a Hive Table with Data ........................................................ 65

5.10.3 Run: ConnectorExportTool Command ....................................................... 65

5.11 Use Case: Import to Teradata Table from ORC File Hive Table ...................... 66

5.11.1 Run: ConnectorImportTool Command ........................................................ 66

5.12 Use Case: Export from ORC File HCat Table to Teradata Table ..................... 66

5.12.1 Setup: Create the Source HCatalog Table ................................................. 66


5.13 Use Case: Import to Teradata Table from Avro File in HDFS .......................... 67


5.13.2 Setup: Prepare the Avro Schema File ........................................................ 68

5.13.3 Run: ConnectorImportTool Command ........................................................ 69

5.14 Use Case: Export from Avro to Teradata Table ............................................... 69

5.14.1 Setup: Prepare the Source Avro File .......................................................... 69



6 Performance Tuning ................................................................................................... 71

6.1 Selecting the Number of Mappers ...................................................................... 71

6.1.1 Maximum Number of Mappers on the Hadoop Cluster ................................ 71

6.1.2 Mixed Workload Hadoop Clusters and Schedulers ...................................... 71

6.1.3 TDCH Support for Preemption ..................................................................... 72


6.1.4 Maximum Number of Sessions on Teradata ................................................ 72

6.1.5 General Guidelines and Measuring Performance ......................................... 72

6.2 Selecting a Teradata Target Plugin .................................................................... 73

6.3 Selecting a Teradata Source Plugin ................................................................... 73

6.4 Increasing the Batchsize Value ........................................................................... 74

6.5 Configuring the JDBC Driver ............................................................................... 74

7 Troubleshooting .......................................................................................................... 75

7.1 Troubleshooting Requirements ........................................................................... 75

7.2 Troubleshooting Overview .................................................................................. 76

7.3 Functional: Understand Exceptions .................................................................... 77

7.4 Functional: Data Issues ...................................................................................... 78

7.5 Performance: Back of the Envelope Guide ......................................................... 78

7.6 Console Output Structure ................................................................................... 80

7.7 Troubleshooting Examples ................................................................................. 81

7.7.1 Database doesn’t exist ................................................................................. 81

7.7.2 Internal fast load server socket time out ....................................................... 82

7.7.3 Incorrect parameter name or missing parameter value in command line ..... 82

7.7.4 Hive partition column can not appear in the Hive table schema ................... 83

7.7.5 String will be truncated if its length exceeds the Teradata String length (VARCHAR or CHAR) when running export job. ..................................................... 83

7.7.6 Scaling number of Timestamp data type should be specified correctly in JDBC URL in internal.fastload method .................................................................... 83

7.7.7 Existing Error table error received when exporting to Teradata in internal.fastload method .......................................................................................... 84

7.7.8 No more room in database error received when exporting to Teradata ....... 84

7.7.9 “No more spool space” error received when exporting to Teradata.............. 85

7.7.10 Separator is wrong or absent ..................................................................... 86

7.7.11 Date / Time / Timestamp format related errors ........................................... 86

7.7.12 Janpanese language problem .................................................................... 87

8 FAQ .............................................................................................................................. 88

8.1 Do I need to install the Teradata JDBC driver manually? ................................... 88

8.2 What authorization is necessary for running the TDCH? .................................... 88

8.3 How do I use User Customized Text Format Parameters? ................................. 88

8.4 How to use Unicode character as the separator? ............................................... 88

8.5 Why is the actual number of mappers less than the value of -nummappers? ..... 89

8.6 Why don’t decimal values in Hadoop exactly match the value in Teradata? ...... 89

8.7 When should charset be specified in the JDBC URL? ........................................ 89

8.8 How do I configure the capacity scheduler to prevent task skew? ...................... 89

8.9 How can I build my own ConnectorDataTypeConverter ..................................... 89

9 Limitations & known issues ....................................................................................... 92

9.1 Teradata Connector for Hadoop ......................................................................... 92

9.2 Teradata JDBC Driver ........................................................................................ 93

9.3 Teradata Database ............................................................................................. 93

9.4 Hadoop Map/Reduce .......................................................................................... 93

9.5 Hive .................................................................................................................... 93

9.6 Avro data type conversion and encoding ............................................................ 94


1 Introduction

1.1 Overview

The Teradata Connector for Hadoop (TDCH) is a map-reduce application that supports high-

performance parallel bi-directional data movement between Teradata systems and various Hadoop

ecosystem components.

TDCH can function as an end user tool with its own command-line interface, can be included in and

launched with custom Oozie workflows, and can also be integrated with other end user tools via its

Java API.

1.2 Audience

TDCH is designed and implemented for the Hadoop user audience. Users in this audience are

familiar with the Hadoop Distributed File System (HDFS) and MapReduce. They are also familiar

with other widely used Hadoop ecosystem components such as Hive, Pig and Sqoop. They should be

comfortable with the command line style of interfaces many of these tools support. Basic knowledge

about the Teradata database system is also assumed.

BI Tools Hive

Teradata Hadoop

Sqoop

Pig

HDFS Teradata DB

Teradata

SQL

Teradata Tools

MapReduce

ETL Tools TDCH

…


1.3 Architecture

TDCH is a bi-directional data movement utility which runs as a MapReduce application inside the

Hadoop cluster. TDCH employs an abstracted ‘plugin’ architecture which allows users to easily

configure, extend and debug their data movement jobs.

1.3.1 MapReduce

TDCH utilizes MapReduce as its execution engine. MapReduce is a framework designed for

processing parallelizable problems across huge datasets using a large number of computers (nodes).

When run against files in HDFS, MapReduce can take advantage of locality of data, processing data

on or near the storage assets to decrease transmission of data. MapReduce supports other distributed

filesystems such as Amazon S3. MapReduce is capable of recovering from partial failure of servers

or storage at runtime. TDCH jobs get submitted to the MapReduce framework, and the distributed

processes launched by the MapReduce framework make JDBC connections to the Teradata database;

the scalability and fault tolerance properties of the framework are key features of TDCH data

movement jobs.

…

Namenode

Datanode Datanode Datanode

JT / RM

Task / Container Task / Container Task / Container

TDCH Mapper

TDCH Mapper

TDCH Mapper

Teradata DB TDCH Job


1.3.2 Controlling the Degree of Parallelism

Both Teradata and Hadoop systems employ extremely scalable architectures, and thus it is very

important to be able to control the degree of parallelism when moving data between the two systems.

Because TDCH utilizes the MapReduce framework as its execution engine, the degree of parallelism

for TDCH jobs is defined by the number of mappers used by the MapReduce job. The number of

mappers used by the MapReduce framework can be configured via the command line parameter

‘nummappers’, or via the ‘tdch.num.mappers’ configuration property. More information about

general TDCH command line parameters and their underlying properties will be discussed in Section

2.1.

1.3.3 Plugin Architecture

The TDCH architecture employs a ‘plugin’ model, where the source and target of the TDCH job are

abstracted into source and target plugins, and the core TDCH ‘data bus’ is plugin agnostic. Plugins

are composed of a pre-defined set of classes - the source plugin is composed of the Processor,

InputFormat, Serde, and PlugInConfiguration classes while the target plugin is composed of the

Converter, Serde, OutputFormat, Processor and PlugInConfiguration classes. Some of these plugin-

specific classes implement the MapReduce API, and all are instantiated by generalized Connector

classes which also implement the MapReduce API. The generalized Connector classes then forward

method calls from the MapReduce framework to the encapsulated plugin classes at runtime. Source

plugins are responsible for generating records in ConnectorRecord form from the source data, while

target plugins are responsible for consuming ConnectorRecords and converting them to the target

data format. This design decouples the source of the TDCH job from the target, ensuring that

minimal knowledge of the source or target system is required by the opposite plugin.

1.3.4 Stages of the TDCH job

A TDCH job is made up of three distinct stages: the preprocessing stage, the data transfer stage, and

the postprocessing stage. During the preprocessing stage, the PlugInConfiguration classes are used to

setup a Hadoop configuration object with information about the job and the source and target

systems. The processor classes of both the input plugin and the output plugin then validate the

source and target properties, and take any necessary steps to prepare the source and target systems

for the TDCH job. Once preprocessing is complete, the job is submitted to the MapReduce

framework; this stage is referred to as the data transfer stage. The input plugin’s is responsible for

generating ConnectorRecords from the source data, while the output plugin’s is responsible for

converting ConnectorRecords to the target data format. The processor classes are then responsible

for cleaning up the source and target environments during the postprocessor stage.


ConnectorIF TD

CH

Data

Bus

ConnectorRR

ConnectorOF

ConnectorRW

InputP

lugin

IF/R

R

InputP

lugin

SerD

e

Outp

utP

lugin

SerD

e

Outp

utP

lugin

OF/R

W

Outp

utP

lugin

Convert

er

Data in Format of

Source Data in

Format of

Target

Data in ConnectorRecord

Format

Data Transfer Stage

TDCH MapReduce Job

Input/OutputPlugInConfiguration

InputPlugin Processor inputPreProcessor()

Preprocessing Stage

ConnectorImport/ExportTool

ConnectorJobRunner

OutputPlugin Processor outputPreProcessor()

JobContext

ConnectorJobRunner

OutputPlugin Processor outputPostProcessor()

InputPlugin Processor inputPostProcessor()

Postprocessing Stage


1.4 TDCH Plugins and Features

1.4.1 Defining Plugins via the Command Line Interface

When using TDCH’s command line interface, the ConnectorImportTool and

ConnectorExportTool classes are responsible for taking user-supplied command line parameters

and values and identifying the desired source and target plugins for the TDCH job. In most cases,

the plugins are identified by the system or component the plugin interfaces with and the file

format of the underlying data. Once the source and target plugins are identified, the plugins’

PlugInConfiguration classes are used to define job-specific properties in the Hadoop

Configuration object.

1.4.2 HDFS Source and Target Plugins

The Hadoop Distributed File System, or HDFS, is a distributed, scalable file system designed to

run on commodity hardware. HDFS is designed to reliably store very large files across machines

in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last

block are the same size. TDCH supports extracting data from and loading data into files and

directories in HDFS via the HDFS Source and Target Plugins. The HDFS Source and Target

Plugins support the following file formats:

TextFile

TextFile is structured as a sequence of lines of text, and each line consists of multiple fields.

Lines and fields are delimited by separator. TextFile is easier for humans to read.

Avro

Avro is a serialization framework developed within Apache's Hadoop project. It uses JSON for

defining data types and protocols, and serializes data in a compact binary format. TDCH jobs

that read or write Avro files require an Avro schema to be specified inline or via a file.

1.4.3 Hive Source and Target Plugins

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools for data

summarization, query, and analysis. It defines a simple SQL-like query language, called Hive

QL, which enables users familiar with SQL to query data in Hadoop. Hive executes queries via

different execution engines depending on the distribution of Hadoop in use. TDCH supports

extracting data from and loading data into Hive tables (both partitioned and non-partitioned) via

the Hive Source and Target Plugins. The Hive Source and Target Plugins support the following

file formats:

TextFile



SequenceFile

SequenceFile is a flat file consisting of binary key/value pairs. It is extensively used in

MapReduce as input/output formats.


RCFile

RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based

data warehouse systems, such as Hive. RCFile applies the concept of “first horizontally-partition,

then vertically-partition”. It combines the advantages of both row-store and column-store.

RCFile guarantees that data in the same row are located in the same node, and can exploit a

column-wise data compression and skip unnecessary column reads.

ORCFile

ORCFile (Optimized Row Columnar File) file format provides a highly efficient way to store

Hive data. It is designed to overcome limitations of the other Hive file formats. Using ORC files

improves performance when Hive is reading, writing, and processing data. ORC file support is

only available on Hadoop systems with Hive 0.11.0 or above installed.

1.4.4 HCatalog Source and Target Plugins

HCatalog is a table and storage management service for data created using Apache Hadoop.

HCatalog’s table abstraction presents users with a relational view of data in HDFS and ensures

that users need not worry about where or in what format their data is stored. TDCH supports

extracting data from and loading data into HCatalog tables via the HCatalog Source and Target

Plugins, though the plugins are being deprecated due to the inclusion of HCatalog within the

Hive project. The HCatalog Source and Target Plugins support the following file formats:

TextFile



1.4.5 Teradata Source Plugins

Teradata is an industry leading data warehouse which employs an MPP, or massively parallel

processing, architecture. Because the TDCH runs on the Hadoop cluster, it does not have direct

access to the underlying data in a Teradata system, and thus it does not make sense to define the

file formats supported by the Teradata source plugins. Rather, the Teradata source plugins are

defined by how they and split the source Teradata data set (where the source data set can be a

table, view, or query) into N ‘splits’, with N being the number of mappers in use by the TDCH

job. The Teradata source plugins support the following split mechanisms:

split.by.hash

The Teradata split.by.hash source plugin utilizes each mapper in the TDCH job to retrieve rows

in a given hash range of the specified split-by column from a source table in Teradata. If the user

doesn’t define a split-by column, the first column of the table’s primary index is used by default.

The split.by.hash plugin supports more data types than the split.by.value plugin.

split.by.value

The Teradata split.by.hash source plugin utilizes each mapper in the TDCH job to retrieve rows

in a given value range of the specified split-by column from a source table in Teradata. If the

user doesn’t define a split-by column, the first column of the table’s primary index is used by

default. The split.by.value plugin supports less data types than the split.by.hash plugin.


split.by.partition

The Teradata split.by.partition source plugin utilizes each mapper in the TDCH job to retrieve

rows in a given partition from a source table in Teradata. The split.by.partition plugin is used by

default when the source data set is defined by a query. The plugin creates a PPI, or partitioned

primary indexed, stage table with data from the source table when the source table is not already

a PPI table or when a query defines the source data set. To enable the creation of staging table,

the split.by.partition plugin requires that the associated Teradata user has ‘create table’ and

‘create view’ privileges as well as free perm space equivalent to the size of the source table

available.

split.by.amp

The Teradata split.by.amp source plugin utilizes each mapper in the TDCH job to retrieve rows

associated with one or more amps from a source table in Teradata. The split.by.amp plugin

delivers the best performance due to its use of a special table operator available only in Teradata

14.10+ database systems.

1.4.6 Teradata Target Plugins

The Teradata target plugins are defined by the mechanism they utilize to load data into the target

Teradata table. The Teradata target plugins implement the following load mechanisms:

batch.insert

The Teradata batch.insert target plugin associates an SQL JDBC session with each mapper in the

TDCH job when loading a target table in Teradata. The batch.insert plugin is the most flexible as

it supports most Teradata data types, requires no coordination between the TDCH mappers, and

can recover from mapper failure. If the target table is not NOPI, a NOPI stage table is created

and loaded as an intermediate step before moving the data to the target via a single insert-select

SQL operation. To enable the creation of staging table, the batch.insert plugin requires that the

associated Teradata user has ‘create table’ privileges as well as free perm space equivalent to the

size of the source data set.

internal.fastload

The Teradata internal.fastload target plugin associates a FastLoad JDBC session with each

mapper in the TDCH job when loading a target table in Teradata. The internal.fastload method

utilizes a FastLoad ‘slot’ on Teradata, and implements coordination between the TDCH mappers

and a TDCH coordinator process (running on the edge node where the job was submitted) as is

defined by the Teradata FastLoad protocol. The internal.fastload plugin delivers exceptional load

performance; however, it supports fewer data types than batch.insert and cannot recover from

mapper failure. If the target table is not NOPI, a NOPI stage table is created and loaded as an

intermediate step before moving the data to the target via a single insert-select SQL operation.

To enable the creation of staging table, the internal.fastload plugin requires that the associated

Teradata user has ‘create table’ privileges as well as free perm space equivalent to the size of the

source data set.


1.5 Teradata Plugin Space Requirements

1.5.1 Space Required by Teradata Target Plugins

This section describes the permanent and spool space required by the Teradata target plugins on the

target Teradata system:

batch.insert

When the target table is not NoPI or is non-empty, the Teradata batch.insert target plugin creates

a temporary NoPI stage table. The Teradata batch.insert target plugin loads the source data into

the temporary stage table before executing an INSERT-SELECT operation to move the data

from the stage table into the target table. To support the use of a temporary staging table, the

target database must have enough permanent space to accommodate data in the stage table. In

addition to the permanent space required by the temporary stage table, the Teradata batch.insert

target plugin requires spool space equivalent to the size of the source data to support the

INSERT-SELECT operation between the temporary staging and target tables.

internal.fastload

When the target table is not NOPI or is non-empty, the Teradata internal.fastload target plugin

creates a temporary NOPI stage table. The Teradata internal.fastload target plugin loads the

source data into the temporary stage table before executing an INSERT-SELECT operation to

move the data from the stage table into the target table. To support the use of a temporary staging

table, the target database must have enough permanent space to accommodate data in the stage

table. In addition to the permanent space required by the temporary stage table, the Teradata

internal.fastload target plugin requires spool space equivalent to the size of the source data to

support the INSERT-SELECT operation between the temporary staging and target tables.

1.5.2 Storage Space Required for Extracting Data from Teradata

This section describes the permanent and spool space required by the Teradata source plugins on the

source Teradata system:

split.by.value

The Teradata split.by.value source plugin associates data in value ranges of the source table with

distinct mappers from the TDCH job. Each mapper retrieves the associated data via a SELECT

statement, and thus the Teradata split.by.value source plugin requires that the source database

have enough spool space to support N SELECT statements, where N is the number of mappers in

use by the TDCH job.

split.by.hash

The Teradata split.by.hash source plugin associates data in hash ranges of the source table with

distinct mappers from the TDCH job. Each mapper retrieves the associated data via a SELECT

statement, and thus the Teradata split.by.hash source plugin requires that the source database

have enough spool space to support N SELECT statements, where N is the number of mappers in

use by the TDCH job.

split.by.partition


When the source table is not partitioned, the Teradata split.by.partition source plugin creates a

temporary partitioned staging table and executes an INSERT-SELECT to move data from the

source table into the stage table. To support the use of a temporary partitioned staging table, the

source database must have enough permanent space to accommodate the source data set in the

stage table as well as in the source table. In addition to the permanent space required by the

temporary stage table, the Teradata split.by.partition source plugin requires spool space

equivalent to the size of the source data to support the INSERT-SELECT operation between the

source table and the temporary partitioned stage table.

Once a partitioned source table is available, the Teradata split.by.partition source plugin

associates partitions from the source table with distinct mappers from the TDCH job. Each

mapper retrieves the associated data via a SELECT statement, and thus the Teradata

split.by.partition source plugin requires that the source database have enough spool space to

support N SELECT statements, where N is the number of mappers in use by the TDCH job.

split.by.amp

The Teradata split.by.amp source plugin does not require any space on the source database due

to its use of the tdampcopy table operator.

1.6 Teradata Plugin Privilege Requirements

The following table defines the privileges required by the database user associated with the TDCH

job when the Teradata source or target plugins are in use.

NOTE: Create table privileges are only required by the batch.insert and internal.fastlod Teradata

plugins when staging tables are required.

Teradata

Plugin

Requires

Create

Table

Privilege

Requires

Create

View

Privilege

Select Privilege Required on System Views/Tables

usexviews argument enabled usexviews argument disabled

split by

hash No No

DBC.COLUMNSX

DBC.INDICESX

DBC.COLUMNS

DBC.INDICES

split by

value No No

DBC.COLUMNSX

DBC.INDICESX

DBC.COLUMNS

DBC.INDICES

split by

partition Yes Yes

DBC.COLUMNSX

DBC.INDICESX

DBC.COLUMNS

DBC.INDICES

split by

amp No No

DBC.COLUMNSX

DBC.TABLESX

DBC.COLUMNS

DBC.TABLES

batch

insert Yes No

DBC.COLUMNSX

DBC.INDICESX

DBC.TABLESX

DBC.COLUMNS

DBC.INDICES

DBC.TABLES

internal

fastload Yes No

DBC.COLUMNSX

DBC.INDICESX

DBC.TABLESX

DBC.DATABASESX

DBC.TABLE_LEVELCONSTRA

INTSX

DBC.TRIGGERSX

DBC.COLUMNS

DBC.INDICES

DBC.TABLES

DBC.DATABASES

DBC.TABLE_LEVELCONSTRAIN

TS

DBC.TRIGGERS


2 Supported Plugin Properties

TDCH jobs are configured by associating a set of properties and values with a Hadoop configuration

object. The TDCH job’s source and target plugins should be defined in the Hadoop configuration

object using TDCH’s ConnectorImportTool and ConnectorExportTool command line utilities, while

other common and plugin-centric attributes can be defined either by command line arguments or

directly via their java property names. The table below provides some metadata about the

configuration property definitions in this section.

Java Property The fully-qualified configuration property

CLI Argument If available, the command line argument associated with the property

Tool Class If a command line argument is associated with the property, the Tool

class(es) which support the command line argument

Description A description about the property and information about how it affects

the job or plugin’s operation

Required If a user-defined specification of the property is required or optional

Supported Values The values supported by the property

Default Value The default value used if the property is not specified

Case Sensitive In some situations a property supports string values which are case

sensitive

2.1 Source Plugin Definition Properties

Java Property tdch.plugin.input.processor

tdch.plugin.input.format

tdch.plugin.input.serde

CLI Argument method

Tool Class com.teradata.connector.common.tool.ConnectorImportTool

Description The three ‘tdch.plugin.input’ properties define the source plugin. When

using the ConnectorImportTool, the source plugin will always be one of

the four Teradata source plugins. Submitting a valid value to the

ConnectorImportTool’s ‘method’ command line argument will cause the

three ’input.plugin.input’ properties to be assigned values associated

with the selected Teradata source plugin. At this point, users should not

define the ‘tdch.plugin.input’ properties directly.

Required no

Supported Values The following values are supported by the ConnectorImportTool’s

‘method’ argument: split.by.hash, split.by.value, split.by.partition,

split.by.amp.

Default Value split.by.hash

Case Sensitive yes





CLI Argument jobtype + fileformat

Tool Class com.teradata.connector.common.tool.ConnectorExportTool

Description The three ‘tdch.plugin.input’ properties define the source plugin. When

using the ConnectorExportTool, the source plugin will always be one of

the plugins that interface with components in the Hadoop cluster.

Submitting a valid value to the ConnectorImportTool’s ‘jobtype’ and

‘fileformat’ command line arguments will cause the three

‘tdch.plugin.input’ properties to be assigned values associated with the

selected Hadoop source plugin. At this point, users should not define the

‘tdch.plugin.input’ properties directly.

Required no

Supported Values The following combinations of values are supported by the

ConnectorExportTool’s ‘jobtype’ and ‘fileformat’ arguments: hdfs +

textfile | avrofile, hive + textfile | sequencefile | rcfile | orcfile, hcat +

textfile

Default Value hdfs + textfile

Case Sensitive yes

2.2 Target Plugin Definition Properties

Java Property tdch.plugin.output.processor

tdch.plugin.output.format

tdch.plugin.output.serde

tdch.plugin.data.converter

CLI Argument jobtype + fileformat


Description The three ‘tdch.plugin.output’ properties and the

‘tdch.plugin.data.converter’ property define the target plugin. When

using the ConnectorImportTool, the target plugin will always be one of

the plugins that interface with components in the Hadoop cluster.

Submitting a valid value to the ConnectorImportTool’s ‘jobtype’ and

‘fileformat’ command line arguments will cause the three

’tdch.plugin.output’ properties and the ‘tdch.plugin.data.converter’ to be

assigned values associated with the selected Hadoop target plugin. At

this point, users should not define the ‘tdch.plugin.output’ properties

directly.

Required no

Supported Values The following combinations of values are supported by the

ConnectorImportTool’s ‘jobtype’ and ‘fileformat’ arguments: hdfs +

textfile | avrofile, hive + textfile | sequencefile | rcfile | orcfile, hcat +

textfile

Default Value hdfs + textfile


Case Sensitive yes




tdch.plugin.data.converter

CLI Argument method


Description The three ‘tdch.plugin.output’ properties and the

‘tdch.plugin.data.converter’ property define the target plugin. When

using the ConnectorExportTool, the target plugin will always be one of

the twoTeradata target plugins. Submitting a valid value to the

ConnectorImportTool’s ‘method’ command line argument will cause the

three ’input.plugin.input’ properties and the ‘tdch.plugin.data.converter’

property to be assigned values associated with the selected Teradata

target plugin. At this point, users should not define the

‘tdch.plugin.output’ properties directly.

Required no

Supported Values The following values are supported by the ConnectorExportTool’s

‘method’ argument: batch.insert, internal.fastload

Default Value batch.insert

Case Sensitive yes

2.3 Common Properties

Java Property tdch.num.mappers

CLI Argument nummappers


com.teradata.connector.common.tool.ConnectorExportTool

Description The number of mappers used by the TDCH job. It is also the number of

splits TDCH will attempt to create when utilizing a Teradata source

plugin. This value is only a recommendation to the MR framework, and

the framework may or may not spawn the exact amount of mappers

requested by the user (this is especially true when using HDFS / Hive /

HCatalog source plugins; for more information see MapReduce’s split

generation logic).

Required no

Supported Values integers > 0

Default Value 2

Java Property tdch.throttle.num.mappers


CLI Argument throttlemappers

Description Force the TDCH job to only use as many mappers as the queue

associated with the job can handle concurrently, overwriting the user

defined nummapper value.

Required no

Supported Values true | false

Default Value false

Java Property tdch.input.converter.record.schema

CLI Argument sourcerecordschema



Description A comma separated list of data type names or references to java classes

which extend the ConnectorDataTypeConverter class. When the

‘tdch.input.converter.record.schema’ property is specified, the

‘tdch.output.converter.record.schema’ property should also be specified

and the number of items in the comma separated lists must match. Both

lists must be specified such that any references to java classes in the

‘tdch.input.converter.record.schema’ property have their conversion

method’s return type defined in the

‘tdch.output.converter.record.schema’ property. See section 8.10 for

more information about user-defined ConnectorDataTypeConverters.

Required no

Supported Values string

Default Value Comma-separated list of data type names representing the

ConnectorRecords generated by the source plugin.

Java Property tdch.output.converter.record.schema

CLI Argument targetrecordschema



Description A comma separated list of data type names. When the

‘tdch.input.converter.record.schema’ property is specified, the

‘tdch.output.converter.record.schema’ property should also be specified

and the number of items in the comma separated lists must match. Both

lists must be specified such that any references to java classes in the

‘tdch.input.converter.record.schema’ property have their conversion

method’s return type defined in the

‘tdch.output.converter.record.schema’ property. See section 8.10 for

more information about user-defined ConnectorDataTypeConverters.

Required no


Default Value Comma-separated list of data type names representing the

ConnectorRecords expected by the target plugin.


Java Property tdch.input.date.format

CLI Argument sourcedateformat



Description The parse pattern to apply to all input string columns during conversion

to the output column type, where the output column type is

determined to be a date column.

Required no


Default Value yyyy-MM-dd

Case Sensitive yes

Java Property tdch.input.time.format

CLI Argument sourcetimeformat





determined to be a time column.

Required no


Default Value HH:mm:ss

Case Sensitive yes

Java Property tdch.input.timestamp.format

CLI Argument sourcetimestampformat





determined to be a timestamp column.

Required no


Default Value yyyy-MM-dd HH:mm:ss.SSS

Case Sensitive yes

Java Property tdch.input.timezone.id

CLI Argument sourcetimezoneid




Description The source timezone used during conversions to or from date and time

types.

Required no


Default Value hadoop cluster’s default timezone

Case Sensitive no

Java Property tdch.output.date.format

CLI Argument targetdateformat



Description The format of all output string columns, when the input column type is

determined to be a date column.

Required no


Default Value yyyy-MM-dd

Case Sensitive yes

Java Property tdch.output.time.format

CLI Argument targettimeformat




determined to be a time column.

Required no


Default Value HH:mm:ss

Case Sensitive yes

Java Property tdch.output.timestamp.format

CLI Argument targettimestampformat




determined to be a timestamp column.

Required no


Default Value yyyy-MM-dd HH:mm:ss.SSS

Case Sensitive yes


Java Property tdch.output.timezone.id

CLI Argument targettimezoneid



Description The target timezone used during conversions to or from date and time

types.

Required no


Default Value hadoop cluster’s default timezone

Case Sensitive no

Java Property tdch.output.write.phase.close

CLI Argument debugoption



Description A performance debug option which allows users to determine the

amount of overhead associated with the target plugin; enabling this

property discards the data generated by the source plugin.

Required no

Supported Values 0 | 1

Default Value 0

Java Property tdch.string.truncate

CLI Argument stringtruncate



Description If set to 'true', strings will be silently truncated based on the length of the

target char or varchar column. If set to 'false', when a string is larger than

the target column an exception will be thrown and the mapper will fail.

Required no


Default Value true

2.4 Teradata Source Plugin Properties

Java Property tdch.input.teradata.jdbc.driver.class

CLI Argument classname


Description The JDBC driver class used by the source Teradata plugins when


connecting to the Teradata system.

Required no


Default Value com.teradata.jdbc.TeraDriver

Case Sensitive yes

Java Property tdch.input.teradata.jdbc.url

CLI Argument url


Description The JDBC url used by the source Teradata plugins to connect to the

Teradata system.

Required yes


Default Value

Case Sensitive yes

Java Property tdch.input.teradata.jdbc.user.name

CLI Argument username


Description The authentication username used by the source Teradata plugins to

connect to the Teradata system. Note that the value can include Teradata

Wallet references in order to use user name information from the current

user's wallet.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.teradata.jdbc.password

CLI Argument password


Description The authentication password used by the source Teradata plugins to



user's wallet.

Required no


Default Value

Case Sensitive yes


Java Property tdch.input.teradata.database

CLI Argument

Tool Class

Description The name of the database in the Teradata system from which the source

Teradata plugins will read data; this property gets defined by specifying

a fully qualified table name for the ‘tdch.input.teradata.table’ property.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.teradata.table

CLI Argument sourcetable


Description The name of the table in the Teradata system from which the source

Teradata plugins will read data. Either specify this or the

'tdch.input.teradata.query' parameter but not both. The

‘tdch.input.teradata.table’ property can be used in conjunction with

‘tdch.input.teradata.conditions’ for all Teradata source plugins.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.teradata.conditions

CLI Argument sourceconditions


Description The SQL WHERE clause (with the WHERE removed) that the source

Teradata plugins will use in conjunction with the

‘tdch.input.teradata.table’ value when reading data from the Teradata

system.

Required no

Supported Values Teradata database supported conditional SQL.

Default Value

Case Sensitive yes

Java Property tdch.input.teradata.query

CLI Argument sourcequery



Description The SQL query which the split.by.partition Teradata source plugin will

use to select data from Teradata database. Either specify this or the

'tdch.input.teradata.table' parameter but not both. The use of the

‘sourcequery’ command line argument forces the use of the

split.by.partition Teradata source plugin.

Required no

Supported Values Teradata database supported select SQL.

Default Value

Case Sensitive yes

Java Property tdch.input.teradata.field.names

CLI Argument sourcefieldnames


Description The names of columns that the source Teradata plugins will read from

the source table in the Teradata system. If this property is specified via

the 'sourcefieldnames' command line argument, the value should be in

comma separated format. If this property is specified directly via the '-D'

option, or any equivalent mechanism, the value should be in JSON

format. The order of the source field names must match exactly the order

of the target field names for schema mapping. If not specified, then all

columns from the source table will be retrieved.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.teradata.data.dictionary.use.xview

CLI Argument usexviews


Description If set to true, the source Teradata plugins will use XViews to get

Teradata system information. This option allows users who have access

privileges to run TDCH jobs, though performance may be degraded.

Required no


Default Value false

Java Property tdch.input.teradata.access.lock

CLI Argument accesslock


If set to true, the source Teradata plugins lock the source Teradata table

during the data transfer phase of the TDCH job ensuring that there is no

concurrent access to the table during the data transfer.


Required no


Default Value false

Java Property tdch.input.teradata.query.band

CLI Argument queryband


Description A string, when specified, is used to set the value of session level query

band for the Teradata source plugins.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.teradata.batch.size

CLI Argument batchsize


Description The number of rows the Teradata source plugins will attempt to fetch

from the Teradata system, up to the 1MB buffer size limit.

Required no

Supported Values Integer greater than 0

Default Value 10000

Java Property tdch.input.teradata.num.partitions

CLI Argument numpartitions


Description The number of partitions in the staging table created by the

split.by.partition Teradata source plugin. If the number of mappers is

larger than the number of partitions in the staging table, the value of

‘tdch.num.mappers’ will be overridden with the

‘tdch.input.teradata.num.partitions’ value.

Required no

Supported Values integer greater than 0

Default Value If undefined, ‘tdch.input.teradata.num.partitions’ is set to

‘tdch.num.mappers’.

Java Property tdch.input.teradata.stage.database

CLI Argument stagedatabase



Description The database in the Teradata system in which the Teradata source

plugins create the staging table, if a staging table is utilized.

Required no

Supported Values the name of a database in the Teradata system

Default Value the current logon database of the JDBC connection

Case Sensitive no

Java Property tdch.input.teradata.stage.table.name

CLI Argument stagetablename


Description The name of the staging table created by the Teradata source plugins, if

a staging table is utilized. The staging table should not exist in the

database.

Required no


Default Value The value of ‘teradata.input.teradata.table’ appended with a numerical

time in the form ‘hhmmssSSS’, separated by an underscore.

Case Sensitive no

Java Property tdch.input.teradata.stage.table.forced

CLI Argument forcestage


Description If set to true, then staging is used by the Teradata split.by.partition

source plugin, irrespective of the source table’s definition.

Required no


Default Value false

Java Property tdch.input.teradata.split.by.column

CLI Argument splitbycolumn


Description The name of a column in the source table which the Teradata

split.by.hash and split.by.value plugins use to split the source data set. If

this parameter is not specified, the first column of the table’s primary

key or primary index will be used.

Required no

Supported Values a valid table column name

Default Value The first column of the table’s primary index

Case Sensitive no


2.5 Teradata Target Plugin Properties

Java Property tdch.output.teradata.jdbc.driver.class

CLI Argument classname


Description The JDBC driver class used by the target Teradata plugins when

connecting to the Teradata system.

Required no


Default Value com.teradata.jdbc.TeraDriver

Case Sensitive yes

Java Property tdch.output.teradata.jdbc.url

CLI Argument url


Description The JDBC url used by the target Teradata plugins to connect to the

Teradata system.

Required yes


Default Value

Case Sensitive yes

Java Property tdch.output.teradata.jdbc.user.name

CLI Argument username


Description The authentication username used by the target Teradata plugins to



user's wallet.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.teradata.jdbc.password

CLI Argument password


Description The authentication password used by the target Teradata plugins to




user's wallet.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.teradata.database

CLI Argument

Tool Class

Description The name of the target database in the Teradata system where the target

Teradata plugins will write data; this property gets defined by specifying

a fully qualified table name for the ‘tdch.output.teradata.table’ property.

Required no


Default Value

Case Sensitive no

Java Property tdch.output.teradata.table

CLI Argument targetable


Description The name of the target table in the Teradata system where the target

Teradata plugins will write data.

Required yes


Default Value

Case Sensitive no

Java Property tdch.output.teradata.field.names

CLI Argument targetfieldnames


Description The names of fields that the target Teradata plugins will write to the

table in the Teradata system. If this property is specified via the

'targetfieldnames' command line argument, the value should be in



format. The order of the target field names must match the order of the

source field names for schema mapping.

Required no


Default Value


Case Sensitive no

Java Property tdch.output.teradata.data.dictionary.use.xview

CLI Argument usexviews


Description If set to true, the target Teradata plugins will use XViews to get Teradata

system information. This option allows users who have limited access

privileges to run TDCH jobs, though performance may be degraded.

Required no


Default Value false

Java Property tdch.output.teradata.query.band

CLI Argument queryband


Description A string, when specified, is used to set the value of session level query

band for the Teradata target plugins.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.teradata.batch.size

CLI Argument batchsize


Description The number of rows the Teradata target plugins will attempt to batch

before submitting the rows to the Teradata system, up to the 1MB buffer

size limit.

Required no

Supported Values an integer greater than 0

Default Value 10000

Java Property tdch.output.teradata.stage.database

CLI Argument stagedatabase


Description The database in the Teradata system in which the Teradata target plugins

create the staging table, if a staging table is utilized.

Required no


Supported Values the name of a database in the Teradata system


Case Sensitive no

Java Property tdch.output.teradata.stage.table.name

CLI Argument stagetablename


Description The name of the staging table created by the Teradata target plugins, if a

staging table is utilized. The staging table should not exist in the

database.

Required no


Default Value The value of ‘teradata.output.teradata.table’ appended with a numerical

time in the form ‘hhmmssSSS’, separated by an underscore.

Case Sensitive no

Java Property tdch.output.teradata.stage.table.forced

CLI Argument forcestage


Description If set to true, then staging is used by the Teradata target plugins,

irrespective of the target table’s definition.

Required no


Default Value false

Java Property tdch.output.teradata.stage.table.kept

CLI Argument keepstagetable


Description If set to true, the staging table is not dropped by the Teradata target

plugins when a failure occurs during the insert-select operation between

the staging and target tables.

Required no


Default Value false

Java Property tdch.output.teradata.fastload.coordinator.socket.host

CLI Argument fastloadsockethost


Description The host name or ip address of the node on the Hadoop cluster where the


TDCH job is being launched. The internal.fastload Teradata target

plugin utilizes a coordinator process on the node where the TDCH job is

launched to coordinate the TDCH mappers as is specified by the

Teradata FastLoad protocol, thus any user-defined hostname or ip

address should be reachable by all of the nodes in the Hadoop cluster

such that the TDCH mappers can communicate with this coordinator

process. If this parameter is not specified, the Teradata internal.fastload

plugin will automatically lookup the ip address of the first physical

interface on the node where the TDCH job was launched, after verifying

that the interface can reach a data node in the cluster. The values of the

'dfs.datanode.dns.interface' and 'mapred.tasktracker.dns.interface' can be

used to define which interface on the local node to select. The

‘tdch.output.teradata.fastload.coordinator.socket.host’ value then gets

advertised to the TDCH mappers during the data transfer phase.

Required no

Supported Values A resolvable host name or IP address.

Default Value The ip address of the first physical interface on the node where the

TDCH job was launched, after verifying that the interface can reach a

data node in the cluster.

Java Property tdch.output.teradata.fastload.coordinator.socket.port

CLI Argument fastloadsocketport


Description The port that the Teradata internal.fastload plugin coordinator will listen

on. The TDCH mappers will communicate with the coordinator on this

port.

Required no

Supported Values integer > 0

Default Value The Teradata internal.fastload plugin will automatically select an

available port starting from 8678.

Java Property tdch.output.teradata.fastload.coordinator.socket.timeout

CLI Argument fastloadsockettimeout


Description The amount of milliseconds the Teradata internal.fastload coordinator

will wait for connections from TDCH mappers before timing out.

Required no

Supported Values integer > 0

Default Value 480000

Java Property tdch.output.teradata.fastload.coordinator.socket.backlog

CLI Argument

Tool Class


Description The backlog for the server socket used by the Teradata internal.fastload

coordinator. The coordinator handles one task at a time; if the

coordinator cannot keep up with incoming connections from the TDCH

mappers, they get queued in the socket's backlog. If the number of tasks

in the backlog exceeds the backlog size undefined errors can occur.

Required no

Supported Values integer

Default Value 256

Java Property tdch.output.teradata.error.table.name

CLI Argument errortablename


Description The prefix of the name of the error table created by the internal.fastload

Teradata target plugin. Error tables are used by the FastLoad protocol to

handle records with erroneous columns.

Required no


Default Value The value of ‘tdch.output.teradata.table’ appended with the strings

‘_ERR_1’ and ‘_ERR_2’ for the first and second error tables,

respectively.

Case Sensitive no

Java Property tdch.output.teradata.error.table.database

CLI Argument errortabledatabase


Description The name of the database where the error tables will be created by the

internal.fastload Teradata target plugin.

Required no



Case Sensitive no

Java Property tdch.output.teradata.error.limit

CLI Argument errorlimit


Description The maximum number of records that will be sent to an error table by

the internal.fastload Teradata taget plugin. If the error row count exceeds

this value, the job will fail. There is no limit if this value is 0.

Required no

Supported Values integer

Default Value 0


2.6 HDFS Source Plugin Properties

Java Property tdch.input.hdfs.paths

CLI Argument sourcepaths


Description The directory in HDFS from which the source HDFS plugins will read

files.

Required yes


Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.field.names



Description The names of fields that the source HDFS plugins will read from the

HDFS files, in comma separated format. The order of the source field

names need to match the order of the target field names for schema

mapping.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.hdfs.schema

CLI Argument sourcetableschema


Description When supplied, a schema is mapped to the files in HDFS when read

using the source HDFS plugins; the source data is then converted to this

schema before mapping and conversion to the target plugin’s schema

occurs.

Required no

Supported Values A comma separated schema definition.

Default Value By default, all columns in the source HDFS files are treated as strings by

the source HDFS plugins.

Case Sensitive no


Java Property tdch.input.hdfs.separator

CLI Argument separator


Description The field separator that the HDFS textfile source plugin uses when

parsing files from HDFS.

Required no


Default Value \t (tab character)

Case Sensitive yes

Java Property tdch.input.hdfs.null.string

CLI Argument nullstring


Description When specified, the HDFS textfile source plugin compares the columns

from the source HDFS files with this value, and when the column value

matches the user-defined value the column is then treated as a null. This

logic is only applied to string columns.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.null.non.string

CLI Argument nullnonstring


Description When specified, the HDFS textfile source plugin compares the columns

from the source HDFS files with this value, and when the column value

matches the user-defined value the column is then treated as a null. This

logic is only applied to non-string columns.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.enclosed.by

CLI Argument enclosedby


Description When specified the HDFS textfile source plugin assumes that the given

character encloses each field from the source HDFS file on both ends.

These bounding characters are stripped before the record is passed to the


target plugin.

Required no

Supported Values single characters

Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.escaped.by

CLI Argument escapedby


Description When specified the HDFS textfile source plugin assumes that the given

character is used to escape occurrences of the

‘tdch.input.hdfs.enclosed.by’ character in the source record. These

escape characters are stripped before the record is passed to the target

plugin.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.avro.schema

CLI Argument avroschema


Description A string representing an inline Avro schema. This schema is applied to

the input Avro file in HDFS by the HDFS Avro source plugin. This

value takes precedence over the value supplied for

‘tdch.input.hdfs.avro.schema.file’.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hdfs.avro.schema.file

CLI Argument avroschemafile


Description The path to an Avro schema file in HDFS. This schema is applied to the

input Avro file in HDFS by the HDFS Avro source plugin.

Required no


Default Value

Case Sensitive yes


2.7 HDFS Target Properties

Java Property tdch.output.hdfs.paths

CLI Argument targetpaths


Description The directory in which the target HDFS plugins will write files in

HDFS.

Required yes


Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.field.names



Description The names of fields that the target HDFS plugins will write to the target

HDFS files, in comma separated format. The order of the target field

names need to match the order of the source field names for schema

mapping.

Required no

Supported Values String

Default Value

Case Sensitive no

Java Property tdch.output.hdfs.schema

CLI Argument targettableschema


Description When supplied, a schema is mapped to the files in HDFS when written

using the target HDFS plugins; the source data is converted to this

schema during the source to target schema conversion.

Required no


Default Value

Case Sensitive no

Java Property tdch.output.hdfs.separator




Description The field separator that the HDFS textfile target plugin uses when

writing files to HDFS.

Required no


Default Value \t

Case Sensitive yes

Java Property tdch.output.hdfs.line.separator

CLI Argument


Description The line separator that the HDFS textfile target plugin uses when writing

files to HDFS.

Required no


Default Value \n

Case Sensitive yes

Java Property tdch.output.hdfs.null.string



Description When specified, the HDFS textfile target plugin replaces null columns in

records generated by the source plugin with this value. This logic is only

applied to string columns (by default all columns written by the HDFS

textfile plugin are string type, unless overridden by the

‘tdch.output.hdfs.schema’ property).

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.null.non.string

CLI Argument nullnonstring


Description When specified, the HDFS textfile target plugin replaces null columns in

records generated by the source plugin with this value. This logic is only

applied to non-string columns (by default all columns written by the

HDFS textfile plugin are of string type, unless overridden by the

‘tdch.output.hdfs.schema’ property).

Required no



Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.enclosed.by

CLI Argument enclosedby


Description When specified the HDFS textfile target plugin encloses each column in

the source record with the user-defined characters before writing the

records to files in HDFS.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.escaped.by

CLI Argument escapedby


Description When specified the HDFS textfile target plugin escapes occurrences of

the ‘tdch.output.hdfs.enclosed.by’ character in the source data with the

user-defined ‘tdch.output.hdfs.escaped.by’ character.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.avro.schema

CLI Argument avroschema


Description A string representing an inline Avro schema. This schema is used when

generating the output Avro file in HDFS by the HDFS Avro target

plugin. This value takes precedence over the value supplied for

‘tdch.input.hdfs.avro.schema.file’.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hdfs.avro.schema.file


CLI Argument avroschemafile


Description The path to an Avro schema file in HDFS. This schema is used when

generating the output Avro file in HDFS by the HDFS Avro target

plugin.

Required no


Default Value

Case Sensitive yes

2.8 Hive Source Properties

Java Property tdch.input.hive.conf.file

CLI Argument hiveconf


Description The path to a Hive configuration file in HDFS. The source Hive plugins

can use this file for TDCH jobs launched through remote execution or on

data nodes.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hive.paths

CLI Argument sourcepaths


Description The directory in HDFS from which the source Hive plugins will read

files. Either specify this or the 'teradata.input.hive.table’ parameter but

not both.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hive.database

CLI Argument sourcedatabase


Description The name of the database in Hive from which the source Hive plugins


will read data.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.hive.table



Description The name of the table in Hive from which the source Hive plugins will

read data. Either specify this or the 'tdch.input.hive.paths' parameter but

not both.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.hive.field.names



Description The names of columns that the source Hive plugins will read from the

source table in Hive. If this property is specified via the

'sourcefieldnames' command line argument, the value should be in



format. The order of the source field names need to match the order of

the target field names for schema mapping.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.hive.table.schema

CLI Argument sourcetableschema


Description A comma separated schema specification. If defined, the source Hive

plugins will override the schema associated with the

‘tdch.input.hive.table’ table and use the ‘tdch.input.hive.table.schema’

value instead. The ‘tdch.input.hive.table.schema’ value should not

include the partition schema associated with the table.

Required no



Default Value

Case Sensitive no

Java Property tdch.input.hive.partition.schema

CLI Argument sourcepartitionschema


Description A comma separated partition schema specification. If defined, the source

Hive plugins will override the partition schema associated with the

‘tdch.input.hive.table’ table and use the ‘tdch.input.hive.table.schema’

value instead. When this property is specified, the

'tdch.input.hive.table.schema' property must also be specified.

Required no


Default Value

Case Sensitive no

Java Property tdch.input.hive.null.string



Description When specified, the Hive source plugins compares the columns from the

source Hive tables with this value, and when the column value matches

the user-defined value the column is then treated as a null. This logic is

only applied to all columns.

Required no


Default Value

Case Sensitive yes

Java Property tdch.input.hive.fields.separator



Description The field separator that the Hive textfile source plugin uses when

reading from Hive delimited tables.

Required no


Default Value \u0001

Case Sensitive yes


Java Property tdch.input.hive.line.separator

CLI Argument lineseparator


Description The line separator that the Hive textfile source plugin uses when reading

from Hive delimited tables.

Required no


Default Value \n

Case Sensitive yes

2.9 Hive Target Properties

Java Property tdch.output.hive.conf.file

CLI Argument hiveconf


Description The path to a Hive configuration file in HDFS. The target Hive plugins

can use this file for TDCH jobs launched through remote execution or on

data nodes.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hive.paths

CLI Argument targetpaths


Description The directory in HDFS where the target Hive plugins will write files.

Either specify this or the 'teradata.output.hive.table’ parameter but not

both.

Required no


Default Value The directory in HDFS associated with the target hive table.

Case Sensitive yes

Java Property tdch.output.hive.database

CLI Argument targetdatabase


Description The name of the database in Hive where the target Hive plugins will


write data.

Required no


Default Value default

Case Sensitive no

Java Property tdch.output.hive.table

CLI Argument targettable


Description The name of the table in Hive where the target Hive plugins will write

data. Either specify this parameter or the 'tdch.output.hive.paths'

parameter but not both.

Required no


Default Value

Case Sensitive no

Java Property tdch.output.hive.field.names



Description The names of fields that the target Hive plugins will write to the table in

Hive. If this property is specified via the 'targetfieldnames' command

line argument, the value should be in comma separated format. If this

property is specified directly via the '-D' option, or any equivalent

mechanism, the value should be in JSON format. The order of the target

field names must match the order of the source field names for schema

mapping.

Required no


Default Value

Case Sensitive no

Java Property tdch.output.hive.table.schema

CLI Argument targettableschema


Description A comma separated schema specification. If defined, the

‘tdch.output.hive.table’ table should not exist, as the target Hive plugins

will use ‘tdch.output.hive.table.schema’ value when creating the Hive

table before the data transfer phase. The ‘tdch.output.hive.table.schema’

value should not include the partition schema, if a partition schema is to

be associated with the target Hive table.


Required no


Default Value

Case Sensitive no

Java Property tdch.output.hive.partition.schema

CLI Argument targetpartitionschema


Description A comma separated partition schema specification. If defined, the

‘tdch.output.hive.table’ table should not exist, as the target Hive plugins

will use ‘tdch.output.hive.partition.schema’ value when creating the

partitions associated with the Hive table before the data transfer phase.

The 'tdch.output.hive.table.schema' parameter must also be specified

when the ‘tdch.output.hive.partition.schema’ property is specified.

Required no


Default Value

Case Sensitive no

Java Property tdch.output.hive.null.string



Description When specified, the Hive textfile target plugin replaces null columns in

records generated by the source plugin with this value.

Required no


Default Value

Case Sensitive yes

Java Property tdch.output.hive.fields.separator



Description The field separator that the Hive textfile target plugin uses when writing

to Hive delimited tables.

Required no


Default Value \u0001

Case Sensitive yes


Java Property tdch.output.hive.line.separator

CLI Argument lineseparator


Description The line separator that the Hive textfile target plugin uses when writing

to Hive delimited tables.

Required no


Default Value \n

Case Sensitive yes

2.10 HCat Source Properties

Java Property tdch.input.hcat.database

CLI Argument sourcedatabase


Description The name of the database in HCat from which the source HCat plugin

will read data.

Required no



Case Sensitive yes

Java Property tdch.input.hcat.table



Description The name of the table in HCat from which the source HCat plugin will

read data.

Required yes


Default Value

Case Sensitive no

Java Property tdch.input.hcat.field.names



Description The names of fields that the source HCat plugin will read from the HCat

table, in comma separated format. The order of the source field names


need to match the order of the target field names for schema mapping.

Required no


Default Value

Case Sensitive no

2.11 HCat Target Properties

Java Property tdch.output.hcat.database

CLI Argument targetdatabase


Description The name of the database in HCat where the target HCat plugin will

write data.

Required no



Case Sensitive no

Java Property tdch.output.hcat.table

CLI Argument targettable


Description The name of the table in HCat where the target HCat plugin will write

data.

Required yes


Default Value

Case Sensitive no

Java Property tdch.output.hcat.field.names



Description The names of fields that the target HCat plugin will write to the HCat

table, in comma separated format. The order of the target field names

need to match the order of the source field names for schema mapping.

Required no


Default Value

Case Sensitive no


3 Installing Connector

3.1 Prerequisites

Teradata Database 13.0+

Hadoop cluster running a supported Hadoop distribution

o HDP

o CDH

o IBM

o MapR

TDCH continuously certifies against the latest Hadoop distributions from the most prominent

Hadoop vendors. See the SUPPORTLIST files available with TDCH for more information about the

distributions and versions of Hadoop supported by a given TDCH release.

3.2 Software Download

Currently, the latest software release of the Teradata Connector for Hadoop is available at the

following location on the Teradata Developer Exchange:

http://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-command-

line-edition

Some Hadoop vendors will distribute Teradata-specific versions of Sqoop that are ‘powered by

Teradata’; in most cases these Sqoop packages contain one of the latest versions of TDCH. These

Sqoop implementations will forward Sqoop command line arguments to TDCH via the Java API and

then rely on TDCH for data movement between Hadoop and Teradata. In this way, Hadoop users

can utilize a common Sqoop interface to launch data movement jobs using specialized vendor-

specific tools.

3.3 RPM Installation

TDCH can be installed on any node in the Hadoop cluster, though typically it is installed on a

Hadoop edge node.

TDCH is distributed in RPM form, and can be installed in a single step:

rpm -ivh teradata-connector-<version>-<hadoop(1.x|2.x)>.noarch.rpm

After RPM installation, the following directory structure should be created (teradata-connector-

1.4.1-hadoop2.x used as example):

http://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-command-line-edition

http://downloads.teradata.com/download/connectivity/teradata-connector-for-hadoop-command-line-edition


/usr/lib/tdch/1.4/:

README SUPPORTLIST-hadoop2.x conf lib scripts

/usr/lib/tdch/1.4/conf:

teradata-export-properties.xml.template

teradata-import-properties.xml.template

/usr/lib/tdch/1.4/lib:

tdgssconfig.jar teradata-connector-1.4.1.jar terajdbc4.jar

/usr/lib/tdch/1.4/scripts:

configureOozie.sh

The README and SUPPORTLIST files contain information about the features and fixes included

in a given TDCH release, as well as information about what versions of relevant systems (Teradata,

Hadoop, etc) are supported by a given TDCH release.

The conf directory contains a set of xml files that can be used to define default values for common

TDCH properties. To use these files, specify defaults values for the desired properties in Hadoop

configuration format, remove the ‘.template’ extension and copy them into the Hadoop conf

directory.

The lib directory contains the TDCH jar, as well as the Teradata GSS and JDBC jars. Only the

TDCH jar is required when launching TDCH jobs via the command line interface, while all three

jars are required when launching TDCH jobs via Oozie Java actions.

The scripts directory contains the configureOozie.sh script which can be used to install TDCH into

HDFS such that TDCH jobs can be launched by other Teradata products via custom Oozie Java

actions; see the following section for more information.

3.4 ConfigureOozie Installation

Once TDCH has been installed into the Linux filesystem, the configureOozie.sh script can be used to

install TDCH, its dependencies, and a set of custom Oozie workflow files into HDFS. By installing

TDCH into HDFS in this way, TDCH jobs can be launched by users and applications outside of the

Hadoop cluster via Oozie. Currently, both the Teradata Studio and Teradata Unity Data Mover

products support launching TDCH jobs from nodes outside of the cluster when TDCH is installed

into HDFS using the configureOozie.sh script.

The configureOozie.sh script supports the following arguments in the form ‘<argument>=<value>’:

nn - The Name Node host name (required)

nnHA - If the name node is HA, specify the fs.defaultFS value found in `core-site.xml'

rm - The Resource Manager host name (uses nn parameter value if omitted)

oozie - The Oozie host name (uses nn parameter value if omitted)


webhcat - The WebHCatalog host name (uses nn parameter if omitted)

webhdfs - The WebHDFS host name (uses nn parameter if omitted)

nnPort - The Name node port number (8020 if omitted)

rmPort - The Resource Manager port number (8050 if omitted)

ooziePort - The Oozie port number (11000 if omitted)

webhcatPort - The WebHCatalog port number (50111 if omitted)

webhdfsPort - The WebHDFS port number (50070 if omitted)

hiveClientMetastorePort - The URI port for hive client to connect to metastore server (9083

if omitted)

kerberosRealm - name of the Kerberos realm

hiveMetaStore - The Hive Metastore host name (uses nn paarameter value if omitted)

hiveMetaStoreKerberosPrincipal - The service principal for the metastore thrift server

(hive/_HOST if ommitted)

Once the configureOozie.sh script has been run, the following directory structure should exist in

HDFS:

/teradata/hadoop/lib/<all dependent hadoop jars>

/teradata/tdch/1.3/lib/teradataconnector-<version>.jar

/teradata/tdch/1.3/lib/tdgssconfig.jar

/teradata/tdch/1.3/lib/terajdbc4.jar

/teradata/tdch/1.3/oozieworkflows/<all generated oozie workflow files>


4 Launching TDCH Jobs

4.1 TDCH’s Command Line Interface

To launch a TDCH job via the command line interface, utilize the following syntax

hadoop jar teradata-connector-<version>.jar <path.to.tool.class>

(-libjars <comma separated list of runtime dependencies>)?

<hadoop or plugin properties specified via the –D syntax>

<tool specific command line arguments>

The tool class to-be-used will depend on whether the TDCH job is exporting data from the Hadoop

cluster to Teradata or importing data into the Hadoop cluster from Teradata.

For exports from Hadoop, reference the ConnectorExportTool main class via the path

‘com.teradata.connector.common.tool.ConnectorExportTool’

For imports to Hadoop, reference the ConnectorImportTool main class via the path

‘com.teradata.connector.common.tool.ConnectorImportTool’

When running TDCH jobs which utilize the Hive or Hcatalog source or target plugins, a set

dependent jars must be distributed with the TDCH jar to the nodes on which the TDCH job will be

run. These runtime dependencies should be defined in comma-separated format using the ‘-libjars’

command line option; see the following section for more information about runtime dependencies.

Job and plugin-specific properties can be defined via the ‘-D<property>=value’ format, or via their

associated command line interface arguments. See section 2 for a full list of the properties and

arguments supported by the plugins and tool classes, and see section 5.1 for examples which utilize

the ConnectorExportTool and the ConnectorImportTool classes to launch TDCH jobs via the

command line interface.

4.2 Runtime Dependencies

In some cases, TDCH supports functionality which depends on libraries that are not encapsulated in

the TDCH jar. When utilizing TDCH via the command line interface, the absolute path to the

runtime dependencies associated with the given TDCH functionality should be included in the

HADOOP_CLASSPATH environment variable as well as specified in comma-separated format for

the ‘-libjars’ argument. Most often, these runtime dependencies can be found in the lib directories of

the Hadoop components installed on the cluster. As an example, the jars associated with Hive 1.2.1

are used below. The location and version of the dependent jars will change based on the version of

Hive installed on the local cluster, and thus the version numbers associated with the jars should be

updated accordingly.

TDCH jobs which utilize the HDFS Avro plugin as a source or target are dependent on the following

Avro jar files:


avro-1.7.4.jar

avro-mapred-1.7.4-hadoop2.jar

TDCH jobs which utilize the Hive plugins as sources or targets are dependent on the following Hive

jar files:

antlr-runtime-3.4.jar

commons-dbcp-1.4.jar

commons-pool-1.5.4.jar

datanucleus-api-jdo-3.2.6.jar

datanucleus-core-3.2.10.jar

datanucleus-rdbms-3.2.9.jar

hive-cli-1.2.1.jar

hive-exec-1.2.1.jar

hive-jdbc-1.2.1.jar

hive-metastore-1.2.1.jar

jdo-api-3.0.1.jar

libfb303-0.9.2.jar

libthrift-0.9.2.jar

TDCH jobs which utilize the HCatalog plugins as sources or targets are dependent on all of the jars

associated with the Hive plugins (defined above), as well as the following HCatalog jar files:

hive-hcatalog-core-1.2.1.jar

4.3 Launching TDCH with Oozie workflows

TDCH can be launched from nodes outside of the Hadoop cluster via the Oozie web application.

Oozie executes user-defined workflows - an Oozie workflow is one or more actions arranged in a

graph, defined in an xml document in HDFS. Oozie actions are Hadoop jobs (Oozie supports

MapReduce, Hive, Pig and Sqoop jobs) which get conditionally executed on the Hadoop cluster

based on the workflow definition. TDCH’s ConnectorImportTool and ConnectorExportTool classes

can be referenced directly in Oozie workflows via Oozie Java actions.

The configureOozie.sh script discussed in section 3.4 creates a set of Oozie workflows in HDFS

which can be used directly, or can be used as examples during custom workflow development.

4.4 TDCH’s Java API

For advanced TDCH users which would like to build a Java application around TDCH launching

TDCH jobs directly via the Java API is possible, though Teradata support for custom applications


which utilize the TDCH Java API is limited. The TDCH Java API is composed of the following sets

of classes:

Utility classes

o These classes can be used to fetch information and modify the state of a given data

source or target.

Common and plugin specific configuration classes

o These classes can be used to set common and plugin specific TDCH properties in a

Hadoop Configuration object.

Job execution classes

o These classes can be used to define the plugins in use for a given TDCH job and

submit the job for execution on the MR framework.

Below is an overview of the TDCH package structure including the locations of useful utility,

configuration, and job execution classes:

com.teradata.connector.common.utils

o ConnectorJobRunner

com.teradata.connector.common.utils

o ConnectorConfiguration

o ConnectorMapredUtils

o ConnectorPlugInUtils

com.teradata.connector.hcat.utils

o HCatPlugInConfiguration

com.teradata.connector.hdfs.utils

o HdfsPlugInConfiguration

com.teradata.connector.hive.utils

o HivePlugInConfiguration

o HiveUtils

com.teradata.connector.teradata.utils

o TeradataPlugInConfiguration

o TeradataUtils

More information about TDCH’s Java API and Javadocs detailing the above classes are available

upon request.


5 Use Case Examples

5.1 Environment Variables for Runtime Dependencies

Before launching a TDCH job via the command line interface, setup the following environment

variables on an edge node in the Hadoop cluster where the TDCH job will be run. As an example,

the following environment variables reference TDCH 1.4.1 and the Hive 1.2.1 libraries. Ensure that

the HIVE_HOME and HCAT_HOME environment variables are set and the versions of the

referenced Hive libraries are updated for the given local cluster.

export LIB_JARS=

/path/to/avro/jars/avro-1.7.4.jar,

/path/to/avro/jars/avro-mapred-1.7.4-hadoop2.jar,

$HIVE_HOME/conf,

$HIVE_HOME/lib/antlr-runtime-3.4.jar,

$HIVE_HOME/lib/commons-dbcp-1.4.jar,

$HIVE_HOME/lib/commons-pool-1.5.4.jar,

$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar,

$HIVE_HOME/lib/datanucleus-core-3.2.10.jar,

$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar,

$HIVE_HOME/lib/hive-cli-1.2.1.jar,

$HIVE_HOME/lib/hive-exec-1.2.1.jar,

$HIVE_HOME/lib/hive-jdbc-1.2.1.jar,

$HIVE_HOME/lib/hive-metastore-1.2.1.jar,

$HIVE_HOME/lib/jdo-api-3.0.1.jar,

$HIVE_HOME/lib/libfb303-0.9.2.jar,

$HIVE_HOME/lib/libthrift-0.9.2.jar,

$HCAT_HOME/hive-hcatalog-core-1.2.1.jar

export HADOOP_CLASSPATH=

/path/to/avro/jars/avro-1.7.4.jar:

/path/to/avro/jars/avro-mapred-1.7.4-hadoop2.jar:

$HIVE_HOME/conf:


$HIVE_HOME/lib/antlr-runtime-3.4.jar:

$HIVE_HOME/lib/commons-dbcp-1.4.jar:

$HIVE_HOME/lib/commons-pool-1.5.4.jar:

$HIVE_HOME/lib/datanucleus-api-jdo-3.2.6.jar:

$HIVE_HOME/lib/datanucleus-core-3.2.10.jar:

$HIVE_HOME/lib/datanucleus-rdbms-3.2.9.jar:

$HIVE_HOME/lib/hive-cli-1.2.1.jar:

$HIVE_HOME/lib/hive-exec-1.2.1.jar:

$HIVE_HOME/lib/hive-jdbc-1.2.1.jar:

$HIVE_HOME/lib/hive-metastore-1.2.1.jar:

$HIVE_HOME/lib/jdo-api-3.0.1.jar:

$HIVE_HOME/lib/libfb303-0.9.2.jar:

$HIVE_HOME/lib/libthrift-0.9.2.jar:

$HCAT_HOME/hive-hcatalog-core-1.2.1.jar

export USERLIBTDCH=/usr/lib/tdch/1.4/lib/teradata-connector-1.4.1.jar

5.2 Use Case: Import to HDFS File from Teradata Table

5.2.1 Setup: Create a Teradata Table with Data

Execute the following through the Teradata BTEQ application.

.LOGON testsystem/testuser

DATABASE testdb;

CREATE MULTISET TABLE example1_td (

c1 INT

,c2 VARCHAR(100)

);

INSERT INTO example1_td VALUES (1,'foo');

.LOGOFF


5.2.2 Run: ConnectorImportTool command

Execute the following on the Hadoop edge node

hadoop jar $USERLIBTDCH

com.teradata.connector.common.tool.ConnectorImportTool

-libjars $LIB_JARS

-url jdbc:teradata://testsystem/database=testdb

-username testuser

-password testpassword

-jobtype hdfs

-sourcetable example1_td

-nummappers 1

-separator ','

-targetpaths /user/mapred/ex1_hdfs

-method split.by.hash

-splitbycolumn c1

5.3 Use Case: Export from HDFS File to Teradata Table

5.3.1 Setup: Create a Teradata Table



DATABASE testdb;


c1 INT

,c2 VARCHAR(100)

);

.LOGOFF

Set job type as ‘hdfs’

Set source Teradata table name

Set separator, e.g. comma

Set target paths (not exist yet)

The import job uses the

split.by.hash method

The column used to make data

split


5.3.2 Setup: Create an HDFS File

Execute the following on the Hadoop edge node.

echo "2,acme" > /tmp/example2_hdfs_data

hadoop fs -mkdir /user/mapred/example2_hdfs

hadoop fs -put /tmp/example2_hdfs_data /user/mapred/example2_hdfs/01

rm /tmp/example2_hdfs_data

5.3.3 Run: ConnectorExportTool command




-libjars $LIB_JARS


-username testuser


-jobtype hdfs

-sourcepaths /user/mapred/example2_hdfs

-nummappers 1

-separator ','

-targettable example2_td

-forcestage true

-stagedatabase testdb

-stagetablename export_hdfs_stage

-method internal.fastload

Set job type as ‘hdfs’

Set source HDFS path

Set separator, e.g. comma

Set target Teradata table name

Force to create stage table

Database to create stage table

Name of the stage table

Use internal.fastload method


5.4 Use Case: Import to Existing Hive Table from Teradata Table




DATABASE testdb;


c1 INT

,c2 VARCHAR(100)

);

INSERT INTO example3_td VALUES (3,'bar');

.LOGOFF

5.4.2 Setup: Create a Hive Table

Execute the following through the Hive command line interface on the Hadoop edge node

CREATE TABLE example3_hive (

h1 INT

, h2 STRING

) STORED AS RCFILE;

5.4.3 Run: ConnectorImportTool Command

Execute the following on the Hadoop edge node



-libjars $LIB_JARS


-username testuser



-jobtype hive

-fileformat rcfile


-nummappers 1

-targettable example3_hive


Execute the following on the Hadoop edge node. Notice this command uses the ‘-sourcequery’

command line argument to define the data that the source Teradata plugin should fetch.



-libjars $LIB_JARS


-username testuser


-jobtype hive

-fileformat rcfile

-sourcequery "select * from example3_td"

-nummappers 1


5.5 Use Case: Import to New Hive Table from Teradata Table




DATABASE testdb;

Set job type as ‘hive’

Set file format as rcfile

Set source TD table name

Set target Hive table name



Use a SQL query to get

source data

Set target TD table name



c1 INT ,c2 VARCHAR(100)

,c3 FLOAT

);

INSERT INTO example4_td VALUES (3,'bar',2.35);

.LOGOFF


Execute the following on the Hadoop edge node. Notice that a new Hive partitioned table with the

specified table schema and partition schema will be created by Hive target plugin during this job.



-libjars $LIB_JARS


-username testuser


-jobtype hive

-fileformat rcfile


-sourcefieldnames "c1,c3,c2"

-nummappers 1

-targetdatabase default


-targettableschema "h1 int,h2 float"

-targetpartitionschema "h3 string"

-targetfieldnames "h1,h2,h3 "



Set source TD table name

Set target Hive table name

The Hive database name to create

table

The schema for target Hive table

The partition schema for target

Hive table

All columns in target Hive table

Same order as target schema


5.6 Use Case: Export from Hive Table to Teradata Table




DATABASE testdb;


c1 INT

, c2 VARCHAR(100)

);

.LOGOFF

5.6.2 Setup: Create a Hive Table with Data

Execute the following through the Hive command line interface on the Hadoop edge node.


h1 INT

, h2 STRING

) row format delimited fields terminated by ',' stored as textfile;


h1 INT

, h2 STRING

) stored as textfile;


echo "4,acme">/tmp/example5_hive_data

hive -e "LOAD DATA LOCAL INPATH '/tmp/example5_hive_data' INTO TABLE

example5_hive;"


hive -e "INSERT OVERWRITE TABLE example6_hive SELECT * FROM

example5_hive;"

rm /tmp/example5_hive_data

5.6.3 Run: ConnectorExportTool Command




-libjars $LIB_JARS


-username testuser


-jobtype hive

-fileformat textfile

-sourcetable example6_hive

-nummappers 1


-separator '\u0001'

5.7 Use Case: Import to Hive Partitioned Table from Teradata PPI Table

5.7.1 Setup: Create a Teradata PPI Table with Data



DATABASE testdb;


c1 INT

, c2 DATE

) PRIMARY INDEX (c1)


Set file format as textfile

Set source Hive table name

Set target TD table name

Set a Unicode character as the separator


PARTITION BY RANGE_N(c2 BETWEEN DATE '2006-01-01' AND DATE '2012-12-

31' EACH INTERVAL '1' MONTH);

INSERT INTO example6_td VALUES (5,DATE '2012-02-18');

.LOGOFF

5.7.2 Setup: Create a Hive Partitioned Table



h1 INT

) PARTITIONED BY (h2 STRING)

STORED AS RCFILE;





-libjars $LIB_JARS


-username testuser


-jobtype hive

-fileformat rcfile


-sourcefieldnames "c1,c2"

-nummappers 1


-targetfieldnames "h1,h2"

Specify both source and target

field names so TDCH knows how to map Teradata column Hive

partition columns.


5.8 Use Case: Export from Hive Partitioned Table to Teradata PPI Table

5.8.1 Setup: Create a Teradata PPI Table



DATABASE testdb;


c1 INT

, c2 DATE

) PRIMARY INDEX (c1)

PARTITION BY RANGE_N(c2 BETWEEN DATE '2006-01-01' AND DATE '2012-12-

31' EACH INTERVAL '1' MONTH);

.LOGOFF

5.8.2 Setup: Create a Hive Partitioned Table with Data


echo "6,2012-02-18" > /tmp/example7_hive_data


CREATE TABLE example7_tmp (h1 INT, h2 STRING) ROW FORMAT DELIMITED

FIELDS TERMINATED BY ',' STORED AS TEXTFILE;


h1 INT

) PARTITIONED BY (h2 STRING)

STORED AS RCFILE;

LOAD DATA LOCAL INPATH '/tmp/example7_hive_data' INTO TABLE

example7_tmp


INSERT INTO TABLE example7_hive PARTITION (h2='2012-02-18') SELECT h1

FROM example7_tmp;

DROP TABLE example7_tmp;

5.8.3 Run: ConnectorExportTool command




-libjars $LIB_JARS


-username testuser


-jobtype hive

-fileformat rcfile


-sourcefieldnames "h1,h2"

-nummappers 1


-targetfieldnames "c1,c2"

5.9 Use Case: Import to Teradata Table from HCatalog Table




DATABASE testdb;


c1 INT

Specify both source and target field names so TDCH knows how to

map Hive partition column to

Teradata column.


, c2 VARCHAR(100)

);

INSERT INTO example8_td VALUES (7,'bar');

.LOGOFF

5.9.2 Setup: Create a Hive Table

Execute the following on the Hive command line


h1 INT

, h2 STRING

) STORED AS RCFILE;





-libjars $LIB_JARS


-username testuser


-jobtype hcat


-nummappers 1


Set job type as ‘hcat’


5.10 Use Case: Export from HCatalog Table to Teradata Table




DATABASE testdb;


c1 INT

, c2 VARCHAR(100)

);

.LOGOFF

5.10.2 Setup: Create a Hive Table with Data



h1 INT

, h2 STRING

) ROW FORMAT DELIMITED FIELDS TERMINA TED BY ',' STORED AS TEXTFILE;


echo "8,acme">/tmp/example9_hive_data

hive -e "LOAD DATA LOCAL INPATH '/tmp/example9_hive_data' INTO TABLE

example9_hive;"

rm /tmp/example9_hive_data






-libjars $LIB_JARS


-username testuser


-jobtype hcat


-nummappers 1


5.11 Use Case: Import to Teradata Table from ORC File Hive Table





-libjars $LIB_JARS


-username testuser


-classname com.teradata.jdbc.TeraDriver

-fileformat orcfile

-jobtype hive

-targettable import_hive_fun22

-sourcetable import_hive_fun2

-targettableschema "h1 float,h2 double,h3 int,h4 string,h5 string"

-nummappers 2

-separator ','

Set job type as ‘hcat’

Set file format as ‘orcfile’

Set target table schema

Set usexviews to false


-usexviews false

5.12 Use Case: Export from ORC File HCat Table to Teradata Table

5.12.1 Setup: Create the Source HCatalog Table


CREATE TABLE export_hcat_fun1(h1 float,h2 double,h3 int,h4 string,h5

string) row format delimited fields terminated by ',' stored as orc;





-libjars $LIB_JARS


-username testuser



-fileformat orcfile

-jobtype hcat

-sourcedatabase default

-sourcetable export_hcat_fun1

-nummappers 2

-separator ','

-targettable export_hcat_fun1

Set file format to ‘orcfile’


5.13 Use Case: Import to Teradata Table from Avro File in HDFS



create multiset table tdtbl(i int, b bigint, s smallint, f float,

c clob, blb blob, chr char(100), vc varchar(1000), d date, ts

timestamp);

insert into tdtbl(1, 2, 3, 4, '{"string":"ag123"}', '3132'xb,

'{"string":"char6"}', '{"string":"varchar7"}', date, current_timestamp);

insert into tdtbl(1, 2, 3, 4, '{"string":"ag1567"}', '3132'xb,

'{"string":"char3ss23"}', '{"string":"varchara2sd2"}', date,

current_timestamp);

insert into tdtbl(null, null, null, null, null, null, null, null,

null, null);

5.13.2 Setup: Prepare the Avro Schema File

Create a file named 'null_schema.avsc' on the Hadoop edge node. Copy the following Avro schema

definition into the new file.

{

"type" : "record",

"name" : "xxre",

"fields" : [ {

"name" : "col1",

"type" : "int", "default":1

}, {

"name" : "col2",

"type" : "long"

}, {

"name" : "col3",

"type" : "float"

}, {

"name" : "col4",


"type" : "double", "default":1.0

}, {

"name" : "col5",

"type" : "string", "default":"xsse"

} ]

}





-libjars $LIB_JARS


-username testuser



-fileformat avrofile

-jobtype hdfs

-targetpaths /user/hduser/avro_import

-nummappers 2

-sourcetable tdtbl

-usexviews false

-avroschemafile file:////home/hduser/tdch/manual/schema_default.avsc

-targetfieldnames "col2,col3"

-sourcefieldnames "i,s"

Set file format to ‘avrofile’

Specify Avro schema file


5.14 Use Case: Export from Avro to Teradata Table

5.14.1 Setup: Prepare the Source Avro File


hadoop fs -mkdir /user/hduser/avro_export

hadoop fs -put /user/hduser/avro_import/*.avro

/user/hduser/avro_export



create multiset table tdtbl_export(i int, b bigint, s smallint);



hadoop jar $jarname


-libjars $LIB_JARS


-username testuser



-fileformat avrofile

-jobtype hdfs

-sourcepaths /user/hduser/avro_export

-nummappers 2

-targettable tdtbl_export

-usexviews false

-avroschemafile file:///home/hduser/tdch/manual/schema_default.avsc

Set file format to ‘avrofile’


-sourcefieldnames "col2,col3"

-targetfieldnames "i,s"


6 Performance Tuning

6.1 Selecting the Number of Mappers

TDCH is a highly scalable application which runs atop the MapReduce framework, and thus its

performance is directly related to the number of mappers associated with a given TDCH job. In most

cases, TDCH jobs should run with as many mappers as the given Hadoop and Teradata systems, and

their administrators, will allow. The number of mappers will depend on whether the Hadoop and

Teradata systems are used for mixed workloads or are dedicated to data movement tasks for a certain

period of time, and will also depend on the mechanisms TDCH utilizes to interface with both

systems. This section attempts to describe the factors that come into play when defining the number

of mappers associated with a given TDCH job.

6.1.1 Maximum Number of Mappers on the Hadoop Cluster

The maximum number of mappers that can be run on a Hadoop cluster can be calculated in one of

two ways. For MRv1 clusters, the maximum number of mappers is defined by the number of data

nodes running tasktracker processes multiplied by the ‘mapred.tasktracker.map.tasks.maximum’

value. For MRv2/YARN-enabled clusters, a more complex calculation based on the maximum

container memory and virtual cpus per nodemanager process should be utilized to determine the

maximum number of containers (and thus mappers) that can be run on cluster; see Hadoop

documentation for more information on this calculation. In most cases, TDCH jobs will not be able

to utilize the maximum number of mappers on the Hadoop cluster due to the constraints described in

the following sections.

6.1.2 Mixed Workload Hadoop Clusters and Schedulers

The increased adoption of Hadoop as a multi-purpose processing framework has yielded mixed

workload clusters. These clusters are utilized by many users or teams simultaneously, the cluster’s

resources are shared, and thus one job will most likely never be able to utilize the maximum number

of mappers supported by the cluster. In some cases, the Hadoop cluster admin will implement the

capacity or fair scheduler in YARN-enabled clusters to divvy up the cluster’s resources between the

users or groups which utilize the cluster. When submitting TDCH jobs to scheduler-enabled YARN

clusters, the maximum number of mappers that can be associated with a given TDCH job will

depend on two factors:

The scheduler’s queue definition for the queue associated with the TDCH job; the queue

definition will include information about the minimum and maximum number of containers

offered by the queue, as well as whether the scheduler supports preemption.

Whether the given TDCH job supports preemption if the associated YARN scheduler and

queue have enabled preemption.

To determine the maximum number of mappers that can be run on a given scheduler-enabled YARN

cluster, see the Hadoop documentation for the scheduler that has been implemented in YARN on the

given cluster. See the following section for more information on which TDCH jobs support

preemption.


6.1.3 TDCH Support for Preemption

In some cases, the queues associated with a given YARN scheduler will be configured to support

elastic scheduling. This means that a given queue can grow in size to utilize the resources associated

with other queues when those resources are not in use; if these inactive queues become active while

the original job is running, containers associated with the original job will be preempted, and these

containers will be restarted when resources associated with the elastic queue become available. All

of TDCH’s source plugins, and all of TDCH’s target plugins except the TDCH internal.fastload

Teradata target plugin, support preemption. This means that all TDCH jobs, with the exception of

jobs which utilize the TDCH internal.fastload target plugin, can be run with more mappers than are

defined by maximum amount of containers associated with the given queue on scheduler-enabled,

preemption-enabled YARN clusters. Jobs which utilize the TDCH internal.fastload target plugin can

also be run in this environment, but may not utilize elastically-available resources. Again, see the

Hadoop documentation for the given scheduler to determine the maximum number of mappers

supported by a given queue.

6.1.4 Maximum Number of Sessions on Teradata

The maximum number of sessions on a Teradata system is equal to the number of amps on the

Teradata system. Each mapper associated with a TDCH job will connect to the Teradata system via a

JDBC connection, and Teradata will then associate a session with that connection. Thus, the

maximum number of mappers for a given TDCH job is equal to the number of sessions supported by

the given Teradata databases. It is also important to note that there are various Teradata workload

management applications and internal properties which can further limit the number of sessions

available at any given time. This should also be taken into account when defining the number of

mappers for a TDCH job. At this point, the following equation should be used to determine the

maximum number of mappers for a TDCH job:

Max TDCH mappers = max (number of containers available on the Hadoop cluster,

number of sessions available on Teradata)

6.1.5 General Guidelines and Measuring Performance

In most cases, the best way to determine the number of mappers that should be associated with a

given TDCH job is start with a small number of mappers ( < 10) and run the TDCH job against a

small subset of the source data. This will ensure that the source and target plugins are configured

correctly, and also provide some insight into the throughput to expect for the given job. To then

determine the maximum throughput possible for the job, double the number of mappers in use until

one of the previously discussed mapper-related maximums is hit.

To measure the throughput for the data transfer stage of the job, subtract the time at which the

‘Running job’ message is displayed on the console from the time at which the ‘Job completed’

message is displayed on the console to find the elapsed time and divide the source data set size by

this time. When running jobs using the ConnectorImportTool, job progress can be viewed via the

YARN application web interface, or by periodically viewing the size of the part files created by the

TDCH job in HDFS. When running jobs using the ConnectorExportTool, job progress can be

viewed via Teradata-specific monitoring utilities.


6.2 Selecting a Teradata Target Plugin

This section provides suggestions on how to select a Teradata target plugin, and provides some

information about the performance of the various plugins.

batch.insert

The Teradata batch.insert target plugin utilizes uncoordinated SQL sessions when connecting

with Teradata. This plugin should be used when loading a small amount of data, or when there

are complex data types in the target table which are not supported by the Teradata

internal.fastload target plugin. This plugin should also be used for long running jobs on YARN

clusters where preemptive scheduling is enabled or regular failures are expected.

internal.fastload

The Teradata internal.fastload target plugin utilizes coordinated FastLoad sessions when

connection with Teradata, and thus this plugin is more performant than the Teradata batch.insert

target plugin. This plugin should be used when transferring large amounts of data from large

Hadoop systems to large Teradata systems. This plugin should not be used for long running jobs

on YARN clusters where preemptive scheduling could cause mappers to be restarted or where

regular failures are expected, as the FastLoad protocol does not support restarted sessions and the

job will fail in this scenario.

6.3 Selecting a Teradata Source Plugin

This section provides suggestions on how to select a Teradata source plugin, and provides some

information about the performance of the various plugins.

split.by.value

The Teradata split.by.value source plugin performs the best when the split-by column has more

distinct values than the TDCH job has mappers, and when the distinct values in the split-by

column evenly partition the source dataset. The plugin has each mapper submit a range-based

SELECT query to Teradata, fetching the subset of data associated with the mapper’s designated

range. Thus, when the source data set is not evenly partitioned by the values in the split-by

column, the work associated with the data transfer will be skewed between the mappers, and the

job will take longer to complete.

split.by.hash

The Teradata split.by.hash source plugin performs the best when the split-by column has more

distinct hash values than the TDCH job has mappers, and when the distinct hash values in the

split-by column evenly partition the source dataset. The plugin has each mapper submit a range-

based SELECT query to Teradata, fetching the subset of data associated with the mapper’s

designated range. Thus, when the source data set is not evenly partitioned by the hash values in

the split-by column, the work associated with the data transfer will be skewed between the

mappers, and the job will take longer to complete.

split.by.partition

The Teradata split.by.partition source plugin performs the best when the source table is evenly

partitioned, the partition column(s) are also the indexed, and the number of partitions in the

source table is equal to the number of mappers used by the TDCH job. The plugin has each


mapper submit a range-based SELECT query to Teradata, fetching the subset of data associated

with one or more partitions. The plugin is the only Teradata source plugin to support defining the

source data set via an arbitrarily complex select query; in this scenario a staging table is used.

The number of partitions associated with the staging table created by the Teradata

split.by.partition source plugin can be explicitly defined by the user, so the plugin is the most

tunable of the four Teradata source plugins.

split.by.amp

The Teradata split.by.amp source plugin performs the best when the source data set is evenly

distributed on the amps in the Teradata system, and when the number of mappers used by the

TDCH job is equivalent to the number of amps in the Teradata system. The plugin has each

mapper submit a table operator-based SELEC query to Teradata, fetching the subset of data

associated with the mapper’s designated amps. The plugin’s use of the table operator makes it

the most performant of the four Teradata source plugins, but the plugin can only be used against

Teradata systems which have the table operator available (14.10+).

6.4 Increasing the Batchsize Value

The TDCH ‘tdch.input.teradata.batchsize’ and ‘tdch.output.teradata.batchsize’ properties and their

associated ‘batchsize’ command line argument defines how many records each mapper should fetch

from the database in one receive operation, or how many records get batched before they are

submitted to the database in one send operation. Increasing the batchsize value often leads to

increased throughput, as more data is sent between the TDCH mappers and Teradata per

send/receive operation.

6.5 Configuring the JDBC Driver

Each mapper uses a JDBC connection to interact with the Teradata database. The TDCH jar includes

the latest version of the Teradata JDBC driver and provides users with the ability to interface directly

with the driver via the ‘tdch.input.teradata.jdbc.url’ and ‘tdch.output.teradata.jdbc.url’ properties.

Using these properties, TDCH users can fine tune the low-level data transfer characteristics of the

driver by appending JDBC options to the JDBC URL properties. Internal testing has shown that

disabling the JDBC CHATTER option increases throughput when the Teradata and Hadoop systems

are localized. More information about the JDBC driver and its supported options can be found at the

online version of the Teradata JDBC Driver Reference.

https://developer.teradata.com/doc/connectivity/jdbc/reference/current/frameset.html


7 Troubleshooting

7.1 Troubleshooting Requirements

In order to troubleshoot failing or performance issues with TDCH jobs, the following information

should be available:

The version of the TDCH jar in use

The full TDCH or Sqoop command, all arguments included

The console logs from the TDCH or Sqoop command

The logs from a mapper associated with the TDCH job

Hadoop system information

o Hadoop version

o Hadoop distribution

o YARN scheduler configuration

The Hadoop job configuration

o The TDCH and Hadoop properties and their values

Teradata system information

o Teradata version

o Teradata system amp count

o Any workload management configurations

Source and target table definitions, if applicable

o Teradata table definition

o Hive table definition

o Hive configuration defaults

Example source data

NOTE: In some cases, it is also useful to enable DEBUG messages in the console and mapper logs.

To enable DEBUG messages in the console logs, update the HADOOP_ROOT_LOGGER

environment variable with the command ‘export HADOOP_ROOT_LOGGER=DEBUG,console’.

To enable DEBUG messages in the mapper logs, add the following property definition to the TDCH

command ‘-Dmapred.map.log.level=DEBUG’.


7.2 Troubleshooting Overview

The chart below provides an overview of our suggested troubleshooting process:

Problem Area

JDBC Database Hadoop Connector Data

Check for issues with each job stage

Setup (user error, database configuration, hadoop

configuration, database staging)

Run (code exception, data quality, data type handling)

Cleanup (database configuration, database staging cleanup)

Issue Type

Functional (look for exceptions) Performance (go through

checklist)

Customer Scenario (understand command parameters)

Import Export


7.3 Functional: Understand Exceptions

NOTE: The example console output contained in this section has not been updated to reflect the latest

version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though not

identical.

Look in the console output for

The very last error code

o 10000: runtime (look for database error code, or JDBC error code, or back trace)

o Others: pre-defined (checked) errors by TDCH

The very first instance of exception messages

Examples:

com.teradata.hadoop.exception.TeradataHadoopSQLException:

com.teradata.jdbc.jdbc_4.util.JDBCException: [Teradata Database]

[TeraJDBC 14.00.00.13] [Error 5628]

[SQLState HY000] Column h3 not found in mydb.import_hive_table.

(omitted)……

13/04/03 09:52:48 INFO tool.TeradataImportTool: job completed with exit

code 10000

com.teradata.hadoop.exception.TeradataHadoopSchemaException: Field data

type is invalid

at

com.teradata.hadoop.utils.TeradataSchemaUtils.lookupDataTypeByTypeName(Te

radataSchemaUtils.java:1469)

(omitted)……

13/04/03 09:50:02 INFO tool.TeradataImportTool: job completed with exit

code 14006


7.4 Functional: Data Issues

This category of issues occurs at runtime (most often with the internal.fastload method), and usually it’s

not obvious what the root cause is. Our suggestion is that you can check the following:

Does the schema match the data?

Is the separator correct?

Does the table DDL have time or timestamp columns?

o Check if tnano/tsnano setting has been specified to JDBC URL

Does the table DDL have Unicode columns?

o Check if CHARSET setting has been specified to JDBC URL

Does the table DDL have decimal columns

o Before release 1.0.6, this may cause issues

Check Fastload error tables to see what’s inside

7.5 Performance: Back of the Envelope Guide

We are all aware that no throughput is faster than the total I/O or network transfer capacities of the least

powerful component in the overall solution. So our methodology is to understand max for the

configuration and work it backwards.

Max theoretical throughput <= MIN (

∑(Ttd-io), ∑(Ttd-transfer),

∑(Thadoop-io), ∑(Thadoop-transfer), Tnetwork-transfer

)

Therefore we should:

Watch out for node-level CPU saturation (including core saturation), because “no CPU = no

work can be done”.

If all-node saturated with either Hadoop or Teradata, consider expanding system footprint

and/or lowering concurrency

If one-node much busier than other nodes within either Hadoop or Teradata, try to balance

the workload skew

If both Hadoop and Teradata are mostly idle, look for obvious user mistakes or configuration

issues, and if possible, increase concurrency.

And here is the checklist we could go through in case of slow performance:


User Settings

Teradata JDBC URL

o Connecting to MPP system name? (and not single-node)

o Connecting through correct (fast) network interface?

• /etc/hosts

• ifconfig

Using best-performance methods?

Using most optimal number of mappers? (small number of mapper sessions can significantly

impact performance)

Is batch size too small or too large?

Database

Is database CPU or IO saturated?

o iostat, mpstat, sar, top

Is there any TDWM setting limiting # of concurrent sessions or user’s query priority?

o tdwmcmd -a

DBSControl settings

o AWT tasks: maxawttask, maxloadtask, maxloadawt

o Compression settings

Is database almost out of room?

Is there high skew to some AMPs (skew on PI column or split-by column)

Network

Are Hadoop network interfaces saturated?

o Could be high replication factor combined with slow network between nodes

Are Teradata network interfaces saturated?

o Could be slow network between systems

o Does network has bad latency?


Hadoop

Are Hadoop data nodes CPU or IO saturated?

o iostat, mpstat, sar, top, using ganglia or other tools

o Could be Hadoop configuration too small for the job’s size

Are there settings limiting # of concurrent mappers?

o mapred-site.xml

o scheduler configuration

Are mapper tasks skewed to a few nodes?

o use ps | grep java on multiple nodes to see if tasks have skew

o In capacity-scheduler.xml, set maxtasksperheartbeat to force even distribution

7.6 Console Output Structure

NOTE: The example console output contained in this section has not been updated to reflect the latest

version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though not

identical.

13/03/29 11:27:11 INFO tool.TeradataImportTool: TeradataImportTool starts at 1364570831203

13/03/29 11:27:16 INFO mapreduce.TeradataInputProcessor: job setup starts at 1364570836879

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: database product is Teradata

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: database version is 13.10

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: jdbc driver version is 14.0

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input method is split.by.hash

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input split column is page_name

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input query is select "page_name", "page_hour", "page_view" from "npv_m" where page_language like '%9sz6n6%' or page_name is not null

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input database name is

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input table name is npv_m

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input conditions are page_language like '%9sz6n6%' or page_name is not null

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input field names are [page_name, page_hour, page_view]

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input batch size is 10000

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: input number of mappers are 6

Verify parameter settings


13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: job setup ends at 1364570843647

13/03/29 11:27:23 INFO mapreduce.TeradataInputProcessor: job setup time is 6s

13/03/29 11:27:28 INFO mapred.JobClient: Running job: job_201303251205_0253

13/03/29 11:27:29 INFO mapred.JobClient: map 0% reduce 0%

13/03/29 11:27:54 INFO mapred.JobClient: map 100% reduce 0%

13/03/29 11:27:59 INFO mapred.JobClient: Job complete: job_201303251205_0253

13/03/29 11:27:59 INFO mapred.JobClient: Counters: 19

13/03/29 11:27:59 INFO mapred.JobClient: Job Counters

……

13/03/29 11:27:59 INFO mapred.JobClient: Map output records=4

13/03/29 11:27:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=1326

13/03/29 11:27:59 INFO mapreduce.TeradataInputProcessor: job cleanup starts at 1364570879466

13/03/29 11:28:01 INFO mapreduce.TeradataInputProcessor: job cleanup ends at 1364570881367

13/03/29 11:28:01 INFO mapreduce.TeradataInputProcessor: job cleanup time is 1s

13/03/29 11:28:01 INFO tool.TeradataImportTool: TeradataImportTool ends at 1364570881367

13/03/29 11:28:01 INFO tool.TeradataImportTool: TeradataImportTool time is 50s

13/03/29 11:28:01 INFO tool.TeradataImportTool: job completed with exit code 0

7.7 Troubleshooting Examples

7.7.1 Database doesn’t exist

The error message on top of the error stack trace indicates that the “testdb” database does not exist:

com.teradata.hadoop.exception.TeradataHadoopException:


[TeraJDBC 14.00.00.13] [Error 3802] [SQLState 42S02] Database ‘testdb' does not

exist. at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error

Factory.java:307) at

com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI

nitSubState.java:102) at

com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi

ne(StatementReceiveState.java:302) at

com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem

entReceiveState.java:183) at

com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme

ntController.java:121) at

com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo

ntroller.java:112) at

com.teradata.jdbc.jdbc_4.TDSession.executeSessionRequest(TDSession.java:6

Successful completion of Setup, Run, and Cleanup stages will

have a corresponding log entry

Total Elapsed Time & Exit Code


24) at

com.teradata.jdbc.jdbc_4.TDSession.<init>(TDSession.java:288) at

com.teradata.jdbc.jdk6.JDK6_SQL_Connection.<init>(JDK6_SQL_Connection.jav

a:30) at

com.teradata.jdbc.jdk6.JDK6ConnectionFactory.constructConnection(JDK6Conn

ectionFactory.java:22) at

com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto

ry.java:130) at

com.teradata.jdbc.jdbc.ConnectionFactory.createConnection(ConnectionFacto

ry.java:120) at

com.teradata.jdbc.TeraDriver.doConnect(TeraDriver.java:228) at

com.teradata.jdbc.TeraDriver.connect(TeraDriver.java:154) at

java.sql.DriverManager.getConnection(DriverManager.java:582) at

java.sql.DriverManager.getConnection(DriverManager.java:185) at

com.teradata.hadoop.db.TeradataConnection.connect(TeradataConnection.java

:274)

7.7.2 Internal fast load server socket time out

When running export job using the "internal.fastload" method, the following error may occur:

Internal fast load socket server time out

This error occurs because the number of available map tasks currently is less than the number of

map tasks specified in the command line by parameter of "-nummappers". This error can occur in the

following conditions:

(1) There are some other map/reduce jobs running concurrently in the Hadoop cluster, so there are

not enough resources to allocate specified map tasks for the export job.

(2) The maximum number of map tasks is smaller than existing map tasks added expected map tasks

of the export jobs in the Hadoop cluster.

When the above error occurs, please try to increase the maximum number of map tasks of the

Hadoop cluster, or decrease the number of map tasks for the export job.

7.7.3 Incorrect parameter name or missing parameter value in command line

All the parameter names specified in the command line should be in lower case. When the

parameters names are not correct or the necessary parameter value is missing, the following error

will occur:

Export (Import) tool parameters is invalid

When this error occurs, please double check the input parameters and their values.


7.7.4 Hive partition column can not appear in the Hive table schema

When running import job with 'hive' job type, the columns defined in the target partition schema

cannot appear in the target table schema. Otherwise, the following exception will be thrown:

Target table schema should not contain partition schema

In this case, please check the provided schemas for Hive table and Hive partition.

7.7.5 String will be truncated if its length exceeds the Teradata String length (VARCHAR or CHAR) when running export job.

When running an export job, if the length of the source string exceeds the maximum length of

Teradata’s String type (CHAR or VARCHAR), the source string will be truncated. It will result in

data inconsistency.

To prevent that from happening, please carefully set the data schema for source data and target data.

7.7.6 Scaling number of Timestamp data type should be specified correctly in JDBC URL in internal.fastload method

NOTE: The example console output contained in this section has not been updated to reflect the

latest version of TDCH; the error messages and stack traces for TDCH 1.3+ will look similar, though

not identical.

When loading data into Teradata using the internal.fastload method, the following error may occur:

com.teradata.hadoop.exception.TeradataHadoopException: java.io.EOFException

at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323)

at java.io.DataInputStream.readUTF(DataInputStream.java:572)

at

java.io.DataInputStream.readUTF(DataInputStream.java:547) at

com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.beginLoading(TeradataInternalFastloadOutputProcessor.java:889) at

com.teradata.hadoop.mapreduce.TeradataInternalFastloadOutputProcessor.run

(TeradataInternalFastloadOutputProcessor.java:173) at

com.teradata.hadoop.job.TeradataExportJob.runJob(TeradataExportJob.java:7

5) at

com.teradata.hadoop.tool.TeradataJobRunner.runExportJob(TeradataJobRunner

.java:192) at

com.teradata.hadoop.tool.TeradataExportTool.run(TeradataExportTool.java:3

9) at

org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at

org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at

com.teradata.hadoop.tool.TeradataExportTool.main(TeradataExportTool.java:

395) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

Method)


Usually the error is caused by setting the wrong ‘tsnano’ value in the JDBC URL. In Teradata DDL,

the default length of timestamp is 6, which is also the maximum allowed value, but user can specify

a lower value.

When ‘tsnano’ is set to

The same as the specified length of timestamp in the Teradata table: no problem;

‘tsnano’ is not set: no problem, it will use the specified length as in the Teradata table

less than the specified length: an error table will be created in Teradata, but no exception will

be shown

Greater than the specified length: the quoted error message will be received.

7.7.7 Existing Error table error received when exporting to Teradata in internal.fastload method



not identical.

If the following error occurs when exporting to Teradata using the internal.fastload method:

com.teradata.hadoop.exception.TeradataHadoopException:


[TeraJDBC 14.00.00.13] [Error 2634] [SQLState HY000] Existing ERROR table(s) or Incorrect use of export_hdfs_fun1_054815 in Fast Load operation.

This is caused by the existence of the Error table. If an export task is interrupted or aborted while

running, an error table will be generated and stay in Teradata database. Now when you try to run

another export job, the above error will take place.

In this case, user needs to drop the existed Error table manually, and then rerun the export job.

7.7.8 No more room in database error received when exporting to Teradata



not identical.

If the following error occurs when exporting to Teradata:

com.teradata.hadoop.exception.TeradataHadoopSQLException:



[TeraJDBC 14.00.00.01] [Error 2644] [SQLState HY000] No more room in database

testdb. at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(Error

Factory.java:307) at

com.teradata.jdbc.jdbc_4.statemachine.ReceiveInitSubState.action(ReceiveI

nitSubState.java:102) at

com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachi

ne(StatementReceiveState.java:298) at

com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(Statem

entReceiveState.java:179) at

com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(Stateme

ntController.java:120) at

com.teradata.jdbc.jdbc_4.statemachine.StatementController.run(StatementCo

ntroller.java:111) at

com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:37

2) at

com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:31

4) at

com.teradata.jdbc.jdbc_4.TDStatement.doNonPrepExecute(TDStatement.java:27

7) at

com.teradata.jdbc.jdbc_4.TDStatement.execute(TDStatement.java:1087) at

com.teradata.hadoop.db.TeradataConnection.executeDDL(TeradataConnection.j

ava:364) at

com.teradata.hadoop.mapreduce.TeradataMultipleFastloadOutputProcessor.get

RecordWriter(TeradataMultipleFastloadOutputProcessor.java:315)

This is caused by the perm space of the database in Teradata being set too low. Please reset it to a

higher value to resolve it.

7.7.9 “No more spool space” error received when exporting to Teradata



not identical.

If the following error occurs when exporting to Teradata:

java.io.IOException: com.teradata.jdbc.jdbc_4.util.JDBCException:

[Teradata Database] [TeraJDBC 14.00.00.21] [Error 2646] [SQLState HY000]

No more spool space in example_db.

This is caused by the spool space of the database in Teradata being set too low. Please reset it to a

higher value to resolve it.


7.7.10 Separator is wrong or absent



not identical.

If the ‘-separator’ parameter is not set or is wrong, you may run into the following error:

java.lang.NumberFormatException: For input string:

"12,23.45,101,complex1" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222)

at java.lang.Double.valueOf(Double.java:475) at

com.teradata.hadoop.datatype.TeradataDataType$10.transform(TeradataDataTy

pe.java:194) at

com.teradata.hadoop.data.TeradataHdfsTextFileDataConverter.convert(Terada

taHdfsTextFileDataConverter.java:194) at

com.teradata.hadoop.data.TeradataHdfsTextFileDataConverter.convert(Terada

taHdfsTextFileDataConverter.java:167) at

com.teradata.hadoop.mapreduce.TeradataTextFileExportMapper.map(TeradataTe

xtFileExportMapper.java:32) at

com.teradata.hadoop.mapreduce.TeradataTextFileExportMapper.map(TeradataTe

xtFileExportMapper.java:12) at

org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at

org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at

org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at

org.apache.hadoop.mapred.Child$4.run(Child.java:255) at

java.security.AccessController.doPrivileged(Native Method) at

javax.security.auth.Subject.doAs(Subject.java:396) at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation

.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)

Please make sure the separator parameter’s name and value is specified correctly.

7.7.11 Date / Time / Timestamp format related errors



not identical.

If you run into one of the following errors：

java.lang.IllegalArgumentException

at java.sql.Date.valueOf(Date.java:138)

java.lang.IllegalArgumentException


at java.sql.Time.valueOf(Time.java:89)

java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-

dd hh:mm:ss[.fffffffff]

It is caused by incorrect date / time / timestamp formats:

1) When exporting data with time, date or timestamp type from HDFS text files to Teradata:

a) Value of date type in text files should follow the format of ‘yyyy-mm-dd’

b) Value of time type in text files should follow the format of ‘hh:mm:ss’

c) Value of timestamp type in text files should follow the format of ‘yyyy-mm-dd hh:mm:ss[.f...]’,

length of nano should be less than 9.

2) When importing data with time, date or timestamp type from Teradata to HDFS text file:

a) Value of date type in text files will follow the format of ‘yyyy-mm-dd’

b) Value of time type in text files will follow the format of ‘hh:mm:ss’

c) Value of timestamp in text files will follow the format of ‘yyyy-mm-dd hh:mm:ss.fffffffff’,

length of nano is 9.

3) When exporting data from Hive text files to Teradata:

a) Value of timestamp type in Hive text files should follow the format of ‘yyyy-mm-dd

hh:mm:ss.fffffffff’ (nano is optional, maximum length is 9)

4) When importing data from Teradata to Hive text files:

a) Value of timestamp type in Hive text files will follow the format of ‘yyyy-mm-dd

hh:mm:ss.fffffffff’ (nano is optional, maximum length is 9)

7.7.12 Janpanese language problem



not identical.

If you run into one of the following errors：


[TeraJDBC 14.10.00.21] [Error 6706] [SQLState HY000] The string contains

an untranslatable character.

This error is reported by the Teradata database. One reason is that the user uses a database with Japanese

language supported. When the connector wants to get table schema from the database, it uses the

following statement:

SELECT TRIM(TRAILING FROM COLUMNNAME) AS COLUMNNAME, CHARTYPE FROM

DBC.COLUMNS WHERE DATABASENAME = (SELECT DATABASE) AND TABLENAME =

$TABLENAME;

The internal process in database encounters an invalid character during processing, which may be a

problem about TD14. The workaround is to set dbscontrol flag “acceptreplacementCharacters” to true.


8 FAQ

8.1 Do I need to install the Teradata JDBC driver manually?

You do not need to find and install Teradata JDBC driver, as the latest Teradata JDBC driver (15.10)

is packed in the TDCH jar file. If you have installed other version of JDBC driver, please ensure it is

not included in the HADOOP_CLASSPATH or libjars environment variables such that TDCH

utilizes the version of the driver packaged with TDCH. If you would like TDCH to utilize a different

JDBC driver, see the ‘tdch.input.teradata.jdbc.driver.class’ and

‘tdch.output.teradata.jdbc.driver.class’ properties.

8.2 What authorization is necessary for running the TDCH?

Following authorization is a must regardless of import or export tool.

o Teradata Database:

Please check section 1.6:

o Hadoop cluster

Privilege to submit & run job via MapReduce

Privileges to read/write directories in HDFS

Privileges to access tables in Hive

8.3 How do I use User Customized Text Format Parameters?

TDCH provides two parameters, enclosedby and escapeby, for dealing with data containing

separator characters and quote characters in the ‘textfile’ of ‘hdfs’ job. The default values for

enclosedby is “ (double quote) and for escapeby is \ (backward slash). If the file format is not

‘textfile’ or when the job type is not ‘hdfs’, these two parameters do not take effect.

When neither parameter is specified, TDCH does not enclose or escape any characters in the data

during import or scan for enclose-by or escape-by characters during export. If either or both

parameters are provided, then TDCH will process enclose-by and escape-by values as appropriate.

8.4 How to use Unicode character as the separator?

Using shell to invoke TDCH:

When user set a Unicode character as the separator, user should input like, -separator “\uxxxx”

or –separator \\uxxxx where xxxx is the Unicode of this character. Shell will automatically

remove double quotes and the first back slash.

Using other methods to invoke TDCH:

TDCH accepts a Unicode character as separator with format \uxxxx, user should make sure the

separator value passed to TDCH has correct format.


8.5 Why is the actual number of mappers less than the value of -nummappers?

When you specify the number of mappers using the nummappers parameter, but in the execution,

you find that the actual number of mappers is less than your specified value, this is expected

behavior. This behavior is due to the fact that we use the getSplits() method of

CombineFileInputFormat class of Hadoop to decide partitioned splits number. As a result, the

number of mappers for running the job equals to splits number.

8.6 Why don’t decimal values in Hadoop exactly match the value in Teradata?

When exporting data to Teradata, if the precision of decimal type is more than that of the target

Teradata column type, the decimal value will be rounded when stored in Teradata. On the other

hand, if the precision of decimal type is less than the definition of the column in the Teradata table,

‘0’s will be appended to the scaling.

8.7 When should charset be specified in the JDBC URL?

If the column of the Teradata table is defined as Unicode, then you should specify the same character

set in the JDBC URL. Otherwise, it will result in wrong encoding of transmitted data, and there will

be no exception thrown. Also if you want to display Unicode data on Shell or other clients correctly,

don’t forget to configure your client to display as UTF-8 as well.

8.8 How do I configure the capacity scheduler to prevent task skew?

We can use capacity scheduler configuration to prevent task skew.

Here are steps you should follow:

1. Make sure scheduler you are using is capacity scheduler(check mapred-site.xml and check

scheduler)

2. Configure capacity-scheduler.xml(usually in the same location with mapred-

site.xml,$HADOOP_HOME/conf):

Add this property: mapred.capacity-scheduler.maximum-tasks-per-heartbeat

Give a reasonable value of this property, such as 4 for a 28-mapper job.

3. Copy this xml file to each node in Hadoop cluster then restart Hadoop.

8.9 How can I build my own ConnectorDataTypeConverter

Users also have the ability to define their own conversion routines, and reference these routines in

the value supplied to the ‘sourcerecordschema’ command line argument. In this scenario, the user

would also need to supply a value for the ‘targetrecordschema’ command line argument, providing

TDCH with information about the record generated by the user-defined converter.

As an example, here's a user-defined converter which replaces occurrences of the term 'foo' in a

source string with the term 'bar':


public class FooBarConverter extends ConnectorDataTypeConverter {

public FooBarConverter() {}

public final Object convert(Object object) {

if (object == null)

return null;

return ((String)object).replaceAll("foo","bar");

}

}

This user-defined converter extends the ConnectorDataTypeConverter class, and thus requires an

implementation for the convert(Object) method. At the time of the 1.4 release, user-defined

converters with no-arg constructors were not supported (this bug is being tracked by TDCH-775, see

known issues list); thus this user-defined converter a single arg constructor, where the input

argument is not used. To compile this user-defined converter, use the following syntax:

javac FooBarConverter.java -cp /usr/lib/tdch/1.4/lib/teradata-

connector-1.4.1.jar

To run using this user-defined converter, first create a new jar which contains the user-defined

converter's class files:

jar cvf user-defined-converter.jar .

Then add the new jar onto the HADOOP_CLASSPATH and LIB_JARS environment variables:

export HADOOP_CLASSPATH=/path/to/user-defined-

converter.jar:$HADOOP_CLASSPATH

export LIB_JARS=/path/to/user-defined-converter.jar,$LIB_JARS

Finally, reference the user-defined converter in your TDCH command. As an example, this TDCH

job would export 2 columns from an HDFS file into a Teradata table with one int column and one

string column. The second column in the hdfs file will have the FooBarConverter applied to it before

the record is sent to the TD table:


hadoop jar $TDCH_JAR


-libjars=$LIB_JARS

-url <jdbc url>

-username <db username>

-password <db password>

-sourcepaths <source hdfs path>

-targettable <target TD table>

-sourcerecordschema "string, FooBarConverter(value)"

-targetrecordschema "int, string"


9 Limitations & known issues

9.1 Teradata Connector for Hadoop

a) ORDER BY is not supported by the ‘sourcequery’ command line argument

b) ORDER BY without TOP N is not supported by the ‘sourcequery’ command line argument

c) RECURSIVE query is not supported by the ‘sourcequery’ command line argument

d) Queue Table is not supported

e) Custom UDT is not supported

f) Geospatial type is not supported

g) Teradata GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC types are not supported

h) Teradata Array type is supported with Teradata source plugins only

i) Teradata BLOB column's length must be 64KB or less with export

j) Teradata CLOB column's length must be 64KB or less with export

k) Teradata BYTE, VARBYTE, BLOB's column contents are converted to HEX

string in data type conversion to string

l) HDFS file's field format must match table column's format when being loaded into Teradata

table

m) When Teradata target plugin’s batch size parameter is manually set to a large number, the JVM

heap size for mapper/reducer tasks need to be set to an appropriate value to avoid

OutOfMemoryException.

q) Teradata internal.fastload target plugin will proceed with data transfer only after all tasks are

launched or otherwise timeout after 8 minutes.

r) Hive MAP, ARRAY, and STRUCT types are supported with export and import, and converted

to and from varchar in JSON format in Teradata system.

s) Hive UNIONTYPE is not yet supported

t) Partition values with import to Hive partitioned table cannot be null or empty.

v) Only UTF8 character set is supported.

w) Character '/' and '=' are not supported for string value of a partition column of Hive table.


9.2 Teradata JDBC Driver

a) Row length (of all columns selected) must be 64KB or less

e) Number of rows in each batch.insert request needs to be less than 13668.

f) PERIOD (TIME) with custom format is not supported

g) PERIOD (TIMESTAMP) with custom format is not supported

h) JDBC Batch Insert max parcel size is 1MB

i) JDBC Fastload max parcel size is 64KB

9.3 Teradata Database

a) Any row with unsupported Unicode characters during Fastload export will be moved to error

tables or an exception may be thrown during batch insert export. For a complete list of the unsupported

Unicode characters in Teradata database, see Teradata knowledge article KAP1A269E.

Teradata 14.00

N/A

Teradata 13.10

N/A

Teradata 13.00

a) PERIOD datatype is not supported on Teradata 13.00.

9.4 Hadoop Map/Reduce

a) With mapred.tasktracker.map.tasks.maximum property set to a high number in mapred-site.xml

exceeding the total number of tasks for a job, it could result in a task scheduling skew onto a subset of

nodes by the Capacity Scheduler. The result is that only a subset of the data transfer throughput

capabilities are utilized and job's overall data transfer throughput performance are impacted. The

workaround is to set the mapred.capacity-scheduler.maximum-tasks-per-heartbeat property in the

capacity-scheduler.xml to a small number (e.g. 4) to allow more nodes a fair chance at running tasks.

9.5 Hive

a) "-hiveconf" option is used to specify the path of a Hive configuration file (see Section 3.1.) It is

required for a “hive” or “hcat” job.

With version 1.0.7, the file can be located in HDFS (hdfs://) or in a local file system (file://). Without the

URI schema (hdfs:// or file://) specified, the default schema name is "hdfs". Without the "-hiveconf"

parameter specified, the "hive-site.xml" file should be located in $HADOOP_CLASSPATH, a local


path, before running the TDCH job. For example, if the file "hive-site.xml" is in "/home/hduser/", a user

should export the path using the following command before running the TDCH job:

export HADOOP_CLASSPATH=/home/hduser/conf:$HADOOP_CLASSPATH

9.6 Avro data type conversion and encoding

a) On Avro complex type (except UNION) support, we only support data type conversion between

complex types to/from string data types (CLOB,VARCHAR) in Teradata.

b) When importing data from Teradata to an Avro file, if a field data type in Avro is UNION with

null and the corresponding source column in Teradata table is nullable,

i) A NULL value is converted to a null value within a UNION value in the corresponding target

field of Avro.

ii) A non-NULL value is converted to a value of corresponding type within a UNION value in

the corresponding target Avro field.

c) When exporting data from Avro to Teradata, if a field data type in Avro is UNION with null and

i) Target column is nullable, then a NULL value within UNION is converted to a NULL value in

the target Teradata table column.

ii) Target column is not nullable, then a NULL value within UNION is converted to a connector-

defined default value of the specified data type.

d) TDCH currently supports only Avro binary encoding.

teradata connector for hadoop tutorial connector for... · teradata connector for hadoop tutorial...

Documents