informatica best practices - error handling

Table of Contents

Error Handling ______________________________________________________________ 3

Error Handling Process

_______________________________________________________ 3

Error Handling Strategies - B2B Data Transformation

_________________________________ 8

Error Handling Strategies - Data Warehousing

_____________________________________ 15

Error Handling Strategies - General

_____________________________________________ 21

Error Handling Techniques - PowerCenter Mappings

_________________________________ 28

Error Handling Techniques - PowerCenter Workflows and Data Analyzer

__________________ 32

2013 Informatica Corporation. All rights reserved. Velocity Methodology

Error Handling Process

Challenge

For an error handling strategy to be implemented successfully, it must be integral to the load process as

a whole. The method of implementation for the strategy will vary depending on the data integration

requirements for each project.

The resulting error handling process should however, always involve the following three steps:

1. Error identification

2. Error retrieval

3. Error correction

This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.

Description

A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as:

* Relational database error logging

* Email notification of workflow failures

* Session error thresholds

* The reporting capabilities of PowerCenter Data Analyzer

* Data profiling

These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart

below:

Error Identification

Error Handling

3/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

The first step in the error handling process is error identification. Error identification is often achieved

through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and

referential integrity constraints at the database.

This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors,

and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables.

Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables

(PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter

repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this

manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was

called within the mapping.

Error Retrieval

The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository,

it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be

customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report

prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and

data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a

Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as number of errors is

greater than zero).

Error Correction

The final step in the error handling process is error correction. As PowerCenter automates the process of error identification,

and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through

Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message,

error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and

others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA

to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports

emailing a report directly through the web-based interface to make the process even easier.

For further automation, a report broadcasting rule that emails the error report to a developers inbox can be set up to run on a

pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be

implemented. The exact method of data correction depends on various factors such as the number of records with errors, data

availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred.

Considerations made during error correction include:

* The owner of the data should always fix the data errors. For example, if the source data is coming from an external

system, then the errors should be sent back to the source system to be fixed.

* In some situations, a simple re-execution of the session will reprocess the data.

* Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate

processing of rows.

* Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data

can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error

reports. The corrected data can then be manually inserted into the target table using a SQL statement.

Any approach to correct erroneous data should be precisely documented and followed as a standard.

If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session

to correct the errors and load the corrected data into the ODS or staging area.

Data Profiling Option

For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the

PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling

Error Handling


option enables users to create data profiles through a wizard-driven GUI that provides profile reporting

such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default

values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to

deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports

and data profile reports can be delivered to users through the same easy-to-use application.

Integrating Error Handling, Load Management, and Metadata

Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management

process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for

successful operation and management. The flow chart below illustrates this in the end-to-end load process.

Error Handling


Error Handling


Error handling underpins the data integration system from end-to-end. Each of the load components

performs validation checks, the results of which must be reported to the operational team. These components are not just

PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for

example:

* Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)?

* Source File Validation. Is the source file datestamp later than the previous load?

* File Check. Does the number of rows successfully loaded match the source rows read?

Error Handling


Error Handling Strategies - B2B Data Transformation

Challenge

The challenge for B2B Data Transformation (B2B DT) based solutions is to create efficient accurate

processes for transforming data to appropriate intermediate data formats and to subsequently transform

data from those formats to correct output formats. Error handling strategies are a core part of assuring

the accuracy of any transformation process.

Error handling strategies in B2B Data Transformation solutions should address the following two needs:

1. Detection of errors in the transformation leading to successive refinement of the transformation logic during an iterative

development cycle.

2. Designed for correct error detection, retrieval and handling in production environments.

In general errors can be characterized as either expected or unexpected.

An expected error is an error condition that we do anticipate to occur periodically. For example, a printer running out of paper is

an expected error. In a B2B scenario this may correspond to a partner company sending a file in an incorrect format although

it is an error condition it is expected from time to time. Usually processing of an expected error is part of normal system

functionality and does not constitute a failure of the system to perform as designed.

Unexpected errors typically occur when the designers of a system believe a particular scenario is handled, but due to logic

flaws or some other implementation fault, the scenario is not handled. These errors might include hardware failures, out of

memory situations or unexpected situations due to software bugs.

Errors can also be classified by severity (e.g., warning errors and fatal errors).

For unexpected fatal errors, the transformation process is often unable to complete and may result in a loss of data. In these

cases, the emphasis is on prompt discovery and reporting of the error and support of any troubleshooting process.

Often the appropriate action for fatal unexpected errors is not addressed at the individual B2B Data Transformation translation

level but at the level of the calling process.

This Best Practice describes various strategies for handling expected and unexpected errors both from production and

development troubleshooting points of view and discusses the error handling features included as part of Informaticas B2B

Data Transformation 8.x.

Description

This Best Practice is intended to help designers of B2B DT solutions decide which error handling strategies to employ in their

solutions and to familiarize them with new features in Informatica B2B DT.

Terminology

B2B DT is used as a generic term for the parsing, transformation and serialization technologies provided in Informaticas B2B

DT products. These technologies have been made available through the B2B Data Transformation, Unstructured Data Option

for PowerCenter and as standalone products known as B2B Data Transformation and PowerExchange for Complex Data.

Note: Informaticas B2B DT was previously known as PowerExchange for Complex Data Exchange (CDE) or Item field Content

Master (CM).

Errors in B2B Data Transformation Solutions

There are several types of errors possible in a B2B data transformation. The common types of errors that should be handled

while designing and developing are:

* Logic errors

Error Handling


* Errors in structural aspects of inbound data (missing syntax etc)

* Value errors

* Errors reported by downstream components (i.e., legacy components in data hubs)

* Data-type errors for individual fields

* Unrealistic values (e.g., impossible dates)

* Business rule breaches

Production Errors vs. Flaws in the Design - Production errors are those where the source data or the environmental setup

does not conform to the specifications for the development whereas flaws in design occur when the development does not

conform to the specification. For example, a production error can be an incorrect source file format that does not conform to the

specification layout given for development. A flaw in design could be as trivial as defining an element to be mandatory where

the possibility of non-occurrence of the element cannot be ruled out completely.

Unexpected Errors vs. Expected Errors: Expected errors are those that can be anticipated for a solution scenario based

upon experience (e.g., the EDI message file does not conform to the latest EDI specification). Unexpected errors are most

likely caused by environment set up issues or unknown bugs in the program (e.g., a corrupted file system).

Severity of Errors Not all the errors in a system are equally important. Some errors may require that the process be halted

until they are corrected (e.g., an incorrect format of source files). These types of errors are termed as critical/fatal errors.

Whereas there could be a scenario where a description field is longer than the field length specified, but the truncation does not

affect the process. These types of errors are termed as warnings. The severity of a particular error can only be defined with

respect to the impact it creates on the business process it supports. In B2B DT the severity of errors are classified into the

following categories:

Information A normal operation performed by B2B Data Transformation.

Warning A warning about a possible error. For example, B2B Data

Transformation generates a warning event if an operation

overwrites the existing content of a data holder. The execution

continues.

Failure A component failed. For example, an anchor fails if B2B Data

Transformation cannot find it in the source document. The

execution continues.

Optional Failure An optional component, configured with the optional property,

failed. For example, an optional anchor is missing from the

source document. The execution continues.

Fatal error A serious error occurred, for example, a parser has an illegal

configuration. B2B Data Transformation halts the execution.

Unknown The event status cannot be determined.

Error Handling in Data Integration Architecture

The Error Handling Framework in the context of B2B DT defines a basic infrastructure and the mechanisms for building more

reliable and fault-tolerant data transformations. It integrates error handling facilities into the overall data integration architecture.

How do you integrate the necessary error handling into the data integration architecture?

User interaction : Even in erroneous situations the data transformation should behave in a controlled way and the user should

be informed appropriately about the systems state. The user must interact between the errors handling to avoid cyclic

dependencies.

Robustness : The error handling should be simple. All additional code for handling error situations makes the transformation

more complex, which itself increases the probability of errors. Thus the error handling should provide some basic mechanism

for handling internal errors. However, for the error handling code it is even more important to be correct and to avoid any

nested error situations.

Error Handling


Separation of error handling code : Without any separation the normal code will be cluttered by a lot of error handling code.

This makes code less readable, error prone and more difficult to maintain.

Specific error handling versus complexity : Errors must be classified more precisely in order to handle them effectively and

to take measures tailored to specific errors.

Detailed error information versus complexity : Whenever the transformation terminates due to an error suitable information

is needed to analyze the error. Otherwise, it is not feasible to investigate the original fault that caused the error.

Performance : We do not want to pay very much for error handling during normal operation.

Reusability : The services of the error handling component should be designed for reuse because it is a basic component

useful for a number of transformations.

Error Handling Mechanisms in B2B DT

The common techniques that would help a B2B DT designer in designing an error handling strategy are summarized below.

Debug: This method of error handling underlines the usage of the built-in capabilities of B2B DT for most basic errors. The

debug of a B2B DT parser/serializer can be done in multiple ways.

* Highlight the selection of an element on the example source file.

* Use a Writeval component along with disabling automatic output in the project property.

* Use of the disable and enable feature for each of the components.

* Run the parser/serializer and browse the event log for any failures.

All the debug components should be removed before deploying the service in production.

Schema Modification: This method of error handling demonstrates one of the ways to communicate the erroneous record

once it is identified. The erroneous data can be captured at different levels (e.g., at field level or at record level). The XML

schema methodology dictates to add additional XML elements into the schema structure for the error data and error message

holding. This allows the developer to validate each of the elements with the business rules and if any element or record does

not conform to the rules then that data and a corresponding error message can be stored in the XML structure.

Error Data in a Different File : This methodology stresses the point of storing the erroneous records or elements in a

separate file other than the output data stream. This method is useful when a business critical timeline for the data processing

cannot be compromised for a couple of erroneous records. This method allows the processing for the correct records to be

done and the erroneous records to be inspected and corrected as a different stream function. In this methodology the business

validations are done for each of the elements with the specified rules and if any of the elements or records fails to conform, they

are directed to a predefined error file. The path to the file is generally passed in the output file for further investigation or the

path of the file is a static path upon which a script is executed to send those error files to operations for correction.

Design Time Tools for Error Handling in B2B DT

A failure is an event that prevents a component from processing data in the expected way. An anchor might fail if it searches for

text that does not exist in the source document. A transformer or action might fail if its input is empty or has an inappropriate

data type.

A failure can be a perfectly normal occurrence. For example, a source document might contain an optional date. A parser

contains a Content anchor that processes the date, if it exists. If the date does not exist in a particular source document, the

Content anchor fails. By configuring the transformation appropriately, you can control the result of a failure. In the example, you

might configure the parser to ignore the missing data and continue processing.

B2B Data Transformation offers various mechanisms for error handling during design time

Feature Description

B2B DT event log This is a B2B DT specific event generation mechanism where each event corresponds to an

action taken by a transformation such as recognizing a particular lexical sequence. It is useful

in the troubleshooting of work in progress, but event files can grow very large, hence it is not

recommended for production systems. It is distinct from the event system offered by other

B2B DT products and from the OS based event system. Custom events can be generated

within transformation scripts. Event based failures are reported as exceptions or other errors

in the calling environment.

Error Handling


B2B DT Trace files Trace files are controlled by the B2B DT configuration application. Automated strategies may

be applied for the recycling of trace files

Custom error information At the simplest levels, custom errors can be generated as B2B DT events (using the

AddEventAction). However if the event mechanism is disabled for memory or performance

reasons, these are omitted. Other alternatives include generation of custom error files,

integration with OS event tracking mechanisms and integration with 3rd party management

platform software. Integration with OS eventing or 3rd party platform software requires custom

extensions to B2B DT.

The event log is the main trouble shooting tool in B2B DT solutions. It captures all of the details in an event log file when an

error occurs in the system. These files can be generated when testing in a development studio environment or running a

service engine. These files reside in the CM_Reports directory specified in the CM_Config file under the installation directory of

B2B DT. In the studio environment the default location is Results/events.cme in the project folder. The error messages

appearing in the event log file are either system generated or user-defined (which can be accomplished by adding the add

event action). The ADDEVENT Action enables the developer to pass a user-defined error message to the event log file in case

a specific error condition occurs.

Overall the B2B DT event mechanism is the simplest to implement. But for large or high volume production systems, the event

mechanism can create very large event files, and it offers no integration with popular enterprise software administration

platforms. Informatica recommends using B2B DT Events for troubleshooting purposes during development only.

In some cases, performance constraints may determine the error handling strategy. For example updating an external event

system may cause performance bottlenecks and producing a formatted error report can be time consuming. In some cases

operator interaction may be required that could potentially block a B2B DT transformation from completing.

Finally, it is worth looking at whether some part of the error handling can be offloaded outside of B2B DT to avoid performance

bottlenecks.

When using custom error schemes, consider the following:

* Multiple invocations of the same transformation may execute in parallel

* Dont hardwire error file paths

* Dont assume a single error output file

Avoid the use of the B2B DT event log for production systems (especially when processing Excel files).

The trace files capture the state of the system along with the process ID and failure messages. It creates the reporting of the

error along with the time stamp. It captures details about the system in different category areas, including file system,

environment, networking etc. It gives details about the process id and the thread id that was processing the execution. It aids in

getting the system level error (if there is one). The name of the trace file can be modified in the Configuration wizard. The

maximum size of the trace file can also be limited in the CMConfiguration editor.

If the Data Transformation Engine runs under multiple user accounts, the user logs may overwrite each other, or it may be

difficult to identify the logs belonging to a particular user. Prevent this by configuring users with different log locations.

In addition to the logs of service events, there is an Engine initialization event log. This log records problems that occur when

the Data Transformation Engine starts without reference to any service or input data. View this log to diagnose installation

problems such as missing environment variables.

The initialization log is located in the CMReports\Init directory.

New Error Handling Features in B2B DT 8.x

Using the Optional Property to Handle Failures

If the optional property of a component is not selected, a failure of the component causes its parent to fail. If the parent is also

non-optional, its own parent fails, and so forth. For example, suppose that a Parser contains a Group, and the Group contains a

Error Handling


Marker. All the components are non-optional. If the Marker does not exist in the source document, the

Marker fails. This causes the Group to fail, which in turn causes the Parser to fail.

If the optional property of a component is selected, a failure of the component does not bubble up to the parent. For example,

suppose that a Parser contains a Group, and the Group contains a Marker. In this example, suppose that the Group is optional.

The failed Marker causes the Group to fail, but the Parser does not fail.

Note however that certain components lack the optional property because the components never fail, regardless of their input.

An example is the Sort action. If the Sort action finds no data to sort, it simply does nothing. It does not report a failure.

Rollback

If a component fails, its effects are rolled back. For example, suppose that a Group contains three non-optional Content

anchors that store values in data holders. If the third Content anchor fails, the Group fails. Data Transformation rolls back the

effects of the first two Content anchors. The data that the first two Content anchors already stored in data holders is removed.

The rollback applies only to the main effects of a transformation such as a parser storing values in data holders or a serializer

writing to its output. The rollback does not apply to side effects. In the above example, if the Group contains an ODBCAction

that performs an INSERT query on a database, the record that the action added to the database is not deleted.

Writing a Failure Message to the User Log

A component can be configured to output failure events to a user-defined log. For example, if an anchor fails to find text in the

source document, it can write a message in the user log. This can occur even if the anchor is defined as optional so that the

failure does not terminate the transformation processing.

The user log can contain the following types of information:

* Failure level: Information, Warning, or Error

* Name of the component that failed

* Failure description

* Location of the failed component in the IntelliScript

* Additional information about the transformation status (such as the values of data holders)

CustomLog

The CustomLog component can be used as the value of the on_fail property. In the event of a failure, the CustomLog

component runs a serializer that prepares a log message. The system writes the message to a specified location.

Property Description

run_serializer A serializer that prepares the log message

output The output location. The options include: MSMQOutput. Writes

to an MSMQ queue. OutputDataHolder. Writes to a data holder.

OutputFile. Writes to a file. ResultFile. Writes to the default

results file of the transformation. OutputCOM. Uses a custom

COM component to output the data. Additional choices:

OutputPort. The name of an AdditionalOutputPort where the data

is written. StandardErrorLog. Writes to the user log.

Error Handling in B2B DT with PowerCenter Integration

In a B2B DT solution scenario, both expected and unexpected errors can occur, whether caused by a production issue or a

flaw in the design. If the right error handling processes are not in place, then if an error occurs, the processing aborts with a

description of the error in the log (event file). This can also results in data loss if the erroneous records are not captured and

reported correctly. It also fails the program it is called from. For example if the B2B Data Transformation service is used through

PowerCenter UDO/B2B DT, then an error causes the PowerCenter session to fail.

Error Handling


This section focuses on how to orchestrate PowerCenter and B2B DT if the B2B DT services are being

called from a PowerCenter mapping. Below are the most common ways of orchestrating the error trapping and error handling

mechanism.

1. Use PowerCenters Robustness and Reporting Function : In general the PowerCenter engine is very robust

and powerful enough to handle complex erroneous scenarios. Thus the usual practice is to perform any business

validation or valid values comparison in PowerCenter. This enables error records to be directed to the already established

Bad Files or Reject Tables in PowerCenter. This feature also allows the repository to store information about the number

of records loaded and the number of records rejected and thus aids in easier reporting of errors.

2. Output the Error in an XML Tag : When complex parsing validations are involved, B2B DT is more powerful than

PowerCenter in handling them (e.g., String function and regular expression). In these scenarios the validations are

performed in the B2B DT engine and the schema is redesigned to capture the error information in the associated tag of

the XML. When this XML is parsed in a PowerCenter mapping the error tags are directed to be stored in the custom build

error reporting tables from which the reporting of the errors can be done. The design of the custom build error tables will

depend on the design of the error handling XML schema. Generally these tables correspond one-to-one with the XML

structure with few additional metadata fields like processing date, Source System, etc.

3. Output to the PowerCenter Log Files : If unexpected error occurs in the B2B DT processing then the error

descriptions and details are stored in the log file directory as specified in the CMconfig.xml. The path to the file and the

fatal errors are reported to the PowerCenter Log so that the operators can quickly detect problems. This unexpected error

handling can be exploited with care for the user defined errors in the B2B DT transformation by adding the Addevent

Action and marking the error type as Failure.

Best Practices for Handling Errors in Production

In a production environment the turnaround time of the processes should be as short as possible and as automated as

possible. Using B2B DT integration with Power Center these requirements should be met seamlessly without intervention from

IT professionals for error reporting, the correction of the data file and the reprocessing of data.

Example Scenario 1 HIPAA Error Reporting

Error Handling


Example Scenario 2 Emailing Error Files to Operator

Below is a case study for an implementation at a major financial client. The solution was implemented with total automation for

the sequence of error trapping, error reporting, correction and reprocessing of data. The high level solution steps are:

* Analyst receives loan tape via Email from a dealer

* Analyst saves the file to a file system on a designated file share

* A J2EE server monitors the file share for new files and pushes them to PowerCenter

* PowerCenter invokes B2B DT to process (passing XML data fragment, supplying path to loan tape file and other

parameters)

* Upon a successful outcome, PowerCenter saves the data to the target database

* PowerCenter notifies the Analyst via Email

* On failure, PowerCenter Emails the XLS error file containing the original data and errors

Error Handling


Error Handling Strategies - Data Warehousing

Challenge

A key requirement for any successful data warehouse or data integration project is that it attains

credibility within the user community. At the same time, it is imperative that the warehouse be as

up-to-date as possible since the more recent the information derived from it is, the more relevant it is to

the business operations of the organization, thereby providing the best opportunity to gain an advantage

over the competition.

Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction

(in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the

event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular

load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data

cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information.

Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally

important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e.,

before the business tries to use the information in error).

The types of error to consider include:

* Source data structures

* Sources presented out-of-sequence

* Old sources represented in error

* Incomplete source files

* Data-type errors for individual fields

* Unrealistic values (e.g., impossible dates)

* Business rule breaches

* Missing mandatory data

* O/S errors

* RDBMS errors

These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors)

concerns.

Description

In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that

every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational

constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct

order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of

resources, or have permissions and privileges change.

Realistically, however, the operational applications are rarely able to cope with every possible business scenario or

combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite

the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored

(typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse

expects.

Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business

managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current

version of the warehouse can be published). Preferably, error data should be identified during the load process and prevented

from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no

resources are wasted trying to load it.

As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the

Error Handling


warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it

becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a

by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the

warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and

end-users become owners of their data.

As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load

management, data quality and key management, and operational processes and procedures.

Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the

failure occurred.

Quality management defines the criteria whereby data can be identified as in error; and error management identifies the

specific error(s), thereby allowing the source data to be corrected.

Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors,

perhaps indicating a failure in operational procedure.

Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level

flow chart below:

Error Handling


Error Management

Considerations

High-Level Issues

From previous discussion of load

management, a number of checks

can be performed before any

attempt is made to load a source

data set. Without load management

in place, it is unlikely that the

warehouse process will be robust

enough to satisfy any end-user

requirements, and error correction

processing becomes moot (in so far

as nearly all maintenance and

development resources will be

working full time to manually

correct bad data in the warehouse).

The following assumes that you

have implemented load

management processes similar to

Informaticas best practices.

* Process Dependency checks in

the load management can

identify when a source data set

is missing, duplicates a

previous version, or has been

presented out of sequence, and

where the previous load failed

but has not yet been corrected.

* Load management prevents

this source data from being

loaded. At the same time, error

management processes should

record the details of the failed

load; noting the source

instance, the load affected, and

when and why the load was

aborted.

* Source file structures can be

compared to expected

structures stored as metadata,

either from header information

or by attempting to read the first

data row.

* Source table structures can be

compared to expectations;

typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in

Error Handling


metadata), or by simply running a describe command against the table (again comparing to a

pre-stored version in metadata).

* Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been

corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational

application).

* In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and

any other relevant process-level details.

Low-Level Issues

Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further

error management processes need to be applied to the individual source rows and fields.

* Individual source fields can be compared to expected data-types against standard metadata within the repository, or

additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field

structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more

worryingly) is likely to be processed unpredictably.

* Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can

be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored

procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the

potentially crushing impact on performance if a particularly error-filled load occurs.

* Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within

PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included

in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the

rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to

implement such a process.

* Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management

in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However,

from an error handling perspective, such errors must still be identified and recorded, even when key management

techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular

source data fails, it is difficult to realize when there is a systematic problem in the source systems.

* Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of

events (e.g., a customer query, followed by a booking request, followed by a confirmation, followed by a payment). If the

events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem

with the source system, or the way in which the source system is being used.

* An important principle to follow is to try to identify all of the errors on a particular row before halting processing, rather

than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced

data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed

subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the

first, re-running, and then identifying a second error (which halts the load for a second time).

OS and RDBMS Issues

Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the

RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference

information is missing).

However, there is little that can be done to identify the more generic RDBMS problems that are likely to occur; changes to

schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table

space extents available, missing partitions and the like.

Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax,

and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who,

from an operational perspective, are not aware that there is likely to be an impact on the data warehouse, or are not aware that

the data warehouse managers need to be kept up to speed.

In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be

Error Handling


impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are

revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs

out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNIX host

are amended, bad files themselves (or even the log files) may not be accessible.

Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a

load to complete should be absolutely the last step in a given process, any failure before, or including, that point leaves the

load in an incomplete state. Subsequent runs should note this, and enforce correction of the last load before beginning the

new one.

The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and

DBAs have proper and working communication with the data warehouse management to allow proactive control of changes.

Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if

they occur.

Auto-Correction vs. Manual Correction

Load management and key management best practices ( Key Management in Data Warehousing Solutions ) have already

defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual

intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is preserved,

and incorrect key values are corrected as soon as the source system provides the missing data.

We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data

as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link

between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the

warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of

the warehouse to be reloaded entirely from scratch.

In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or

more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user

training.

The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be

corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows

source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and

managed.

Error Management Techniques

Simple Error Handling Structure

The following data structure is an example of the error metadata that should be captured as a minimum within the error

handling strategy.

Error Handling


The example defines three main sets of information:

* The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including:

- process-level (e.g., incorrect source file, load started out-of-sequence)

- row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and

- reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

* The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of

error for particular loads and of the distribution of error types. It is linked to the load management processes via the

SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered.

* The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that

was in error (using the source natural keys and row number) together with a string of field identifier/value pairs

concatenated together. It is not expected that this information will be deconstructed as part of an automatic correction

load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for

subsequent reporting.

Error Handling


Error Handling Strategies - General

Challenge

The challenge is to accurately and efficiently load data into the target data architecture. This Best

Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying

data errors, methods for handling data errors, and alternatives for addressing the most common types of

problems. For the most part, these strategies are relevant whether your data integration project is

loading an operational data structure (as with data migrations, consolidations, or loading various sorts of

operational data stores) or loading a data warehousing structure.

Description

Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business.

When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate

manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it

until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two

conflicting goals:

* The need for accurate information.

* The ability to analyze or process the most complete information available with the understanding that errors can exist.

Data Integration Process Validation

In general, there are three methods for handling data errors detected in the loading process:

* Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are

detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete.

Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are

and how they affect the completeness of the data.

Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be

created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows

have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may

cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a

complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect

downstream transactions.

The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through

existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the

target if it is correct, and it would then be loaded into the data mart using the normal process.

* Reject None. This approach gives users a complete picture of the available data without having to consider data that

was not available due to it being rejected during the load process. The problem is that the data may not be complete or

accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty

transactions.

With Reject None, the complete set of data is loaded, but the data may not support correct transactions or aggregations.

Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but

incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different

hierarchies.

The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct

all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and

fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load,

and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or

data marts.

Error Handling


* Reject Critical. This method provides a balance between missing information and incorrect information. It involves

examining each row of data and determining the particular data elements to be rejected. All changes that are valid are

processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can

be fixed in the source systems and loaded on a subsequent run of the ETL process.

This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates.

Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at

various levels in the organization. Attributes provide additional descriptive information per key element.

Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the

dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can

usually be loaded with the existing dimensional data unless the update is to a key element.

The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or

non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some

tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data

architecture.

Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the

most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the

ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to

understand that some information may be held out of the target, and also that some of the information in the target data

structures may be at least temporarily allocated to the wrong hierarchies.

Handling Errors in Dimension Profiles

Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created

with date stamps that indicate when the change took place. This allows power users to review the target data using either

current (As-Is) or past (As-Was) views of the data.

A profile record should occur for each change in the source data. Problems occur when two fields change in the source system

and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the

second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the

existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is

complicated. If a third field is changed in the source before the error is fixed, the correction process is complicated further.

The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On

1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field

3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.

Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 Closed Sunday Black Open 9 5

1/5/2000 Open Sunday BRed Open 9 5

1/10/2000 Open Sunday BRed Open 24hrs

1/15/2000 Open Sunday Red Open 24hrs

Three methods exist for handling the creation and update of profiles:

1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid,

then the original field value is maintained.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 5

1/5/2000 1/5/2000 Open Sunday Black Open 9 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source

system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates

a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a

mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been

Error Handling


created.

* The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the

profile record for the change to Field 3.

If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field

changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field

was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.

* The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the

Field 2 value in both.

Date Profile Date Field 1 Value Field 2 Value Field 3 Value

1/1/2000 1/1/2000 Closed Sunday Black Open 9 5

1/5/2000 1/5/2000 Open Sunday Black Open 9 5

1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

1/15/2000 1/5/2000 (Update) Open Sunday Red Open 9-5

1/15/2000 1/10/2000 (Update) Open Sunday Red Open 24hrs

If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex

algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all

profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods,

we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but

a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records,

when in reality a new profile record should have been entered.

Recommended Method

A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new,

correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the

corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting

the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another

process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis

of the data until the correction method is determined because the current information is reflected in the new Profile.

Data Quality Edits

Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target.

The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators

can be used to:

* Show the record and field level quality associated with a given record at the time of extract.

* Identify data sources and errors encountered in specific records.

* Support the resolution of specific record error types via an update and resubmission process.

Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value), missing data in a

required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be

appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ

fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are

stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be

loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.

The following types of errors cannot be processed:

* A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and

used to generate a notice to the sending system indicating that x number of invalid records were received and could not be

processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has

been replaced or not.

Error Handling


* The source file or record is illegible. The file or record would be sent to a reject queue. Metadata

indicating that x number of invalid records were received and could not be processed may or may not be available for a

general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to

determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual

unique records within the file are not identifiable. While information can be provided to the source system site indicating

there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.

In these error types, the records can be processed, but they contain errors:

* A required (non-key) field is missing.

* The value in a numeric or date field is non-numeric.

* The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is

used for this validation.

When an error is detected during ingest and cleansing, the identified error type is recorded.

Quality Indicators (Quality Code Table)

The requirement to validate virtually every data element received from the source data systems mandates the development,

implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an

elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data

quality problems, systemic issues, business process problems and information technology breakdowns.

The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required Field, 3-Wrong Data Type/Format, 4-Invalid

Data Value and 5-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields

for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily

identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail

necessary for acute quality problems to be remedied in a timely manner.

Handling Data Errors

The need to periodically correct data in the target is inevitable. But how often should these corrections be performed?

The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data

from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid

performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.

Reject Tables vs. Source System

As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the

related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix

data in the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the

target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of

changes in the target data architecture. If all fixes occur in the source systems, then these fixes must be applied correctly to the

target data.

Attribute Errors and Default Values

Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a

product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the

address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are

most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research).

Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target.

When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter

thetarget. Some rules that have been proposed for handling defaults are as follows:

Value Types

Error Handling

Description

Error Handling

Default

Error Handling


Reference Values Attributes that are foreign keys to other tables Unknown

Small Value Sets Y/N indicator fields No

Other Any other type of attribute Null or Business provided value

Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not

translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this

purpose.)

The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values

(e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these

values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case

basis. In many cases, the data integration process is set to populate Null into these fields, which means undefined in the target.

After a source system value is corrected and passes validation, it is corrected in the target.

Primary Key Errors

The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new

key is actually an update to an old key in the source system. For example, a location number is assigned and the new location

is transferred to the target using the normal process; then the location number is changed due to some source business rule

such as: all Warehouses should be in the 5000 range . The process assumes that the change in the primary key is actually a

new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data

being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture.

Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two

rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence

properly. If two profile records exist for the same day, then a manual decision is required as to which is correct. If facts were

loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct

the data.

The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data

ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and

facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the

restore to correct the errors.

DM Facts Calculated from EDW Dimensions

If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target,

we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected.

If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement.

If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data

is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are

fixed, the affected rows can simply be loaded and applied to the target data.

Fact Errors

If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we

encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following

night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic

analyses should be performed on the errors to determine why they are not being loaded.

Data Stewards

Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in

dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables

enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the

source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple

Error Handling


source data occurs when two source systems can contain different data for the same dimensional entity.

Reference Tables

The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code

value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table

to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source

systems into the target structures.

The translation tables contain one or more rows for each source value and map the value to a matching row in the reference

table. For example, the SOURCE column in FILE X on System X can contain O, S or W. The data steward would be

responsible for entering in the translation table the following values:

Source Value Code Translation

O OFFICE

S STORE

W WAREHSE

These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar

field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the

translation table to maintain consistency across systems:

Source Value Code Translation

OF OFFICE

ST STORE

WH WAREHSE

The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL

process uses the reference table to populate the following values into the target:

Code Translation Code Description

OFFICE Office

STORE Retail Store

WAREHSE Distribution Warehouse

Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after

data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to

OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the

first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the

entire target data architecture.

Dimensional Data

New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products,

at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the

source system value to the target value. For location, this is straightforward, but over time, products may have multiple source

system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves

as a good example for error handling.)

There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation

data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the

data steward to review it. The first option requires the data steward to create the translation for new entities, while the second

lets the ETL process create the translation, but marks the record as Pending Verification until the data steward reviews it and

changes the status to Verified before any facts that reference it can be loaded.

When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy values. This

requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate

an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists

them.

Error Handling


A problem specific to Product is that when it is created as new, it is really just a changed SKU number.

This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is

fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to

be merged, requiring manual intervention.

The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but

really represent two different products). In this case, it is necessary to restore the source information for all loads since the error

was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the

data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.

Manual Updates

Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be

established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning

and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be

maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.

Multiple Sources

The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain

subsets of the required information. For example, one system may contain Warehouse and Store information while another

contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the

correct information.

When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update

the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one

source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a

new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the

new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems

remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with

the same information.

To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should

be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this

sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the

one profile record created for that day.

One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from

the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a

primary source where information can be shared from multiple sources. Developers can use the field level information to

update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the

correct source fields as primary and by the data integration team to customize the load process.

Error Handling


Error Handling Techniques - PowerCenter Mappings

Challenge

Identifying and capturing data errors using a mapping approach, and making such errors available for

further processing or correction.

Description

Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production

environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is

to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected

transformation or database constraint errors.

Data Validation Errors

The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.

Consider the following questions:

What types of data errors are likely to be encountered?

Of these errors, which ones should be captured?

What process can capture the possible errors?

Should errors be captured before they have a chance to be written to the target database?

Will any of these errors need to be reloaded or corrected?

How will the users know if errors are encountered?

How will the errors be stored?

Should descriptions be assigned for individual errors?

Can a table be designed to store captured errors and the error descriptions?

Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and

improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g.,

executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of

functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the

target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to

the session log and the reject/bad file, thus improving performance.

Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to

them. This approach can be effective for many types of data content error, including: date conversion, null values intended for

not null target fields, and incorrect data formats or data types.

Sample Mapping Approach for Data Validation Errors

In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written

to not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated

from the data flow and logged in an error table.

One solution is to implement a mapping similar to the one shown below:

Error Handling


An expression transformation can be employed to validate the source data, applying rules and flagging

records with one or more errors.

A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows

with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would

refer to the mapping name and the ROW_ID would be created by a sequence generator.

The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error

reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.

The table

ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error

description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of

reference for reporting.

The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns:

ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the

entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to

reprocess them.

The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended

Error Handling


for a not null target field could generate an error message such as NAME is NULL or DOB is NULL. This

step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping).

After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a

normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that

are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be

generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.

The following table shows how the error data produced may look.

Table Name: CUSTOMER_ERR

NAME DOB ADDRESS ROW_ID MAPPING_ID

NULL NULL NULL 1 DIM_LOAD

Table Name: ERR_DESC_TBL

FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target

CUST DIM_LOAD 1 Name is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

CUST DIM_LOAD 1 DOB is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

CUST DIM_LOAD 1 Address is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in

mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of

data validation errors.

Data validation error handling can be extended by including mapping logic to grade error severity. For example, f

informatica best practices - error handling

Documents

error data

error function

error messages

error identification2

error retrieval3

error handling strategy

error handling tablespmerr

typical error handling