informatica best practices - error handling

Upload: anidatta

Post on 12-Oct-2015

451 views

Category:

Documents


0 download

DESCRIPTION

Informatica Velocity documents

TRANSCRIPT

  • Table of Contents

    Error Handling ______________________________________________________________ 3

    Error Handling Process

    _______________________________________________________ 3

    Error Handling Strategies - B2B Data Transformation

    _________________________________ 8

    Error Handling Strategies - Data Warehousing

    _____________________________________ 15

    Error Handling Strategies - General

    _____________________________________________ 21

    Error Handling Techniques - PowerCenter Mappings

    _________________________________ 28

    Error Handling Techniques - PowerCenter Workflows and Data Analyzer

    __________________ 32

    2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling Process

    Challenge

    For an error handling strategy to be implemented successfully, it must be integral to the load process as

    a whole. The method of implementation for the strategy will vary depending on the data integration

    requirements for each project.

    The resulting error handling process should however, always involve the following three steps:

    1. Error identification

    2. Error retrieval

    3. Error correction

    This Best Practice describes how each of these steps can be facilitated within the PowerCenter environment.

    Description

    A typical error handling process leverages the best-of-breed error management technology available in PowerCenter, such as:

    * Relational database error logging

    * Email notification of workflow failures

    * Session error thresholds

    * The reporting capabilities of PowerCenter Data Analyzer

    * Data profiling

    These capabilities can be integrated to facilitate error identification, retrieval, and correction as described in the flow chart

    below:

    Error Identification

    Error Handling

    3/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • The first step in the error handling process is error identification. Error identification is often achieved

    through the use of the ERROR() function within mappings, enablement of relational error logging in PowerCenter, and

    referential integrity constraints at the database.

    This approach ensures that row-level issues such as database errors (e.g., referential integrity failures), transformation errors,

    and business rule exceptions for which the ERROR() function was called are captured in relational error logging tables.

    Enabling the relational error logging functionality automatically writes row-level data to a set of four error handling tables

    (PMERR_MSG, PMERR_DATA, PMERR_TRANS, and PMERR_SESS). These tables can be centralized in the PowerCenter

    repository and store information such as error messages, error data, and source row data. Row-level errors trapped in this

    manner include any database errors, transformation errors, and business rule exceptions for which the ERROR() function was

    called within the mapping.

    Error Retrieval

    The second step in the error handling process is error retrieval. After errors have been captured in the PowerCenter repository,

    it is important to make their retrieval simple and automated so that the process is as efficient as possible. Data Analyzer can be

    customized to create error retrieval reports from the information stored in the PowerCenter repository. A typical error report

    prompts a user for the folder and workflow name, and returns a report with information such as the session, error message, and

    data that caused the error. In this way, the error is successfully captured in the repository and can be easily retrieved through a

    Data Analyzer report, or an email alert that identifies a user when a certain threshold is crossed (such as number of errors is

    greater than zero).

    Error Correction

    The final step in the error handling process is error correction. As PowerCenter automates the process of error identification,

    and Data Analyzer can be used to simplify error retrieval, error correction is straightforward. After retrieving an error through

    Data Analyzer, the error report (which contains information such as workflow name, session name, error date, error message,

    error data, and source row data) can be exported to various file formats including Microsoft Excel, Adobe PDF, CSV, and

    others. Upon retrieval of an error, the error report can be extracted into a supported format and emailed to a developer or DBA

    to resolve the issue, or it can be entered into a defect management tracking tool. The Data Analyzer interface supports

    emailing a report directly through the web-based interface to make the process even easier.

    For further automation, a report broadcasting rule that emails the error report to a developers inbox can be set up to run on a

    pre-defined schedule. After the developer or DBA identifies the condition that caused the error, a fix for the error can be

    implemented. The exact method of data correction depends on various factors such as the number of records with errors, data

    availability requirements per SLA, the level of data criticality to the business unit(s), and the type of error that occurred.

    Considerations made during error correction include:

    * The owner of the data should always fix the data errors. For example, if the source data is coming from an external

    system, then the errors should be sent back to the source system to be fixed.

    * In some situations, a simple re-execution of the session will reprocess the data.

    * Does partial data that has been loaded into the target systems need to be backed-out in order to avoid duplicate

    processing of rows.

    * Lastly, errors can also be corrected through a manual SQL load of the data. If the volume of errors is low, the rejected data

    can be easily exported to Microsoft Excel or CSV format and corrected in a spreadsheet from the Data Analyzer error

    reports. The corrected data can then be manually inserted into the target table using a SQL statement.

    Any approach to correct erroneous data should be precisely documented and followed as a standard.

    If the data errors occur frequently, then the reprocessing process can be automated by designing a special mapping or session

    to correct the errors and load the corrected data into the ODS or staging area.

    Data Profiling Option

    For organizations that want to identify data irregularities post-load but do not want to reject such rows at load time, the

    PowerCenter Data Profiling option can be an important part of the error management solution. The PowerCenter Data Profiling

    Error Handling

    4/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • option enables users to create data profiles through a wizard-driven GUI that provides profile reporting

    such as orphan record identification, business rule violation, and data irregularity identification (such as NULL or default

    values). The Data Profiling option comes with a license to use Data Analyzer reports that source the data profile warehouse to

    deliver data profiling information through an intuitive BI tool. This is a recommended best practice since error handling reports

    and data profile reports can be delivered to users through the same easy-to-use application.

    Integrating Error Handling, Load Management, and Metadata

    Error handling forms only one part of a data integration application. By necessity, it is tightly coupled to the load management

    process and the load metadata; it is the integration of all these approaches that ensures the system is sufficiently robust for

    successful operation and management. The flow chart below illustrates this in the end-to-end load process.

    Error Handling

    5/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling

    6/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error handling underpins the data integration system from end-to-end. Each of the load components

    performs validation checks, the results of which must be reported to the operational team. These components are not just

    PowerCenter processes such as business rule and field validation, but cover the entire data integration architecture, for

    example:

    * Process Validation. Are all the resources in place for the processing to begin (e.g., connectivity to source systems)?

    * Source File Validation. Is the source file datestamp later than the previous load?

    * File Check. Does the number of rows successfully loaded match the source rows read?

    Error Handling

    7/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling Strategies - B2B Data Transformation

    Challenge

    The challenge for B2B Data Transformation (B2B DT) based solutions is to create efficient accurate

    processes for transforming data to appropriate intermediate data formats and to subsequently transform

    data from those formats to correct output formats. Error handling strategies are a core part of assuring

    the accuracy of any transformation process.

    Error handling strategies in B2B Data Transformation solutions should address the following two needs:

    1. Detection of errors in the transformation leading to successive refinement of the transformation logic during an iterative

    development cycle.

    2. Designed for correct error detection, retrieval and handling in production environments.

    In general errors can be characterized as either expected or unexpected.

    An expected error is an error condition that we do anticipate to occur periodically. For example, a printer running out of paper is

    an expected error. In a B2B scenario this may correspond to a partner company sending a file in an incorrect format although

    it is an error condition it is expected from time to time. Usually processing of an expected error is part of normal system

    functionality and does not constitute a failure of the system to perform as designed.

    Unexpected errors typically occur when the designers of a system believe a particular scenario is handled, but due to logic

    flaws or some other implementation fault, the scenario is not handled. These errors might include hardware failures, out of

    memory situations or unexpected situations due to software bugs.

    Errors can also be classified by severity (e.g., warning errors and fatal errors).

    For unexpected fatal errors, the transformation process is often unable to complete and may result in a loss of data. In these

    cases, the emphasis is on prompt discovery and reporting of the error and support of any troubleshooting process.

    Often the appropriate action for fatal unexpected errors is not addressed at the individual B2B Data Transformation translation

    level but at the level of the calling process.

    This Best Practice describes various strategies for handling expected and unexpected errors both from production and

    development troubleshooting points of view and discusses the error handling features included as part of Informaticas B2B

    Data Transformation 8.x.

    Description

    This Best Practice is intended to help designers of B2B DT solutions decide which error handling strategies to employ in their

    solutions and to familiarize them with new features in Informatica B2B DT.

    Terminology

    B2B DT is used as a generic term for the parsing, transformation and serialization technologies provided in Informaticas B2B

    DT products. These technologies have been made available through the B2B Data Transformation, Unstructured Data Option

    for PowerCenter and as standalone products known as B2B Data Transformation and PowerExchange for Complex Data.

    Note: Informaticas B2B DT was previously known as PowerExchange for Complex Data Exchange (CDE) or Item field Content

    Master (CM).

    Errors in B2B Data Transformation Solutions

    There are several types of errors possible in a B2B data transformation. The common types of errors that should be handled

    while designing and developing are:

    * Logic errors

    Error Handling

    8/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • * Errors in structural aspects of inbound data (missing syntax etc)

    * Value errors

    * Errors reported by downstream components (i.e., legacy components in data hubs)

    * Data-type errors for individual fields

    * Unrealistic values (e.g., impossible dates)

    * Business rule breaches

    Production Errors vs. Flaws in the Design - Production errors are those where the source data or the environmental setup

    does not conform to the specifications for the development whereas flaws in design occur when the development does not

    conform to the specification. For example, a production error can be an incorrect source file format that does not conform to the

    specification layout given for development. A flaw in design could be as trivial as defining an element to be mandatory where

    the possibility of non-occurrence of the element cannot be ruled out completely.

    Unexpected Errors vs. Expected Errors: Expected errors are those that can be anticipated for a solution scenario based

    upon experience (e.g., the EDI message file does not conform to the latest EDI specification). Unexpected errors are most

    likely caused by environment set up issues or unknown bugs in the program (e.g., a corrupted file system).

    Severity of Errors Not all the errors in a system are equally important. Some errors may require that the process be halted

    until they are corrected (e.g., an incorrect format of source files). These types of errors are termed as critical/fatal errors.

    Whereas there could be a scenario where a description field is longer than the field length specified, but the truncation does not

    affect the process. These types of errors are termed as warnings. The severity of a particular error can only be defined with

    respect to the impact it creates on the business process it supports. In B2B DT the severity of errors are classified into the

    following categories:

    Information A normal operation performed by B2B Data Transformation.

    Warning A warning about a possible error. For example, B2B Data

    Transformation generates a warning event if an operation

    overwrites the existing content of a data holder. The execution

    continues.

    Failure A component failed. For example, an anchor fails if B2B Data

    Transformation cannot find it in the source document. The

    execution continues.

    Optional Failure An optional component, configured with the optional property,

    failed. For example, an optional anchor is missing from the

    source document. The execution continues.

    Fatal error A serious error occurred, for example, a parser has an illegal

    configuration. B2B Data Transformation halts the execution.

    Unknown The event status cannot be determined.

    Error Handling in Data Integration Architecture

    The Error Handling Framework in the context of B2B DT defines a basic infrastructure and the mechanisms for building more

    reliable and fault-tolerant data transformations. It integrates error handling facilities into the overall data integration architecture.

    How do you integrate the necessary error handling into the data integration architecture?

    User interaction : Even in erroneous situations the data transformation should behave in a controlled way and the user should

    be informed appropriately about the systems state. The user must interact between the errors handling to avoid cyclic

    dependencies.

    Robustness : The error handling should be simple. All additional code for handling error situations makes the transformation

    more complex, which itself increases the probability of errors. Thus the error handling should provide some basic mechanism

    for handling internal errors. However, for the error handling code it is even more important to be correct and to avoid any

    nested error situations.

    Error Handling

    9/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Separation of error handling code : Without any separation the normal code will be cluttered by a lot of error handling code.

    This makes code less readable, error prone and more difficult to maintain.

    Specific error handling versus complexity : Errors must be classified more precisely in order to handle them effectively and

    to take measures tailored to specific errors.

    Detailed error information versus complexity : Whenever the transformation terminates due to an error suitable information

    is needed to analyze the error. Otherwise, it is not feasible to investigate the original fault that caused the error.

    Performance : We do not want to pay very much for error handling during normal operation.

    Reusability : The services of the error handling component should be designed for reuse because it is a basic component

    useful for a number of transformations.

    Error Handling Mechanisms in B2B DT

    The common techniques that would help a B2B DT designer in designing an error handling strategy are summarized below.

    Debug: This method of error handling underlines the usage of the built-in capabilities of B2B DT for most basic errors. The

    debug of a B2B DT parser/serializer can be done in multiple ways.

    * Highlight the selection of an element on the example source file.

    * Use a Writeval component along with disabling automatic output in the project property.

    * Use of the disable and enable feature for each of the components.

    * Run the parser/serializer and browse the event log for any failures.

    All the debug components should be removed before deploying the service in production.

    Schema Modification: This method of error handling demonstrates one of the ways to communicate the erroneous record

    once it is identified. The erroneous data can be captured at different levels (e.g., at field level or at record level). The XML

    schema methodology dictates to add additional XML elements into the schema structure for the error data and error message

    holding. This allows the developer to validate each of the elements with the business rules and if any element or record does

    not conform to the rules then that data and a corresponding error message can be stored in the XML structure.

    Error Data in a Different File : This methodology stresses the point of storing the erroneous records or elements in a

    separate file other than the output data stream. This method is useful when a business critical timeline for the data processing

    cannot be compromised for a couple of erroneous records. This method allows the processing for the correct records to be

    done and the erroneous records to be inspected and corrected as a different stream function. In this methodology the business

    validations are done for each of the elements with the specified rules and if any of the elements or records fails to conform, they

    are directed to a predefined error file. The path to the file is generally passed in the output file for further investigation or the

    path of the file is a static path upon which a script is executed to send those error files to operations for correction.

    Design Time Tools for Error Handling in B2B DT

    A failure is an event that prevents a component from processing data in the expected way. An anchor might fail if it searches for

    text that does not exist in the source document. A transformer or action might fail if its input is empty or has an inappropriate

    data type.

    A failure can be a perfectly normal occurrence. For example, a source document might contain an optional date. A parser

    contains a Content anchor that processes the date, if it exists. If the date does not exist in a particular source document, the

    Content anchor fails. By configuring the transformation appropriately, you can control the result of a failure. In the example, you

    might configure the parser to ignore the missing data and continue processing.

    B2B Data Transformation offers various mechanisms for error handling during design time

    Feature Description

    B2B DT event log This is a B2B DT specific event generation mechanism where each event corresponds to an

    action taken by a transformation such as recognizing a particular lexical sequence. It is useful

    in the troubleshooting of work in progress, but event files can grow very large, hence it is not

    recommended for production systems. It is distinct from the event system offered by other

    B2B DT products and from the OS based event system. Custom events can be generated

    within transformation scripts. Event based failures are reported as exceptions or other errors

    in the calling environment.

    Error Handling

    10/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • B2B DT Trace files Trace files are controlled by the B2B DT configuration application. Automated strategies may

    be applied for the recycling of trace files

    Custom error information At the simplest levels, custom errors can be generated as B2B DT events (using the

    AddEventAction). However if the event mechanism is disabled for memory or performance

    reasons, these are omitted. Other alternatives include generation of custom error files,

    integration with OS event tracking mechanisms and integration with 3rd party management

    platform software. Integration with OS eventing or 3rd party platform software requires custom

    extensions to B2B DT.

    The event log is the main trouble shooting tool in B2B DT solutions. It captures all of the details in an event log file when an

    error occurs in the system. These files can be generated when testing in a development studio environment or running a

    service engine. These files reside in the CM_Reports directory specified in the CM_Config file under the installation directory of

    B2B DT. In the studio environment the default location is Results/events.cme in the project folder. The error messages

    appearing in the event log file are either system generated or user-defined (which can be accomplished by adding the add

    event action). The ADDEVENT Action enables the developer to pass a user-defined error message to the event log file in case

    a specific error condition occurs.

    Overall the B2B DT event mechanism is the simplest to implement. But for large or high volume production systems, the event

    mechanism can create very large event files, and it offers no integration with popular enterprise software administration

    platforms. Informatica recommends using B2B DT Events for troubleshooting purposes during development only.

    In some cases, performance constraints may determine the error handling strategy. For example updating an external event

    system may cause performance bottlenecks and producing a formatted error report can be time consuming. In some cases

    operator interaction may be required that could potentially block a B2B DT transformation from completing.

    Finally, it is worth looking at whether some part of the error handling can be offloaded outside of B2B DT to avoid performance

    bottlenecks.

    When using custom error schemes, consider the following:

    * Multiple invocations of the same transformation may execute in parallel

    * Dont hardwire error file paths

    * Dont assume a single error output file

    Avoid the use of the B2B DT event log for production systems (especially when processing Excel files).

    The trace files capture the state of the system along with the process ID and failure messages. It creates the reporting of the

    error along with the time stamp. It captures details about the system in different category areas, including file system,

    environment, networking etc. It gives details about the process id and the thread id that was processing the execution. It aids in

    getting the system level error (if there is one). The name of the trace file can be modified in the Configuration wizard. The

    maximum size of the trace file can also be limited in the CMConfiguration editor.

    If the Data Transformation Engine runs under multiple user accounts, the user logs may overwrite each other, or it may be

    difficult to identify the logs belonging to a particular user. Prevent this by configuring users with different log locations.

    In addition to the logs of service events, there is an Engine initialization event log. This log records problems that occur when

    the Data Transformation Engine starts without reference to any service or input data. View this log to diagnose installation

    problems such as missing environment variables.

    The initialization log is located in the CMReports\Init directory.

    New Error Handling Features in B2B DT 8.x

    Using the Optional Property to Handle Failures

    If the optional property of a component is not selected, a failure of the component causes its parent to fail. If the parent is also

    non-optional, its own parent fails, and so forth. For example, suppose that a Parser contains a Group, and the Group contains a

    Error Handling

    11/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Marker. All the components are non-optional. If the Marker does not exist in the source document, the

    Marker fails. This causes the Group to fail, which in turn causes the Parser to fail.

    If the optional property of a component is selected, a failure of the component does not bubble up to the parent. For example,

    suppose that a Parser contains a Group, and the Group contains a Marker. In this example, suppose that the Group is optional.

    The failed Marker causes the Group to fail, but the Parser does not fail.

    Note however that certain components lack the optional property because the components never fail, regardless of their input.

    An example is the Sort action. If the Sort action finds no data to sort, it simply does nothing. It does not report a failure.

    Rollback

    If a component fails, its effects are rolled back. For example, suppose that a Group contains three non-optional Content

    anchors that store values in data holders. If the third Content anchor fails, the Group fails. Data Transformation rolls back the

    effects of the first two Content anchors. The data that the first two Content anchors already stored in data holders is removed.

    The rollback applies only to the main effects of a transformation such as a parser storing values in data holders or a serializer

    writing to its output. The rollback does not apply to side effects. In the above example, if the Group contains an ODBCAction

    that performs an INSERT query on a database, the record that the action added to the database is not deleted.

    Writing a Failure Message to the User Log

    A component can be configured to output failure events to a user-defined log. For example, if an anchor fails to find text in the

    source document, it can write a message in the user log. This can occur even if the anchor is defined as optional so that the

    failure does not terminate the transformation processing.

    The user log can contain the following types of information:

    * Failure level: Information, Warning, or Error

    * Name of the component that failed

    * Failure description

    * Location of the failed component in the IntelliScript

    * Additional information about the transformation status (such as the values of data holders)

    CustomLog

    The CustomLog component can be used as the value of the on_fail property. In the event of a failure, the CustomLog

    component runs a serializer that prepares a log message. The system writes the message to a specified location.

    Property Description

    run_serializer A serializer that prepares the log message

    output The output location. The options include: MSMQOutput. Writes

    to an MSMQ queue. OutputDataHolder. Writes to a data holder.

    OutputFile. Writes to a file. ResultFile. Writes to the default

    results file of the transformation. OutputCOM. Uses a custom

    COM component to output the data. Additional choices:

    OutputPort. The name of an AdditionalOutputPort where the data

    is written. StandardErrorLog. Writes to the user log.

    Error Handling in B2B DT with PowerCenter Integration

    In a B2B DT solution scenario, both expected and unexpected errors can occur, whether caused by a production issue or a

    flaw in the design. If the right error handling processes are not in place, then if an error occurs, the processing aborts with a

    description of the error in the log (event file). This can also results in data loss if the erroneous records are not captured and

    reported correctly. It also fails the program it is called from. For example if the B2B Data Transformation service is used through

    PowerCenter UDO/B2B DT, then an error causes the PowerCenter session to fail.

    Error Handling

    12/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • This section focuses on how to orchestrate PowerCenter and B2B DT if the B2B DT services are being

    called from a PowerCenter mapping. Below are the most common ways of orchestrating the error trapping and error handling

    mechanism.

    1. Use PowerCenters Robustness and Reporting Function : In general the PowerCenter engine is very robust

    and powerful enough to handle complex erroneous scenarios. Thus the usual practice is to perform any business

    validation or valid values comparison in PowerCenter. This enables error records to be directed to the already established

    Bad Files or Reject Tables in PowerCenter. This feature also allows the repository to store information about the number

    of records loaded and the number of records rejected and thus aids in easier reporting of errors.

    2. Output the Error in an XML Tag : When complex parsing validations are involved, B2B DT is more powerful than

    PowerCenter in handling them (e.g., String function and regular expression). In these scenarios the validations are

    performed in the B2B DT engine and the schema is redesigned to capture the error information in the associated tag of

    the XML. When this XML is parsed in a PowerCenter mapping the error tags are directed to be stored in the custom build

    error reporting tables from which the reporting of the errors can be done. The design of the custom build error tables will

    depend on the design of the error handling XML schema. Generally these tables correspond one-to-one with the XML

    structure with few additional metadata fields like processing date, Source System, etc.

    3. Output to the PowerCenter Log Files : If unexpected error occurs in the B2B DT processing then the error

    descriptions and details are stored in the log file directory as specified in the CMconfig.xml. The path to the file and the

    fatal errors are reported to the PowerCenter Log so that the operators can quickly detect problems. This unexpected error

    handling can be exploited with care for the user defined errors in the B2B DT transformation by adding the Addevent

    Action and marking the error type as Failure.

    Best Practices for Handling Errors in Production

    In a production environment the turnaround time of the processes should be as short as possible and as automated as

    possible. Using B2B DT integration with Power Center these requirements should be met seamlessly without intervention from

    IT professionals for error reporting, the correction of the data file and the reprocessing of data.

    Example Scenario 1 HIPAA Error Reporting

    Error Handling

    13/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Example Scenario 2 Emailing Error Files to Operator

    Below is a case study for an implementation at a major financial client. The solution was implemented with total automation for

    the sequence of error trapping, error reporting, correction and reprocessing of data. The high level solution steps are:

    * Analyst receives loan tape via Email from a dealer

    * Analyst saves the file to a file system on a designated file share

    * A J2EE server monitors the file share for new files and pushes them to PowerCenter

    * PowerCenter invokes B2B DT to process (passing XML data fragment, supplying path to loan tape file and other

    parameters)

    * Upon a successful outcome, PowerCenter saves the data to the target database

    * PowerCenter notifies the Analyst via Email

    * On failure, PowerCenter Emails the XLS error file containing the original data and errors

    Error Handling

    14/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling Strategies - Data Warehousing

    Challenge

    A key requirement for any successful data warehouse or data integration project is that it attains

    credibility within the user community. At the same time, it is imperative that the warehouse be as

    up-to-date as possible since the more recent the information derived from it is, the more relevant it is to

    the business operations of the organization, thereby providing the best opportunity to gain an advantage

    over the competition.

    Transactional systems can manage to function even with a certain amount of error since the impact of an individual transaction

    (in error) has a limited effect on the business figures as a whole, and corrections can be applied to erroneous data after the

    event (i.e., after the error has been identified). In data warehouse systems, however, any systematic error (e.g., for a particular

    load instance) not only affects a larger number of data items, but may potentially distort key reporting metrics. Such data

    cannot be left in the warehouse "until someone notices" because business decisions may be driven by such information.

    Therefore, it is important to proactively manage errors, identifying them before, or as, they occur. If errors occur, it is equally

    important either to prevent them from getting to the warehouse at all, or to remove them from the warehouse immediately (i.e.,

    before the business tries to use the information in error).

    The types of error to consider include:

    * Source data structures

    * Sources presented out-of-sequence

    * Old sources represented in error

    * Incomplete source files

    * Data-type errors for individual fields

    * Unrealistic values (e.g., impossible dates)

    * Business rule breaches

    * Missing mandatory data

    * O/S errors

    * RDBMS errors

    These cover both high-level (i.e., related to the process or a load as a whole) and low-level (i.e., field or column-related errors)

    concerns.

    Description

    In an ideal world, when an analysis is complete, you have a precise definition of source and target data; you can be sure that

    every source element was populated correctly, with meaningful values, never missing a value, and fulfilling all relational

    constraints. At the same time, source data sets always have a fixed structure, are always available on time (and in the correct

    order), and are never corrupted during transfer to the data warehouse. In addition, the OS and RDBMS never run out of

    resources, or have permissions and privileges change.

    Realistically, however, the operational applications are rarely able to cope with every possible business scenario or

    combination of events; operational systems crash, networks fall over, and users may not use the transactional systems in quite

    the way they were designed. The operational systems also typically need some flexibility to allow non-fixed data to be stored

    (typically as free-text comments). In every case, there is a risk that the source data does not match what the data warehouse

    expects.

    Because of the credibility issue, in-error data must not be propagated to the metrics and measures used by the business

    managers. If erroneous data does reach the warehouse, it must be identified and removed immediately (before the current

    version of the warehouse can be published). Preferably, error data should be identified during the load process and prevented

    from reaching the warehouse at all. Ideally, erroneous source data should be identified before a load even begins, so that no

    resources are wasted trying to load it.

    As a principle, data errors should corrected at the source. As soon as any attempt is made to correct errors within the

    Error Handling

    15/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • warehouse, there is a risk that the lineage and provenance of the data will be lost. From that point on, it

    becomes impossible to guarantee that a metric or data item came from a specific source via a specific chain of processes. As a

    by-product, adopting this principle also helps to tie both the end-users and those responsible for the source data into the

    warehouse process; source data staff understand that their professionalism directly affects the quality of the reports, and

    end-users become owners of their data.

    As a final consideration, error management (the implementation of an error handling strategy) complements and overlaps load

    management, data quality and key management, and operational processes and procedures.

    Load management processes record at a high-level if a load is unsuccessful; error management records the details of why the

    failure occurred.

    Quality management defines the criteria whereby data can be identified as in error; and error management identifies the

    specific error(s), thereby allowing the source data to be corrected.

    Operational reporting shows a picture of loads over time, and error management allows analysis to identify systematic errors,

    perhaps indicating a failure in operational procedure.

    Error management must therefore be tightly integrated within the data warehouse load process. This is shown in the high level

    flow chart below:

    Error Handling

    16/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Management

    Considerations

    High-Level Issues

    From previous discussion of load

    management, a number of checks

    can be performed before any

    attempt is made to load a source

    data set. Without load management

    in place, it is unlikely that the

    warehouse process will be robust

    enough to satisfy any end-user

    requirements, and error correction

    processing becomes moot (in so far

    as nearly all maintenance and

    development resources will be

    working full time to manually

    correct bad data in the warehouse).

    The following assumes that you

    have implemented load

    management processes similar to

    Informaticas best practices.

    * Process Dependency checks in

    the load management can

    identify when a source data set

    is missing, duplicates a

    previous version, or has been

    presented out of sequence, and

    where the previous load failed

    but has not yet been corrected.

    * Load management prevents

    this source data from being

    loaded. At the same time, error

    management processes should

    record the details of the failed

    load; noting the source

    instance, the load affected, and

    when and why the load was

    aborted.

    * Source file structures can be

    compared to expected

    structures stored as metadata,

    either from header information

    or by attempting to read the first

    data row.

    * Source table structures can be

    compared to expectations;

    typically this can be done by interrogating the RDBMS catalogue directly (and comparing to the expected structure held in

    Error Handling

    17/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • metadata), or by simply running a describe command against the table (again comparing to a

    pre-stored version in metadata).

    * Control file totals (for file sources) and row number counts (table sources) are also used to determine if files have been

    corrupted or truncated during transfer, or if tables have no new data in them (suggesting a fault in an operational

    application).

    * In every case, information should be recorded to identify where and when an error occurred, what sort of error it was, and

    any other relevant process-level details.

    Low-Level Issues

    Assuming that the load is to be processed normally (i.e., that the high-level checks have not caused the load to abort), further

    error management processes need to be applied to the individual source rows and fields.

    * Individual source fields can be compared to expected data-types against standard metadata within the repository, or

    additional information added by the development. In some instances, this is enough to abort the rest of the load; if the field

    structure is incorrect, it is much more likely that the source data set as a whole either cannot be processed at all or (more

    worryingly) is likely to be processed unpredictably.

    * Data conversion errors can be identified on a field-by-field basis within the body of a mapping. Built-in error handling can

    be used to spot failed date conversions, conversions of string to numbers, or missing required data. In rare cases, stored

    procedures can be called if a specific conversion fails; however this cannot be generally recommended because of the

    potentially crushing impact on performance if a particularly error-filled load occurs.

    * Business rule breaches can then be picked up. It is possible to define allowable values, or acceptable value ranges within

    PowerCenter mappings (if the rules are few, and it is clear from the mapping metadata that the business rules are included

    in the mapping itself). A more flexible approach is to use external tables to codify the business rules. In this way, only the

    rules tables need to be amended if a new business rule needs to be applied. Informatica has suggested methods to

    implement such a process.

    * Missing Key/Unknown Key issues have already been defined in their own best practice document Key Management

    in Data Warehousing Solutions with suggested management techniques for identifying and handling them. However,

    from an error handling perspective, such errors must still be identified and recorded, even when key management

    techniques do not formally fail source rows with key errors. Unless a record is kept of the frequency with which particular

    source data fails, it is difficult to realize when there is a systematic problem in the source systems.

    * Inter-row errors may also have to be considered. These may occur when a business process expects a certain hierarchy of

    events (e.g., a customer query, followed by a booking request, followed by a confirmation, followed by a payment). If the

    events arrive from the source system in the wrong order, or where key events are missing, it may indicate a major problem

    with the source system, or the way in which the source system is being used.

    * An important principle to follow is to try to identify all of the errors on a particular row before halting processing, rather

    than rejecting the row at the first instance. This seems to break the rule of not wasting resources trying to load a sourced

    data set if we already know it is in error; however, since the row needs to be corrected at source, then reprocessed

    subsequently, it is sensible to identify all the corrections that need to be made before reloading, rather than fixing the

    first, re-running, and then identifying a second error (which halts the load for a second time).

    OS and RDBMS Issues

    Since best practice means that referential integrity (RI) issues are proactively managed within the loads, instances where the

    RDBMS rejects data for referential reasons should be very rare (i.e., the load should already have identified that reference

    information is missing).

    However, there is little that can be done to identify the more generic RDBMS problems that are likely to occur; changes to

    schema permissions, running out of temporary disk space, dropping of tables and schemas, invalid indexes, no further table

    space extents available, missing partitions and the like.

    Similarly, interaction with the OS means that changes in directory structures, file permissions, disk space, command syntax,

    and authentication may occur outside of the data warehouse. Often such changes are driven by Systems Administrators who,

    from an operational perspective, are not aware that there is likely to be an impact on the data warehouse, or are not aware that

    the data warehouse managers need to be kept up to speed.

    In both of the instances above, the nature of the errors may be such that not only will they cause a load to fail, but it may be

    Error Handling

    18/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • impossible to record the nature of the error at that point in time. For example, if RDBMS user ids are

    revoked, it may be impossible to write a row to an error table if the error process depends on the revoked id; if disk space runs

    out during a write to a target table, this may affect all other tables (including the error tables); if file permissions on a UNIX host

    are amended, bad files themselves (or even the log files) may not be accessible.

    Most of these types of issues can be managed by a proper load management process, however. Since setting the status of a

    load to complete should be absolutely the last step in a given process, any failure before, or including, that point leaves the

    load in an incomplete state. Subsequent runs should note this, and enforce correction of the last load before beginning the

    new one.

    The best practice to manage such OS and RDBMS errors is, therefore, to ensure that the Operational Administrators and

    DBAs have proper and working communication with the data warehouse management to allow proactive control of changes.

    Administrators and DBAs should also be available to the data warehouse operators to rapidly explain and resolve such errors if

    they occur.

    Auto-Correction vs. Manual Correction

    Load management and key management best practices ( Key Management in Data Warehousing Solutions ) have already

    defined auto-correcting processes; the former to allow loads themselves to launch, rollback, and reload without manual

    intervention, and the latter to allow RI errors to be managed so that the quantitative quality of the warehouse data is preserved,

    and incorrect key values are corrected as soon as the source system provides the missing data.

    We cannot conclude from these two specific techniques, however, that the warehouse should attempt to change source data

    as a general principle. Even if this were possible (which is debatable), such functionality would mean that the absolute link

    between the source data and its eventual incorporation into the data warehouse would be lost. As soon as one of the

    warehouse metrics was identified as incorrect, unpicking the error would be impossible, potentially requiring a whole section of

    the warehouse to be reloaded entirely from scratch.

    In addition, such automatic correction of data might hide the fact that one or other of the source systems had a generic fault, or

    more importantly, had acquired a fault because of on-going development of the transactional applications, or a failure in user

    training.

    The principle to apply here is to identify the errors in the load, and then alert the source system users that data should be

    corrected in the source system itself, ready for the next load to pick up the right data. This maintains the data lineage, allows

    source system errors to be identified and ameliorated in good time, and permits extra training needs to be identified and

    managed.

    Error Management Techniques

    Simple Error Handling Structure

    The following data structure is an example of the error metadata that should be captured as a minimum within the error

    handling strategy.

    Error Handling

    19/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • The example defines three main sets of information:

    * The ERROR_DEFINITION table, which stores descriptions for the various types of errors, including:

    - process-level (e.g., incorrect source file, load started out-of-sequence)

    - row-level (e.g., missing foreign key, incorrect data-type, conversion errors) and

    - reconciliation (e.g., incorrect row numbers, incorrect file total etc.).

    * The ERROR_HEADER table provides a high-level view on the process, allowing a quick identification of the frequency of

    error for particular loads and of the distribution of error types. It is linked to the load management processes via the

    SRC_INST_ID and PROC_INST_ID, from which other process-level information can be gathered.

    * The ERROR_DETAIL table stores information about actual rows with errors, including how to identify the specific row that

    was in error (using the source natural keys and row number) together with a string of field identifier/value pairs

    concatenated together. It is not expected that this information will be deconstructed as part of an automatic correction

    load, but if necessary this can be pivoted (e.g., using simple UNIX scripts) to separate out the field/value pairs for

    subsequent reporting.

    Error Handling

    20/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling Strategies - General

    Challenge

    The challenge is to accurately and efficiently load data into the target data architecture. This Best

    Practice describes various loading scenarios, the use of data profiles, an alternate method for identifying

    data errors, methods for handling data errors, and alternatives for addressing the most common types of

    problems. For the most part, these strategies are relevant whether your data integration project is

    loading an operational data structure (as with data migrations, consolidations, or loading various sorts of

    operational data stores) or loading a data warehousing structure.

    Description

    Regardless of target data structure, your loading process must validate that the data conforms to known rules of the business.

    When the source system data does not meet these rules, the process needs to handle the exceptions in an appropriate

    manner. The business needs to be aware of the consequences of either permitting invalid data to enter the target or rejecting it

    until it is fixed. Both approaches present complex issues. The business must decide what is acceptable and prioritize two

    conflicting goals:

    * The need for accurate information.

    * The ability to analyze or process the most complete information available with the understanding that errors can exist.

    Data Integration Process Validation

    In general, there are three methods for handling data errors detected in the loading process:

    * Reject All. This is the simplest to implement since all errors are rejected from entering the target when they are

    detected. This provides a very reliable target that the users can count on as being correct, although it may not be complete.

    Both dimensional and factual data can be rejected when any errors are encountered. Reports indicate what the errors are

    and how they affect the completeness of the data.

    Dimensional or Master Data errors can cause valid factual data to be rejected because a foreign key relationship cannot be

    created. These errors need to be fixed in the source systems and reloaded on a subsequent load. Once the corrected rows

    have been loaded, the factual data will be reprocessed and loaded, assuming that all errors have been fixed. This delay may

    cause some user dissatisfaction since the users need to take into account that the data they are looking at may not be a

    complete picture of the operational systems until the errors are fixed. For an operational system, this delay may affect

    downstream transactions.

    The development effort required to fix a Reject All scenario is minimal, since the rejected data can be processed through

    existing mappings once it has been fixed. Minimal additional code may need to be written since the data will only enter the

    target if it is correct, and it would then be loaded into the data mart using the normal process.

    * Reject None. This approach gives users a complete picture of the available data without having to consider data that

    was not available due to it being rejected during the load process. The problem is that the data may not be complete or

    accurate. All of the target data structures may contain incorrect information that can lead to incorrect decisions or faulty

    transactions.

    With Reject None, the complete set of data is loaded, but the data may not support correct transactions or aggregations.

    Factual data can be allocated to dummy or incorrect dimension rows, resulting in grand total numbers that are correct, but

    incorrect detail numbers. After the data is fixed, reports may change, with detail information being redistributed along different

    hierarchies.

    The development effort to fix this scenario is significant. After the errors are corrected, a new loading process needs to correct

    all of the target data structures, which can be a time-consuming effort based on the delay between an error being detected and

    fixed. The development strategy may include removing information from the target, restoring backup tapes for each nights load,

    and reprocessing the data. Once the target is fixed, these changes need to be propagated to all downstream data structures or

    data marts.

    Error Handling

    21/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • * Reject Critical. This method provides a balance between missing information and incorrect information. It involves

    examining each row of data and determining the particular data elements to be rejected. All changes that are valid are

    processed into the target to allow for the most complete picture. Rejected elements are reported as errors so that they can

    be fixed in the source systems and loaded on a subsequent run of the ETL process.

    This approach requires categorizing the data in two ways: 1) as key elements or attributes, and 2) as inserts or updates.

    Key elements are required fields that maintain the data integrity of the target and allow for hierarchies to be summarized at

    various levels in the organization. Attributes provide additional descriptive information per key element.

    Inserts are important for dimensions or master data because subsequent factual data may rely on the existence of the

    dimension data row in order to load properly. Updates do not affect the data integrity as much because the factual data can

    usually be loaded with the existing dimensional data unless the update is to a key element.

    The development effort for this method is more extensive than Reject All since it involves classifying fields as critical or

    non-critical, and developing logic to update the target and flag the fields that are in error. The effort also incorporates some

    tasks from the Reject None approach, in that processes must be developed to fix incorrect data in the entire target data

    architecture.

    Informatica generally recommends using the Reject Critical strategy to maintain the accuracy of the target. By providing the

    most fine-grained analysis of errors, this method allows the greatest amount of valid data to enter the target on each run of the

    ETL process, while at the same time screening out the unverifiable data fields. However, business management needs to

    understand that some information may be held out of the target, and also that some of the information in the target data

    structures may be at least temporarily allocated to the wrong hierarchies.

    Handling Errors in Dimension Profiles

    Profiles are tables used to track history changes to the source data. As the source systems change, profile records are created

    with date stamps that indicate when the change took place. This allows power users to review the target data using either

    current (As-Is) or past (As-Was) views of the data.

    A profile record should occur for each change in the source data. Problems occur when two fields change in the source system

    and one of those fields results in an error. The first value passes validation, which produces a new profile record, while the

    second value is rejected and is not included in the new profile. When this error is fixed, it would be desirable to update the

    existing profile rather than creating a new one, but the logic needed to perform this UPDATE instead of an INSERT is

    complicated. If a third field is changed in the source before the error is fixed, the correction process is complicated further.

    The following example represents three field values in a source system. The first row on 1/1/2000 shows the original values. On

    1/5/2000, Field 1 changes from Closed to Open, and Field 2 changes from Black to BRed, which is invalid. On 1/10/2000, Field

    3 changes from Open 9-5 to Open 24hrs, but Field 2 is still invalid. On 1/15/2000, Field 2 is finally fixed to Red.

    Date Field 1 Value Field 2 Value Field 3 Value

    1/1/2000 Closed Sunday Black Open 9 5

    1/5/2000 Open Sunday BRed Open 9 5

    1/10/2000 Open Sunday BRed Open 24hrs

    1/15/2000 Open Sunday Red Open 24hrs

    Three methods exist for handling the creation and update of profiles:

    1. The first method produces a new profile record each time a change is detected in the source. If a field value was invalid,

    then the original field value is maintained.

    Date Profile Date Field 1 Value Field 2 Value Field 3 Value

    1/1/2000 1/1/2000 Closed Sunday Black Open 9 5

    1/5/2000 1/5/2000 Open Sunday Black Open 9 5

    1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

    1/15/2000 1/15/2000 Open Sunday Red Open 24hrs

    By applying all corrections as new profiles in this method, we simplify the process by directly applying all changes to the source

    system directly to the target. Each change -- regardless if it is a fix to a previous error -- is applied as a new change that creates

    a new profile. This incorrectly shows in the target that two changes occurred to the source information when, in reality, a

    mistake was entered on the first change and should be reflected in the first profile. The second profile should not have been

    Error Handling

    22/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • created.

    * The second method updates the first profile created on 1/5/2000 until all fields are corrected on 1/15/2000, which loses the

    profile record for the change to Field 3.

    If we try to apply changes to the existing profile, as in this method, we run the risk of losing profile information. If the third field

    changes before the second field is fixed, we show the third field changed at the same time as the first. When the second field

    was fixed, it would also be added to the existing profile, which incorrectly reflects the changes in the source system.

    * The third method creates only two new profiles, but then causes an update to the profile records on 1/15/2000 to fix the

    Field 2 value in both.

    Date Profile Date Field 1 Value Field 2 Value Field 3 Value

    1/1/2000 1/1/2000 Closed Sunday Black Open 9 5

    1/5/2000 1/5/2000 Open Sunday Black Open 9 5

    1/10/2000 1/10/2000 Open Sunday Black Open 24hrs

    1/15/2000 1/5/2000 (Update) Open Sunday Red Open 9-5

    1/15/2000 1/10/2000 (Update) Open Sunday Red Open 24hrs

    If we try to implement a method that updates old profiles when errors are fixed, as in this option, we need to create complex

    algorithms that handle the process correctly. It involves being able to determine when an error occurred and examining all

    profiles generated since then and updating them appropriately. And, even if we create the algorithms to handle these methods,

    we still have an issue of determining if a value is a correction or a new value. If an error is never fixed in the source system, but

    a new value is entered, we would identify it as a previous error, causing an automated process to update old profile records,

    when in reality a new profile record should have been entered.

    Recommended Method

    A method exists to track old errors so that we know when a value was rejected. Then, when the process encounters a new,

    correct value it flags it as part of the load strategy as a potential fix that should be applied to old Profile records. In this way, the

    corrected data enters the target as a new Profile record, but the process of fixing old Profile records, and potentially deleting

    the newly inserted record, is delayed until the data is examined and an action is decided. Once an action is decided, another

    process examines the existing Profile records and corrects them as necessary. This method only delays the As-Was analysis

    of the data until the correction method is determined because the current information is reflected in the new Profile.

    Data Quality Edits

    Quality indicators can be used to record definitive statements regarding the quality of the data received and stored in the target.

    The indicators can be append to existing data tables or stored in a separate table linked by the primary key. Quality indicators

    can be used to:

    * Show the record and field level quality associated with a given record at the time of extract.

    * Identify data sources and errors encountered in specific records.

    * Support the resolution of specific record error types via an update and resubmission process.

    Quality indicators can be used to record several types of errors e.g., fatal errors (missing primary key value), missing data in a

    required field, wrong data type/format, or invalid data value. If a record contains even one error, data quality (DQ) fields will be

    appended to the end of the record, one field for every field in the record. A data quality indicator code is included in the DQ

    fields corresponding to the original fields in the record where the errors were encountered. Records containing a fatal error are

    stored in a Rejected Record Table and associated to the original file name and record number. These records cannot be

    loaded to the target because they lack a primary key field to be used as a unique record identifier in the target.

    The following types of errors cannot be processed:

    * A source record does not contain a valid key. This record would be sent to a reject queue. Metadata will be saved and

    used to generate a notice to the sending system indicating that x number of invalid records were received and could not be

    processed. However, in the absence of a primary key, no tracking is possible to determine whether the invalid record has

    been replaced or not.

    Error Handling

    23/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • * The source file or record is illegible. The file or record would be sent to a reject queue. Metadata

    indicating that x number of invalid records were received and could not be processed may or may not be available for a

    general notice to be sent to the sending system. In this case, due to the nature of the error, no tracking is possible to

    determine whether the invalid record has been replaced or not. If the file or record is illegible, it is likely that individual

    unique records within the file are not identifiable. While information can be provided to the source system site indicating

    there are file errors for x number of records, specific problems may not be identifiable on a record-by-record basis.

    In these error types, the records can be processed, but they contain errors:

    * A required (non-key) field is missing.

    * The value in a numeric or date field is non-numeric.

    * The value in a field does not fall within the range of acceptable values identified for the field. Typically, a reference table is

    used for this validation.

    When an error is detected during ingest and cleansing, the identified error type is recorded.

    Quality Indicators (Quality Code Table)

    The requirement to validate virtually every data element received from the source data systems mandates the development,

    implementation, capture and maintenance of quality indicators. These are used to indicate the quality of incoming data at an

    elemental level. Aggregated and analyzed over time, these indicators provide the information necessary to identify acute data

    quality problems, systemic issues, business process problems and information technology breakdowns.

    The quality indicators: 0-No Error, 1-Fatal Error, 2-Missing Data from a Required Field, 3-Wrong Data Type/Format, 4-Invalid

    Data Value and 5-Outdated Reference Table in Use, apply a concise indication of the quality of the data within specific fields

    for every data type. These indicators provide the opportunity for operations staff, data quality analysts and users to readily

    identify issues potentially impacting the quality of the data. At the same time, these indicators provide the level of detail

    necessary for acute quality problems to be remedied in a timely manner.

    Handling Data Errors

    The need to periodically correct data in the target is inevitable. But how often should these corrections be performed?

    The correction process can be as simple as updating field information to reflect actual values, or as complex as deleting data

    from the target, restoring previous loads from tape, and then reloading the information correctly. Although we try to avoid

    performing a complete database restore and reload from a previous point in time, we cannot rule this out as a possible solution.

    Reject Tables vs. Source System

    As errors are encountered, they are written to a reject file so that business analysts can examine reports of the data and the

    related error messages indicating the causes of error. The business needs to decide whether analysts should be allowed to fix

    data in the reject tables, or whether data fixes will be restricted to source systems. If errors are fixed in the reject tables, the

    target will not be synchronized with the source systems. This can present credibility problems when trying to track the history of

    changes in the target data architecture. If all fixes occur in the source systems, then these fixes must be applied correctly to the

    target data.

    Attribute Errors and Default Values

    Attributes provide additional descriptive information about a dimension concept. Attributes include things like the color of a

    product or the address of a store. Attribute errors are typically things like an invalid color or inappropriate characters in the

    address. These types of errors do not generally affect the aggregated facts and statistics in the target data; the attributes are

    most useful as qualifiers and filtering criteria for drilling into the data, (e.g. to find specific patterns for market research).

    Attribute errors can be fixed by waiting for the source system to be corrected and reapplied to the data in the target.

    When attribute errors are encountered for a new dimensional value, default values can be assigned to let the new record enter

    thetarget. Some rules that have been proposed for handling defaults are as follows:

    Value Types

    Error Handling

    Description

    Error Handling

    Default

    Error Handling

    24/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Reference Values Attributes that are foreign keys to other tables Unknown

    Small Value Sets Y/N indicator fields No

    Other Any other type of attribute Null or Business provided value

    Reference tables are used to normalize the target model to prevent the duplication of data. When a source value does not

    translate into a reference table value, we use the Unknown value. (All reference tables contain a value of Unknown for this

    purpose.)

    The business should provide default values for each identified attribute. Fields that are restricted to a limited domain of values

    (e.g., On/Off or Yes/No indicators), are referred to as small-value sets. When errors are encountered in translating these

    values, we use the value that represents off or No as the default. Other values, like numbers, are handled on a case-by-case

    basis. In many cases, the data integration process is set to populate Null into these fields, which means undefined in the target.

    After a source system value is corrected and passes validation, it is corrected in the target.

    Primary Key Errors

    The business also needs to decide how to handle new dimensional values such as locations. Problems occur when the new

    key is actually an update to an old key in the source system. For example, a location number is assigned and the new location

    is transferred to the target using the normal process; then the location number is changed due to some source business rule

    such as: all Warehouses should be in the 5000 range . The process assumes that the change in the primary key is actually a

    new warehouse and that the old warehouse was deleted. This type of error causes a separation of fact data, with some data

    being attributed to the old primary key and some to the new. An analyst would be unable to get a complete picture.

    Fixing this type of error involves integrating the two records in the target data, along with the related facts. Integrating the two

    rows involves combining the profile information, taking care to coordinate the effective dates of the profiles to sequence

    properly. If two profile records exist for the same day, then a manual decision is required as to which is correct. If facts were

    loaded using both primary keys, then the related fact rows must be added together and the originals deleted in order to correct

    the data.

    The situation is more complicated when the opposite condition occurs (i.e., two primary keys mapped to the same target data

    ID really represent two different IDs). In this case, it is necessary to restore the source information for both dimensions and

    facts from the point in time at which the error was introduced, deleting affected records from the target and reloading from the

    restore to correct the errors.

    DM Facts Calculated from EDW Dimensions

    If information is captured as dimensional data from the source, but used as measures residing on the fact records in the target,

    we must decide how to handle the facts. From a data accuracy view, we would like to reject the fact until the value is corrected.

    If we load the facts with the incorrect data, the process to fix the target can be time consuming and difficult to implement.

    If we let the facts enter downstream target structures, we need to create processes that update them after the dimensional data

    is fixed. If we reject the facts when these types of errors are encountered, the fix process becomes simpler. After the errors are

    fixed, the affected rows can simply be loaded and applied to the target data.

    Fact Errors

    If there are no business rules that reject fact records except for relationship errors to dimensional data, then when we

    encounter errors that would cause a fact to be rejected, we save these rows to a reject table for reprocessing the following

    night. This nightly reprocessing continues until the data successfully enters the target data structures. Initial and periodic

    analyses should be performed on the errors to determine why they are not being loaded.

    Data Stewards

    Data Stewards are generally responsible for maintaining reference tables and translation tables, creating new entities in

    dimensional data, and designating one primary data source when multiple sources exist. Reference data and translation tables

    enable the target data architecture to maintain consistent descriptions across multiple source systems, regardless of how the

    source system stores the data. New entities in dimensional data include new locations, products, hierarchies, etc. Multiple

    Error Handling

    25/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • source data occurs when two source systems can contain different data for the same dimensional entity.

    Reference Tables

    The target data architecture may use reference tables to maintain consistent descriptions. Each table contains a short code

    value as a primary key and a long description for reporting purposes. A translation table is associated with each reference table

    to map the codes to the source system values. Using both of these tables, the ETL process can load data from the source

    systems into the target structures.

    The translation tables contain one or more rows for each source value and map the value to a matching row in the reference

    table. For example, the SOURCE column in FILE X on System X can contain O, S or W. The data steward would be

    responsible for entering in the translation table the following values:

    Source Value Code Translation

    O OFFICE

    S STORE

    W WAREHSE

    These values are used by the data integration process to correctly load the target. Other source systems that maintain a similar

    field may use a two-letter abbreviation like OF, ST and WH. The data steward would make the following entries into the

    translation table to maintain consistency across systems:

    Source Value Code Translation

    OF OFFICE

    ST STORE

    WH WAREHSE

    The data stewards are also responsible for maintaining the reference table that translates the codes into descriptions. The ETL

    process uses the reference table to populate the following values into the target:

    Code Translation Code Description

    OFFICE Office

    STORE Retail Store

    WAREHSE Distribution Warehouse

    Error handling results when the data steward enters incorrect information for these mappings and needs to correct them after

    data has been loaded. Correcting the above example could be complex (e.g., if the data steward entered ST as translating to

    OFFICE by mistake). The only way to determine which rows should be changed is to restore and reload source data from the

    first time the mistake was entered. Processes should be built to handle these types of situations, including correction of the

    entire target data architecture.

    Dimensional Data

    New entities in dimensional data present a more complex issue. New entities in the target may include Locations and Products,

    at a minimum. Dimensional data uses the same concept of translation as reference tables. These translation tables map the

    source system value to the target value. For location, this is straightforward, but over time, products may have multiple source

    system values that map to the same product in the target. (Other similar translation issues may also exist, but Products serves

    as a good example for error handling.)

    There are two possible methods for loading new dimensional entities. Either require the data steward to enter the translation

    data before allowing the dimensional data into the target, or create the translation data through the ETL process and force the

    data steward to review it. The first option requires the data steward to create the translation for new entities, while the second

    lets the ETL process create the translation, but marks the record as Pending Verification until the data steward reviews it and

    changes the status to Verified before any facts that reference it can be loaded.

    When the dimensional value is left as Pending Verification however, facts may be rejected or allocated to dummy values. This

    requires the data stewards to review the status of new values on a daily basis. A potential solution to this issue is to generate

    an email each night if there are any translation table entries pending verification. The data steward then opens a report that lists

    them.

    Error Handling

    26/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • A problem specific to Product is that when it is created as new, it is really just a changed SKU number.

    This causes additional fact rows to be created, which produces an inaccurate view of the product when reporting. When this is

    fixed, the fact rows for the various SKU numbers need to be merged and the original rows deleted. Profiles would also have to

    be merged, requiring manual intervention.

    The situation is more complicated when the opposite condition occurs (i.e., two products are mapped to the same product, but

    really represent two different products). In this case, it is necessary to restore the source information for all loads since the error

    was introduced. Affected records from the target should be deleted and then reloaded from the restore to correctly split the

    data. Facts should be split to allocate the information correctly and dimensions split to generate correct profile information.

    Manual Updates

    Over time, any system is likely to encounter errors that are not correctable using source systems. A method needs to be

    established for manually entering fixed data and applying it correctly to the entire target data architecture, including beginning

    and ending effective dates. These dates are useful for both profile and date event fixes. Further, a log of these fixes should be

    maintained to enable identifying the source of the fixes as manual rather than part of the normal load process.

    Multiple Sources

    The data stewards are also involved when multiple sources exist for the same data. This occurs when two sources contain

    subsets of the required information. For example, one system may contain Warehouse and Store information while another

    contains Store and Hub information. Because they share Store information, it is difficult to decide which source contains the

    correct information.

    When this happens, both sources have the ability to update the same row in the target. If both sources are allowed to update

    the shared information, data accuracy and profile problems are likely to occur. If we update the shared information on only one

    source system, the two systems then contain different information. If the changed system is loaded into the target, it creates a

    new profile indicating the information changed. When the second system is loaded, it compares its old unchanged value to the

    new profile, assumes a change occurred and creates another new profile with the old, unchanged value. If the two systems

    remain different, the process causes two profiles to be loaded every day until the two source systems are synchronized with

    the same information.

    To avoid this type of situation, the business analysts and developers need to designate, at a field level, the source that should

    be considered primary for the field. Then, only if the field changes on the primary source would it be changed. While this

    sounds simple, it requires complex logic when creating Profiles, because multiple sources can provide information toward the

    one profile record created for that day.

    One solution to this problem is to develop a system of record for all sources. This allows developers to pull the information from

    the system of record, knowing that there are no conflicts for multiple sources. Another solution is to indicate, at the field level, a

    primary source where information can be shared from multiple sources. Developers can use the field level information to

    update only the fields that are marked as primary. However, this requires additional effort by the data stewards to mark the

    correct source fields as primary and by the data integration team to customize the load process.

    Error Handling

    27/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • Error Handling Techniques - PowerCenter Mappings

    Challenge

    Identifying and capturing data errors using a mapping approach, and making such errors available for

    further processing or correction.

    Description

    Identifying errors and creating an error handling strategy is an essential part of a data integration project. In the production

    environment, data must be checked and validated prior to entry into the target system. One strategy for catching data errors is

    to use PowerCenter mappings and error logging capabilities to catch specific data validation errors and unexpected

    transformation or database constraint errors.

    Data Validation Errors

    The first step in using a mapping to trap data validation errors is to understand and identify the error handling requirements.

    Consider the following questions:

    What types of data errors are likely to be encountered?

    Of these errors, which ones should be captured?

    What process can capture the possible errors?

    Should errors be captured before they have a chance to be written to the target database?

    Will any of these errors need to be reloaded or corrected?

    How will the users know if errors are encountered?

    How will the errors be stored?

    Should descriptions be assigned for individual errors?

    Can a table be designed to store captured errors and the error descriptions?

    Capturing data errors within a mapping and re-routing these errors to an error table facilitates analysis by end users and

    improves performance. One practical application of the mapping approach is to capture foreign key constraint errors (e.g.,

    executing a lookup on a dimension table prior to loading a fact table). Referential integrity is assured by including this sort of

    functionality in a mapping. While the database still enforces the foreign key constraints, erroneous data is not written to the

    target table; constraint errors are captured within the mapping so that the PowerCenter server does not have to write them to

    the session log and the reject/bad file, thus improving performance.

    Data content errors can also be captured in a mapping. Mapping logic can identify content errors and attach descriptions to

    them. This approach can be effective for many types of data content error, including: date conversion, null values intended for

    not null target fields, and incorrect data formats or data types.

    Sample Mapping Approach for Data Validation Errors

    In the following example, customer data is to be checked to ensure that invalid null values are intercepted before being written

    to not null columns in a target CUSTOMER table. Once a null value is identified, the row containing the error is to be separated

    from the data flow and logged in an error table.

    One solution is to implement a mapping similar to the one shown below:

    Error Handling

    28/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • An expression transformation can be employed to validate the source data, applying rules and flagging

    records with one or more errors.

    A router transformation can then separate valid rows from those containing the errors. It is good practice to append error rows

    with a unique key; this can be a composite consisting of a MAPPING_ID and ROW_ID, for example. The MAPPING_ID would

    refer to the mapping name and the ROW_ID would be created by a sequence generator.

    The composite key is designed to allow developers to trace rows written to the error tables that store information useful for error

    reporting and investigation. In this example, two error tables are suggested, namely: CUSTOMER_ERR and ERR_DESC_TBL.

    The table

    ERR_DESC_TBL, is designed to hold information about the error, such as the mapping name, the ROW_ID, and the error

    description. This table can be used to hold all data validation error descriptions for all mappings, giving a single point of

    reference for reporting.

    The CUSTOMER_ERR table can be an exact copy of the target CUSTOMER table appended with two additional columns:

    ROW_ID and MAPPING_ID. These columns allow the two error tables to be joined. The CUSTOMER_ERR table stores the

    entire row that was rejected, enabling the user to trace the error rows back to the source and potentially build mappings to

    reprocess them.

    The mapping logic must assign a unique description for each error in the rejected row. In this example, any null value intended

    Error Handling

    29/37 2013 Informatica Corporation. All rights reserved. Velocity Methodology

  • for a not null target field could generate an error message such as NAME is NULL or DOB is NULL. This

    step can be done in an expression transformation (e.g., EXP_VALIDATION in the sample mapping).

    After the field descriptions are assigned, the error row can be split into several rows, one for each possible error using a

    normalizer transformation. After a single source row is normalized, the resulting rows can be filtered to leave only errors that

    are present (i.e., each record can have zero to many errors). For example, if a row has three errors, three error rows would be

    generated with appropriate error descriptions (ERROR_DESC) in the table ERR_DESC_TBL.

    The following table shows how the error data produced may look.

    Table Name: CUSTOMER_ERR

    NAME DOB ADDRESS ROW_ID MAPPING_ID

    NULL NULL NULL 1 DIM_LOAD

    Table Name: ERR_DESC_TBL

    FOLDER_NAME MAPPING_ID ROW_ID ERROR_DESC LOAD_DATE SOURCE Target

    CUST DIM_LOAD 1 Name is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

    CUST DIM_LOAD 1 DOB is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

    CUST DIM_LOAD 1 Address is NULL 10/11/2006 CUSTOMER_FF CUSTOMER

    The efficiency of a mapping approach can be increased by employing reusable objects. Common logic should be placed in

    mapplets, which can be shared by multiple mappings. This improves productivity in implementing and managing the capture of

    data validation errors.

    Data validation error handling can be extended by including mapping logic to grade error severity. For example, f