46574483 siperian hub implementer guide (1)

Siperian Hub XU

Implementer’s Guide

XU

© 2007 Siperian, Inc.

Copyright 2007 Siperian, Inc. [Unpublished - rights reserved under the Copyright Laws of the United States]

THIS DOCUMENTATION CONTAINS CONFIDENTIAL INFORMATION AND TRADE SECRETS OF SIPERIAN, INC. USE, DISCLOSURE OR REPRODUCTION IS PROHIBITED WITHOUT THE PRIOR EXPRESS WRITTEN PERMISSION OF SIPERIAN, INC.

Contents

PrefaceIntended Audience ............................................................................................................................................xContents .............................................................................................................................................................xiLearning About Siperian Hub .......................................................................................................................xiiContacting Siperian ........................................................................................................................................xiv

Chapter 1: Introducing Siperian Hub ImplementationSiperian Implementation Methodology..........................................................................................................2

Reducing Project Risk.............................................................................................................................2Core Principles .........................................................................................................................................2

Roles in a Siperian Hub Implementation Project .........................................................................................4Phases in an Siperian Hub Implementation Project ....................................................................................6

Discover Phase.........................................................................................................................................6Analyze Phase...........................................................................................................................................7Design Phase ............................................................................................................................................7Build Phase ...............................................................................................................................................8Deploy Phase............................................................................................................................................8

Chapter 2: Analyzing DataGetting Started ...................................................................................................................................................9Defining the Flow of Data Between Siperian Hub and Source/Target Systems ..................................10

Determine Data Source Characteristics .............................................................................................10Assemble a Statistically Representative Sample Data Set................................................................11Consider Data Sizing.............................................................................................................................11Consider the Relationship Between Data and Business Processes................................................12

Consider Data Cleansing and Standardization Rules .................................................................................12Consider Trust Levels and Validation Rules ...............................................................................................12

iii

Trust Levels............................................................................................................................................ 13Validation Rules..................................................................................................................................... 14

Consider Match Rules .................................................................................................................................... 14

Chapter 3: Designing the Data ModelAbout Data Modeling for MRM................................................................................................................... 18

Data Model Design Deliverables ....................................................................................................... 18Conceptual Model ................................................................................................................................. 19Logical Model ........................................................................................................................................ 20Physical Model....................................................................................................................................... 24

Design Principles............................................................................................................................................. 27Principle 1: Consider Deep Versus Wide .......................................................................................... 28Principle 2: Match Requirements Drive the Model ......................................................................... 29Principle 3: Consolidation Counts...................................................................................................... 30Principle 4: Pass the Independence Test .......................................................................................... 33Principle 5: Mix Different Types of Customers Carefully .............................................................. 36Principle 6: Landing and Staging Data............................................................................................... 40

Design Patterns ............................................................................................................................................... 42Households ............................................................................................................................................ 42Addresses................................................................................................................................................ 43Populating the Address Household Object ...................................................................................... 45Communication Channel Models ....................................................................................................... 46

Chapter 4: Using Trust Settings and Validation RulesUsing Trust Levels .......................................................................................................................................... 52

About Trust Levels ............................................................................................................................... 52How Trust Works ................................................................................................................................. 52Ranking Source Systems According to Trustworthiness ................................................................ 55Trust Best Practices .............................................................................................................................. 58Configuring Trust Levels ..................................................................................................................... 60Example Stored Procedure to Calculate Decayed Trust................................................................. 63

Using Validation.............................................................................................................................................. 65

iv Siperian Hub XU Implementer’s Guide

About Validation Rules.........................................................................................................................65How Validation Works .........................................................................................................................65Best Practices for Validation Rules.....................................................................................................68

Using Trust and Validation Together...........................................................................................................70Scenarios Involving Trust and Validation for a Column.................................................................70What Happens When a Record Is Updated ......................................................................................71Example Using Trust Levels and Validation Rules Together.........................................................72

Chapter 5: Configuring and Tuning Match RulesAbout Matching ...............................................................................................................................................76

Before You Start Defining Your Match Rules..................................................................................76Steps in the Match Process ..................................................................................................................76

Populations .......................................................................................................................................................77Tokens for Match Keys .................................................................................................................................77

Determining When to Tokenize Your Data......................................................................................78Match Key Widths.................................................................................................................................79Match Key Types and Mixed Data .....................................................................................................80

Search Strategies ..............................................................................................................................................80Match Purposes ...............................................................................................................................................81

Using the Match Purposes to Match People .....................................................................................82Using the Match Purposes to Match Organizations ........................................................................82Using the Match Purposes to Match Addresses ...............................................................................82Name Formats .......................................................................................................................................82Field Types Used in Purposes .............................................................................................................83Match Levels ..........................................................................................................................................84

Defining and Testing Your Match Rules .....................................................................................................85About Testing.........................................................................................................................................86

Matching Best Practices..................................................................................................................................86Exact Match Column Properties...................................................................................................................87

Null Match..............................................................................................................................................87Segment Match.......................................................................................................................................88Using Matching on Dependent Tables ..............................................................................................91

Setting Match Batch Sizes ..............................................................................................................................91

Contents v

Using Dynamic Match Analysis Threshold................................................................................................. 92Tuning Match for Performance .................................................................................................................... 92About Merging ................................................................................................................................................ 94

Chapter 6: Implementing Hierarchy ManagerAbout Hierarchy Manager ............................................................................................................................. 96Before You Begin Implementing Hierarchy Manager............................................................................... 97

Defining Your Goals ............................................................................................................................ 97Understanding the Data ....................................................................................................................... 97Assembling the Team........................................................................................................................... 98Determining Resources ........................................................................................................................ 98

About Implementing a Hierarchy Manager System .................................................................................. 98Step 1: Analyze Your Data .................................................................................................................. 99Step 2: Build the Data Model ............................................................................................................ 102Step 3: Configure Your Hierarchy Manager Implementation...................................................... 102Step 4: Load Data................................................................................................................................ 102

Chapter 7: Scheduling Batch Jobs and Batch GroupsAbout Scheduling Siperian Hub Batch Jobs............................................................................................. 104Setting Up Job Execution Scripts ............................................................................................................... 104

Metadata in the C_REPOS_TABLE_OBJECT_V View............................................................. 104Identifiers in C_REPOS_TABLE_OBJECT_V............................................................................ 106Determining Available Execution Scripts ....................................................................................... 107Retrieving Values from C_REPOS_TABLE_OBJECT_V at Execution Time ....................... 107Running Scripts Asynchronously...................................................................................................... 108

Monitoring Job Results and Statistics ........................................................................................................ 108Error Messages and Return Codes................................................................................................... 108Job Execution Status .......................................................................................................................... 108

Job Scheduling Reference ............................................................................................................................ 111Alphabetical List of Jobs.................................................................................................................... 111Autolink Jobs ....................................................................................................................................... 112Auto Match and Merge Jobs ............................................................................................................. 113

vi Siperian Hub XU Implementer’s Guide

Automerge Jobs ...................................................................................................................................115BVT Snapshot Jobs.............................................................................................................................116Generate Match Token Jobs..............................................................................................................118Key Match Jobs....................................................................................................................................120Load Jobs ..............................................................................................................................................121Manual Link Jobs.................................................................................................................................123Manual Unlink Jobs.............................................................................................................................125Match Jobs............................................................................................................................................127Match Analyze Jobs.............................................................................................................................128Match for Duplicate Data Jobs .........................................................................................................130Stage Jobs..............................................................................................................................................131Unmerge Jobs.......................................................................................................................................133

Scheduling Batch Groups.............................................................................................................................137About Batch Groups...........................................................................................................................137Stored Procedures for Batch Groups ...............................................................................................138

Developing Custom Stored Procedures for Batch Jobs..........................................................................145About Custom Stored Procedures ....................................................................................................145Required Execution Parameters for Custom Batch Jobs ..............................................................145Example Custom Stored Procedure .................................................................................................146Registering a Custom Stored Procedure ..........................................................................................149

Chapter 8: Implementing Custom Buttons in Hub Console ToolsAbout Custom Buttons in the Hub Console ............................................................................................151

How Custom Buttons Appear in the Hub Console.......................................................................152What Happens When a User Clicks a Custom Button..................................................................154

Adding Custom Buttons...............................................................................................................................155Writing a Custom Function ...............................................................................................................155Controlling the Custom Button Appearance ..................................................................................159Deploying Custom Buttons ...............................................................................................................159

Contents vii

viii Siperian Hub XU Implementer’s Guide

Preface

Welcome to the Siperian Hub Implementer’s Guide. This guide explains how to design and implement your Master Reference Manager (MRM) system.

This guide has been written for database administrators, system administrators, data stewards, application developers, and other members of an MRM implementation team who are responsible for MRM implementation and configuration tasks. To learn more, see “Intended Audience” on page x.

You must be familiar with the platform on which Siperian Hub is installed. If that platform is Windows, then you must also have knowledge of Microsoft Windows Component Services, which is required for Siperian Hub™. Database administrators must be familiar with the database environment on which they have installed MRM. Knowledge of Oracle administration is particularly important.

Other administration and configuration tasks are described in the Siperian Hub Administrator’s Guide and Siperian Hub User’s Guide.

This guide assumes that MRM and all supporting software components have been installed. To learn more about installing MRM, see the Siperian Hub Installation Guide for your platform.

ix

Intended Audience

Intended AudienceThis guide is intended for the following audiences:

Audience Description

MRM Implementers Those responsible for designing, developing, testing, and deploying MRM according to the requirements of the organization. All of the chapters in this book are recommended for implementers.

Hierarchy Manager Implementers

Those responsible for designing, developing, testing, and deploying Hierarchy Manager according to the requirements of the organization. See Chapter 6, “Implementing Hierarchy Manager”.

Data Stewards Custodians of data quality. In Siperian terms, data stewards are the people responsible for reviewing and, where necessary, correcting and manually merging business data on a regular and ongoing basis. While the primary resources for data stewards is the Siperian Hub User’s Guide, data stewards will also find the following chapters useful:• Chapter 1, “Introducing Siperian Hub Implementation” • Chapter 2, “Analyzing Data”• Chapter 3, “Designing the Data Model”• Chapter 4, “Using Trust Settings and Validation Rules”

Siperian Administrators IT people responsible for configuring or updating a Hub Store so that it provides the rules and functionality required by the data stewards. While the primary resource for administrators is the Siperian Hub Administrator’s Guide, administrators will also find the following chapters useful:• Chapter 5, “Configuring and Tuning Match Rules”• Chapter 4, “Using Trust Settings and Validation Rules”

x Siperian Hub XU Implementer’s Guide

Contents

ContentsThis guide contains the following chapters:Chapter 1, “Introducing Siperian Hub Implementation”

Introduces the overall Siperian Hub implementation process and describes key concepts you need to understand before starting a Siperian Hub implementation project.

Chapter 2, “Analyzing Data”

Describes activities involved with analyzing data for a Siperian Hub implementation project.

Chapter 3, “Designing the Data Model”

Describes what implementers need to know need before building the data model for a Siperian Hub implementation project.

Chapter 4, “Using Trust Settings and Validation Rules”

Provides a brief overview of how trust settings and validation rules work together, best practice recommendations, and examples.

Chapter 5, “Configuring and Tuning Match Rules”

Describes how to use and tune match rules.

Chapter 6, “Implementing Hierarchy Manager”

Describes concepts, methodology, design patterns, and other information that implementers need to know before beginning a Hierarchy Manager™ (HM) implementation project.

Chapter 7, “Scheduling Batch Jobs and Batch Groups”

Explains how to schedule Siperian Hub batch jobs using job execution scripts.

Chapter 8, “Implementing Custom Buttons in Hub Console Tools”

Explains how to add custom buttons to tools in the Hub Console that allow users to invoke external services on demand.

xi

Learning About Siperian Hub

Learning About Siperian HubSiperian Hub Documentation Navigator

The Siperian Hub Documentation Navigator directs you to the books in the Siperian Hub documentation that are most useful to you based on your role.

Siperian Hub Installation Guide

The Siperian Hub Installation Guide for your platform explains how to install Siperian Hub and Cleanse Match Server. There is a Siperian Hub Installation Guide for each supported platform.

Siperian Hub Release Notes

The Siperian Hub Release Notes contain important information about this release of Siperian Hub. Read the Siperian Hub Release Notes before installing Siperian Hub.

What’s New in Siperian Hub

What’s New in Siperian Hub provides an enhanced description of the new features for this release.

Siperian Hub Tutorial

The Siperian Hub Tutorial walks you through various Siperian Hub implementation tasks on a step-by-step basis.

Siperian Hub Administrator’s Guide

The Siperian Hub Administrator’s Guide explains how to configure, administer, and manage a Siperian Hub implementation. It provides a description of the Siperian Hub platform through a discussion of Siperian Hub concepts, services, tools, and databases. Administrators should read the Siperian Hub Administrator’s Guide first.

xii Siperian Hub XU Implementer’s Guide

Learning About Siperian Hub

Siperian Hub User’s Guide

The Siperian Hub User’s Guide explains how to use Siperian Hub. It provides a description of the Siperian Hub platform through a discussion of Siperian Hub concepts and tasks. Data stewards and users who are new to Siperian Hub should read the Siperian Hub User’s Guide first.

Siperian Hub Implementer’s Guide

The Siperian Hub Implementer’s Guide explains how to design, implement, test, and deploy a Siperian Hub implementation. Implementers must be familiar with the content of the Siperian Hub Administrator’s Guide as well as the Siperian Hub Implementer’s Guide before starting a Siperian Hub implementation.

Siperian Services Integration Framework Guide

The Siperian Services Integration Framework Guide explains how to use the Siperian Hub Services Integration Framework (SIF) to integrate Siperian Hub functionality with your applications and how to create applications using the data provided by Siperian Hub. SIF allows you to integrate Siperian Hub smoothly with your organization's applications.

Siperian Training and Materials

Siperian provides live, instructor-based training to help you become a proficient user as quickly as possible. From initial installation onward, a dedicated team of qualified trainers ensure that your staff is equipped to take advantage of this powerful platform. To inquire about training classes or to find out where and when the next training session is offered, please visit our web site or contact Siperian directly.

xiii

Contacting Siperian

Contacting SiperianTechnical support is available to answer your questions and to help you with any problems encountered using Siperian products. Please contact your local Siperian representative or distributor as specified in your support agreement. If you have a current Siperian Support Agreement, you can contact Siperian Technical Support:

We are interested in hearing your comments about this book. Send your comments to:

Method Contact Information

World Wide Web http://www.siperian.com

E-Mail [email protected]

Voice U.S.: 1-866-SIPERIAN (747-3742)

by E-Mail: [email protected]

by Postal Service: Documentation ManagerSiperian, Inc.1820 Gateway Dr., Suite 109 San Mateo, CA 94404

xiv Siperian Hub XU Implementer’s Guide

http://www.siperian.com

1
Introducing Siperian Hub Implementation
This chapter introduces the overall Siperian Hub implementation process and describes key concepts you need to understand before starting a Siperian Hub implementation project. It provides a framework and methodology for implementing Siperian Hub in a Siperian customer environment. This framework is intended to help with implementation planning in conjunction with the particular requirements of your Siperian Hub implementation. Although every Siperian Hub implementation is unique in specific ways, certain principles, patterns, and best practices can apply generally across most Siperian Hub implementations.

Before you attempt to implement your Siperian Hub system, you should be intimately familiar with Siperian Hub and proficient in using the Siperian Hub tools. To learn more about using Siperian Hub, read through the following documents:• Siperian Hub User’s Guide

• Siperian Hub Administrator’s Guide

Chapter Contents• Siperian Implementation Methodology

• Roles in a Siperian Hub Implementation Project

• Phases in an Siperian Hub Implementation Project

1

Siperian Implementation Methodology

Siperian Implementation MethodologyThe Siperian implementation methodology provides a comprehensive set of procedures, guidelines, best practices, templates, and checklists for implementing the Siperian Hub in a customer environment. It is intended to provide project teams with the flexibility to tailor an implementation project to meet their specific needs, while still providing the structure and guidance required to successfully implement Siperian Hub.

Reducing Project Risk

The main focus of the Siperian implementation methodology is to reduce project risk by:• Standardizing the approach to implementing Siperian solutions through the use of

best practices and templates

• Applying a risk avoidance-based scheduling approach to all project plans so that high-risk components of the project plan are completed as early as possible

• Including checkpoint review processes to help keep projects on track

• Providing sufficient knowledge transfer of Siperian products and implementation methodology, along with associated skills, to customers and implementation partners

Core Principles

The Siperian implementation methodology is deliverables-based, not time-based. Deliverables are produced by specific activities that are grouped into five gated phases (described in “Phases in an Siperian Hub Implementation Project” on page 6). Gated phases mean that the project needs to pass through a checkpoint gate (a specific review process) before any activities for the next phase can begin.

The objective of checkpoint gate reviews is not to enforce a rigid waterfall methodology in which everything must be completed, approved, and signed off before any activities in the next phase can begin. Used on its own, the Siperian implementation methodology allows for overlap between phases, with as much concurrency as possible, without exposing the project to unacceptable risk. The checkpoint gate reviews determine whether a sufficient portion of the deliverables

2 Siperian Hub XU Implementer’s Guide

Siperian Implementation Methodology

from the current phase have been delivered with acceptable quality before the phase can be considered complete.

The Siperian implementation methodology can be used on its own or it can be incorporated into many other methodologies, such as PMBOK, Prince2, Iterative, Waterfall, RAD, and others (including your own in-house methodology). If you do incorporate the Siperian implementation methodology into your enterprise project management methodology, then your approach to starting a new phase will be determined by the guidelines of your particular enterprise project management methodology.

The Siperian implementation methodology is a project-based methodology that is based on the following principles:• A project is a temporary and unique endeavor.

• A project has a start date and an end date.

• A project has a specific scope that is constrained by time, cost, and quality.

• A project contains risk that must be managed.

The final goal of any project implemented under the guidelines of the Siperian implementation methodology is to deliver a fully configured, tested, and deployed Siperian Hub environment with the appropriate levels of project documentation.

Introducing Siperian Hub Implementation 3

Roles in a Siperian Hub Implementation Project

Roles in a Siperian Hub Implementation ProjectA Siperian Hub implementation project usually involves the following roles, various of which might be held by customers, Siperian, or a third-party integrators.Typical Roles in a Siperian Hub Implementation Project

Role Responsibilities

Customer Project Manager

Manages the overall project, including:• Provides day-to-day project management, planning, and tracking• Ensures that all issues and change requests have been

communicated/resolved in a timely manner• Defines and communicates resource needs• Provides best practices and program management guidance• Assists in requirements definition

Technical Lead • Primary technical representative on project team• Participates in analysis, design, and testing activities• Manages Master Data design and implementation, including:

Data ModelingBusiness RulesData LoadsRules TuningConsolidation QAPackage/View Configuration

Database Administrator

• Configures the database for Siperian Hub• Sets up the Hub databases• Works with the Solution Architect during Hub database

performance testing and tuning

System Administrator

Configures the required hardware and infrastructure software

Solution Architect Provides expert advice, counsel, and technical expertise to the project team to help assure that Siperian solutions are designed and developed in the optimal manner and in accordance with industry and Siperian best practices

Hub Builder Assists with Siperian Hub design, development, testing, and deployment


Roles in a Siperian Hub Implementation Project

The distinctions here are fluid and project-dependent. For a given Siperian Hub implementation project, a single team member might be responsible for multiple roles, and a single role might be shared among multiple team members.

EAI Specialist Provides the design and development of EAI programs

ETL Specialist Provides the design and development of ETL programs/modules

Web Services Specialist

Provides the design and development of Web interface applications

Checkpoint Reviewer

Provides an independent review of designs and deliverables at key junctures in the project to help assure the quality of the end product

Typical Roles in a Siperian Hub Implementation Project (Cont.)

Role Responsibilities


Phases in an Siperian Hub Implementation Project

Phases in an Siperian Hub Implementation ProjectA Siperian Hub implementation can be broken down into five distinct phases:• Discover Phase

• Analyze Phase

• Design Phase

• Build Phase

• Deploy Phase

Each phase has specific activities and deliverables.

Note: A sixth phase, the management of steady-state processes for supporting the environment post-deployment, is outside the scope of this document.

Discover Phase

The Discover phase initiates the implementation project and includes the following activities:• Identifying the overall vision driving the need for the project

• Analyzing the high-level requirements for the project



• Defining scope restrictions for the project

• Defining the high-level solution architecture

• Project planning and costing, along with all underlying assumptions

• Assessing project risk and defining risk mitigation strategies

• Defining service level agreements (SLAs) for key systemic qualities, such as scalability, high availability, and performance

Note: Describing the Discover phase is outside the scope of this document.

Analyze Phase

The Analyze phase involves refining the analysis of the system requirements, including:• Detailed source data analysis

• Detailed requirements definition

• Detailed gap analysis

• Evaluation and acquisition of any third party solutions

• Refining the solution architecture

Design Phase

The Design phase focuses on translating the requirements of the analyze phase into concrete designs that can be implemented and tested in the build phase. It includes• Data modeling

• Interface design

• Definition of business rules for cleansing, matching, merging, and maintaining data

• Codification of standards and conventions

• Definition of test cases



Build Phase

The Build phase focuses on the following activities in a development environment:• Siperian Hub installation and setup

• Configuring Siperian Hub to implement the data model and rules defined in the design phase

• Fine-tuning the rules

• Developing any interfaces between Siperian Hub and the source and target systems

• Security and rules configuration

• Testing the interfaces and rules

Deploy Phase

The Deploy phase involves:• Deploying the fully built, tested, and accepted solution into a production

environment

• Wrapping up the project

• Handing the system over to the appropriate system support team

• Training


2
Analyzing Data
This chapter describes activities involved with analyzing data for a Siperian Hub implementation project.

Chapter Contents• Getting Started

• Defining the Flow of Data Between Siperian Hub and Source/Target Systems

• Consider Data Cleansing and Standardization Rules

• Consider Trust Levels and Validation Rules

• Consider Match Rules

Getting StartedA critical early step in a Siperian Hub implementation project is to gain a thorough understanding of the data that you are integrating. For example, for each data source, you must know the data’s relative accuracy, structure, size, trends in the data, the amount of data, the expected growth of the data set, and any other characteristics that are peculiar to the data.

Data analysis is performed in the Analyze phase. The Analyze phase follows the Discover phase, during which a high-level data analysis is performed in order to identify any data issues or gaps that could impact project scope, timeline, costs, or risks. The Analyze phase includes both data analysis and business and functional requirements analysis. Data analysis and requirements analysis tend to happen in

9

Defining the Flow of Data Between Siperian Hub and Source/Target Systems

parallel with each other. The findings from data analysis often impact the requirements specification, and vice versa. However, data analysis is not dependent on requirements analysis.


Data analysis begins by determining the source systems that will feed data into MRM. You must know exactly what data is coming—and where it is coming from—by understanding what sources feed data into Siperian Hub, as well as what target systems are fed updates from Siperian Hub. At a high-level (in the Discover phase), it is just a system-level bubble diagram. By the time the technical design document is completed in the Design phase, it has evolved to the level of specific files or tables.

Determine Data Source Characteristics

For each data source, consider the following tasks:• For each data source, determine the size, data type, data age, quality, quantity,

source, and any other characteristics that are peculiar to the data set.

• Determine any data quality issues.

• Check the primary keys that are available in the data.

• Gain an understanding of the data cardinality—between entities, as well as consolidation cardinality.

• Determine total data volumes, expected delta volumes, and load frequencies per source.

• Identify any special initial data load requirements for the system.

• Analyze data for invalid conditions, and then perform frequency analysis to determine how often those conditions occur per source.

• Differentiate between invalid data conditions that can or cannot be remedied through data cleansing. The latter data conditions are the ones that should be considered in defining trust and validation rules.

• It is important to identify what is the more correct data, not just the more correctly formatted data.



• Consider which external systems, including source systems, should be updated when data changes in a base object. For example, you might want to update the CRM system whenever a customer’s address gets changed. Message queue triggers can be configured in the Hub Console so that data changes can be published to outbound message queues for retrieval by external systems. To learn more, see the Siperian Hub Administrator’s Guide.

Assemble a Statistically Representative Sample Data Set

To assist in data analysis, assemble a complete, diverse, but statistically representative sample of your production data from each source system. This sample should contain various types of non-identical duplicates. The more closely the sample data reflects the typical characteristics of the production data set, the more useful it will be. Having a sample data set is an invaluable resource for designing, configuring, and testing match rules.

Consider Data Sizing

Developing detailed knowledge about data sources provides the basis for correctly sizing your MRM implementation. Consider the following factors:• data volume—number of rows, size of rows, large data sets, amount of raw data,

ratio of raw to consolidated records, how “matchy” the data is

• data volatility—the frequency of updates to the data within the source system

• load frequency—how often this data will be brought into MRM to update the master records

• data model—number of base objects

• history retention and audit requirements

• number of source systems

• match rules

• performance requirements, if applicable

Analyzing Data 11

Consider Data Cleansing and Standardization Rules

Consider the Relationship Between Data and Business Processes

It is essential to understand the importance of:• each column’s data to the business processes and business users that produce it.

• the quality of the data capture processes and data validation processes in each source system

• how closely aligned is your use of the data to the purposes of the people with whom the data originates (closer alignment is more reliable)

Consider Data Cleansing and Standardization RulesWhen analyzing data, consider source attributes that would benefit from data cleaning via the use of data cleansing and standardization rules. Cleanse lists are intended to facilitate data conversion during the staging process to ensure that the data that ends up in the staging table is in a standardized, consistent format. For each source, the appropriate transformation from source specific codes to the standard codes can be achieved with a cleanse list maintained in MRM. This will also enable the base objects to contain the actual standardized code values (as opposed to the Rowid_Object pointing to the standard code value).

If cleanse lists are used to standardize codes, then a lookup table can be set up in MRM for each code to validate the code during data loading, ensuring that any record containing an erroneous code for which there is not a cleanse list entry does not get propagated into the base objects.

Consider Trust Levels and Validation RulesDuring the analysis and design phases of a project, it is important to identify the factors affecting the trust levels of your source data, and to determine what validation rules need to be implemented. Although configuring trust levels occurs later in the Siperian Hub implementation process, you should begin thinking about trust level settings and validation rules during data analysis. As you analyze the data, you learn more about its varying levels of accuracy. This knowledge contributes to the trust rules design.


Consider Trust Levels and Validation Rules

The quality of the data (as defined by the relative importance of the source system and the relative quality of the data coming from that source system) is the main factors in determining trust settings. If you find during data analysis that some data is typically erroneous, then you probably want to give it a lower trust score.

To learn more about defining trust settings and validation rules, see Chapter 4, “Using Trust Settings and Validation Rules”. For more information on using the MRM tools to set trust levels, see the Siperian Hub Administrator’s Guide.

Trust Levels

In MRM, the Siperian Trust Framework ensures that its consolidated records, at the cell level, contain the most reliable information available from the data sources. Trust is a mechanism for measuring the confidence factor associated with each cell based on its source system, change history, and other business rules. Trust takes into account the validity of the data, the age of the data, and how much its reliability has decayed over time. For more information about trust settings, see “Using Trust Levels” on page 52.

Trust is assigned at the column level. It can be specified, for example, that Source System 1 is more reliable for “customer name” but Source System 2 is more reliable for “phone number”. There are several parameters that can be set to assign Trust for each source system’s column, such as:• Maximum (initial) Trust level for a new data value

• Minimum Trust level for an “old” data value

• Decay Period or length of time that the trust level takes to decay from the Maximum Trust to the Minimum Trust

• Decay Type or the shape of the decay curve (a straight line or a curve)

For example, the “Email Address” from a Web application might be assigned Maximum Trust of 80, Minimum Trust of 20, Decay Period of 1 year, and Decay Type of SIRL (Slow Initial, Rapid Later), indicating a curve that decays gently at first and more rapidly later.

In addition to internal data sources, consider data sources that are not controlled within your organization. For example, suppose your organization purchases data sets

Analyzing Data 13

Consider Match Rules

from a third-party provider. These data sources might be guaranteed to consist of unique records with a high level of accuracy. Accordingly, you could decide to designate a high level of trust for this data.

Validation Rules

A validation rule tells Siperian Hub the condition under which a data value is not valid. If data meets the criterion specified by the validation rule, then the trust value for that data is downgraded by the percentage specified in the validation rule. To learn more about validation rules, see “Using Validation” on page 65.

Here are some examples of validation rules:• Downgrade trust on Last Name if length(last_name) < 3 and last_

name<> ‘NG’

• Downgrade trust on middle_name if middle_name is null

• Downgrade trust on Address Line 1, City, State, Zip and Valid_address_ind if Valid_address_ind= ‘False’

If the Reserve Minimum Trust flag is enabled (checked) for a column, then the trust cannot be downgraded below the column’s minimum trust setting.

Consider Match RulesAlthough configuring match rules occurs later in the Siperian Hub implementation process, you should begin thinking about match rules during data analysis because the data analysis will turn up data characteristics that govern the match rules. Therefore, as you analyze data, do so with match rules in mind.

During data analysis, identify which columns are appropriate for matching. For example, if a gender column is null 80% of the time, then this column is probably not a column to use in a match rule. Similarly, investigate the distribution of data so that you can assess in advance how selective a match rule needs to be for certain columns.



To learn more about defining match rules, see Chapter 5, “Configuring and Tuning Match Rules”. For more information on using the MRM tools to configure match rules, see the Siperian Hub Administrator’s Guide.

Analyzing Data 15

3
Designing the Data Model
This chapter describes what implementers need to know need before building the data model for a Siperian Hub implementation project. It is recommended for all implementers and anyone else who must understand the Master Reference Manager data model. To learn more about the data model, see the Siperian Hub Administrator’s Guide.

Note: This chapter assumes that the reader is familiar with conventional data modeling methodologies—it supplements conventional data modeling techniques with MRM-specific recommendations.

Chapter Contents• About Data Modeling for MRM

• Design Principles

• Design Patterns

17

About Data Modeling for MRM

About Data Modeling for MRMData modelers and design consultants responsible for defining the data model for MRM require expertise in relational modeling at the conceptual, logical, and physical levels. The following sections introduce the various types of models necessary to develop a Siperian Hub implementation:• Data Model Design Deliverables

• Conceptual Model

• Logical Model

• Physical Model

Data Model Design Deliverables

The process of designing the data model for consolidated reference data for a Siperian Hub implementation involves a series of deliverables. The following figure shows the major phases of the Siperian implementation methodology, along with the data model delivered in each phase.

• The design starts with a conceptual model, which identifies the main objects to be managed in MRM. It also identifies which objects will be consolidated, because match criteria ultimately drive modeling decisions for the physical model.

• The conceptual model is used as the starting point for the logical model, which provides a logical representation of the entities and attributes to be managed in MRM.

• The logical model is transformed into a physical model, which is the model that is then defined in MRM using the Schema Manager in the Hub Console. Transitioning from a logical model to an ideal MRM physical model involves design principles that are described in “Design Principles” on page 27 later in this



chapter. The physical model is the final output from the data modeling design steps, and it is the model that the business and system owners need to approve.

The following figure shows the increasing level of detail and number of entities in conceptual, logical, and physical models.

Conceptual Model

The purpose of the conceptual model is to identify and describe the main objects needed to create a global business view of the data, with little detail. This step is often skipped in typical IT projects, or it might be combined with the logical model. However, for Siperian Hub implementations, it is very important to go through this step because it starts the process of thinking about match requirements, which impact the physical model design.

The conceptual model for a Siperian Hub implementation shows the business entities that will need to be managed in MRM, along with the relationships among the business entities and some high-level design properties. If you have worked with entity relationship diagrams (ERDs), the conceptual model might look similar. To facilitate logical and physical (or logical to physical) data model design, the Match and Merge and Intertable Match Parent properties are the most critical properties to identify (to learn more, see the Siperian Hub Administrator’s Guide). One approach is to begin with the worst case match scenario, determine the elements in the token match table, and then trim this down to the tables that would be realistically used for matching.

Designing the Data Model 19


The following figure shows an example of a conceptual data model.

The conceptual model must be derived from the system requirements, with inputs from analyses of internal and external business system data sources.

Note: For some projects, a pre-existing logical data model might be available. In such cases, it is still important to create a conceptual data model to ensure that you have identified the Match and Merge requirements that can have a significant impact on the subsequent physical data model.

Logical Model

The purpose of building a logical model is to confirm that the application will satisfy the business requirements. A logical model represents the entities, relationships, and attributes that are representative of the business information needs. A logical model is usually a normalized model. Normalization is the process of determining stable attribute groupings in entities with high interdependency and affinity.

By defining entities, attributes, and their relationships, you might discover data model design flaws that could produce anomalies. Data flaws include:• Missing entities

• Multiple entities that represent the same conceptual entity



• Many-to-many relationships that need additional entities to resolve the many-to-many relationship by creating an intersection table, thus turning the many-to-many relationship into two one-to-many relationships.

• Multivalued and redundant attributes

Example Logical Model with Design Flaws

The following figure shows an example of a logical model that has some design flaws.

This logical model is based on the previous conceptual model example shown in the figure in “Conceptual Model” on page 19. It has the following design flaws:1. Affiliation Role probably needs a Lookup table to define the different types of

Affiliation Roles (missing entity).

2. Repeating attributes (phone numbers, fax numbers, email addresses) can be normalized into an Electronic Address entity.



Example Logical Model with Fixed Design Flaws

The following figure shows the logical model after it has been fully normalized and missing entities have been added.

The logical model includes the following new entities:3. An Electronic Address entity has been added to handle the repeating phone and

fax number attributes (which have therefore been removed from the Customer Address intersection table).

4. An Electronic Address Type table has been added to provide definitions for the types of electronic address represented in each record.

5. An Affiliation Role lookup table has been added.



Pre-Existing and New Logical Models

Before considering how the logical model will transition to a physical model, it is important to get the logical model right. In some Siperian Hub implementations, a pre-defined logical model is available. In such situations, you still need to evaluate the logical model to make sure that:• it meets the stated business needs

• it makes sense logically

• the entities and attributes in the logical model can be populated from the source systems (there is little point in modeling entities or attributes that cannot be populated from the source systems)

The pre-existing logical model might not be tuned to work particularly well in MRM. Therefore, you will need to determine how to transition that logical model to a suitable physical model.

In other Siperian Hub implementations, you will need to define the logical model from scratch. In such cases, the logical model can be defined in a way that suits the business needs and is more closely aligned with the models for which MRM is tuned.

Objects in the Logical Model

When modeling for MRM, the logical model must focus on the actual entities that will be defined in MRM as base objects or dependent objects.Objects in the Logical Model

Type of Object Description

Base Objects Used to describe central business entities, such as Customer, Product, or Employee. In a base object, data from multiple sources can be consolidated or merged. Trust settings are used to determine the most reliable value for each base object cell. In addition, one-to-many relationships (foreign keys) can be defined between base objects.

Dependent Objects Used to store detailed information about the rows in a base object (such as header-detail relationships). One row in a base object table can map to several rows in a dependent object table.



You do not model history, cross-references, and so on, as MRM automatically creates and manages these structures for you. In addition, avoid adding landing tables or staging tables to the logical model, because they clutter the model unnecessarily. You can model landing tables as part of the physical model.

Remember that the logical model is not an enterprise-wide data model. The logical model is a model for reference data only, and it is usually only for a specific subset of the reference data (such as Customer data). Similarly, do not include transaction data in the logical model, and limit the model to the reference data that is to be managed in MRM. Finally, bear in mind that the physical model—not the logical model—is the actual model that you will implement for MRM.

Physical Model

The physical model is the actual model that you define using the Schema Manager in the Hub Console (to learn more, see the Siperian Hub Administrator’s Guide). It is thus a subset of the complete physical schema that will be generated by MRM. The physical model diagram shows the base objects, dependent objects, and landing tables to be implemented in MRM.

The rule of thumb for physical model diagrams is to show the user-defined entities and attributes, plus the primary and foreign keys, so that relationships can be modeled correctly. In the physical model, avoid showing MRM-generated entities or attributes other than primary and foreign keys. All supporting tables—such as cross-references, history tables, control tables, and staging tables—will be created by MRM and therefore are not included in the physical model diagram.

MRM is flexible enough to implement any logical model as a physical model, but it is tuned to work better with some types of models than with others. Performance is the main driver for differences between the logical model and the physical model. Before you develop a physical model for a Siperian Hub implementation, you must carefully review your logical model in light of its performance implications. An ideal physical model for MRM is a balance between a completely denormalized model (best performance) and highly normalized (best flexibility).



The following figure shows an example physical model based on the logical model described previously.

Notice that all of the entities defined in the logical model will be implemented as base objects and that ROWID_OBJECT is used for all primary keys. In addition, notice that the many-to-many relationship between the Customer and Address entities in the logical model has been changed to a one-to-many relationship in the physical model. The reasons for these changes will be explained in the “Design Principles” on page 27 section later in this chapter.

When designing the physical model, consider the following factors:• Required Functionality

• Performance and Scalability

• Flexibility for Future Use

• Siperian Product Roadmap



Required Functionality

Required functionality is one of the key factors affecting design decisions in the physical model. Some examples of functionality requirements include:• If you must keep a history of changes to attributes of an object, then define that

object as a base object.

Performance and Scalability

A completely denormalized model gives the fastest performance, particularly for merge and unmerge, as there are fewer child tables to be updated on merge or unmerge. However, a completely denormalized model limits both flexibility and functionality.

The more denormalized the model, the fewer levels of consolidation are available, and the more difficult it can be to add new data sources and new attributes or entities in the future. You must therefore find a balance between modeling for performance (denormalizing) and modeling for functionality/flexibility (normalizing). You should not denormalize simply for the sake of denormalizing—there are some areas that are better to denormalize than others, as they yield the most performance benefit with the least functionality/flexibility loss. These issues are discussed in detail in the “Design Principles” on page 27 section later in this chapter.

Flexibility for Future Use

When defining the physical model, it is important to keep possible future requirements in mind, but without adding entities or attributes that cannot yet be maintained or that that are not yet fully understood. Sometimes building in system flexibility is as simple as naming things flexibly. For example, if you are building a Customer master for Organization data and you know that the plan is to add Person data to that Customer Master within the next year, then consider using a name other than “Organization” (such as “Business Party”) for the Customer table because the table may well end up containing both Organization and Customer data.

Be wary of adding physical limitations that might later cause problems. One example of this is specifying user-defined unique keys on base objects. If you define a unique key on a base object, you cannot merge records in that base object. Although this might not be a problem in the initial implementation of a project, it is not uncommon for


Design Principles

new sources that are later added to a system will bring their own values for the base object with the unique key, making it desirable to use match and merge functionality to consolidate the new system’s data with that of the original system’s data.

Siperian Product Roadmap

An optimal physical design for a Siperian Hub implementation takes into account what is known of future requirements, the Siperian product roadmap, and the intersection between them. If you have any questions about how your model relates to the Siperian product roadmap, arrange (through Siperian Support) for a data model review with Siperian Solutions Delivery and Engineering.

If you model types of objects (such as households) or types of relationships that are not discussed in this document, then you should review the data model with your Siperian Solution Architect to make sure that the model does not run contrary to any assumptions in MRM design, QA, or planned features. This review should be conducted as part of the data model checkpoint review that should already be built into your project plan.

Design PrinciplesThis section describes some underlying design principles for transitioning from a highly normalized logical model to a physical model.• Principle 1: Consider Deep Versus Wide

• Principle 2: Match Requirements Drive the Model

• Principle 3: Consolidation Counts

• Principle 4: Pass the Independence Test

• Principle 5: Mix Different Types of Customers Carefully

• Principle 6: Landing and Staging Data


Design Principles

Principle 1: Consider Deep Versus Wide

This design principle refers to the number of direct child tables linked to a parent table. The following figure shows the two different types of designs.

This principle applies when you want to merge or unmerge on the parent table. The design principle mainly affects performance of the merge and unmerge processes.

The more directly-linked child tables that a parent table has, the more those tables must have foreign key references updated when records merge in the parent table. Therefore, the more child tables a parent table has, the slower will merges for the parent table be.

This principle applies to the unmerge process as well. For unmerges in a deep model, consider how far you allow unmerges to cascade. Which child tables need to have cascade unmerge enabled? How many child tables deep should you choose to enable the Unmerge on Parent Unmerge flag? The more child tables you have with merged records and the Unmerge on Parent Unmerge flag enabled, the more work the unmerge needs to do, and therefore the slower the unmerge process.


Design Principles

Principle 2: Match Requirements Drive the Model

Match criteria also drive physical data model decisions with respect to functionality. Intertable match criteria involves the use of attributes from one table in the match rules of a related table—for example, matching customers using address information from the Address table. For more information, see “Address Example” on page 31.

Another area in which required match functionality can affect the physical model is the way in which match rules must be defined. If you need to define an AND match rule, you need to denormalize repeating attributes that are to be used in the match rule. Normalizing repeated attributes into a child table allows OR match rules on the normalized attributes, not AND match rules.

For example, if you create an Electronic Address table that contains phone numbers and e-mail addresses, you can use these in a match rule that identifies records as matching if their phone numbers are the same OR if their email addresses are the same. If you need a match rule that identifies records as matching if their phone numbers match AND their email addresses match, then you need to denormalize these into separate columns.

The following figure shows an example of a normalized Electronic Address table that supports OR match conditions only.

This Electronic Address table supports match rules in which phone numbers matched OR e-mail addresses matched. In the example shown, Customer IDs 12345, 45678, and 00001 would all be identified as matches for one another because of their matching phone numbers.


Design Principles

The following figure shows an example of denormalized attributes to support AND match conditions.

Logically, this table shows the same data as in the normalized Electronic Address table, but the physical structure has been denormalized to support match conditions that specify AND criteria. In this example, Customer IDs 12345 and 45678 would match because their phone numbers match AND their email addresses match. Customer ID 00001 would not be considered a match for the other two records because it has a different e-mail address. For more information, see “Communication Channel Models” on page 46.

Principle 3: Consolidation Counts

The physical model must take into account the required results after consolidation and, particularly, the desired cardinality of base object to cross-references after consolidation (where cardinality is the ratio of the number of records in the base object to the number of records in the cross-reference). The physical model must also consider the effects of source updates on the surviving record. This section describes several examples to illustrate this principle.

Physician Specialities Example

A physician can have one or more specialties. Pharmaceutical companies are often interested in identifying only the primary specialty for a physician. However, when two physician records are merged from different sources, those sources might provide different values for the physician's primary specialty. If the required cardinality after merging the specialties is one surviving primary specialty, then you should include Primary Specialty as a column on the Physician base object. However, if the pharmaceutical company wants to keep all of the specialties for the merged physician record, then Physician Specialty must be a child table of the Physician table.


Design Principles

Address Example

Logically, a single address can belong to multiple customers. For example, office addresses can be shared by colleagues at the same location, or group practice addresses can be shared by partners in the same law firm. Of course, a customer can also have multiple addresses. For this reason, logical models usually have customer and address as distinct entities with a many-to-many intersection table between them.

However, in a physical model for consolidated data, this approach is not necessarily practical, especially if you are trying to reduce duplication in addresses from multiple sources. Consolidating addresses when they are not directly linked to customers means that you are consolidating addresses across customers. For example, in the following figure, N.E. One and Ann Other both have the same address. If the two address records are merged, then one survived address record will remain and that record will be linked to both N.E. One and Ann Other through the Customer Address intersection table.

Avoid consolidating addresses across customers unless there is a real business need for an enterprise-wide unique ID per physical address location. Even if there is a real business need, there are other ways to model this instead. For more information, see “Design Patterns” on page 42.

Consolidating addresses across customers involves limiting address changes to the right customers, performance considerations, and functionality considerations.


Design Principles

Limiting Address Changes to the Right Customers

If one Customer changes their address, then you need to make sure that the address change is not automatically applied to the consolidated address record for all customers. For example, in the figure shown in “Address Example” on page 31, if N.E. One moves their office, it does not mean that Ann Other has also moved their office, so the consolidated address that was previously linked to both N.E. One and Ann Other now belongs only to Ann Other.

Performance Considerations

Consolidating addresses across customers means that you usually have a high degree of cardinality between the source addresses and the resultant consolidated addresses. The higher the number of duplicate records, the more work the merge must do to process them. The cardinality is reduced if Customer ID is one of the match criteria for addresses—that is, if addresses are consolidated only within customer records, not across them. The following figure shows the recommended approach for customer address relationships.

Using this approach also reduces the number of tables that must be staged and loaded. This approach does not necessarily yield a large performance gain if your implementation involves only a handful of source systems to process. However, the more source systems that are configured to process, the higher will be the performance impact that each additional target table has on stage and load batches. For example, a Siperian Hub implementation with five sources for the previous model (shown without consolidated addresses in “Business Party and Differentiated Customer Models” on page 36) requires 15 stage jobs and 15 load jobs. An implementation with ten sources for that same model requires 30 stage and 30 load jobs. For the model with


Design Principles

consolidated addresses, five sources require ten stage and ten load jobs, and ten sources requires 20 stage and 20 load jobs.

Functionality Considerations

Modeling customer address as a direct (one-to-many) relationship between customer and address means that customer address attributes can be stored directly on the Customer Address base object or as a child base object linked to Customer Address. As long as the attributes are part of a base object, MRM tracks their history. This approach also means that Customer can use attributes from child tables of the Customer Address table for matching.

Similarly, keeping customer address attributes in base objects means that duplicate or overlapping attribute values from multiple sources can be consolidated to get to “best of breed” values for those attributes.

Principle 4: Pass the Independence Test

Independent base objects are base objects that are not linked to the core consolidated object through a one-to-many or a many-to-one relationship, but are instead linked through many-to-many intersection tables. If a base object is modeled as an independent base object, then its records should make sense on their own, without any reference to the core base object. It should make sense to consolidate its records to a distinct set of values.

Steps for Testing Independence

The independence test for a physical model includes the following steps:1. Identify the core base object that is being consolidated in the Hub

Store—Customer in a Customer Master, Supplier in a Supplier Master, and so on.

2. Look for any many-to-many relationships (direct or indirect).

3. Inspect the base object that is on the other side of the many-to-many relationship and ask the question: “What can the business do with a distinct list of the things in this object without knowing who the Customer is?” If the answer is “Nothing,” then change the many-to-many to a one-to-many relationship.


Design Principles

Example Using a Highly Normalized Model

The following figure shows an example of a highly normalized model.

In this model, Specialty, Address, and Electronic Address are all linked to the core object—Customer—through many-to-many relationships. You can therefore apply the independence test by asking the following questions:

Question Answer

What can the business do with a distinct list of Specialties without knowing who the Customer is?

The distinct list of Specialties can be used to provide a pick or lookup list of Specialty values in a capture screen for new Customer information. The business wants to standardize the list of Specialties it uses in reporting by assigning each source specialty to a consolidated enterprise specialty value.

What can the business do with a distinct list of Addresses without knowing who the Customer is?

In most cases, the answer to this question is “Nothing.” Addresses are usually meaningful only in terms of the Customer to whom the Address belongs.

What can the business do with a distinct list of Electronic Addresses (for example, telephone numbers) without knowing who the Customer is

“Nothing”—a telephone number has no significance in its own right.


Design Principles

Converting relationships from many-to-many to one-to-many for the objects that failed the independence test would result in the model shown in the following figure.


Design Principles

Principle 5: Mix Different Types of Customers Carefully

In Siperian Hub implementations, you must be careful when mixing different types of customers.

Business Party and Differentiated Customer Models

This principle focuses on the consequences of implementing two different models—a Business Party model versus a Differentiated Customer model, which are shown in the following figure.

Data modelers often prefer the Differentiated Customer model because it reduces null attributes on the Customer table (for example, the Organization Customer does not need to carry any attributes that apply only to an Individual Customer). However, there are definite advantages to using a Business Party model over a Differentiated Customer

Model Description

Business Party Model All Customer records are loaded into the same Business Party table, and an attribute on that table identifies the type or classification of the Customer records. In this example of a Business Party model, the Class of Customer attribute distinguishes Organizations from Individuals.

Differentiated Customer Model

The type or classification of the Customer records is implied by where the records are stored. In this example of a Differentiated Customer model, the Organization table holds Customers classified as Organizations, and the Individual table holds Customers classified as Individuals.


Design Principles

model, even if it does result in more null attributes on the Business Party table. Such advantages include:• The Business Party model easily supports any number of chained relationships

between different classes of customers and/or the same classes of customers.

• The Business Party model allows you to model networks, not just parent/child hierarchies.

• The Business Party model provides a single unique identifier for each Customer without any chance of overlap.

• The Business Party model allows you to search for Customers in one place without needing to know anything about the type of Customer.

• The Business Party model allows you to identify source records that have given Customers incorrect types.

Mixing Models

In your Siperian Hub implementation, you might decide to implement a Business Party model so that you get one unique Customer identifier and you can model Customer Affiliations flexibly. If you want to avoid too many redundant/null value columns on the Business Party base object, you can use child tables to carry some of the attributes that are specific to specific sub-types of Customers. However, if you do this, you must be very careful about how you mix the Business Party and Differentiated Customer models.


Design Principles

The following figure shows a poor mix of these models.


Design Principles

The following figure shows a better way to mix these models.

If the merge performance is a concern, then consider using a pure Business Party model, as shown in the figure in “Business Party and Differentiated Customer Models” on page 36.

This is a better mix than the figure showing a poor mix of models because it simplifies the relationships between the objects and reduces the number of cross-table joins required to get the match data. The preferred model is still the full business party model shown in “Business Party and Differentiated Customer Models” on page 36, as that reduces the number of child tables to be maintained on merge and unmerge.

The Customer match attributes have been denormalized so that they are attributes of the Business Party base object instead of the Organization and Individual base objects. This reduces the number of cross-joins used in populating the match token.

In the better mix, all relationships have been defined at the Business Party level, making it easier to navigate and maintain the relationships. The poor mix has an uneasy mixture of relationships, with Addresses having nullable foreign keys to either Individual or Organization.


Design Principles

Principle 6: Landing and Staging Data

This principle considers how you design landing and staging tables in your Siperian Hub implementation.

Landing Table

Although we have no strong design recommendations with respect to landing tables, consider the following issues for your Siperian Hub implementation:• Some implementations have used source-specific landing tables (a landing table per

source table/source file). This keeps the landing table format closer to the source format and means that the ETL process does not need to transform all sources to a standard layout, which could simplify the process of making changes for one source or adding new sources with different attributes later. However, it usually also means a very large number of landing tables, which can be tedious and cumbersome to set up.

• Other implementations have used one landing table per target table, which means that the ETL needs to transform all sources for the same target to the same standard layout. This approach does allow the ETL to be standardized, making it much faster to develop and test for the first implementation (where typically a large number of sources need to be coded). It is possible that this approach also makes it more costly to maintain after initial deployment, because changes from one source could potentially affect multiple ETL mappings.

If you use one landing table per target table in your Siperian Hub implementation, then the landing table needs to include a source identifier, which must be used in filtering the data mapped to each staging table. The landing table should also have a range partition specified in Oracle to partition it according to source system, which allows partitions to be truncated before the ETL inserts data from a source, rather than having records deleted from the landing table.


Design Principles

Staging Tables

Staging tables must be based on the columns provided by the source system for the target base object or dependent object for which the staging table is defined, even if the landing tables are shared across multiple source systems. If you do not make the column on staging tables source-specific, then you create unnecessary trust and validation requirements.

Trust is a powerful mechanism, but it carries performance overhead. Use trust where it is appropriate and necessary, but not where the most recent cell value will suffice for the surviving record. For more information, see “Using Trust Levels” on page 52.

If you limit the columns in the staging tables to the columns actually provided by the source systems, then you can restrict the trust columns to those that come from two or more staging tables. Use this approach instead of treating every column as if it comes from every source, which would mean needing to add trust for every column, and then validation rules to downgrade the trust on null values for all of the sources that do not provide values for the columns.

More trust columns and validation rules obviously affect the load and the merge processes. Also, the more trusted columns, the longer will the update statements be for the control table. Bear in mind that Oracle and DB2 have a 32K limit on the size of the SQL buffer for SQL statements. For this reason, more than 40 trust columns result in a horizontal split in the update of the control table—MRM will try to update only 40 columns at a time.


Design Patterns

Design PatternsThis section summarizes the following typical physical data model design scenarios and describes options for implementing them:• Households

• Addresses

• Populating the Address Household Object

• Communication Channel Models

Households

A Household is a grouping of customer records according to geographic location. For example, all of the people living at one address could be considered a household, or a group of doctors practicing at one hospital could be considered a household.

Create Household as a base object that is the parent of Customer. The easiest type of household is one in which the household has no attributes of its own. It uses inter-table match to match on selected Customer match columns that usually include the Address match columns.


Design Patterns

The following figure shows an example of a logical mode for Households.

Addresses

The ideal model for addresses involves a one-to-many relationship from Business Party to Address, with Address match rules that include Business Party ID to prevent matches across different Business Parties. However, there are occasionally business cases for consolidating addresses across Business Parties, such as to get a single identifying key for all addresses for the same location, regardless of which Business Parties use that address. If there are business reasons for consolidating Addresses


Design Patterns

regardless of the Business Parties using the Addresses, then the following consolidated address model is recommended.

In this model, the Business Party Address base object consolidates the Addresses per Business Party. The Business_Party_ROWID is one of the match criteria for the Business Party Address base object, and Business Party Addresses should merge only if they have the same Business_Party_ROWID value.

The Business Party Address base object gives you the distinct set of addresses for each business party, but it does not give you a distinct set of all the addresses with a unique ID for each unique address. To get a unique set of Address identifiers, the Address base object would need to be included in the data model.

At its simplest, the Address base object does not include any attributes of its own, other than a Status Indicator to indicate whether the Address ID is active or inactive. Instead, it uses intertable match to match using the attributes from the Business Party Address table. This approach assumes that tight matching rules are used for the Address base object, and that survivorship of household-specific attributes is not required. If household-specific attributes need to be survived, then those attributes must be defined and populated for the Address base object, along with the appropriate Trust rules.


Design Patterns

Populating the Address Household Object

The Address household object is a standard base object that is populated through landing and staging tables. At the cross-reference level, there is one-to-one cardinality between the Address base object (Address cross-reference) and the Business Party Address base object (Business Party Address cross-reference).

Landing Tables

The Address object should share landing tables with the Business Party Address base object. The Address base object uses the same pkey_src_object values as the Business Party Address.

Staging Tables

The Address object must have its own staging tables. As for any other base object, the Address base object requires a separate staging table for each source system that can populate it. Each Address staging table usually only has pkey_src_object and last_update_date columns, unless there are other, household-specific attributes to be included.

If hard delete detection is being used to deactivate unused address identifiers, then the staging tables may also include a status_ind column.

Populating Address and Populating Business Party Address

When you first load Address and Business Party Address (before doing any matching and merging on either), there will be the same number of records in Address and in Business Party Address. The cross-reference tables will also have the same number of records. Each Business Party Address record has a foreign key referencing a record in the Address table.

As you merge records in the Address table, records in the Business Party Address table will be updated to reflect the surviving Address key. Once all Address records have been merged, you can merge records in the Business Party Address table.


Design Patterns

Each new Business Party Address record will have a corresponding new Address record that will need to be matched against the existing Address records.

Communication Channel Models

Communication channel data refers to electronic and telephonic address information, such as phone numbers, fax numbers, email addresses, URLs, pagers, and so on.

Two Communication Channel Models

This section describes two communication channel models.

Generic, Normalized Communication Channel Model

In some data models, communication channel data are modeled as a completely generic structure, with a type identifier that determines the type of electronic or telephonic address contained in each row of data, as shown in the following figure.


Design Patterns

Denormalized, Type-Specific Communication Channel Model

In other data models, communication channel data is stored in a denormalized, type-specific structure in which each communication channel is stored in its own specific column, as shown in the following figure.

Comparison Between the Models

This section compares the pros and cons of each of these models.

Generic, Highly-Normalized Communication Channel Model

The following table compares and contrasts the advantages and disadvantages of using a generic, highly-normalized communication channel model.Generic, Highly Normalized Communication Channel Model

PROS CONS

No restriction on the number of phone numbers, e-mail addresses, and so on that can be stored for a customer.

More problematic to use in Customer match rules. Cannot do combinations of matches like “WHERE customer phone numbers match AND customer e-mails match”.

Duplicate values can easily be identified and consolidated.

More difficult to determine new versus updated values because there is no primary key value for each record that does not include the communication channel value.


Design Patterns


The following table compares and contrasts the advantages and disadvantages of using a denormalized, type-specific communication channel model.

New communication channel types can easily be added without needing to change anything in the data model.

Does not support a granular level of detail for types. For example, all elements of a phone number are usually stored in one string, rather than being separated into country code, dial code, phone number, and extension.

Requires ETL work to normalize the data, or otherwise requires multiple loads in MRM through multiple staging tables.


PROS CONS

Can do match combinations like “WHERE customer phone numbers match AND customer emails match”.

Number of communication channels that can be stored per customer is limited by the number of columns available.

Easier to determine new versus updated values. An update to a value can be treated as a direct update because the record is keyed on customer_id and the update is to a specific column.

Can be more difficult to identify and consolidate duplicate values. For example, if Source A provides phone_1 as 555-123-4576 but Source B has that value as phone_2 and not phone_1, then the two values will not be de-duped.

Supports a more granular level of detail for types. For example, instead of storing all elements of a phone number in one string, they can be separated into country code, dial code, phone number, and extension.

New communication channel types require changes to the data model.

Straight mapping from landing to staging can be supported for the relevant communication channels, with the need for normalizing the data in ETL or through multiple stage jobs in MRM.

Generic, Highly Normalized Communication Channel Model (Cont.)

PROS CONS


Design Patterns

Proposed Third Type of Data Model for Communication Channels

This model shown in the following figure is a hybrid of the other two models. It provides normalized structures for communications channels without a high degree of generalization described in “Generic, Normalized Communication Channel Model” on page 46.

In this hybrid model, separate communication channel objects have been created for logically similar communication types. This approach minimizes the drawbacks of the two previous models and provides the best benefits of both.


Design Patterns


4
Using Trust Settings and Validation Rules
Trust is a designation of confidence in the relative accuracy of a particular piece of data. Validation determines whether a particular piece of data is valid. Trust and validation work together to determine “the best version of the truth” among multiple sources of data. This chapter provides a brief overview of how trust settings and validation rules work together, best practice recommendations, and examples. This chapter is recommended for administrators and implementers.

Chapter Contents• Using Trust Levels

• Using Validation

• Using Trust and Validation Together

51

Using Trust Levels

Using Trust LevelsThis section describes how to determine appropriate trust levels for an individual piece of data coming from a given data source. You use the Trust tool in the Hub Console to configure trust levels. To learn more, see the Siperian Hub Administrator’s Guide.

About Trust Levels

Trust is a designation the confidence in the relative accuracy of a particular piece of data. For each column from each source, you can define a trust level represented by a number between 0 and 100, with zero being the least trustworthy and 100 being the most trustworthy. By itself, this number has no meaning. It becomes meaningful only when compared with another trust number to determine which is higher.

Trust is used to determine:• Survivorship when two or more records are merged (in case of a group merge).

• Whether updates from a source system are reliable enough to update the “best version of the truth” record.

MRM’s on-going management of the “best of breed” record is achieved using the trust rules to assess updates from source systems in terms of their trust weightings.

How Trust Works

In a merge, Siperian Hub calculates a trust score for both records being merged together (merge source and merge target). Siperian Hub compares the trust score of the merge source with the trust score of the merge target and changes the survived value in the base object only if the merge source has a higher trust score than the merge target. If the trust score of the merge target is higher, then the value of the merge target remains unchanged.

Consider the following example. When two base object records merge, MRM calculates the trust score for each trusted column in the two base object records being merged.


Using Trust Levels

Cells (the intersection of a column and record) with the highest values survive in the final merged record.

Calculations

When an update comes in from a source system, MRM calculates the trust score on the incoming data and compares it to the trust score of the data in the base object.

Using Trust Settings and Validation Rules 53

Using Trust Levels

Updates are applied to the base object only for cells that have the same or higher trust score on the incoming data.

How Decay Periods Affect Trust Levels

Depending on the configured decay period specified, a small difference (such as one day) in the age of two records does not affect survivorship immediately, especially if the merge date is very close to the src_lud (not much time has passed for the trust level to move down the curve). With Linear decay, the impact of age remains constant. With RISL and SIRL, the impact of age changes as the trust level moves down the curve.

However, the way you specify the time units does affect trust levels. The more granularly you specify the time units, the more sensitive the graph is to small changes in age, although that sensitivity does decrease with longer decay periods. For example, the following table shows trust settings based on different ways to configure the decay


Using Trust Levels

period. For all of these examples, the maximum trust setting is 90 and the minimum trust setting is 10.

Ranking Source Systems According to Trustworthiness

Before you define trust settings, you must analyze the data source systems and rank them in descending order of reliability. The goal is to define the relative (not absolute) level of reliability of data in these source systems. Ranking is by attribute. For each attribute, the ranking of source systems might differ. Levels need not be exclusive—you can have more than one system rated at the same level.

Consider doing this process on a whiteboard. List the attributes and assign them either a straight ranking for each attribute or a group of related attributes (such as address data).

When ranking the reliability of source systems, consider the following issues:• What are the processes for updating the source data? For example, if the source

system has three screens for updating all of the data, then data on the first, most

How Time Settings for Decay Period Affect Trust Levels

Decay Period Graph Type Trust Level

One year RISL trust = 90

12 months RISL trust = 89.60

365 days RISL trust = 89.56

1 year linear trust = 90

12 months linear trust = 89.8

365 days linear trust = 89.78

3 years linear trust = 90 after one day

36 months linear trust = 89.93 (actually 89.9333, but the system rounds to two decimal places)

1095 days linear trust = 89.93 (actually 89.9269)


Using Trust Levels

frequently-used screen is likely to be updated more frequently than values on any subsequent screens.

• What information goes into each source system? How is data validated? What is the process for updating data? Do the attributes that you want to bring into Siperian Hub exist in the source system (if so, then you might encounter a lot more unwanted or incorrect data)? How clean is the data in the source system and how clean can the data be made by removing junk data? It is important to understand what the source systems and your ETL process are doing to cleanse data in the source system.

• Look at systems that are highly rated. Are there conditions that you define as part of the data analysis that result from the most reliable source? Note those conditions as part of your analysis.

• Focus on one base object at a time. Within the base object, focus on each trusted attribute. Rank the source systems for that attribute according to their relative trustworthiness.

• Ask on-site business experts and/or data specialists to provide practical knowledge about the data sources so that you can more effectively define the trust rankings. Consider conducting one or more trust workshops with these experts to help clarify the trust rankings. Make sure that you document any decisions, particularly trade-off decisions, and obtain sign-off approval from the participants.

• Analyze data for invalid conditions. Conduct a frequency analysis to determine how often such conditions occur per source. The goal is to identify what is the more correct data, not just the more correctly formatted data.

Note: Be sure to distinguish between invalid data conditions that can be remedied through data cleansing and those that cannot. Consider focusing on trust and validation rules for conditions that cannot be remedied.

• Determine which columns require trust settings and which do not. You should define trust on a column if any of the following conditions apply:

• There are two or more data sources for that column and they are not equally reliable (or equally unreliable).

• The Last Update Date must be taken into account in determining survivorship.


Using Trust Levels

• A data steward must be able to select or promote the surviving trust value in the Merge Manager / Data Manager (to learn more, see the Siperian Hub User’s Guide).

• Consider the performance impact of configuring trust columns.

• The larger the number of configured trust columns and validation rules, the longer it takes to complete the load and merge processes.

• The larger the number of trusted columns, the longer it takes to complete the update statements for the control table. Oracle has a 32K limit on the size of the SQL buffer for SQL statements. For this reason, more than 60 trust columns result in a horizontal split in the update of the control table (consequently, MRM will try to update only 60 columns at a time).

• Identify logical trust groups in the data and assign them all the same trust levels, as well as validation downgrades.

• For example, address fields should all belong to the same logical trust group so that all parts of the address are always taken from the same source record. This is because the granular components of an address are dependent on each other for their meaning. Nonsensical addresses could result if parts of the address were taken from one source and other parts of the address were taken from a difference source.

Note: The logical trust group for address should include a validation status indicator if it is being used to determine a downgrade percentage in a validation rule.

• Names (First Name, Middle Name and Last Name) usually do not belong to a logical trust group. This is because components of a full name are not dependent on each other for their meaning. A source system that provides, for example, good information on last names might provide only an initial letter for middle name, while another source system that provides lower-quality last names might provide full and valid middle names.

• Siperian Hub handles delete flags in two different ways:

• direct delete—Delete-flagging any cross-reference for a base object will result in the base object record being flagged for delete as well.

• consensus delete—A base object record is flagged as fully inactive only if all of its cross-reference records are flagged as deleted. In this model, the base


Using Trust Levels

object records that have some but not all cross-references flagged as deleted are flagged as partially deleted.

For delete flags for consensus trust:

• "P" (Partial Delete) must have a lower trust than "I" (Inactive) or "A" (Active).

• "I" and "A" should be at the same level.

Add a validation rule to downgrade the trust score if the delete flag is "P" or "I". Do not preserve minimum trust. For more information, see the Hard Delete Detection technical bulletin.

Trust Best Practices

Trust values are run-time calculations. Trust is planned in the Discover and Design phases and verified and fine-tuned in the Build phase. To learn more, see “Phases in an Siperian Hub Implementation Project” on page 6.

Choosing the correct trust levels is a complex process. It is not enough to consider one system in isolation. You must ensure that the trust settings for all of the source systems that contribute to a particular column combine to produce the behavior that you want.1. During the Discover phase, talk with as many people as possible about the data.

2. Use the Data Quality Audit questionnaire in the Analysis phase. Question the system owners, including maintenance, data steward, and sales liaison representatives.

• For each table/file, determine the table/file name, the total number of records in the inspected set, and the total number of records in the full data population.

• For each column in the table, determine the column name, number of distinct values, number of NULL values, percentage of NULL values, text length (maximum, minimum, and average), types of non-alphanumeric characters found, number of values that indicate “unknown” or “undefined”, the top ten values (the ten values that occur most frequently), and any other notes regarding your visual inspection of the data.


Using Trust Levels

3. Use the Trust Matrix to record all relevant information that goes toward determining trust settings for each source system. The Trust Matrix asks a number of questions about the source data. Each question is designed to elicit information about the probable reliability of the source system. Here are some of the questions that you should consider:

• Does the source system validate this data value? How reliably does it do this?

• How important is this data value to the users of the source system, as compared with other data values? Users are likely to put the most effort into validating the data that is central to their work.

• How frequently is the source system updated?

• How frequently is a particular attribute likely to be updated?

4. Rank the systems in relation to the source system of highest trust based on the attributes that will be used.

5. For each column in each base object table, you can enable or disable trust using the Trust tool in the Hub Console.

• If trust is disabled, Siperian Hub will always use the most recently loaded value, regardless of which source system it comes from.

• For most columns that come from multiple source systems, you will want to enable trust because some systems are more reliable sources of that particular information than others. If you enable the trust for a column, you also specify the trust settings for each source system that could update the column.

6. If you expect a data steward to override settings of sources, enable trust and use a special source system called “Admin” that represents manual updates that the data steward makes within Siperian Hub. This source system can contribute data to any column that has trust enabled on using the Trust tool. You must specify trust settings for the Siperian Admin system. You will probably want to set the trust settings for this system to high values to ensure that your manual updates override any existing values from source systems.

7. Trust and validation can cause situations in which values survive in the base object even though they are no longer in any of the cross-references. Validation downgrade can mean that the source does not update a cell even if it had previously provided the cell value. The survived value in a base object might not have same value as the corresponding cross-reference. There might not be any


Using Trust Levels

cross-references with the same value or trust as the base object. This situation causes problems in the following areas:

• Delete indicators – making sure the right value is in the base object

• Removing the influence of inactivated records from base object

Configuring Trust Levels

This section covers issues associated with configuring trust levels. You use the Trust tool in the Hub Console to configure trust levels. To learn more, see the Siperian Hub Administrator’s Guide.

Guidelines for Configuring Trust Settings

Consider the following guidelines for configuring trust settings:• If a column receives data from multiple sources, then enable trust for that column.

This also requires you to specify the relative trust level for each of the source systems that update the column.

• If you are doing a lot to clean data from one source and not another, reduce trust for the data source that requires more cleaning, after receiving the appropriate approvals from the business.

• If you set a long decay period for data, you might have difficulty picking up small fluctuations in the trust level. You must balance this consequence against your reasons for setting a long decay period.

• Some groups of data form logical trust groups. For example, the components of an address forms a logical trust group. All the elements of an address must have the same settings: trust codes, decay values, and so on. You do not want to pick up pieces of an address from different sources. Also, if a postal service database returns an indicator that some part of the address data is invalid, then grouping the data means that all parts of the address will be downgraded the same amount.

• With staging tables, if you have logical trust groups, enable the Allow Null Update flag for the members of that group. For example, suppose an Address Line 2 column contains the value Suite 2 and then a user corrects the record by removing the Suite 2 value. If Allow Null Update is not enabled for that column, then the Suite 2 value would remain in the cell, resulting in an inaccurate record.


Using Trust Levels

• Avoid assigning numbers that are too close together. Make sure that you set trust levels far enough apart (a minimum difference of five; ten is better) to avoid rounding problems that might occur during trust calculations. In the course of calculating trust as it degrades, Siperian Hub rounds these numbers and, if the numbers are too close together, rounding errors can obscure the differences.

Defining Trust Settings

When defining trust settings, you must:• Determine the ranking of the attributes (or groups of attributes). See “Ranking

Source Systems According to Trustworthiness” on page 55 for more information.

• Assign trust values based on these rankings. See “Guidelines for Configuring Trust Settings” on page 60 for more information.

• Assign decay values based on the analysis of the continuing reliability of the data. See “Enabling Cell Update” on page 62 for more information.

The following example shows ranking source systems for customer name.

To define trust settings:1. Review your data source analysis and notice the criteria you noted that

distinguished the highly-rated systems. The criteria that result in the most reliable sources become the validation rules (see “Using Validation” on page 65). Using these criteria, you can make sure that data from sources that conform to those rules prevail over less reliable data sources.

2. Quantify these rules by applying a numerical designation of trust to those source systems using a scale of 0 (lowest trust) to 100 (highest trust). Remember that these numbers have no meaning in themselves—they are meaningful only in the relative ranking of the source systems in relation to each other.

3. Once you have identified the validation rules, define the decay type and rate.

Source Title First Middle Last Suffix

Sys1 80 90 60 90 90

Sys2 60 75 90 80 60

Sys3 95 80 80 60 80


Using Trust Levels

The most common decay type is SIRL (Slow initial, Rapid Later). This decay type makes the most sense for most data.

Another common scenario is where data that comes from the source system (in the form of updates) must always prevail over the existing data. In this case, consider disabling trust. This will guarantee that the newest incoming data from the source system will overwrite the data already in the MRM.

4. Define maximum / minimum trust settings and decay curves.

To do so, identify the cross-over points where decay curves would intersect each other. Leave a buffer at the top and bottom of the ranges (avoid setting the maximum trust to 100 or the minimum trust to 0). Leave a buffer between source systems as well. This buffer makes it easier to tweak trust settings and to add more sources later. A suggested gap between settings is at least five or more, preferably 10.

Enabling Cell Update

The default behavior for when Siperian Hub receives an updated value for a column on a record from a source system is that all trust values for the trusted columns for that source are recalculated from maximum trust again, based on the last update date of the record. Because Siperian Hub does not check to see whether the actual cell values have changed, an update in one column is regarded as enforcing the values in other columns. This restarts the decay curve for all the values for the record from the beginning. If you want Siperian Hub to check whether the actual column value has changed before updating the column and recalculating its trust level from the Maximum Trust, then enable cell update using the Schema Manager in the Hub Console (to learn more, see the Siperian Hub Administrator’s Guide).

Enable cell update on your staging table if you have parts of the record coming from source systems that are regularly updated, and other parts of the record that are not regularly updated. Generally, users never look at the parts that are not regularly updated. It is a good idea to enable cell update so that these parts of the record carry on decaying, while the updated bits have their trust values reset appropriately.

For example, suppose a source system has three screens for updating all the data. Anything that is not on the first, most frequently-used screen is probably updated


Using Trust Levels

much less frequently. In this case, enabling cell update allows the trust value for these infrequently updated cells to continue to decay.

Example Stored Procedure to Calculate Decayed Trust

The following code shows an example stored procedure to calculate decayed trust. Use this code if you want to get the calculated trust value for a particular cell.DECLARE RetVal NUMBER; IN_PREV_UPDATE_DATE VARCHAR2(200); IN_LATEST_UPDATE_DATE VARCHAR2(200); IN_TRUST NUMBER; IN_MIN_TRUST NUMBER; IN_TIME_UNITS VARCHAR2(200); IN_GRAPH_TYPE NUMBER; IN_X_MAX NUMBER;

BEGIN IN_PREV_UPDATE_DATE := '11 OCTOBER 2005'; IN_LATEST_UPDATE_DATE := '12 OCTOBER 2005'; IN_TRUST := 90; IN_MIN_TRUST := 10; IN_TIME_UNITS := 'YYYY'; -- 'YYYY' or 'M' OR 'D' IN_GRAPH_TYPE := 2; IN_X_MAX := 3;

RetVal := CMX.CALC_DECAYED_TRUST ( IN_PREV_UPDATE_DATE, IN_LATEST_UPDATE_DATE, IN_TRUST, IN_MIN_TRUST, IN_TIME_UNITS, IN_GRAPH_TYPE, IN_X_MAX );

DBMS_OUTPUT.Put_Line('RetVal = ' || TO_CHAR(RetVal));

COMMIT; END;

In this example:

Name Description

IN_PREV_UPDATE_DATE Previous updated date.

IN_LATEST_UPDATE_DATE Date to calculate the decayed trust score.


Using Trust Levels

If you add SET SERVEROUTPUT ON in the beginning of the code, when executing it using SQL*PLUS, the trust score on a specific date would be calculated and printed out.

IN_TRUST Maximum trust level.

IN_MIN_TRUST Minimum trust level.

IN_TIME_UNITS Time unit used for decay period.

IN_GRAPH_TYPE Decay period type. One of the following types:• 0 – slow initial, rapid later• 1 – rapid initial, slow later• 2 – linear

IN_X_MAX Number of decay units.

EDU_S330_COMPLETE ORS name.

Name Description


Using Validation

Using ValidationThis section describes how to use validation rules to determine the validity of an individual piece of data coming from a given data source. You use the Schema Manager in the Hub Console to configure validation rules. To learn more, see the Siperian Hub Administrator’s Guide.

About Validation Rules

A validation rule tells MRM the condition under which a data value is not valid. If data meets the criterion specified by the validation rule, then the trust value for that data is downgraded by the percentage specified in the validation rule. If the Reserve Minimum Trust flag is set for the column, then the trust score cannot be downgraded below the column’s minimum trust setting.

How Validation Works

If you set validation rules with trust settings, cells that meet the condition defined in the validation rule have their trust scores downgraded by the percentage downgrade value specified for the validation rule according to the following algorithm.Final trust = Trust - (Trust * Validation_Downgrade / 100)

For example, with a validation downgrade percentage of 50%, and a trust level calculated at 60:Final Trust Score = 60 - (60 * 50 / 100)

Therefore:Final Trust Score = 60 - 30 = 30

Validation rules are evaluated in sequence, and the last validation rule that is met provides the validation downgrade that is applied. The order of the validation rules is therefore important. For example, the following two validation rule lists have different results for the same input data.


Using Validation

Sequence 11. 'Downgrade trust on First_Name by 50% if Length < 3’

2. 'Downgrade trust on First_Name by 75% if Delete_ind=Y'

Sequence 21. 'Downgrade trust on First_Name by 75% if Delete_ind=Y'

2. 'Downgrade trust on First_Name by 50% if Length < 3'

For a given record that is flagged as deleted and where the value in the First_Name column is 'MK', the final trust score for each of the lists given above are calculated as follows:• Sequence1: Final Trust Score = (Trust - (Trust * 75 / 100))

If Trust was calculated as 60, then for Sequence 1, Final Trust = (60 - 45) = 15.

• Sequence 2: Final Trust Score = (Trust - (Trust * 50 / 100))

If Trust was calculated as 60, then for Sequence 2, Final Trust = (60 - 30) = 30.

If it is more important that the trust score be downgraded for deleted records than for records with short first names, then obviously the first scenario is the best approach to use.

Differences Between Batch and Online (API) Processing

Validation rules are evaluated differently for batch and online (API) processing, which can result in different outcomes for downgrading trust levels for a given column or piece of data.• Batch Process—Validation rules are evaluated from top to bottom. The process

steps through all the rules and only the last applicable rule is applied.

• API call (PUT)—Validation rules are applied in the reverse order. The first rule that meets the validation criteria is the only one applied (evaluation stops at that point).

Ordering and Grouping Validation Rules

The order of the validation rules is very critical. Validation rules should be ordered starting with the rules that have the lowest impact (rules resulting in the least


Using Validation

downgrade), and moving to the rules that have the highest impact (rules resulting in the highest downgrade). In many cases, downgrades are mixed and matched across rules. Therefore, the goal is to determine how you order them by level of severity.

Consider the following set of example validation rules.Rule 1 - Downgrade FName by 20%, downgrade ID by 60% WHEN fieldA = 'BAD'Rule 2 - Downgrade FName by 40%, downgrade ID by 40%, downgrade FLAG_A by 80% WHEN FLAG_A = 'N'Rule 3 - Downgrade FName by 10%, downgrade ID by 70% WHEN FLAG_B='N'

In this set of validation rules, note that the downgrade in rule 1 is for two columns, whereas rule 2 has three columns for downgrade and rule 3 has two columns for downgrade. If the situation arises in which all three rules are satisfied, then the final outcome of the downgrade will be based on a combination, such as:"downgrade FName by 10%, dowgrade ID by 70% and downgrade FLAG_A by 80%"

The downgrade process sequentially applies the downgrade rule that meets the condition and stores the downgraded results in a temp table. In this example, the values inserted will be for Rule 1, which includes only the FName and ID columns. Rule 2 will overwrite those values for this rowid_object with FName, ID and FLAG_A. Rule 3 will then overwrite the same record with values only for columns FName and ID. This processing results in the downgrade values that go across rules.

If all the downgrade rules are met, then only the values from one downgrade rule per column (not always the same one) will be applied. Therefore, the downgrade values are not cumulative.

The grouping and ordering of the downgrade rules should be done by grouping and defining validation rules that have the same columns. Therefore, you might end up defining multiple rules with the same WHERE clause, which would definitely increase the number of validation rules. The previous example would need to be broken down as:Rule 1 - Downgrade FName by 10% WHEN FLAG_B='N'Rule 2 - Downgrade FName by 20% WHEN fieldA = 'BAD'Rule 3 - Downgrade FName by 40% WHEN FLAG_A = 'N'


Using Validation

Rule 4 - Downgrade ID by 40%, downgrade FLAG_A by 80% WHEN FLAG_A = 'N'Rule 5 - Downgrade ID by 60% WHEN fieldA = 'BAD' Rule 6 - Downgrade ID by 70% WHEN FLAG_B='N'

Compared to the previous example, if all rules were met, then this would give us a final result of "downgrade FName by 40%, downgrade ID by 70% and downgrade FLAG_A by 80%"

Best Practices for Validation Rules

This section describes best practices for validation rules.

Using Cross-Column Validation

Consider how data is coming in terms of grouping of data. Do all columns come in together from staging tables and PUTs? If not, then the validation rules are not valid.

Using Complex Validation Rules

You must have foreign keys when using complex validation rules.

Validation and Its Affect on Load and Merge Performance

Validation rules have an impact on the performance of Load and Merge jobs because they involve running more queries and maintaining more metadata. Therefore, you should use validation rules judiciously and only where needed. Consider the following issues:• Use validation rules for a column only when they are truly required.

• Limit the number of validation rules per column.

• If a Load job is slow, manually create indexes in the database on the staging table for columns used as criteria.


Using Validation

• Joining to other tables involves a lot of overhead. If a join is necessary, join only low volume tables. It is better to have that data be part of the ETL process than the validation process.

Using SQL In Validation Rules

Make sure that any SQL used in a validation rule is well formed and well tuned. For example:• If your validation rules contains multiple conditions, enclose the validation rule in

parentheses, especially if the validation rule contains OR conditions. The SQL fragment you define in a validation rule is appended to an existing SQL fragment in MRM, and badly formed queries can result in unexpected results and long-running queries

• Use the following syntax:x IN (value1, value2, value3)

instead of the following syntax:(x = value1 or x = value2 or x = value3)

as it is more efficient for the RDBMS to evaluate a subset than multiple OR conditions.


Using Trust and Validation Together

Using Trust and Validation TogetherThis section describes using trust levels and validation rules together.

Scenarios Involving Trust and Validation for a Column

This section describes the following scenarios:• Column with No Configured Trust Levels or Validation Rules

• Column Configured With Validation Rules But No Trust Levels

• Column Configured With Trust Levels But No Validation Rules

• Column Configured With Trust Levels and One or More Validation Rules

Column with No Configured Trust Levels or Validation Rules

If a given column has no configured trust settings or validation rules, then the most recently loaded source value for the cell is always the winner, and the cell will be updated in the base object. In a merge, the value from the record that MRM deems to be the merge source will survive after the merge.

Column Configured With Validation Rules But No Trust Levels

If a given column is configured with one or more validation rules but no trust settings, then the following will occur:• If the validation rule is specified as 100% downgrade without the Reserve

Minimum Trust option, then a cell that meets the validation rule condition (meaning that the data is invalid) will not survive in the base object. If there exists no other source that can provide an update value for the cell in the base object, then the default value specified for the cell survives in the base object. If no default is specified, then the surviving value is NULL.

• If the validation rule is specified as something other than a 100% downgrade, and/or if the rule has the Reserve Minimum Trust option, then the most recently updated source value for the cell is always the winner.



Column Configured With Trust Levels But No Validation Rules

If a given column is configured with trust settings but no validation rules, then the decayed trust score is calculated based on the last update date of the source record, and the trust settings for the column for that system. The winning cell is the one with the highest trust score after decay.

Column Configured With Trust Levels and One or More Validation Rules

If a given column is configured with trust settings and one or more validation rules, then the validation downgrade is applied for the most severe rule (defined by the validation rule sequence in the Hub Console) that fails validation, and then the trust score for that data is downgraded by that percentage. If the new trust score is below the minimum trust for the rule, then the minimum trust setting is the final trust score. Finally, the two cell trust scores are compared and the data in the cell with the highest trust score is chosen as the winning data that updates the cell.

What Happens When a Record Is Updated

When a record is updated, the cross-reference records for the data are always updated. The base object records are updated only by data that have higher trust levels than the existing data in the target base object. Whenever the Load procedure updates the base object, it also updates the control and history tables associated with the base object.

Note: Load allows a NULL value to come in only if the initial load base object has the NULL value or you have allow_null_update enabled.

The cross-reference will always get updated for the source system. The base object will get updated only if the trust score of the latest update for the cell is higher than the trust score of the base object cell.



Example Using Trust Levels and Validation Rules Together

This section provides an example of using trust levels and validation rules together for a column based on the “Scenarios Involving Trust and Validation for a Column” on page 70.

When merging record A into record B, if no trust or validation settings are configured, then all of the data from record A will be kept. This is not always desirable when there are numerous data sources of differing levels of trustworthiness providing potential values for the consolidated record. To achieve a goal of greater data reliability, trust and validation must be implemented.

Consider the following data.

Suppose trust and validation were enabled on all four columns and you created the following validation rules.Rule Name: "Downgrade trust on short Middle Name"Rule Type: CustomRule Columns: Middle_NameRule SQL: Where Length(S.Middle_Name) < 3Downgrade Percentage: 80Rule Name: "Downgrade trust on short Last Name"Rule Type: CustomRule Columns: Last_NameRule SQL: Where Length(S.Last_Name) <= 3Downgrade Percentage: 80

To keep this example simple, assume that the source for each record is the MRM Admin System and that the Maximum Trust is set to 90 on all columns.

Record A Record B Final Output

First_Name: Mark First_Name: Mark First_Name: Mark

Middle_Name: L Middle_Name: Lawrence Middle_Name: L

Last_Name: Hoare Last_Name: Hoa Last_Name: Hoa

isRegistered: N isRegistered: Y isRegistered: N



Consider the trust scores after these records are loaded/inserted for the Admin System.

The following results from merging the two records with these settings:First_Name: Mark

Middle_Name: Lawrence

Last_Name: Hoare

isRegistered: N

Notice that the prevailing value for Middle Name was selected from the record with the highest final trust score for that cell (Record B). The winning value for Last Name was selected from the record with the highest final trust score for that cell (Record A).

Because validation rules were not defined for the First_Name or isRegistered columns, the surviving values were picked from the most recently updated source record.

Field Description

Last_Name For Record A, the trust score for Last_Name is 90 because the value “Hoare” does not result in a trust downgrade. For Record B, the trust score for Last_Name is (90 - (90 *80/100) = 18 after the validation downgrade.

Middle_Name For Record B, the trust score for Middle_Name is 90 because the value "Lawrence" does not result in a trust downgrade. For Record A, the trust score for Middle_Name is (90 - (90 *80/100) = 18 after the validation downgrade.


5
Configuring and Tuning Match Rules
This chapter provides information on how to use and tune match rules. It is recommended for all implementers as well as Siperian Hub administrators. This section assumes you are familiar with the material in “Match and Merge Setup” in the Siperian Hub Administrator’s Guide.

Chapter Contents• About Matching

• Tokens for Match Keys

• Search Strategies

• Match Purposes

• Defining and Testing Your Match Rules

• Matching Best Practices

• Exact Match Column Properties

• Setting Match Batch Sizes

• Using Dynamic Match Analysis Threshold

• Tuning Match for Performance

• About Merging

75

About Matching

About MatchingMatching is how Siperian Hub identifies data duplicates.

Before You Start Defining Your Match Rules

Before you begin the process of defining and refining your match rules, it is essential that you are familiar with your data. You must know:• how complete the data is. Are your base object records sparsely populated, with

many fields that are NULL?

• how clean the data is. Are you reasonably confident of the quality of the data? Is the data that is there relatively accurate? Are there a lot of word and character transpositions?

• what proportion of the data are likely to be duplicates? Data that has many duplicates is referred to as matchy.

• in the columns you expect to use for matching, what is the expected variation in the values? This expected variation is called the cardinality.

• which data is suitable for exact matching, and which is better for fuzzy matching. Fuzzy matching takes into account variations such as word order. Exact matching doesn’t take into account any variations, but it does have performance advantages.

Steps in the Match Process

The match process consists of the following steps:1. Generate tokens that encode the data for searching for possible match candidates.

To learn more, see “Tokens for Match Keys” on page 77.

2. Search the data for possible match candidates.

3. Apply the match rules to the search results to return matches.

Siperian Hub uses the parameters you set for the match to generate a score that describes the degree to which rows match. You can select a range that defines what constitutes a match. If the score is within the range that you select, then those rows are returned as matches. You select the range by setting the match level. To learn more, see “Match Levels” in the Siperian Hub Administrator’s Guide.


Tokens for Match Keys

PopulationsSiperian Hub uses the concept of populations to encapsulate intelligence about customer name and address data for particular geographic groups. For example, different countries use different formats for addresses. These differences include such things as the placement of the street number and street name, location of the postal code, and other variations in addresses. In addition, different populations have different distributions of surnames. For example, US name data typically has Smith as 1% of the surnames. Other populations have other distributions. Siperian Hub uses this intelligence to more effectively match name and address data.

Tokens for Match Keys A token (also called a match key) is a fixed-length compressed and encoded value built from a combination of the words and numbers in a name or address such that relevant variations have the same key value. For one name or address, multiple match keys are generated. The number of keys generated per base object record varies, depending on your data and the match key level.

Siperian Hub fuzzy matching uses tokens as a basis for searching for potential matches. Tokens allow Siperian Hub to match rows with a degree of fuzziness - the match need not be exact to be considered a match. The process of generating tokens is called tokenization. Before you can use fuzzy matching, you must generate these tokens.

For example, the following strings generate the following tokens: Example of strings and tokens

String Token

BETH O'BRIEN MMU$?/$-

BETH O'BRIEN PCOG$$$$

BETH O'BRIEN VL/IEFLM

LIZ O'BRIEN PCOG$$$$

LIZ O'BRIEN SXOG$$$-

LIZ O'BRIEN VL/IEFLM

Configuring and Tuning Match Rules 77


Note: The tokens that are generated depend on your data and the parameters you set for match keys.

When searching for match candidates, LIZ O'BRIEN and BETH O'BRIEN are considered as candidates because they have some key values in common.

With respect to tokens, Siperian Hub does several things during match.• Siperian Hub checks for the tokenization incomplete indicator. If the last

tokenization process started but didn't finish, this indicator is set. If the tokenization incomplete indicator is set, Siperian Hub re-tokenizes the data before matching.

• Siperian Hub checks for the dirty indicator. The dirty indicator indicates that an update occurred after the last time this data was tokenized. The dirty indicator can propagate from a child to a parent record. A value of 0 in the dirty indicator indicates that the record in the token table is up to date. If the record is not up to date, Siperian Hub tokenizes the data before matching.

After generating the tokens, the next step in the match process is to get match candidates from the database using the keys defined for the names or addresses. This is done using the match keys generated on the column(s) selected to form the match key.

Determining When to Tokenize Your Data

You can tokenize your data at any of these times:• when it is loaded

• when it is put into the table (using the PUT or CLEANSE_PUT API calls)

• right before you match

The default setting is to not tokenize when you load or put your data. You may want to change this setting for either of the following reasons:

Do not use the Generate Match Tokens on Put option if you are using the API. If you have this parameter set on, your Put and CLEANSE_PUT API calls will fail. Use the TOKENIZE verb instead. Only turn on Generate Match Tokens on Put if you’re not



using the API and you want data steward updates from the console to be tokenized immediately.

To learn more, see “Modifying the Properties of Base Objects” in the Siperian Hub Administrator’s Guide.

Match Key Widths

Siperian Hub supports the following key widths: • Standard Keys

• Extended Keys

• Limited Keys

• Preferred Keys

These widths represent tradeoffs between the match precision and the space used by the tokens. The space used is determined by the number of tokens generated.

For typical customer data, use the standard key width. The number can vary based on the data, but generally the standard key width generates approximately five or six token records per base object record.

Extended keys support more variation in the values for the key, but also generate more records in the token table, about 10 to 12 token records for every base object record.

Limited keys support less variation in the values used for the key, but the token table is also much smaller, with perhaps two to three token records per base object record. If your data has character transpositions in the data used for the key, limited keys may not be the best choice.

Preferred keys generate a single key per base record. This reduces the number of comparisons and increases performance, but can result in returning fewer matches than other key width options. Use this option if you have high volumes of high quality data.


Search Strategies

Match Key Types and Mixed Data

The match key type you select has a big effect on the match results.

For Party objects that include organizations and people in the same object, the match key type must be Organization_Name, and it must be based on the full name column from the Party object. The full name field must be populated for all records and for individuals it should at least include first name, middle name and last name.

Search StrategiesThe search strategy determines how many candidates are returned in the search phase of the match process. The number of candidates has a direct effect on the number of matches returned and the amount of time it takes Siperian Hub to apply your match rules.

The search strategy used to determine the set of candidates for matching must find the balance between finding all possible candidates, and not slowing the process with too many irrelevant candidates.

Applications dealing with relatively clean and complete data can use a high performance strategy, while applications dealing with less clean data or with more critical duplication issues must use more complex strategies.

To achieve this, four search strategies or search levels are supported:• Narrow

• Typical

• Exhaustive

• Extreme

Narrow gives the best performance but supports the least complexity, as it generates the fewest candidates. Extreme supports the highest level of complexity but gives the worst performance as it generates the most candidates.


Match Purposes

For typical customer data, a search strategy of typical is usually appropriate. You may want to change this to narrow for very large data volumes or highly matchy data. Alternately, if you have a small data set or if it critical that the highest possible number of matching records be identified, then use the exhaustive or extreme search levels instead.

If both performance and completeness of match are critical, then consider a 2-phase approach to the match process: in the first phase, use a narrow search level to more quickly match and then merge highly similar records. Then switch to a different rule set that uses extreme or exhaustive search levels to provide the more complex and complete searches for candidates.

Match PurposesThe match purpose describes the overall goal of a match rule. The match purpose is very important because it determines which columns are used for matching. The list of match purposes available is determined by the population you select. For a list and descriptions of the standard purposes, see the Siperian Hub Administrator’s Guide.

Each match purpose supports a combination of mandatory and optional fields and each field is weighted according to its influence in the match decision. Some fields in some purposes may be grouped. There are two types of groupings:• Required—requires at least one of the field members to be non-null

• Best of—contributes only the best score from the fields in the group to the overall match score

For example, in the Individual match purpose:• Person_Name is a mandatory field

• One of either ID Number or Date of Birth is required

• Other attributes are optional

The overall score returned by each purpose is calculated by adding the participating field scores multiplied by their respective weight and divided by the total of all field weights. If a field is optional and is not provided, it is not included in the weight calculation.


Match Purposes

Using the Match Purposes to Match People

When matching people, if the match includes the address fields, then the Resident purpose is better than the Individual purpose. However, if you want to match on person and external IDs, don't use Resident because it requires the address information. In that case, use Individual.

Using the Match Purposes to Match Organizations

When matching organizations, the Division purpose is better than the Organization purpose. Organization allows organizations without addresses to match with organizations with addresses, which may not be what you want. Division only matches records with similar addresses.

Using the Match Purposes to Match Addresses

For match purposes that use address, don't use Address_Part_2 on its own without Address_Part_1. If you must match on zip or city, then add an exact match column on zip or city. Just using Address Part 2 gives you a very loose match. Alternately, you could add a column using Postal_Area instead of exact match on city/zip.

Name Formats

Siperian Hub match has the concept of a default name format which tells it where to expect the last name. The options are:• Left—last name is at the start of the full name, for example Smith Jim

• Right—last name is at the end of the full name, for example, Jim Smith

The name format used by Siperian Hub depends on the purpose that you're using. If using Organization, then the default is Last name, First name, Middle name. If using Person/Resident then the default is First Middle Last.

Bear this in mind when formatting data for matching. It might not make a big difference, but there are edge cases where it helps, particularly for names that do not fall within the selected population.


Match Purposes

Field Types Used in Purposes

Below are descriptions of the fields supported by the various Match Purposes, provided in alphabetical order.Field Types Used in Match Purposes

Field Name Description

Address_Part1 Address_Part1 includes that part of address up to, but not including, the locality last line. The position of the address components should be the normal word order used in your data population. Pass this data in one field. Depending on your base object, you may concatenate these attributes into one field before matching. For example, in the US, an Address_Part1 string includes the following fields: Care-of + Building Name + Street Number + Street Name + Street Type + Apartment Details. Address_Part1 uses methods and options designed specifically for addresses.

Address_Part2 This is the locality line in an address. For example, in the US, a typical Address_Part2 includes: City + State + Zip (+ Country). Matching on Address_Part2 uses methods and options designed specifically for addresses.

Attribute1, Attribute2 Attribute 1 and Attribute 2 are two general purpose fields. They are matched using a general purpose string matching algorithm that compensates for transpositions and missing characters or digits.

Date The Date field matches any type of date, such as: date of birth, expiry date, date of contract, date of change, creation date, etc. It expects the date to be passed in Day+Month+Year order. It supports the use or absence of delimiters between the date components. Matching on dates uses methods and options designed specifically for dates. It overcomes the typical error and variation found in this data type.

ID The ID field matches any type of ID number, such as: Account number, Customer number, Credit Card number, Drivers License number, Passport, Policy number, SSN or other identity code, VIN, etc. It uses a string matching algorithm that compensates for transpositions and missing characters or digits.


Match Purposes

Match Levels

In in conjunction with the match purposes, you can choose one of three different match levels. To learn more about match levels, see “Match and Merge Setup” in the Siperian Hub Administrator’s Guide.

Organization_Name The Organization_Name field matches the names of organizations. These could be company names, business names, institution names, department names, agency names, trading names, etc. This field supports matching on a single name, or a compound name (such as a legal name and its trading style). You may also use multiple names (e.g. a legal name and a trading style) in a single Organization_Name column for the match.

Person_Name The Person_Name field matches the names of people. Use the full person name. The position of the first name, middle names, and family names, should be the normal word order used in your data population. For example, in English speaking countries, the normal order is: First Name + Middle Name(s) + Family Name(s). Depending on your base object design, you may concatenate these fields into one field before matching. This field supports matching on a single name, or an account name (such as JOHN & MARY SMITH). You may also use multiple names, such as a married name and a former name).

Postal_Area The Postal_Area field can be used to place more emphasis on the postal code than if it were included in the Address_Part2 field. It is for all types of postal codes, including Zip codes. It uses a string matching algorithm that compensates for transpositions and missing characters or digits.

Telephone_Number The Telephone_Number field is used to match telephone numbers. It uses a string matching algorithm that compensates for transpositions and missing digits or area codes.

Field Types Used in Match Purposes

Field Name Description


Defining and Testing Your Match Rules

Defining and Testing Your Match RulesWhen defining your match rules, keep the following points in mind:• Identify records with large numbers of similar values in the match key field. This is

called matchy data. Determine whether those records should be considered for matching. If they shouldn't be considered for matching, then flag those records as consolidated before running the match. To learn more, see “About Consolidation Codes” in the Siperian Hub Administrator’s Guide. If they should be considered for matching, then determine whether you can use the Match for Duplicate Data functionality to quickly match and merge the records. To learn more, see “Match for Duplicate Data” in “Match and Merge Setup” in the Siperian Hub Administrator’s Guide.

Examples of such records are the health-care customers named 'GROUP PRACTICES', 'MULTIPLE DOCTORS' and the zip-aligned customers where the names are all the same with the exception of the zip numbers at the end.

Such records all generate the same match keys, resulting in an enormous pool of records for the match to compare against each other, significantly skewing the match data set and negatively affecting performance.

• For a Party base object - i.e. one that contains both organizations and individuals: create different rules for organizations and individuals based on a customer type / customer class indicator. For each rule, use a match purpose that is appropriate for the customer type.

• If you can do an exact match on an attribute, then include that attribute as an exact match column. It makes a significant difference to performance, as it acts as a filter on the match. If all your columns are fuzzy match columns, then you're not going to get great performance.

• If suffix (Jr., Sr., II, III, etc.) is important in the match, then define it as an exact match column and switch on null matching on that column. If it is only a part of the full name used in Siperian Hub matching then you end up matching records that do not have a suffix with records that do have a suffix.

• If you do not have much variation in the values in the column you're keying on (i.e. low cardinality column) or if there is little in the way of misspellings and character transpositions in the data then use and exact match base object instead of a fuzzy match base object. The performance is significantly better. An example of an ideal


Matching Best Practices

candidate for an exact match base object is an External Identifier base object that stores identifiers such as social security numbers, license numbers, etc.

About Testing

For prototyping and testing your rules, use random data that is of both reasonable quality and quantity. Do not build the prototype to search on made-up names in a development database. Fabricated data will not give you an accurate picture of how your rules will behave in a production environment. Use a random sample of real data.

Understand the business and performance needs of the match. There is a natural conflict between performance and completeness of search. To balance these conflicting requirements, choosing the search level with care. Test your searches using different search levels on real production data.

When making your judgement, consider measures of completeness (the percentage of known matches found) against measures of performance (how long the search transaction or batch job took). Choose the one that best conforms to your business requirements.

When measuring match completeness, it is best to have a known set of expected search results. When measuring performance, in addition to ensuring the actual production volume of data is being searched, also take into account network and machine load overhead.

Matching Best PracticesKeep these considerations in mind as you define and tune your rules:• The more fields and columns you can give Siperian Hub the better, as that helps it

get better, and sometimes more, matches. These additional fields provide additional context for the match. This context allows Siperian Hub to make decisions about which columns in a match have higher or lower levels of importance in determining the outcome of the match.

• If you add exact match columns, these columns have a filtering effect. Exact match columns are applied before fuzzy columns are considered. They result in the set of


Exact Match Column Properties

match candidates being reduced to only those records that have the same value in the exact match column as at least one other record in the exact match set.

• If you add fuzzy match columns, these columns do not have a filtering effect on the match. They do not filter out matches on their own; the match engine evaluates all fuzzy columns before determining whether two records are a match.

• Avoid breaking composite values down into their constituent parts, as doing so removes much of the context information that Siperian Hub can derive from the way in which the elements in the composite value are defined. For example, pass a person's full name with as much detail in that field as possible - first name, middle name, last name, suffix, etc. - instead of parsing that field out into first name, middle initial, last name etc.

• Do not filter values out of the data for matching. For example, if you parse suffix (Jr., Sr., etc.) from the full name to include it in an exact match column, don't remove it from full name field.

Exact Match Column PropertiesFor exact match column, you can specify properties that alter the standard exact match rule behavior.

Siperian Hub supports the following properties for exact matches:• Null Match

• Segment Match

For information on how to use the Hub Console to set these match types, see “Match and Merge Setup” in the Siperian Hub Administrator’s Guide.

Null Match

The standard behavior of Siperian Hub matching is to treat each NULL value as a placeholder for an unknown value. So, by default, Siperian Hub treats nulls as unequal



in a match. You can alter this behavior by enabling the Match NULLs property. When you enable NULL Matching, you have these options:

Use null match in cases where you don’t lose any data if you have a null value. For example, null match makes sense for middle name or suffix columns. Conversely, do not use null match in cases where null match could produce an incorrect result. For example, null match is generally inappropriate for a first name column.

When you enable null match for a rule, the rule is rarely applied as most data has relatively few nulls. Generally speaking, more match rules means more overhead, and higher overhead can have an effect on performance. Typically, if you have ten match columns, only enable null match on one or two of those columns. The best way to determine if null matching is appropriate is to know your data and test your rules.

Note: You cannot have null match enabled on the same column you are using for a segment match. See “Segment Match” on page 88 for more.

Segment Match

Segment matching is useful for cases where you have different classes of information in your base object. In this case, you may need different match rules to apply to different types of data. For example, you have a base object that contains customer information for your medical products. This base object contains information for individual doctors, group practices, HMOs, and hospitals. You can create a column that indicates the type of information the record contains: individual, group, and so on. Each of these subsets of your records is referred to as a segment.

Property Description

Disabled Regardless of the other value, nothing will match (nulls are unequal values). Default setting.

NULL Matches NULL

If the other value is also NULL, it is considered a match. A null value is treated as a particular value in its own right. This means a NULL value matches another NULL value, but does not match any other value.

NULL Matches Non-NULL

If the other value is not NULL, it is considered a match. a NULL value is treated as missing data, so it matches to non-NULL data.



Note: Segment matching doesn’t support recursive relationships. An example of a recursive relationship is: a group practice is part of a clinic, which in turn is part of a hospital.

This example is better suited to the Segment matches All Data situation. A more common example, and a more illustrative one, is where you have organizations and individuals in the same Customer table, and you only want to match organizations with organizations and individuals with individuals.

A common scenario where segment matching is useful is a customer base object with individual records as well as organization records. You want to match organizations to organizations and to customers, and individuals to individuals. You never want to match individuals to organizations. For example, you have these rows:

You can create specific match rules for each segment, resulting in different rules for different types of data. You can also specify the name of a segment.

Using Segment Matches All Data

Generally, segments are used to match within subsets of data. For example if you have a column called MATCH_COLUMN_SEGMENT and its values are (“A”, “B”, “C”). To match within the “B” segment, create a rule that only generates matches when MATCH_COLUMN_SEGMENT = “B”. Siperian Hub only generates matches against other rows whose segment is also B. If you turn Segment Matches All Data on, it matches all the rows in the “B” segment against any other segment. To use the sales leads/customer database example, if you choose Segment Matches All Data, Siperian Hub matches sales leads against everything. If this checkbox is not selected, Siperian Hub only matches sales leads against sales leads.

Example of segment matching

Customer Name Customer Class

ABC, Inc O

ABC Company O

Annette Curtin I

A Curtin I



A common scenario where segment matching is useful is a base object with customer records as well as sales lead records. You want to match leads to customers, but never the other way around. For example, you have these rows:

You can create specific match rules for each segment, resulting in different rules for different types of data. You can also specify the name of a segment.

If the segment matches all data option is selected, then Siperian Hub starts with the records in the specified segment and attempts to find matches for those records in the entire base object. For example, your base object contains both sales leads and customers, indicated by either a C (for customer) or an L (for lead) in the segment column. To match leads against both other leads and customers, choose segment matching and Segment Matches All Data. If you matched this data without the Segment Matches All Data option, you would match leads against customers, but you would also match customers against leads, which might result in less reliable data.

Keep in mind that the segment match, and therefore the segments matches all data option, applies to only one match rule. You can have a number of different match rules, some which use the segment and others which don't. Using the lead and customers example, the segment match with segment matches all data allows you to define a looser match rule for the leads segment that allows leads to be loosely matched to other leads as well as customers. You do not want customers to match to customers on that same loose match rule as it could result in overmatching your customer records. So the segment limits the loose match rule just to leads, but would not restrict leads from only matching to other leads.

Example of segment matching

Customer Name Sales Lead Flag

ABC, Inc C

AB Inc C

ABC, Inc L


Setting Match Batch Sizes

Using Matching on Dependent Tables

If you have parent and child objects and you wish to match the child objects, you must include the parent object’s ROWID in all match rules for the child object. If you do not do so, you will lose data. For example:

You have a parent table, COMPANY and a child table, ADDRESS. To match addresses within a company without including ROWID_COMPANY in all match rules causes you to lose a company's address with each merge. For example, the child table includes these rows:

If you do not include the COMPANY_ROWID in all the match rules, these two rows are merged and there is a single COMPANY_ROWID. If the remaining ROWID is 12345, the company with rowid 54321 no longer has a record in the address child table. This data is lost.

Setting Match Batch SizesThe match batch size is the number of records Siperian Hub attempts to match in one group. If the total number of records considered for match and merge exceeds this maximum match batch size, the match process performs the match in cycles. Each cycle is limited to matching the number of records specified by this parameter.

It may seem like a good idea to use a very large match batch size. But the correct size of the match batch depends on the cardinality of your data and the number of matches your rules return.

Example of Segment Matching with Child Tables

COMPANY_ROWID Address

12345 100 Main St

54321 100 Main St


Using Dynamic Match Analysis Threshold

Using Dynamic Match Analysis ThresholdDynamic Match Analysis Threshold is a setting in the Match/Merge setup screen. Dynamic match analysis analyzes the match process at runtime to determine if the match process will take an unacceptably long period of time. The threshold value is how you specify the maximum acceptable number of comparisons.

The analysis is computed by multiplying the number of records in the base match group and the number of records in the token table that must be compared. If this product is less than the threshold, the match proceeds. If it is greater than that threshold, the match is not done, and a message is written to the log, noting the range for further investigation.

Tuning Match for PerformanceOne of the primary culprits in poor performance is excessive numbers of comparisons. The match process creates a list of match candidates. It is these candidates that are then compared to determine matches. Match candidates are determined by the values in the match columns. For example, you are matching a dataset that has pharmacies and there are 50,000 instances of BigChain Pharmacy. Each of these 50,000 records may be unique, but unless you reduced the set of candidates, there would be 50,000 candidate, each of which must be compared to determine matches. It is this comparison work that directly effects performance. Controlling the number of match candidates is they key to improving match performance.

The performance of your system is a function of many individual things. There are some basic strategies you can use to optimize the performance of your matching:• The single most effective thing you can do to improve performance is to know

your data. This knowledge enables you to apply the various strategies for performance optimization and get the best results from your Siperian Hub implementation.

• All match approaches are tradeoffs between performance and number of matches. Be biased towards undermatching. Undermatching means that some possible matches are missed. The reverse, overmatching, means that an excessive number of comparisons are done, which can consume a great deal of processing time and resources, depending on the size of the data set.


Tuning Match for Performance

• Exact matches are much more efficient than fuzzy matches. Where possible, run exact matches to reduce the number of candidate rows before running fuzzy matches. If you have very matchy data, build a match rule set that has only exact rules. Run this rule set to get rid of a large number of matches.

• If you have high volumes of high quality data, using the Preferred key width can improve performance. However, if the quality of the data is lower, this will result in a possibly unacceptable level of undermatching.

• If your ROWID objects are monotonically increasing, then using the Match Only Previous ROWID option in the Match/Merge Setup screen can improve performance. When this option is set, match comparisons are done only downwards with respect to the ROWIDs. That is row A is matched to row B, but row B is not matched to row A. Setting this option can reduce the number of comparisons by about half.

This option is inappropriate in the following cases:

• Records are inserted out of ROWID order

• You are using the services integration framework with this base object

• You are using user-declared ROWIDs

• If your data is appropriate for Match Only Previous ROWID Objects, use that and also select Match Only Once. This option means that once record A has been matched with another record, record A is not compared with any other record again. This dramatically reduces the number of comparisons.

• As you are testing and tuning your match rules, use the Dynamic Match Analyze Threshold option. To learn more, see “Using Dynamic Match Analysis Threshold” on page 92.

• Avoid loose manual rules. This only moves the problem to data stewards.

• If you have a high volume of data:

• Do not have any unconditional fuzzy match rules.

• Always have some exact match filters on every rule. These filters reduce the number of candidates for comparison.

• Create a number of almost identical fuzzy rules with different exact match filters to reduce the number of rows that are compared for the fuzzy match. For example, create rules such as the following:


About Merging

• full name and address + exact postcode

• full name and address + exact state

• full name and address + exact first two digits of postcode

• Consider these exact matches when you are defining your cleansing process. Optimizing the results of the cleanse process to generate good data for the exact matches can significantly improve both performance and the quality of the results. It’s much easier to make these optimizations when you’re defining your cleanse processes than it is to go back after the fact after the data is in the base object and you’ve found match issues.

• Never create automerge rules that contain only a single match column.

About MergingThere are two types of merges: automerge, which merges all merges queued for automerge, and manual. Manual merges require a data steward to use the merge manager. These two types of merges are functionally the same.

For all merges, there are two records, the source and the target. When you're merging A into B, A is the source and B is the target. The only field that is guaranteed to survive the merge is the ROWID of the source.

When the records are merged, all that matters is which is the source, and which is the target. For the purposes of merge, trust on columns doesn’t apply. For non-trusted columns, the source data always survives (and the target data is subsumed). The only time the source data doesn't survive is when the validation rule is 100% downgrade, 0% minimum reserve trust. In this case, the target field prevails.


6
Implementing Hierarchy Manager
This chapter describes information that implementers need to know before beginning a Hierarchy Manager™ (HM) implementation project. It is recommended for all implementers.

Every implementation is unique. Therefore, neither this chapter nor any other can give you exact, detailed instructions for your particular situation. This chapter:• defines the concepts required for Hierarchy Manager

• outlines the methodology for implementing Hierarchy Manager

• describes design patterns in terms of various common requirements

• explains how to configure your Hierarchy Manager implementation

Chapter Contents• About Hierarchy Manager

• Before You Begin Implementing Hierarchy Manager

• About Implementing a Hierarchy Manager System

• Step 1: Analyze Your Data

• Step 2: Build the Data Model

• Step 3: Configure Your Hierarchy Manager Implementation

• Step 4: Load Data

95

About Hierarchy Manager

About Hierarchy ManagerTypically, customer relationship data is stored in a variety of different applications and data warehouses, depending on the business need, making it difficult to view and manage customer relationship data. Each application has a well-defined hierarchy—such as customer-to-account, sales-to-account or product-to-sales—suited for operational purposes and often managed well by the application. Meanwhile, each data warehouse and data mart is designed to reflect relationships necessary for specific reporting purposes, such as sales by region by product over a specific period of time.

Different groups within the organization view a given customer in different and incomplete ways because the application they use has a limited view of the customer relationship and hierarchy information that is specific to that application. In addition, each of these applications may have conflicting information and semantics.

Hierarchy Manager delivers reliable and consolidated customer relationship views that enable organizations to navigate, analyze and manage relationships across multiple hierarchies from disparate applications and data sources.

Hierarchy Manager is part of Siperian Hub, and builds on the power of Siperian Master Reference Manager (MRM), leveraging MRM’s ability to provide the best version of truth from disparate data sources and applications. Hierarchy Manager allows you to gather, visualize, and manage relationships and hierarchies within your data set. With this powerful tool, you can visualize the relationships in your data and use this relationship information to more effectively cross-sell and up-sell into your existing customer base. In addition, these relationships allow you to:• more strategically manage accounts

• audit prospects

• align territories more accurately

• manage compensation more precisely

• get a complete relationship view of a customer (for example, a multi-generational family)

Note: Hierarchy Manager is a part of Siperian Hub and is intended to be used with MRM. Hierarchy Manager requires certain MRM capabilities, such as match and merge.


Before You Begin Implementing Hierarchy Manager

Before You Begin Implementing Hierarchy ManagerHierarchy Manager is part of the Siperian platform. Before you implement HM, you must install and configure MRM. See the Siperian Hub Installation Guide for your platform to learn more about installing the product. See the Siperian Hub Implementer’s Guide and the Siperian Hub Administrator’s Guide to learn more about implementing and configuring Siperian Hub.

Before you implement your HM system, you must be familiar with MRM and proficient in using the MRM tools. See the Siperian Hub Administrator’s Guide to learn more about using MRM. You are also assumed to be familiar with Hierarchy Manager and Hierarchy Manager concepts. To learn more, see

Also, in order to use Hierarchy Manager, you must have a license file from Siperian that indicates you have purchased a license for Hierarchy Manager. To learn more about this license, contact Siperian support.

Defining Your Goals

Before you begin designing your implementation, it is essential that you define the goals of the HM implementation.

You must determine:• what systems you will be using as data sources

• the data relationships you wish to explore and manage

• the hierarchies you expect to define

Understanding the Data

Before starting your HM implementation, you must have a thorough understanding of the data you are integrating. For example, you must know the source of the data, the data’s relative accuracy, structure, size, trends in the data, the amount of data, the expected growth of the dataset, the relationships between the data, and any characteristics that are peculiar to the data from each data source.

Implementing Hierarchy Manager 97

About Implementing a Hierarchy Manager System

Assembling the Team

You must also determine the people who will fill the roles required for your HM implementation. Typically, these roles are:• Implementation Specialists—people whose expertise is implementing

applications.

• Data Stewards— custodians of data quality. In HM terms, data stewards are the people responsible for maintaining relationship data on a regular and ongoing basis.

• MRM Administrators—IT people responsible for configuring or updating a Hub Store so that it provides the rules and functionality required by the data stewards.

• Application Developers— developers that integrate the Hub Store into other applications, such as web or CRM applications.

• DBAs— people who will maintain the database which is the basis of HM and MRM. Since HM is database-based, DBAs contribute significantly to the design phase of the project.

Determining Resources

Lastly, you must determine the implementation resources that will be available to you. These resources may include:• Message Queues

• Hardware and Network Resources

About Implementing a Hierarchy Manager SystemImplementing an Hierarchy Manager system is an iterative process. Since it is impossible to have all the necessary information in hand at the beginning, things you learn in the process cause you to go back and modify your implementation. Implementing is a matter of designing, building, testing, modifying, and testing some more.

While every implementation is different, there is a series of steps you will perform in every implementation:



• Step 1: Analyze Your Data

• Step 2: Build the Data Model

• Step 3: Configure Your Hierarchy Manager Implementation

• Step 4: Load Data

• Step 5: Test and Tune the System. You must test the system to make sure it behaves according to your needs as you have configured it.

Note: If you are implementing an MRM system and intend to also implement an Hierarchy Manager system, it is a good idea to implement the MRM system with the Hierarchy Manager system in mind.

It is also possible to use the Siperian Services Integration Framework (SIF) to write applications that use Hierarchy Manager functionality. To learn more about SIF, see the Siperian Services Integration Framework Guide.

Step 1: Analyze Your Data

As with your MRM implementation, the first step in implementing a Hierarchy Manager system is to look closely at the data. The success of your implementation depends on how well you understand your data.

Each of these steps requires that you examine the business requirements for each part of the Hierarchy Manager system. It is especially important to dig deeply into these requirements. Asking stakeholders why they want to do things the way they do is particularly effective as often there are different ways to achieve the same functionality. The added information about the stakeholder’s reasons will help you choose the correct solution for your organization.

Analyzing the data includes the following steps:1. Defining the Data Flow and Source Systems

2. Determining Entities and Entity Types

3. Determining Relationships and Relationship Types

4. Determining Hierarchies and Hierarchy Types



5. Creating a Sample Data Set for testing purposes

Defining the Data Flow and Source Systems

Determine the source systems that will feed data into Hierarchy Manager and your MRM systems. You must know exactly what data is coming from where.

Consider the following characteristics of each data set from each source:• type

• quality

• quantity

• source

• relationships between data from the same source

• relationships to data from other sources

• any other characteristics that are peculiar each data set

Determining the data flow provides the basis for correctly sizing your Hierarchy Manager implementation. When determining the correct size for your system, consider:• the quantity of data

• the frequency of updates for that data within each source system

• how often this data will be brought into Hierarchy Manager to update the master records

Determining Entities and Entity Types

Your research into the business needs driving your Hierarchy Manager implementation and knowledge of the data set will result in a relatively small number of general types of things you wish to relate.

When determining the different entity types, consider the relationships you expect these entities to have and what the hierarchies might look like. To learn more about



entities and entity types, see “Using the Hierarchy Manager” in the Siperian Hub User’s Guide.

Determining Relationships and Relationship Types

You must have a clear idea of the relationships you wish to manage and explore with Hierarchy Manager. Knowing this sort of relationship data by no means precludes discovering additional, heretofore unknown relationships via the Hierarchy Manager tool. To learn more about relationships and relationship types, see “Using the Hierarchy Manager” in the Siperian Hub User’s Guide.

Determining Hierarchies and Hierarchy Types

Once you have an idea of the entities and relationships you will need, it is important to think about the hierarchies that will be built on that foundation. To learn more about hierarchies and hierarchy types, see “Using the Hierarchy Manager” in the Siperian Hub User’s Guide.

Creating a Sample Data Set

Implementation of your Hierarchy Manager system requires iteration to tune and optimize your system. This is most easily done with a small, representative sample of your data. A sample of a few thousand records or so, containing examples of each type of entity you expect the system to support is a good starting sample. In addition, this sample data must contain the various relationships you expect your Hierarchy Manager system to include. You need sample data from each of your source systems. Naturally, the more closely the sample data reflects the characteristics of the complete data set, the more useful it will be. For example, if you have a customer database with most customers in the United States, you might use data from just a few states.

Most of your testing will be done with this small test database. However, as your system moves closer to deployment, test the system with larger databases. To use the example of a customer database with mostly US addresses, you might test with data from three or four states initially. You might then move on to testing with data from ten states, and then twenty or twenty five.



Step 2: Build the Data Model

Once you are very familiar with the source systems, the data, the data flow, the entities, the relationships and the hierarchies, you are ready to start designing and building the Hierarchy Manager data model. As with the entire implementation process, building the data model is iterative. You will learn things in the process of defining the model and testing that will cause you to go back and modify your data model.

Note: Hierarchy Manager uses a data model that expands the one used by MRM. It is assumed that you are familiar with MRM, its tools, and its data model. To learn more about the MRM data model, see the Siperian Hub Administrator’s Guide and the Siperian Hub Implementer’s Guide. To learn about building the data model, see the Siperian Hub Administrator’s Guide.

Step 3: Configure Your Hierarchy Manager Implementation

One of the primary features of Hierarchy Manager is its ability to allow you to visualize your Hierarchy Manager data. You can specify how that data is displayed. To configure your implementation, including how the data is visualized, see the Siperian Hub Administrator’s Guide.

Step 4: Load Data

The next step is to load data so you can test your system. Based on these tests, you can revise and tune your Hierarchy Manager implementation for your needs. To learn about loading data, see the Siperian Hub Administrator’s Guide.


7
Scheduling Batch Jobs and Batch Groups
This chapter explains how to schedule batch jobs and batch groups in a Siperian Hub implementation. The information in this chapter is intended for implementers and system administrators.

Important: You must have the application server running for the duration of a batch job.

Chapter Contents• About Scheduling Siperian Hub Batch Jobs

• Setting Up Job Execution Scripts

• Monitoring Job Results and Statistics

• Job Scheduling Reference

• Scheduling Batch Groups

• Developing Custom Stored Procedures for Batch Jobs

103

About Scheduling Siperian Hub Batch Jobs

About Scheduling Siperian Hub Batch JobsA Siperian Hub batch job is a program that, when executed, completes a discrete unit of work (a process). All public batch jobs in Siperian Hub can be executed as database stored procedures. To learn more about batch jobs, see the Siperian Hub Administrator’s Guide.

In the Hub Console, the Siperian Hub Batch Viewer and Batch Group tools provide simple mechanisms for executing Siperian Hub batch jobs. However, they do not provide a means for executing and managing jobs on a scheduled basis. For this, you need to execute stored procedures that do the work of batch jobs. Most organizations have job management tools that are used to control IT processes. Any such tool capable of executing Oracle PL*SQL or DB2 SQL commands can be used to schedule and manage Siperian Hub batch jobs.

Setting Up Job Execution ScriptsThis section describes how to set up job execution scripts for running Siperian Hub stored procedures.

Metadata in the C_REPOS_TABLE_OBJECT_V View

Siperian Hub populates the C_REPOS_TABLE_OBJECT_V view with metadata about its stored procedures. You use this metadata to:• determine whether a stored procedure can be run using job scheduling tools, as

described in “Determining Available Execution Scripts” on page 107

• retrieve identifiers in the job execution scripts that execute Siperian Hub stored procedures, as described in “Retrieving Values from C_REPOS_TABLE_OBJECT_V at Execution Time” on page 107


Setting Up Job Execution Scripts

C_REPOS_TABLE_OBJECT_V has the following columns:C_REPOS_TABLE_OBJECT_V Columns

Column Name Description

ROWID_TABLE_OBJECT

Uniquely identifies a batch job.

ROWID_TABLE Depending on the type of batch job, this is the table identifier for either the table affected by the job (target table) or the table providing the data for the job (source table).• For Stage jobs, ROWID_TABLE refers to the target table

(the staging table).• For Load jobs, ROWID_TABLE refers to the source table

(the staging table).• For Match, Match Analyze, Autolink, Automerge, Auto

Match and Merge, External Match, Generate Match Token, Match for Duplicate Data, and Key Match jobs, ROWID_TABLE refers to the base object table, which is both source and target for the jobs.

OBJECT_NAME Description of the type of batch job. Examples include:• Stage jobs: CMX_CLEANSE.EXE.• Load jobs: CMXLD.LOAD_MASTER.• Match and Match Analyze jobs: CMXMA.MATCH.

OBJECT_DESC Description of the batch job, including the type of batch job as well as the object affected by the batch job. Examples include:• Stage for C_STG_CUSTOMER_CREDIT• Load from C_STG_CUSTOMER_CREDIT• Match and Merge for C_CUSTOMER

OBJECT_TYPE_CODE Together with OBJECT_FUNCTION_TYPE_CODE, this is a foreign key to C_REPOS_OBJ_FUNCTION_TYPE.An OBJECT_TYPE_CODE of “P” indicates a procedure that can potentially be executed by a scheduling tool.

OBJECT_FUNCTION_TYPE_CODE

Indicates the actual procedure type (stage, load, match, and so on).

PUBLIC_IND Indicates whether the procedure is a procedure that can be displayed in the Batch Viewer.

Scheduling Batch Jobs and Batch Groups 105


Identifiers in C_REPOS_TABLE_OBJECT_V

You use the following identifier values in C_REPOS_TABLE_OBJECT_V to execute stored procedures.

PARAMETER Describes the parameter list for the procedure. Where specific ROWID_TABLE values are required for the procedure, these are shown in the parameter list. Otherwise, the name of the parameter is simply displayed in the parameter list.An exception to this is the parameter list for Stage jobs (where OBJECT_NAME = CMX_CLEANSE.EXE). In this case, the full parameter list is not shown. For a list of parameters, see “Stage Jobs” on page 131.

VALID_IND If VALID_IND is not equal to 1, do not execute the procedure. It means that some repository settings have changed that affect the procedure. This usually applies to changes that affect the Stage jobs if the mappings have not been checked and saved again. To learn more, see “Determining Available Execution Scripts” on page 107.

OBJECT_NAME OBJECT_DESC

OBJECT_TYPE_CODE


CMXMM.AUTOLINK Link data in BaseObjectName P I

CMXMA.MATCH_AND_MERGE Match and Merge for BaseObjectName P B

CMXMM.AUTOMERGE Merge data in BaseObjectName P G

CMXMA.EXTERNAL_MATCH External Match for BaseObjectName P E

CMXMM.BUILD_BVT Generate BVT snapshot for BaseObjectName

P V

CMXMA.GENERATE_MATCH_TOKENS

Generate Match Tokens for BaseObjectName

P N

CMXMA.KEY_MATCH Key Match for BaseObjectName P K

CMXLD.LOAD_MASTER Load from Link BaseObjectName P L

CMXMM.MLINK Manual Link for BaseObjectName P O

C_REPOS_TABLE_OBJECT_V Columns (Cont.)

Column Name Description



Determining Available Execution Scripts

To determine which batch jobs are available to be executed via stored procedures, run a query using the standard Siperian Hub view called C_REPOS_TABLE_OBJECT_V, as shown in the following example:SELECT * FROM C_REPOS_TABLE_OBJECT_VWHERE PUBLIC_IND = 1 :

Retrieving Values from C_REPOS_TABLE_OBJECT_V at Execution Time

You can use SQL statements to retrieve values from C_REPOS_TABLE_OBJECT_V when executing your scripts at run time. The following example code retrieves the STG_ROWID_TABLE and ROWID_TABLE_OBJECT for cleanse jobs.SELECT a.rowid_table, a.rowid_table_object INTO IN_STG_ROWID_TABLE,

CMXMM.MUNLINK Manual Unlink for BaseObjectName P Q

CMXMA.MATCH Match Analyze for BaseObjectName P Z

CMXMA.MATCH Match for BaseObjectName P M

CMXMA.MATCH_FOR_DUPS Match for Duplicate Data for BaseObjectName

P D

Migrate_Link_Style_to_Merge_Style CMXMA.Migrate Link Style to Merge Style for BaseObjectName

P J

CMXMM.MULTI_MERGE Multi Merge for BaseObjectName P P

CMXMA.RESET_LINKS Reset Links for BaseObjectName P W

Reset_match CMXMA.Reset Match table for BaseObjectName

P R

CMX_CLEANSE.EXE Stage for TargetStagingTableName P C

CMXMM.UNMERGE Unmerge for BaseObjectName P X

OBJECT_NAME OBJECT_DESC

OBJECT_TYPE_CODE



Monitoring Job Results and Statistics

IN_ROWID_TABLE_OBJECTFROM c_repos_table_object_v a, c_repos_table bWHERE a.object_name = 'CMX_CLEANSE.EXE'AND b.rowid_table = a.rowid_tableAND b.table_name = 'C_HMO_ADDRESS'AND a.valid_ind = 1;

Running Scripts Asynchronously

By default, the execution scripts run synchronously (IN_RUN_SYNCH = TRUE or IN_RUN_SYNCH = NULL). To run the execution scripts asynchronously, specify IN_RUN_SYNCH = FALSE. Note that these Boolean values are case-sensitive and must be specified in upper-case characters.

Monitoring Job Results and StatisticsThis section describes how to monitor the results of batch jobs.

Error Messages and Return Codes

Siperian Hub stored procedures return an error message and return code.

Error handling code in job execution scripts can look for return codes and trap any associated error messaged.

Job Execution Status

Siperian Hub stored procedures log their job execution status and statistics in the Siperian Hub repository.

Returned Parameter Description

OUT_ERROR_MESSAGE

Error message if an error occurred.

OUT_RETURN_CODE Return code. Zero (0) if no errors occurred, or one (1) if an error occurred.



The following figure shows the repository tables that can be used for monitoring job results and statistics:



The following provides more information about these repository tables.

Repository Tables Used for Monitoring Job Results and Statistics

Table Name Description

C_REPOS_JOB_CONTROL As soon as a job starts to run, it registers itself in C_REPOS_JOB_CONTROL with a RUN_STATUS of 2 (Running/Processing). Once the job completes, its status is updated to one of the following values:• 0 (Completed Successfully)—Completed without

any errors or warnings.• 1 (Completed with Errors)—Completed, but with

some warnings or data rejections. See the RETURN_CODE for any error code and the STATUS_MESSAGE for a description of the error/warning.

• 2 (Running / Processing)• 3 (Failed—Job did not complete). Corrective

action must be taken and the job must be run again. See the RETURN_CODE for any error code and the STATUS_MESSAGE for the reason for failure.

• 4 (Incomplete)—The job failed before updating its job status and has been manually marked as incomplete. Corrective action must be taken and the job must be run again. RETURN_CODE and STATUS_MESSAGE will not provide any useful information. Marked as incomplete by clicking the Set Status to Incomplete button in the Batch Viewer.

C_REPOS_JOB_METRIC When a batch job has completed, it registers its statistics in C_REPOS_JOB_METRIC. There can be multiple statistics for each job. Join to C_REPOS_JOB_METRIC_TYPE to get a description for each statistic.

C_REPOS_JOB_METRIC_TYPE Stores the descriptions of the types of metrics that can be registered in C_REPOS_JOB_METRIC.

C_REPOS_JOB_STATUS_TYPE Stores the descriptions of the RUN_STATUS values that can be registered in C_REPOS_JOB_CONTROL.


Job Scheduling Reference

Job Scheduling ReferenceThis section provides a reference for the stored procedures that represent Siperian Hub batch jobs. Siperian Hub provides these stored procedures, in compiled form, for each Operational Record Store(ORS), for Oracle and DB2 databases. You can use any job scheduling software (such as Tivoli, CA Unicenter, and so on) to execute these stored procedures.

Alphabetical List of Jobs

Batch Job Description

Autolink Jobs Automatically links records that have qualified for autolinking during the match process and are flagged for autolinking (Automerge_ind = 1). Used with link-style base objects only.

Auto Match and Merge Jobs

Executes a continual cycle of a Match job, followed by an Automerge job, until there are no more records to match, or until the size of the manual merge queue exceeds the configured threshold. Used with merge-style base objects only.

Automerge Jobs Automatically merges records that have qualified for automerging during the match process and are flagged for automerging (Automerge_ind = 1). Used with merge-style base objects only.

BVT Snapshot Jobs Generates a snapshot of the best version of the truth (BVT) for a base object. Used with link-style base objects only.

Generate Match Token Jobs

Prepares data for matching by generating match tokens according to the current match settings. Match tokens are strings that encode the columns used to identify candidates for matching.

Key Match Jobs Matches records from two or more sources when these sources use the same primary key. Compares new records to each other and to existing records, and identifies potential matches based on the comparison of source record keys as defined by the match rules.

Load Jobs Copies records from a staging table to the corresponding target table in the Hub Store (a base object or dependent object). During the load process, applies the current trust and validation rules to the records.

Manual Link Jobs Shows logs for records that have been manually linked in the Merge Manager tool. Used with link-style base objects only.

Manual Unlink Jobs Shows logs for records that have been manually unlinked in the Merge Manager tool. Used with link-style base objects only.



Autolink JobsAutolink jobs automatically link records that have qualified for autolinking during the match process and are flagged for autolinking (Automerge_ind = 1). Autolink jobs are used with link-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Autolink Jobs

To learn about the identifiers used to execute the stored procedure associated with this batch job, see “Identifiers in C_REPOS_TABLE_OBJECT_V” on page 106.

Dependencies for Autolink Jobs

Each Autolink job is dependent on the successful completion of the match process, and the queuing of records for merge.

Match Jobs Compares new records to each other and to existing records, and identifies potential matches based on the current match settings.

Match Analyze Jobs Conducts a search to gather match statistics but does not actually perform the match process. If areas of data with the potential for huge match requirements are discovered, Siperian Hub moves the records to a hold status, which allows a data steward to review the data manually before proceeding with the match process.

Match for Duplicate Data Jobs

For data with a high percentage of duplicate records, compares new records to each other and to existing records, and identifies exact duplicates. The maximum number of exact duplicates is based on the Duplicate Match Threshold setting for this base object.

Stage Jobs Copies records from a landing table into a staging table. During execution, cleanses the data according to the current cleanse settings.

Unmerge Jobs Updates metadata for base objects. Used after a base object has been loaded but not yet merged, and subsequent trust configuration changes (such as enabling trust) have been made to columns in that base object. This job must be run before merging data for this base object.

Batch Job Description



Successful Completion of Autolink Jobs

Autolink jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.

Oracle Implementations

Stored Procedure Definition for Autolink Jobs—OraclePROCEDURE autolink (

in_rowid_table IN cmxlb.cmx_rowid,in_user_name IN cmxlb.cmx_user_name,out_error_message OUT cmxlb.cmx_message,out_return_code OUT int

)

Sample Job Execution Script for Autolink Jobs—OracleDECLARE IN_ROWID_TABLE CHAR(14);

IN_USER_NAME VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_RETURN_CODE NUMBER;

BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;

CMXMM.AUTOLINK ( IN_ ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE);

COMMIT; END;

Auto Match and Merge Jobs

Auto Match and Merge batch jobs execute a continual cycle of a Match job, followed by an Automerge job, until there are no more records to match, or until the size of the manual merge queue exceeds the configured threshold. When executing the MATCH_AND_MERGE job stored procedure, CMXMA.MATCH_AND_MERGE loops Match and Automerge jobs until there are no more records to match, or until the



manual merge queue size limit is reached. Auto Match and Merge jobs are used with merge-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Auto Match and Merge Jobs


Dependencies for Auto Match and Merge Jobs

The Auto Match and Merge jobs for a target base object can either be run on successful completion of each Load job, or on successful completion of all Load jobs for the object.

Successful Completion of Auto Match and Merge Jobs

Auto Match and Merge jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.


Stored Procedure Definition for Auto Match and Merge Jobs—OraclePROCEDURE CMXMA.MATCH_AND_MERGE (

IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000); --Error Message, if any.OUT_RETURN_CODE NUMBER; --Return Code. (If no errors, 0 is

returned))

Sample Job Execution Script for Auto Match and Merge Jobs—OracleDECLARE

IN_ROWID_TABLE CHAR(14);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_RETURN_CODE NUMBER;

BEGIN



IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;

CMXMA.MATCH_AND_MERGE ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE );

COMMIT; END;

Automerge Jobs

Automerge jobs automatically merge records that have qualified for automerging during the match process and are flagged for automerging (Automerge_ind = 1). Automerge jobs are used with merge-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Automerge Jobs


Dependencies for Automerge Jobs

Each Automerge job is dependent on the successful completion of the match process, and the queuing of records for merge.

Successful Completion of Automerge Jobs

Automerge jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.


Stored Procedure Definition for Automerge Jobs—OraclePROCEDURE CMXMM.AUTOMERGE (

IN_ROWID_TABLE CHAR(14); --Rowid of a Table.



IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MESSAGE VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is

returned))

Sample Job Execution Script for Automerge Jobs—OracleDECLARE

IN_ROWID_TABLE CHAR(14);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MESSAGE VARCHAR2(2000);OUT_RETURN_CODE NUMBER;

BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MESSAGE := NULL;OUT_RETURN_CODE := NULL;

CMXMM.AUTOMERGE ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MESSAGE, OUT_RETURN_CODE );

COMMIT; END;

BVT Snapshot Jobs

The BVT Snapshot stored procedure generates a snapshot of the best version of the truth (BVT) for a base object. It supports calculating BVT for one link group (the group of records that are linked to one link group). BVT Snapshot jobs are used with link-style base objects only. To learn more, see the Siperian Hub Administrator’s Guide.

When executing the BVT Snapshot stored procedure:• IN_GROUP_ID_LIST is a list of group_ids delimited by ~ (such as ‘1~2~’).

• OUT_BVT contains the BVT values for the base object table in the following format: ‘col1~col2~|val1~val2’

• OUT_LINEAGE contains the BVT values for the CTL table in the following format: ‘col1~col2~|val1~val2’

• For delimited strings, the escape character is ‘\’.



Identifiers for Executing BVT Snapshot Jobs


Dependencies for BVT Snapshot Jobs

Each BVT Snapshot job is dependent on the successful completion of the Autolink job for this base object.

Successful Completion of BVT Snapshot Jobs

BVT Snapshot jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.


Stored Procedure Definition for BVT Snapshot Jobs—OraclePROCEDURE build_bvt ( in_rowid_table cmxlb.cmx_rowid ,in_group_id_list cmxlb.cmx_big_str ,in_user_name cmxlb.cmx_user_name ,out_bvt OUT cmxlb.cmx_big_str ,out_lineage OUT cmxlb.cmx_big_str ,out_bvt_count OUT int ,out_error_message OUT cmxlb.cmx_message ,out_return_code OUT int)

Sample Job Execution Script for BVT Snapshot Jobs—OracleDECLARE IN_ROWID_TABLE CHAR(14); IN_ROWID_OBJECT CHAR(14); IN_GROUP_ID CHAR(14); IN_INTERACTION_ID NUMBER; IN_USER_NAME VARCHAR2(200); OUT_BVT VARCHAR2(2000); OUT_LINEAGE VARCHAR2(2000); OUT_BVT_COUNT NUMBER;



OUT_ERROR_MESSAGE VARCHAR2(200); OUT_RETURN_CODE NUMBER;

BEGIN IN_ROWID_TABLE := 'SVR1.7S3 '; IN_GROUP_ID := '1 '; IN_USER_NAME := NULL; OUT_LINEAGE :=NULL; OUT_BVT_COUNT :=NULL; OUT_BVT := NULL; OUT_ERROR_MESSAGE := NULL; OUT_RETURN_CODE := NULL;

CMXMM.build_BVT ( IN_ROWID_TABLE, IN_GROUP_ID, IN_USER_NAME, OUT_BVT,OUT_LINEAGE, OUT_BVT_COUNT, OUT_ERROR_MESSAGE, OUT_RETURN_CODE ); COMMIT; DBMS_OUTPUT.PUT_LINE ( 'OUT_BVT= ' || substr(OUT_BVT,1,2000) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_LINEAGE= ' || substr(OUT_LINEAGE,1,2000) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_BVT_COUNT= ' || substr(OUT_BVT_COUNT,1,200) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_ERROR_MESSAGE= ' || substr(OUT_ERROR_MESSAGE,1,200) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_RETURN_CODE= ' || substr(OUT_RETURN_CODE,1,200) ); END;

Generate Match Token Jobs

Generate Match Tokens jobs prepare data for matching by generating match tokens according to the current match settings. Match tokens are strings that encode the columns used to identify candidates for matching. To learn more, see the Siperian Hub Administrator’s Guide.

Schedule Generate Match Tokens jobs if you run the load process without data tokenization, or if match failed during tokenization. The Generate Match Tokens job generates the match tokens for the entire base object (when IN_FULL_RESTRIP_IND is set to 1), or only the jobs that must be processed.



Note: Check (select) the Re-generate All Match Tokens check box in the Batch Viewer to populate the IN_FULL_RESTRIP_IND parameter.

Identifiers for Executing Generate Match Token Jobs


Dependencies for Generate Match Token Jobs

Each Generate Match Tokens job is dependent on the successful completion of the Load job responsible for loading data into the base object.

Successful Completion of Generate Match Token Jobs

Generate Match Tokens jobs must complete with a RUN_STATUS of 0 (Completed Successfully).


Stored Procedure Definition for Generate Match Token Jobs—OraclePROCEDURE GENERATE_MATCH_TOKENS (

IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE NUMBER;---Return Code. (If no errors, 0 is

returned)IN_FULL_RESTRIP_IND NUMBER;--Default 0, retokenize entire table

if set to 1 (strip_truncate_insert))

Sample Job Execution Script for Generate Match Token Jobs—OracleDECLARE

IN_ROWID_TABLE CHAR(14);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_RETURN_CODE NUMBER;IN_FULL_RESTRIP_IND NUMBER;



BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;IN_FULL_RESTRIP_IND := NULL;

CMXMA.GENERATE_MATCH_TOKENS ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE, IN_FULL_RESTRIP_IND );

COMMIT;END;

Key Match Jobs

Key Match jobs are used to match records from two or more sources when these sources use the same primary key. Key Match jobs compare new records to each other and to existing records, and identifies potential matches based on the comparison of source record keys as defined by the match rules. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Key Match Jobs


Dependencies for Key Match Jobs

Key Match jobs are dependent on the successful completion of the Load job responsible for loading data into the base object. The Key Match job cannot have been run after any changes were made to the data.

Successful Completion of Key Match Jobs

Key Match jobs must complete with a RUN_STATUS of 0 (Completed Successfully).




Stored Procedure Definition for Key Match Jobs—OraclePROCEDURE KEY_MATCH (

IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE NUMBER;---Return Code. (If no errors, 0 is

returned))

Sample Job Execution Script for Key Match Jobs—OracleDECLARE

IN_ROWID_TABLE VARCHAR2(200);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MESSAGE VARCHAR2(200);OUT_RETURN_CODE NUMBER;

BEGININ_ROWID_TABLE := NULL;IN_USER_NAME := 'myusername';OUT_ERROR_MESSAGE := NULL;OUT_RETURN_CODE := NULL;

select rowid_table INTO IN_ROWID_TABLEfrom c_repos_tablewhere table_name = 'C_ADDRESS';

DBMS_OUTPUT.Put_Line(' Row id table = ' || IN_ROWID_TABLE);CMXMA.KEY_MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MESSAGE, OUT_RETURN_CODE);DBMS_OUTPUT.Put_Line('OUT_ERROR_MESSAGE = ' || OUT_ERROR_MESSAGE);DBMS_OUTPUT.Put_Line('OUT_RETURN_CODE = ' || TO_CHAR(OUT_RETURN_CODE));COMMIT;END;

Load Jobs

Load jobs move data from staging tables to the final target objects, and apply any trust and validation rules where appropriate. To learn more about Load jobs and the load process, see the Siperian Hub Administrator’s Guide.



Identifiers for Executing Load Jobs


Dependencies for Load Jobs

Each Load job is dependent on the success of the Stage job that precedes it. In addition, each Load job is governed by the demands of referential integrity constraints and is dependent on the successful completion of all other Load jobs responsible for populating tables referenced by the table that is the target of the load.

Successful Completion of Load Jobs

A Load job must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful. The Auto Match and Merge jobs for a target base object can either be run on successful completion of each Load job, or on successful completion of all Load jobs for the base object.


Stored Procedure Definition for Load Jobs—OraclePROCEDURE CMXLD.LOAD_MASTER (

IN_STG_ROWID_TABLE CHAR(14);--Rowid of Staging TableIN_USER_NAME VARCHAR2(200);--DataBase User NameOUT_ERROR_MSG VARCHAR2(2000);--Error Mesasage, if anyOUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is

returned)IN_FORCE_UPDATE_IND NUMBER;--Forced Update value. Default 0, 1

for Forced update.Notes)

For Run

Base Objects Run the loads for parent tables before the loads for child tables.

Dependent Objects Run the loads for all referenced base objects before the load for the dependent object.



Sample Job Execution Script for Load Jobs—OracleDECLARE

IN_STG_ROWID_TABLE CHAR(14);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_RETURN_CODE NUMBER;IN_FORCE_UPDATE_IND NUMBER;

BEGIN IN_STG_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;IN_FORCE_UPDATE_IND := NULL;

CMXLD.LOAD_MASTER ( IN_STG_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE, IN_FORCE_UPDATE_IND );

COMMIT; END;

Manual Link Jobs

Manual Link jobs execute manually linking in the Merge Manager tool. Manual Link jobs are used with link-style base objects only. Results are stored in a _LINK table. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Manual Link Jobs


Dependencies for Manual Link Jobs

Each Manual Link job is dependent on the successful completion of the match process for this base object.



Successful Completion of Manual Link Jobs

Manual Link jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.

When executing the Manual Link stored procedure:• IN_MEMBER_ROW_ID_LIST contains a list of rowid_objects in the following

format, which uses the ~ delimiter: rowid_object1~rowid_object2~rowid_object~

• Insert link records for rowid_objects in the IN_MEMBER_ROWID_LIST to the group of IN_GROUP_ID in the base object’s link table.

• Only one active link record (UNLINK_IND=1) is allowed for each rowid_object.


Stored Procedure Definition for Manual Link Jobs—OraclePROCEDURE mlink ( in_rowid_table cmxlb.cmx_rowid ,in_member_rowid_list cmxlb.cmx_big_str ,in_group_id cmxlb.cmx_rowid ,in_rowid_match_rule cmxlb.cmx_rowid ,in_automerge_ind int ,in_interaction_id int ,in_user_name cmxlb.cmx_user_name ,out_error_message OUT cmxlb.cmx_message ,out_return_code OUT int)

Sample Job Execution Script for Manual Link Jobs—OracleDECLARE IN_ROWID_TABLE CHAR(14); IN_ROWID_OBJECT CHAR(14); IN_GROUP_ID CHAR(14); IN_ROWID_MATCH_RULE CHAR(14); IN_INTERACTION_ID NUMBER; IN_USER_NAME VARCHAR2(200); OUT_ERROR_MESSAGE VARCHAR2(200); OUT_RETURN_CODE NUMBER;

BEGIN



IN_ROWID_TABLE := 'SVR1.ELV '; IN_ROWID_OBJECT := '11 '; IN_GROUP_ID := '1 '; IN_ROWID_MATCH_RULE := NULL; IN_INTERACTION_ID := NULL; IN_USER_NAME := 'JW'; OUT_ERROR_MESSAGE := NULL; OUT_RETURN_CODE := NULL;

CMXMM.MLINK ( IN_ROWID_TABLE, IN_ROWID_OBJECT, IN_GROUP_ID, IN_ROWID_MATCH_RULE, IN_INTERACTION_ID, IN_USER_NAME, OUT_ERROR_MESSAGE, OUT_RETURN_CODE ); COMMIT; DBMS_OUTPUT.PUT_LINE ( 'OUT_ERROR_MESSAGE= ' || substr(OUT_ERROR_MESSAGE,1,200) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_RETURN_CODE= ' || substr(OUT_RETURN_CODE,1,200) ); END;

Manual Unlink Jobs

Manual Unlink jobs execute manually unlinking of records that were previously linked manually in the Merge Manager tool. Manual Unlink jobs are used with link-style base objects only. Manual Unlink jobs ungroup the selected base object records (group member) from the target group, and update the corresponding linkage information from the LINK table (update the unlink_ind value to 0). Manual Unlink jobs also unlink all the group members of the target group if the incoming in_member_rowid_list parameter is NULL. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Manual Unlink Jobs


When executing the Manual Unlink stored procedure:• Active link records of rowid_objects in IN_MEMBER_ROWID_LIST are set to

be inactive (UNLINK_IND=1).

• Active link records of rowid_objects belonging to group IN_GROUP_ID are set to be inactive if the IN_MEMBER_ROWID_LIST is not passed in.



• IN_MEMBER_ROW_ID_LIST contains a list of rowid_objects in the following format, which uses the ~ delimiter: rowid_object1~rowid_object2~rowid_object~

Dependencies for Manual Unlink Jobs

Each Manual Unlink job is dependent on the successful completion of a previously-run Manual Link job.

Successful Completion of Manual Unlink Jobs

Manual unlink jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.

Stored Procedure Definition for Manual Unlink Jobs—OraclePROCEDURE munlink ( in_rowid_table cmxlb.cmx_rowid ,in_member_rowid_list cmxlb.cmx_big_str -- delimited by '~' ,in_group_id cmxlb.cmx_rowid ,in_interaction_id int ,in_user_name cmxlb.cmx_user_name ,out_error_message OUT cmxlb.cmx_message ,out_return_code OUT int)

Sample Job Execution Script for Manual Unlink Jobs—OracleDECLARE IN_ROWID_TABLE CHAR(14); IN_ROWID_OBJECT CHAR(14); IN_GROUP_ID CHAR(14); IN_INTERACTION_ID NUMBER; IN_USER_NAME VARCHAR2(200); OUT_ERROR_MESSAGE VARCHAR2(200); OUT_RETURN_CODE NUMBER;

BEGIN IN_ROWID_TABLE := 'SVR1.ELV '; IN_ROWID_OBJECT := '11 '; IN_GROUP_ID := '1 '; IN_INTERACTION_ID := NULL; IN_USER_NAME := NULL;



OUT_ERROR_MESSAGE := NULL; OUT_RETURN_CODE := NULL;

CMXMM.MUNLINK ( IN_ROWID_TABLE, IN_ROWID_OBJECT, IN_GROUP_ID, IN_INTERACTION_ID, IN_USER_NAME, OUT_ERROR_MESSAGE, OUT_RETURN_CODE ); COMMIT; DBMS_OUTPUT.PUT_LINE ( 'OUT_ERROR_MESSAGE= ' || substr(OUT_ERROR_MESSAGE,1,200) ); DBMS_OUTPUT.PUT_LINE ( 'OUT_RETURN_CODE= ' || substr(OUT_RETURN_CODE,1,200) ); END;

Match Jobs

Match jobs check the specified match condition for the rows of a base object table and then queue the matched rows for either automerge or manual merge. To learn more about Match jobs and the match process, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Match Jobs


Dependencies for Match Jobs

Each Match job is dependent on new / updated records in the base object that have been tokenized and are thus queued for matching. For parent base objects that have children, the Match job is also dependent on the successful completion of the data tokenization jobs for all child tables, which in turn is dependent on successful Load jobs for the child tables.

Successful Completion of Match Jobs

Match jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.




Stored Procedure for Match Jobs—OraclePROCEDURE CMXMA.MATCH (IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is returned)IN_VALIDATE_TABLE_NAME VARCHAR2(200);--Validate Table NameIN_MATCH_ANALYZE_IND NUMBER;--Match Analyze to Check for Matchy Data.)

Sample Job Execution Script for Match Jobs—OracleDECLARE

IN_ROWID_TABLE CHAR(14);IN_USER_NAME VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_RETURN_CODE NUMBER;IN_VALIDATE_TABLE_NAME VARCHAR2(200);IN_MATCH_ANALYZE_IND NUMBER;

BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;IN_VALIDATE_TABLE_NAME := NULL;IN_MATCH_ANALYZE_IND 0;

CMXMA.MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE, IN_VALIDATE_TABLE_NAME, IN_MATCH_ANALYZE_IND );COMMIT; END;

Match Analyze Jobs

Match Analyze jobs perform a search to gather metrics about matching without conducting any actual matching. Match Analyze jobs are typically used to tune match rules, which is described in the Siperian Hub Implementer’s Guide. To learn more, see the Siperian Hub Administrator’s Guide.



Identifiers for Executing Match Analyze Jobs


Dependencies for Match Analyze Jobs

Each Match Analyze job is dependent on new / updated records in the base object that have been tokenized and are thus queued for matching. For parent base objects, the Match Analyze job is also dependent on the successful completion of the data tokenization jobs for all child tables, which in turn is dependent on successful Load jobs for the child tables.

Successful Completion of Match Analyze Jobs

Match Analyze jobs must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.


Stored Procedure for Match Analyze Jobs—OraclePROCEDURE CMXMA.MATCH (

IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE NUMBER;--Return Code. (If no errors, 0 is

returned)IN_VALIDATE_TABLE_NAME VARCHAR2(200);--Validate Table NameIN_MATCH_ANALYZE_IND NUMBER;--Match Analyze to Check for Matchy

Data.)

Sample Job Execution Script for Match Analyze Jobs—OracleDECLARE




IN_VALIDATE_TABLE_NAME VARCHAR2(200);IN_MATCH_ANALYZE_IND NUMBER;

BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;IN_VALIDATE_TABLE_NAME := NULL;IN_MATCH_ANALYZE_IND 1;

CMXMA.MATCH ( IN_ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE, IN_VALIDATE_TABLE_NAME, IN_MATCH_ANALYZE_IND );COMMIT; END;

Match for Duplicate Data Jobs

A Match for Duplicate Data job searches for exact duplicates to consider them matched. Use it to manually run the Match for Duplicate Data process when you want to use your own rule as the match for duplicates criteria instead of all the columns in the base object. The maximum number of exact duplicates is based on the base object columns defined in the Duplicate Match Threshold property in the Schema Manager for each base object. To learn more, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Match for Duplicate Jobs


Dependencies for Match for Duplicate Data Jobs

Match for Duplicate Data jobs require the existence of unconsolidated data in the base object.

Successful Completion of Match for Duplicate Data Jobs

Match for Duplicate Data jobs must complete with a RUN_STATUS of 0 (Completed Successfully).




Stored Procedure Definition for Match for Duplicate Data Jobs—OraclePROCEDURE MATCH_FOR_DUPS (

IN_ROWID_TABLE CHAR(14);--Rowid of a Table.IN_USER_NAME VARCHAR2(200);--User Name.OUT_ERROR_MSG VARCHAR2(2000);--Error Message, if any.OUT_RETURN_CODE INT;---Return Code. (If no errors, 0 is returned)

)

Sample Job Execution Script for Match for Duplicate Data Jobs—OracleDECLARE


BEGIN IN_ROWID_TABLE := NULL;IN_USER_NAME := NULL;OUT_ERROR_MSG := NULL;OUT_RETURN_CODE := NULL;

CMXMA.MATCH_FOR_DUPS ( IN_ ROWID_TABLE, IN_USER_NAME, OUT_ERROR_MSG, OUT_RETURN_CODE);

COMMIT; END;

Stage Jobs

Stage jobs copy records from a landing to a staging table. During execution, Stage jobs optionally cleanse data according to the current cleanse settings. To learn more about Stage jobs and the stage process, see the Siperian Hub Administrator’s Guide.

Identifiers for Executing Stage Jobs




Dependencies for Stage Jobs

Each Stage job is dependent on the successful completion of the ETL process responsible for loading the Landing table used by the Stage job. There are no dependencies between Stage jobs.

Successful Completion of Stage Jobs

A Stage job must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful. On successful completion of a Stage job, the Load job for the target staging table can be run, provided that all other dependencies for the Load job have been met.


Stored Procedure Definition for Stage Jobs—OraclePROCEDURE CMXCL.START_CLEANSE(IN_DB_TYPE_STR VARCHAR2(200); --DataBase type (Oracle/DB2)IN_HOST_NAME VARCHAR2(200); --Database Host NameIN_SCHEMA_NAME VARCHAR2(200);--Schema NameIN_PORT VARCHAR2(200); --DataBase PortIN_CONNECT_PROPS VARCHAR2(200);--Connection PropertiesIN_DB_USER_NAME VARCHAR2(200);--DataBase User NameIN_DB_PASSWORD VARCHAR2(200);--DataBase User’s PasswordIN_STG_ROWID_TABLE VARCHAR2(200);--Rowid of Staging TableIN_ROWID_TABLE_OBJECT VARCHAR2(200);--Rowid of Table ObjectIN_RUN_SYNCH VARCHAR2(200);--Run Synchronize, Boolean value (TRUE/FALSE)OUT_ERROR_MSG VARCHAR2(2000);--Error Mesasage, if anyOUT_ERROR_CODE NUMBER;--Error Code, if any)

Sample Job Execution Script for Stage Jobs—OracleDECLARE

IN_DB_TYPE_STR VARCHAR2(200);IN_HOST_NAME VARCHAR2(200);IN_SCHEMA_NAME VARCHAR2(200);IN_PORT VARCHAR2(200);IN_CONNECT_PROPS VARCHAR2(200);IN_DB_USER_NAME VARCHAR2(200);



IN_DB_PASSWORD VARCHAR2(200);IN_STG_ROWID_TABLE VARCHAR2(200);IN_ROWID_TABLE_OBJECT VARCHAR2(200);IN_RUN_SYNCH VARCHAR2(200);OUT_ERROR_MSG VARCHAR2(2000);OUT_ERROR_CODE NUMBER;

BEGININ_DB_TYPE_STR := NULL;IN_HOST_NAME := 'dbhostmachine';IN_SCHEMA_NAME := 'enterpriseschmeaname';IN_PORT := '7001';IN_CONNECT_PROPS := NULL;IN_DB_USER_NAME := 'admin';IN_DB_PASSWORD := 'mydbpassword';IN_STG_ROWID_TABLE := NULL;IN_ROWID_TABLE_OBJECT := NULL;IN_RUN_SYNCH := NULL;OUT_ERROR_MSG := NULL;

OUT_ERROR_CODE := NULL;

SELECT a.rowid_table, a.rowid_table_object INTO IN_STG_ROWID_TABLE, IN_ROWID_TABLE_OBJECT

FROM c_repos_table_object_v a, c_repos_table bWHERE a.object_name = 'CMX_CLEANSE.EXE'

AND b.rowid_table = a.rowid_tableAND b.table_name = 'C_HMO_ADDRESS'AND a.valid_ind = 1;

CMXCL.START_CLEANSE ( IN_DB_TYPE_STR, IN_HOST_NAME, IN_SCHEMA_NAME, IN_PORT, IN_CONNECT_PROPS, IN_DB_USER_NAME, IN_DB_PASSWORD, IN_STG_ROWID_TABLE, IN_ROWID_TABLE_OBJECT, IN_RUN_SYNCH, OUT_ERROR_MSG, OUT_ERROR_CODE );dbms_output.put_line(' Message is = ' || out_error_msg);COMMIT;END;

Unmerge Jobs

For merge-style base objects only, the Unmerge job can unmerge already-consolidated records, whether those records were consolidated using Automerge, Manual Merge, manual edit, Load by Rowid_Object, or Put Xref. The Unmerge job succeeds or fails as a single transaction: if the server fails while the Unmerge job is executing, the unmerge process is rolled back.



Cascade Unmerge

The Unmerge job performs a cascade unmerge if this feature is enabled for this base object in the Schema Manager in the Hub Console. With cascade unmerge, when records in the parent object are unmerged, Siperian Hub also unmerges affected records in the child base object. To learn more, see the Siperian Hub Administrator’s Guide.

Unmerging All Records or One Record

In your job execution script, you can specify the scope of records to unmerge by setting IN__UNMERGE_ALL_XREFS_IND.• IN__UNMERGE_ALL_XREFS_IND=0: Default setting. Unmerges the single record

identified in the specified XREF to its state prior to the merge.

• IN__UNMERGE_ALL_XREFS_IND=1: Unmerges all XREFs to their state prior to the merge. Use this option to quickly unmerge all XREFs for a single consolidated record in a single operation.

Linear and Tree Unmerge

In your job execution script, you can specify the type of unmerge (linear or tree unmerge) by setting IN_TREE_UNMERGE_IND:• IN_TREE_UNMERGE_IND=0: Default setting. Linear Unmerge

• IN_TREE_UNMERGE_IND=1: Tree Unmerge

The rest of this section describes these two types of unmerges.

Linear Unmerge

Linear unmerge is the default behavior. During a linear unmerge, a base object record is unmerged and taken out of the existing merge tree structure. Only the unmerged base object record itself will come out the merge tree structure, and all base object records below it in the merge tree will stay in the original merge tree.



Tree Unmerge

Tree unmerge is an optional alternative. A tree of merged base object records is a hierarchical structure of the merge history, reflecting the sequence of merge operations that have occurred. The merge history is kept during the merge process in the following tables:• HMXR provides the current state view of merges

• HMRG table provides a hierarchical view of the merge history, a tree of merged base object records, as well as an interactive unmerge history.

During a tree unmerge, you unmerge a tree of merged base object records as an intact sub-structure. A sub-tree having unmerged base object records as root will come out from the original merge tree structure.

Identifiers for Executing Manual Unmerge Jobs


Dependencies for Manual Unmerge Jobs

Each Manual Unmerge job is dependent on data having already been merged.

Successful Completion of Manual Unmerge Jobs

A Manual Unmerge job must complete with a RUN_STATUS of 0 (Completed Successfully) or 1 (Completed with Errors) to be considered successful.


Stored Procedure Definition for Manual Unmerge Jobs—OraclePROCEDURE unmerge ( in_rowid_table cmxlb.cmx_rowid

,in_rowid_system cmxlb.cmx_rowid

,in_pkey_src_object cmxlb.cmx_pkey_src_object



,in_tree_unmerge_ind INT

,in_rowid_job_ctl cmxlb.cmx_rowid

,in_interaction_id INT

,in_user_name cmxlb.cmx_user_name

,out_unmerged_rowid OUT cmxlb.cmx_rowid

,out_tmp_table_list OUT cmxlb.cmx_big_str

,out_error_message OUT cmxlb.cmx_message

,rc OUT INT

,in_unmerge_all_xrefs_ind IN INT default 0

)

Sample Job Execution Script for Manual Unmerge Jobs—OracleDECLARE in_rowid_table CHAR (14); in_rowid_system CHAR (14); in_pkey_src_object VARCHAR2 (255); in_tree_unmerge_ind NUMBER; in_rowid_job_ctl CHAR (14); in_interaction_id NUMBER; in_user_name VARCHAR2 (50); out_unmerged_rowid CHAR (14); out_tmp_table_list VARCHAR2 (32000); out_error_message VARCHAR2 (1024); rc NUMBER; in_unmerge_all_xrefs_ind NUMBER;BEGIN in_rowid_table := 'SVR1.8ZC '; in_rowid_system := 'SVR1.7NJ '; in_pkey_src_object := '6'; in_tree_unmerge_ind := 0; -- default 0, 1 for tree unmerge in_rowid_job_ctl := NULL; in_interaction_id := NULL; in_user_name := 'xhe'; out_unmerged_rowid := NULL; out_tmp_table_list := NULL; out_error_message := NULL; rc := NULL; in_unmerge_all_xrefs_ind := 0; -- default 0, 1 for unmerge_all cmxmm.unmerge ( in_rowid_table,


Scheduling Batch Groups

in_rowid_system, in_pkey_src_object, in_tree_unmerge_ind, in_rowid_job_ctl, in_interaction_id, in_user_name, out_unmerged_rowid, out_tmp_table_list, out_error_message, rc, in_unmerge_all_xrefs_ind ); DBMS_OUTPUT.put_line (' Return Code = ' || rc); DBMS_OUTPUT.put_line (' Message is = ' || out_error_message); END;

Scheduling Batch GroupsThis section describes how to schedule batch groups for your Siperian Hub implementation.

About Batch Groups

A batch group is a collection of individual batch jobs (for example, Stage, Load, and Match jobs) that can be executed with a single command. Each batch job in a batch group can be executed singly or in parallel with other jobs. To learn important background information about batch groups, see the Siperian Hub Administrator’s Guide.

This section describes how to execute batch groups via stored procedures using job scheduling software (such as Tivoli, CA Unicenter, and so on). Siperian Hub provides stored procedures for managing batch groups, as described in “Stored Procedures for Batch Groups” on page 138. Siperian Hub also allows you to create and run custom stored procedures for batch groups, as described in “Developing Custom Stored Procedures for Batch Jobs” on page 145.

You can also use the Batch Group tool in the Hub Console to configure and run batch groups. However, to schedule batch groups, you need to do so via stored procedures, as described in this section. To learn more about the Batch Group tool, see the Siperian Hub Administrator’s Guide.



Stored Procedures for Batch Groups

Siperian Hub provides the following stored procedures for managing batch groups:

In addition to using parameters that are associated with the correspond SIF operation, these stored procedures require the following parameters:• URL of the Hub Server (for example, http://localhost:7001/cmx/request)

• username and password

• target ORS

These stored procedures construct an XML message, perform an HTTP POST to a server URL via SIF, and return the results.

cmxbg.execute_batchgroup

Performs an HTTP POST to the ExecuteBatchGroupRequest operation via SIF.

Note: This stored procedure has an option to execute asynchronously, but not to receive a JMS response for asynchronous execution. If you need to use asynchronous execution and need to know when execution is finished, then poll with the cmxbg.get_batchgroup_status stored procedure. Alternatively, if you need to receive a JMS response for asynchronous execution, then execute the batch group directly in an external application (instead of a job execution script) by invoking the

Stored Procedure Description

cmxbg.execute_batchgroup Performs an HTTP POST to the ExecuteBatchGroupRequest operation via the Services Integration Framework (SIF). To learn more, see “cmxbg.execute_batchgroup” on page 138.

cmxbg.reset_batchgroup Performs an HTTP POST to the ResetBatchGroupRequest operation via SIF. To learn more, see “cmxbg.reset_batchgroup” on page 140.

cmxbg.get_batchgroup_status Performs an HTTP POST to the GetBatchGroupStatusRequest operation via SIF. To learn more, see “cmxbg.get_batchgroup_status” on page 142.



Siperian Hub ExecuteBatchGroupRequest operation, which is described in the Siperian Services Integration Framework Guide.

Signature FUNCTION execute_batchgroup( in_mrm_server_url IN cmxlb.cmx_small_str , in_username IN cmxlb.cmx_small_str , in_password IN cmxlb.cmx_small_str , in_orsid IN cmxlb.cmx_small_str , in_rowid_batchgroup IN cmxlb.cmx_small_str , in_resume IN cmxlb.cmx_small_str , in_asyncronous IN cmxlb.cmx_small_str , out_rowid_batchgroup_log OUT cmxlb.cmx_small_str , out_error_msg OUT cmxlb.cmx_small_str ) RETURN NUMBER -- return the error code

Parameters

Parameter Description

in_mrm_server_url Hub Server SIF URL.

in_username User account with role-based permissions to execute batch groups.

in_password Password for the user account with role-based permissions to execute batch groups.

in_orsid ORS ID as specified in the Database tool in the Hub Console. To learn more, see the Siperian Hub Administrator’s Guide.

in_rowid_batchgroup c_repos_job_group.rowid_job_group

in_resume One of the following values:• true: if previous execution failed, resume at that point• false: regardless of previous execution, start from the

beginning

in_asyncronous Specifies whether to execute asynchronously or synchronously. One of the following values:• true: start execution and return immediately (asynchronous

execution).• false: return when group execution is complete (synchronous

execution).



Returns

ExampleDECLARE

out_rowid_batchgroup_log cmxlb.cmx_small_str;out_error_msg cmxlb.cmx_small_str;ret_val int;

BEGINret_val := cmxbg.execute_batchgroup(

'http://localhost:7001/cmx/request/process/', 'admin', 'admin','localhost-mrm-XU_3009', 'SVR1.1VHDH', 'true' -- or 'false', 'true' -- or 'false', out_rowid_batchgroup_log, out_error_msg

);cmxlb.debug_print('execute_batchgroup: ' || ' code='|| ret_val ||

' message='|| out_error_msg || ' | out_rowid_batchgroup_log='|| out_rowid_batchgroup_log);END;/

cmxbg.reset_batchgroup

Performs an HTTP POST to the ResetBatchGroupRequest operation via SIF.

Signature FUNCTION reset_batchgroup( in_mrm_server_url IN cmxlb.cmx_small_str


out_rowid_batchgroup_log

c_repos_job_group_control.rowid_job_group_control

out_error_msg Error message text.

NUMBER Error code. If zero (0), then the stored procedure completed successfully. If one (1), then the stored procedure returns an explanation in out_error_msg.



, in_username IN cmxlb.cmx_small_str , in_password IN cmxlb.cmx_small_str , in_orsid IN cmxlb.cmx_small_str , in_rowid_batchgroup IN cmxlb.cmx_small_str , out_rowid_batchgroup_log OUT cmxlb.cmx_small_str , out_error_msg OUT cmxlb.cmx_small_str ) RETURN NUMBER -- return the error code

Parameters

Returns

ExampleDECLARE

out_rowid_batchgroup_log cmxlb.cmx_small_str;out_error_msg cmxlb.cmx_small_str;ret_val int;

BEGINret_val := cmxbg.reset_batchgroup(

'http://localhost:7001/cmx/request/process/', 'admin'






in_rowid_batchgroup c_repos_job_group.rowid_job_group




out_error_msg Error message text.




, 'admin','localhost-mrm-XU_3009', 'SVR1.1VHDH', out_rowid_batchgroup_log, out_error_msg

);cmxlb.debug_print('reset_batchgroup: ' || ' code='|| ret_val || '

message='|| out_error_msg || ' | out_rowid_batchgroup_log='|| out_rowid_batchgroup_log);END;/

cmxbg.get_batchgroup_status

Performs an HTTP POST to the GetBatchGroupStatusRequest operation via SIF.

Signature FUNCTION get_batchgroup_status( in_mrm_server_url IN cmxlb.cmx_small_str , in_username IN cmxlb.cmx_small_str , in_password IN cmxlb.cmx_small_str , in_orsid IN cmxlb.cmx_small_str , in_rowid_batchgroup IN cmxlb.cmx_small_str , in_rowid_batchgroup_log IN cmxlb.cmx_small_str , out_rowid_batchgroup OUT cmxlb.cmx_small_str , out_rowid_batchgroup_log OUT cmxlb.cmx_small_str , out_start_rundate OUT cmxlb.cmx_small_str , out_end_rundate OUT cmxlb.cmx_small_str , out_run_status OUT cmxlb.cmx_small_str , out_status_message OUT cmxlb.cmx_small_str , out_error_msg OUT cmxlb.cmx_small_str ) RETURN NUMBER -- return the error code

Parameters







Returns

ExampleDECLARE

out_rowid_batchgroup cmxlb.cmx_small_str; out_rowid_batchgroup_log cmxlb.cmx_small_str; out_start_rundate cmxlb.cmx_small_str; out_end_rundate cmxlb.cmx_small_str; out_run_status cmxlb.cmx_small_str; out_status_message cmxlb.cmx_small_str;


in_rowid_batchgroup c_repos_job_group.rowid_job_groupIf in_rowid_batchgroup_log is null, the most recent log for this group will be used.

in_rowid_batchgroup_log

c_repos_job_group_control.rowid_job_group_controlEither in_rowid_batchgroup or in_rowid_batchgroup_log is required.


out_rowid_batchgroup c_repos_job_group.rowid_job_group



out_start_rundate Date / time when this batch job started.

out_end_rundate Date / time when this batch job ended.

out_run_status Job execution status code that is displayed in the Batch Group tool. To learn more, see the Siperian Hub Administrator’s Guide/

out_status_message Job execution status message that is displayed in the Batch Group tool. To learn more, see the Siperian Hub Administrator’s Guide/

out_error_msg Error message text for this stored procedure call, if applicable.





out_error_msg cmxlb.cmx_small_str; out_returncode int; ret_val int;

BEGIN ret_val := cmxbg.get_batchgroup_status(

'http://localhost:7001/cmx/request/process/', 'admin', 'admin','localhost-mrm-XU_3009', 'SVR1.1VHDH', null, out_rowid_batchgroup, out_rowid_batchgroup_log, out_start_rundate, out_end_rundate, out_run_status, out_status_message, out_error_msg

);cmxlb.debug_print('get_batchgroup_status: ' || ' code='|| ret_val

|| ' message='|| out_error_msg || ' | status=' || out_status_message || ' | out_rowid_batchgroup_log='|| out_rowid_batchgroup_log);END;/


Developing Custom Stored Procedures for Batch Jobs


This section describes how to create and register custom stored procedures for batch jobs that can be added to batch groups for your Siperian Hub implementation.

About Custom Stored Procedures

Siperian Hub also allows you to create and run custom stored procedures for batch groups. After developing the custom stored procedure, you must register it in order to make it available to users as batch jobs in the Batch Viewer and Batch Group tools in the Hub Console. To learn more about these tools, see the Siperian Hub Administrator’s Guide.

Required Execution Parameters for Custom Batch Jobs

The following parameters are required for custom batch jobs. During its execution, a custom batch job can call other MRM procedures to register metrics: cmxut.set_metric_value.

SignaturePROCEDURE example_job( in_rowid_table_object IN cmxlb.cmx_rowid -- c_repos_table_object.rowid_table_object, result of cmxut.REGISTER_CUSTOM_TABLE_OBJECT ,in_user_name IN cmxlb.cmx_user_name -- username calling the function ,in_rowid_job IN cmxlb.cmx_rowid -- c_repos_job_control.rowid_job, for reference, do not update status ,out_err_msg OUT varchar -- message about success or error ,out_err_code OUT int -- >=0: Completed successfully -- <0: error);



Parameters

Returns

Example Custom Stored Procedure/*<TOAD_FILE_CHUNK>*/CREATE or replace PACKAGE cmxbg_exampleAS

PROCEDURE update_table( in_rowid_table_object IN cmxlb.cmx_rowid -- c_repos_table_object.rowid_table_object, result of cmxut.REGISTER_CUSTOM_TABLE_OBJECT ,in_user_name IN cmxlb.cmx_user_name -- username calling the function ,in_rowid_job IN cmxlb.cmx_rowid -- c_repos_job_control.rowid_job, for reference, do not update status ,out_err_msg OUT varchar -- message about success or error ,out_err_code OUT int -- >=0: Completed successfully -- <0: error);

end cmxbg_example;//*<TOAD_FILE_CHUNK>*/


in_rowid_table_object IN cmxlb.cmx_rowid

c_repos_table_object.rowid_table_objectResult of cmxut.REGISTER_CUSTOM_TABLE_OBJECT

in_user_name IN cmxlb.cmx_user_name

User name calling the function.


out_err_msg Error message text.

out_err_code Error code.



CREATE OR REPLACE PACKAGE BODY cmxbg_exampleASPROCEDURE update_table( in_rowid_table_object IN cmxlb.cmx_rowid -- c_repos_table_object.rowid_table_object, result of cmxut.REGISTER_CUSTOM_TABLE_OBJECT ,in_user_name IN cmxlb.cmx_user_name -- username calling the function ,in_rowid_job IN cmxlb.cmx_rowid -- c_repos_job_control.rowid_job, for reference, do not update status ,out_err_msg OUT varchar -- message about success or error ,out_err_code OUT int -- >=0: Completed successfully -- <0: error) AS BEGIN DECLARE cutoff_date DATE; record_count INT; run_status INT; status_message VARCHAR2 (2000); start_date DATE := SYSDATE; mrm_rowid_table cmxlb.cmx_rowid; obj_func_type CHAR (1); job_id CHAR (14); sql_stmt VARCHAR2 (2000); table_name VARCHAR2(30); ret_code INT; register_job_err EXCEPTION; BEGIN sql_stmt := 'alter session set nls_date_format=''dd mon yyyy hh24:mi:ss''';

EXECUTE IMMEDIATE sql_stmt;

cmxut.debug_print ('Start of custom batch job...'); obj_func_type := 'A';

SELECT rowid_table INTO mrm_rowid_table FROM c_repos_table_object



WHERE rowid_table_object = in_rowid_table_object;

SELECT start_run_date INTO cutoff_date FROM c_repos_job_control WHERE rowid_job = in_rowid_job;

IF cutoff_date IS NULL then cutoff_date := SYSDATE - 7; END IF; -- procedure can be registered on different tables, so get the table_name SELECT table_name INTO table_name FROM c_repos_table rt, c_repos_table_object rto WHERE rto.ROWID_TABLE_OBJECT = in_rowid_table_object AND rto.ROWID_TABLE = rt.ROWID_TABLE;

-- The real work! sql_stmt := 'update ' || table_name || ' set zip4 = ''0000'', last_update_date = ''' || cutoff_date || '''' || ' where zip4 is null'; cmxut.debug_print (sql_stmt); EXECUTE IMMEDIATE sql_stmt; record_count := SQL%ROWCOUNT; COMMIT; -- for testing, sleep to make the procedure take longer -- dbms_lock.sleep(5);

-- Set zero or many metrics about the job cmxut.set_metric_value (in_rowid_job, 1, -- c_repos_job_metric_type.metric_type_code record_count, out_err_code, out_err_msg ); COMMIT; IF record_count <= 0 THEN



out_err_msg := 'Failed to update records.'; out_err_code := -1; ELSE IF out_err_code >= 0 THEN out_err_msg := 'Completed successfully.'; END IF; -- else keep success code and msg from set_metric_value END IF; EXCEPTION WHEN OTHERS THEN out_err_code := SQLCODE; out_err_msg := SUBSTR (SQLERRM, 1, 200); END; END;END cmxbg_example;/

Registering a Custom Stored Procedure

You must register a custom stored procedure with Siperian Hub in order to make it available to users in the Batch Group tool in the Hub Console. To register a custom stored procedure, you need to call the following in c_repos_table_object:

cmxut.REGISTER_CUSTOM_TABLE_OBJECT

The same custom job can be registered multiple times for different tables (in_rowid_table).

Signature PROCEDURE register_custom_table_object( in_rowid_table cmxlb.cmx_rowid , in_obj_func_type_code VARCHAR , in_obj_func_type_desc VARCHAR , in_object_name VARCHAR );



Parameters

ExampleBEGIN cmxut.REGISTER_CUSTOM_TABLE_OBJECT ( 'SVR1.RS1B ' -- c_repos_table.rowid_table ,'A' -- job type, must be 'A' for batch group ,'cmxbg_example.update_table example' -- display name ,'cmxbg_example.update_table' -- package.procedure );END;


in_rowid_table cmxlb.cmx_rowid

Foreign key to c_repos_table.rowid_table.When the Hub Server calls the custom job in a batch group, this value is passed in.

in_obj_func_type_code

Job type code. Must be 'A' for batch group custom jobs.

in_obj_func_type_desc

Display name for the custom batch job in the Batch Groups tool in the Hub Console.

in_object_name package.procedure name of the custom job.


8
Implementing Custom Buttons in Hub
Console Tools

This chapter explains how, in your Siperian Hub implementation, you can add custom buttons to tools in the Hub Console that allow users to invoke external services on demand.

Chapter Contents• About Custom Buttons in the Hub Console

• Adding Custom Buttons

About Custom Buttons in the Hub ConsoleIn your Siperian Hub implementation, you can provide Hub Console users with custom buttons that can be used to extend your Siperian Hub implementation. Custom buttons can provide users with on-demand, real-time access to specialized data services. Custom buttons can be added to any of the following tools in the Hub Console: Merge Manager, Data Manager, and Hierarchy Manager.

Custom buttons can give users the ability to invoke a particular external service (such as retrieving data or computing results), perform a specialized operation (such as launching a workflow), and other tasks. Custom buttons can be designed to access data services by a wide range of service providers, including—but not limited to—enterprise applications (such as CRM or ERP applications), external service providers (such as foreign exchange calculators, publishers of financial market indexes,

151

About Custom Buttons in the Hub Console

or government agencies), and even Siperian Hub itself (see Siperian Services Integration Framework Guide).

For example, you could add a custom button that invokes a specialized cleanse function, offered as a Web service by a vendor, that cleanses data in the customer record that is currently selected in the Data Manager screen. When the user clicks the button, the underlying code would capture the relevant data from the selected record, create a request (possibly including authentication information) in the format expected by the Web service, and then submit that request to the Web service for processing. When the results are returned, the Data Manager displays the information in a separate Swing dialog (if you created one and if you implemented this as a client custom function) with the customer rowid_object from Siperian Hub.

Custom buttons are not installed by default, or are they required for every Siperian Hub implementation. For you each custom button that you want to add, you need to implement a Java interface, package the implementation in a JAR file, and deploy it by running a command-line utility. To control the appearance of the custom button in the Hub Console, you can supply either text or an icon graphic in any Swing-compatible graphic format (such as JPG, PNG, or GIF).

How Custom Buttons Appear in the Hub Console

This section shows how custom buttons, once implemented, will appear in the Merge Manager and Data Manager tools of the Hub Console.



Custom Buttons in the Merge Manager

Custom buttons are displayed in the top panel of the Merge Manager screen, as shown in the following example.

Implementing Custom Buttons in Hub Console Tools 153


Custom Buttons in the Data Manager

Custom buttons are displayed in the top panel of the Data Manager screen, as shown in the following example.

What Happens When a User Clicks a Custom Button

When a user clicks a custom button in the Hub Console, the Hub Console invokes the request, passing content and context to the external service. Examples include record keys and other data from a base object, package information, and so on. Execution is asynchronous—the user can continue to work in the Hub Console while the request is processed.

The custom code can process the service response as appropriate—log the results, display the data to the user in a separate Swing dialog (if custom-coded and the custom function is client-side), allow users to copy and paste the results into a data entry field, execute real-time PUTs of the data back into the correct business objects, and so on.


Adding Custom Buttons

Adding Custom ButtonsTo add a custom button to the Hub Console in your Siperian Hub implementation, complete the following tasks:1. Determine the details of the external service that you want to invoke, such as the

format and parameters for request and response messages.

2. Write and package the business logic that the custom button will execute, as described in “Writing a Custom Function” on page 155.

3. Deploy the package so that it appears in the applicable tool(s) in the Hub Console, as described in “Deploying Custom Buttons” on page 159.

Once an external service button is visible in the Hub Console, users can click the button to invoke the service.

Writing a Custom Function

To build an external service invocation, you write a custom function that executes the application logic when a user clicks the custom button in the Hub Console. The application logic implements the following Java interface:

com.siperian.mrm.customfunctions.api.CustomFunction

To learn more about this interface, see the Javadoc that accompanies your Siperian Hub distribution.

Server-Based and Client-Based Custom Functions

Execution of the application logic occurs on either:

Environment Description

Client UI-based custom function—Recommended when you want to display elements in the user interface, such as a separate dialog that displays response information. To learn more, see “Example Client-Based Custom Function” on page 156.



Example Custom Functions

This section provides the Java code for two example custom functions that implement the com.siperian.mrm.customfunctions.api.CustomFunction interface. The code simply prints (on standard error) information to the server log or the Hub Console log.

Example Client-Based Custom Function

The name of the client function class for the following sample code is com.siperian.mrm.customfunctions.test.TestFunction.

//=====================================================================//project: Siperian Master Reference Manager, Hierarchy Manager//---------------------------------------------------------------------//copyright: Siperian Inc. (c) 2003-2006. All rights reserved.//=====================================================================

package com.siperian.mrm.customfunctions.test;

import java.awt.Frame;import java.util.Properties;

import javax.swing.Icon;

import com.siperian.mrm.customfunctions.api.CustomFunction;

public class TestFunctionClient implements CustomFunction {

public void executeClient(Properties properties, Frame frame, String username, String password, String orsId, String baseObjectRowid, String baseObjectUid, String packageRowid, String packageUid, String[] recordIds) {

System.err.println("Called custom test function on the client with the following parameters:");

System.err.println("Username/Password: '" + username + "'/'" + password + "'");

System.err.println(" ORS Database ID: '" + orsId + "'");

Server Server-based custom button—Recommended when it is preferable to call the external service from the server for network or performance reasons. To learn more, see “Example Server-Based Function” on page 157.

Environment Description



System.err.println("Base Object Rowid: '" + baseObjectRowid + "'");System.err.println(" Base Object UID: '" + baseObjectUid + "'");System.err.println(" Package Rowid: '" + packageRowid + "'");System.err.println(" Package UID: '" + packageUid + "'");System.err.println(" Record Ids: ");for(int i = 0; i < recordIds.length; i++) {

System.err.println(" '"+recordIds[i]+"'");}System.err.println(" Properties: " + properties.toString());

}

public void executeServer(Properties properties, String username, String password, String orsId, String baseObjectRowid, String baseObjectUid, String packageRowid, String packageUid, String[] recordIds) {

System.err.println("This method will never be called because getExecutionType() returns CLIENT_FUNCTION");

}

public String getActionText() { return "Test Client"; }

public int getExecutionType() { return CLIENT_FUNCTION; }

public Icon getGuiIcon() { return null; }

}

Example Server-Based Function

The name of the server function class for the following code is com.siperian.mrm.customfunctions.test.TestFunctionClient.

//=====================================================================//project: Siperian Master Reference Manager, Hierarchy Manager//---------------------------------------------------------------------//copyright: Siperian Inc. (c) 2003-2006. All rights reserved.//=====================================================================

package com.siperian.mrm.customfunctions.test;

import java.awt.Frame;import java.util.Properties;

import javax.swing.Icon;



import com.siperian.mrm.customfunctions.api.CustomFunction;

/** * This is a sample custom function that is executed on the Server. * To deploy this function, put it in a jar file and upload the jar file * to the DB using DeployCustomFunction. */public class TestFunction implements CustomFunction {

public String getActionText() {return "Test Server";

}

public Icon getGuiIcon() {return null;

}

public void executeClient(Properties properties, Frame frame, String username, String password, String orsId, String baseObjectRowid, String baseObjectUid, String packageRowid, String packageUid, String[] recordIds) {

System.err.println("This method will never be called because getExecutionType() returns SERVER_FUNCTION");

}

public void executeServer(Properties properties, String username, String password, String orsId, String baseObjectRowid, String baseObjectUid, String packageRowid, String packageUid, String[] recordIds) {

System.err.println("Called custom test function on the server with the following parameters:");

System.err.println("Username/Password: '" + username + "'/'" + password + "'");

System.err.println(" ORS Database ID: '" + orsId + "'");System.err.println("Base Object Rowid: '" + baseObjectRowid + "'");System.err.println(" Base Object UID: '" + baseObjectUid + "'");System.err.println(" Package Rowid: '" + packageRowid + "'");System.err.println(" Package UID: '" + packageUid + "'");System.err.println(" Record Ids: ");for(int i = 0; i < recordIds.length; i++) {

System.err.println(" '"+recordIds[i]+"'");}System.err.println(" Properties: " + properties.toString());

}

public int getExecutionType() {return SERVER_FUNCTION;

}



}

Controlling the Custom Button Appearance

To control the appearance of the custom button in the Hub Console, you implement one of the following methods in the com.siperian.mrm.customfunctions.api.CustomFunction interface:

Custom buttons are displayed alphabetically by name in the Hub Console.

Deploying Custom Buttons

Before users can see the custom buttons in the Hub Console, you need to explicitly add them using the DeployCustomFunction utility from the command line.

To deploy custom buttons:1. Get to a command prompt.

2. Run the DeployCustomFunction utility by specifying following command at the command prompt:

3. When prompted, specify the database type.

4. When prompted, specify database connection information:

• Oracle: database host, port, service, login username, and password

• DB2: host, port, database, schema, username, and password.

Method Description

getActionText Specify the text for the button label. Uses the default visual appearance for custom buttons.

getGuiIcon Specify the icon graphic in any Swing-compatible graphic format (such as JPG, PNG, or GIF). This image file can be bundled with the JAR for this custom function.



5. The DeployCustomFunction tool displays a menu of the following options.

6. When you have finished choosing your actions, choose (Q)uit.

Label Description

(L)ist Displays a list of currently-defined custom buttons.

(A)dd Adds a new custom button. The DeployCustomFunction tool prompts you to specify:• the JAR file for your custom button• the name of the custom function class that implements the

com.siperian.mrm.customfunctions.api.CustomFunction interface• the type of the custom button: d—Data Manager, m—Merge

Manager, and /or h—Hierarchy Manager (you can specify one, two, or three letters)

(U)pdate Updates the JAR file for an existing custom button. The DeployCustomFunction tool prompts you to specify:• the rowID of the custom button to update• the JAR file for your custom button• the name of the custom function class that implements the

com.siperian.mrm.customfunctions.api.CustomFunction interface• the type of the custom button: d—Data Manager, m—Merge


(C)hange Type Changes the type of an existing custom button. The DeployCustomFunction tool prompts you to specify:• the rowID of the custom button to update• the type of the custom button: d—Data Manager, m—Merge


(S)et Properties Specify a properties file, which defines name/value pairs that the custom function requires at execution time (name=value). The DeployCustomFunction tool prompts you to specify the properties file to use.

(D)elete Deletes an existing custom button. The DeployCustomFunction tool prompts you to specify the rowID of the custom button to delete.

(Q)uit Exits the DeployCustomFunction tool.



7. Refresh the browser window to display the custom button you just added.

8. Test your a custom button to ensure that it works properly.


Glossary

accept limit

A number that determines the acceptability of a match. The accept limit is defined by Siperian within a population in accordance with its match purpose.

Admin source system

Default source system. Used for manual trust overrides and data edits from the Data Manager or Merge Manager tools. See source system.

administrator

Siperian Hub user who has the primary responsibility for configuring the Siperian Hub system. Administrators access Siperian Hub through the Hub Console, and use Siperian Hub tools to configure the objects in the Hub Store, and create and modify Siperian Hub security.

authentication

Process of verifying the identity of a user to ensure that they are who they claim to be. In Siperian Hub, users are authenticated based on their supplied credentials—user name / password, security payload, or a combination of both. Siperian Hub provides an internal authentication mechanism and also supports user authentication via third-party authentication providers. See credentials, security payload.

163

authorization

Process of determining whether a user has sufficient privileges to access a requested Siperian Hub resource. In Siperian Hub, resource privileges are allocated to roles. Users and user groups are assigned to roles. A user’s resource privileges are determined by the roles to which they are assigned, as well as by the roles assigned to the user group(s) to which the user belongs. See user, user group, role, resource, and privilege.

autolink

Process of linking records automatically. For link-style base objects only. Match rules can result in automatic linking or manual linking. A match rule that instructs Siperian Hub to perform an autolink will link two or more records of a base object table automatically, without manual intervention. See manual link, link-style base object.

automerge

Process of merging records automatically. For merge-style base objects only. Match rules can result in automatic merging or manual merging. A match rule that instructs Siperian Hub to perform an automerge will combine two or more records of a base object table automatically, without manual intervention. See manual merge, merge-style base object.

base object

A table that contains information about an entity that is relevant to your business, such as customer or account.

batch group

A collection of individual batch jobs (for example, Stage, Load, and Match jobs) that can be executed with a single command. Each batch job in a group can be executed sequentially or in parallel to other jobs. See also batch job.


batch job

A program that, when executed, completes a discrete unite of work (a process). For example, the Match job carries out the match process, checking the specified match condition for the records of a base object table and then queueing the matched records for either automerge (Automerge job) or manual merge (Manual Merge job). See also batch group.

batch mode

Way of interacting with Siperian Hub via batch jobs, which can be executed in the Hub Console or using third-party management tools to schedule and execute batch jobs (in the form of stored procedures) on the database server. See also real-time mode, batch job, batch group, stored procedure.

best version of the truth

A record that has been consolidated with the best cells of data from the source records. Sometimes abbreviated as BVT. The precise definition depends on the base object style:• For merge-style base objects, the base object record is the BVT record, and is built

by consolidating with the most-trustworthy cell values from the corresponding source records.

• For link-style base objects, the BVT Snapshot job will build the BVT record(s) by consolidating with the most-trustworthy cell values from the corresponding linked base object records and return to the requestor a snapshot for consumption.

bulk merge

See automerge.

bulk unmerge

See unmerge.

Glossary 165

BVT

See best version of the truth.

cascade unmerge

During the unmerge process, if this feature is enabled, when records in the parent object are unmerged, Siperian Hub also unmerges affected records in the child base object. See unmerge.

cell

Intersection of a column and a record in a table. A cell contains a data value or null.

cleanse

See data cleansing.

cleanse engine

A cleanse engine is a third party product used to perform data cleansing with the Siperian Hub.

cleanse function

Code changes the incoming data during Stage jobs, converting each input string to an output string. Typically, these functions are used to standardize data and thereby optimize the match process. By combining multiple cleanse functions, you can perform complex filtering and standardization.

cleanse list

A logical grouping of cleanse functions that are executed at run time in a predefined order. See cleanse function, data cleansing.


column

In a table, a set of data values of a particular type, one for each row of the table.

conditional mapping

A mapping between a column in a landing table and a staging table that uses a SQL WHERE clause to conditionally select only those records in the landing table that meet the filter condition. See mapping, distinct mapping.

consolidation process

Process of merging or linking duplicate records into a single record. The goal in Siperian Hub is to identify and eliminate all duplicate data and to merge or link them together into a single, consolidated record while maintaining full traceability.

consolidation indicator

Represents the state of a record in a base object. Stored in the CONSOLIDATION_IND column. The consolidation indicator is one of the following values:

Indicator Value Meaning Purpose

1 Consolidated Indicates the record that has been determined to be unique.

2 Queued for Merge

Indicates that the record has gone through the match process.

3 Queued for Match

Indicates that the record is ready to be put through the match process against the rest of the records in the base object.

4 New Indicates that the record has been newly loaded into the base object and has not gone through the match process.

9 On hold Indicates that the Data Steward has put the record on hold, to deal with later.

Glossary 167

control table

A type of system table in an ORS that Siperian Hub automatically creates for a base object. Control tables are used in support of the stage and load processes. For each trust-enabled column in a base object, Siperian Hub maintains a record (the last update date and an identifier of the source system) in a corresponding control table

credentials

What a user supplies at login time to gain access to Siperian Hub resources. Credentials are used during the authorization process to determine whether a user is who they claim to be. Login credentials might be a user name and password, a security payload (such as a security token or some other binary data), or a combination of user name/password and security payload. See authentication, security payload.

cross-reference table

A type of system table in an ORS that Siperian Hub automatically creates for a base object. For each record of the base object, the cross-reference table contains one record per source system. This record contains the primary key from the source system and the most recent value that the source system has provided for each cell in the base object table.

Customer Data Integration (CDI)

A discipline within Master Data Management (MDM) that focuses on customer master data and its related attributes. See master data.

data cleansing

Process of standardizing data content and layout, decomposing/parsing text values into identifiable elements, verifying identifiable values (such as postal codes) against data libraries, and replacing incorrect values with correct values from data libraries.


data steward

Siperian Hub user who has the primary responsibility for data quality. Data stewards access Siperian Hub through the Hub Console, and use Siperian Hub tools to configure the objects in the Hub Store.

data type

Defines the characteristics of permitted values in a table column—characters, numbers, dates, binary data, and so on. Siperian Hub uses a common set of data types for columns that map directly data types for the database platform (Oracle or DB2) used in your Siperian Hub implementation.

database

Organized collection of data in the Hub Store. Siperian Hub supports two types of databases: a Master Database and an Operational Record Store (Operational Record Store). See Master Database, Operational Record Store (ORS), and Hub Store.

data cleansing

The process of standardizing data content and layout, decomposing and parsing text values into identifiable elements, verifying identifiable values (such as zip codes) against data libraries, and replacing incorrect values with correct values from data libraries. See cleanse function.

Data Manager

Tool used to review the results of all merges—including automatic merges—and to correct data content if necessary. It provides you with a view of the data lineage for each base object record. The Data Manager also allows you to unmerge previously merged records, and to view different types of history on each consolidated record.

Glossary 169

datasource

In the application server environment, a datasource is a JDBC resource that identifies information about a database, such as the location of the database server, the database name, the database user ID and password, and so on. Siperian Hub needs this information to communicate with an ORS.

decay curve

Visually shows the way that trust decays over time. Its shape is determined by the configured decay type and decay period. See decay period, decay type.

decay period

The amount of time (days, weeks, months, quarters, and years) that it takes for the trust level to decay from the maximum trust level to the minimum trust level. See decay curve, decay type.

decay type

The way that the trust level decreases during the decay period. See linear decay, RISL decay, SIRL decay, decay curve, decay period.

delta detection

During the stage process, Siperian Hub only processes new or changed records when this feature is enabled. Delta detection can be done either by comparing entire records or via a date column.

dependent object

A table that is used to store detailed information about the records in a base object (for example, supplemental notes). One record in a base object table can map to multiple records in a dependent object table.


distinct mapping

A mapping between a column in a landing table and a staging table that selects only the distinct records from the landing table. Using distinct mapping is useful in situations in which you have a single landing table feeding multiple staging tables and the landing table is denormalized (for example, it contains both customer and address data). See mapping, conditional mapping.

distinct source system

A source system that provides data that gets inserted into the base object without being consolidated. See source system.

downgrade

Operation that occurs during the load process when a validation rule reduces the trust for a record by a percentage.

duplicate

One or more records in which the data in certain columns (such as name, address, or organization data) is identical or nearly identical. Match rules executed during the match process determine whether two records are sufficiently similar to be considered duplicates for consolidation purposes.

entity

In Hierarchy Manager, an entity is any object, person, organization, place or thing that has meaning and can be acted upon in your database. Examples include a specific person’s name, a specific checking account number, a specific company, a specific address, and so on. See entity type.

entity base object

An entity base is a base object used to store information about Hierarchy Manager entities. See entity type and entity.

Glossary 171

entity type

In Hierarchy Manager, an entity type is a logical classification of one or more entities. Examples include doctors, checking accounts, banks, and so on. All entities with the same entity type are stored in the same entity object. In the HM Configuration tool, entity types are displayed in the navigation tree under the Entity Object with which the Type is associated. See entity.

exact match

A match / search strategy that matches only records that are identical. If you specify an exact match, you can define only exact match columns for this base object (exact-match base objects cannot have fuzzy match columns). A base object that uses the exact match / search strategy is called an exact-match base object. See also match / search strategy, fuzzy match.

external application user

Siperian Hub user who access Siperian Hub data indirectly via third-party applications.

extract-transform-load (ETL) tool

A software tool (external to Siperian Hub) that extracts data from a source system, transforms the data (using rules, lookup tables, and other functionality) to convert it to the desired state, and then loads (writes) the data to a target database. For Siperian Hub implementations, ETL tools are used to extract data from source systems and populate the landing tables.

foreign key

In a relational database, a column (or set of columns) whose value corresponds to a primary key value in another table (or, in rare cases, the same table). The foreign key acts as a pointer to the other table. For example, the Department_Number column in the Employee table would be a foreign key that points to the primary key of the Department table.


fuzzy match

A match / search strategy that uses probabilistic matching, which takes into account spelling variations, possible misspellings, and other differences that can make matching records non-identical. If selected, Siperian Hub adds a special column (Fuzzy Match Key) to the base object. This column is the primary field used during searching and matching to generate match candidates for this base object. All fuzzy base objects have one and only one Fuzzy Match Key. A base object that uses the fuzzy match / search strategy is called a fuzzy-match base object. Using fuzzy match requires a selected population. See also match / search strategy, exact match, and population.

GET

A Siperian Hub operation that gets the specified (by key) record and, optionally, content metadata from a base object. See PUT.

global business identifier (GBID)

A column that contains common identifiers (key values) that allow you to uniquely and globally identify a record based on your business needs. Examples include:• identifiers defined by applications external to Siperian Hub, such as ERP or CRM

systems.

• Identifiers defined by external organizations, such as industry-specific codes (AMA numbers, DEA numbers. and so on), or government-issued identifiers (social security number, tax ID number, driver’s license number, and so on).

global role

A role (list of assigned verbs) that applies to the entire Operational Reference Store (ORS).

hierarchy

In Hierarchy Manager, a set of relationship types. These relationship types are not ranked based on the place of the entities of the hierarchy, nor are they necessarily related to each other. They are merely relationship types that are grouped together for

Glossary 173

ease of classification and identification. See hierarchy type, relationship, relationship type.

hierarchy type

In Hierarchy Manager, a logical classification of hierarchies. The hierarchy type is the general class of hierarchy under which a particular relationship falls. See hierarchy.

history table

A type of table in an ORS that contains historical information about changes to an associated table. History tables provide detailed change-tracking options, including merge and unmerge history, history of the pre-cleansed data, history of the base object, and history of the cross-reference.

HM package

A Hierarchy Manager package represents a subset of an MRM package and contains the metadata needed by Hierarchy Manager.

hotspot

In business data, a group of records representing overmatched data—a large intersection of matches.

Hub Store

In a Siperian Hub implementation, the database that contains the Master Database and one or more Operational Record Stores (ORSs). See Master Database, Operational Record Store (ORS).

immutable source

A data source that always provides the best, final version of the truth for a base object. Records from an immutable source will be accepted as unique and, once a record from that source has been fully consolidated, it will not be changed—even in the event of a


merge. Immutable sources are also distinct systems. For all source records from an immutable source system, the consolidation indicator for Load and PUT is always 1 (consolidated record).

implementer

Siperian Hub user who has the primary responsibility for designing, developing, testing, and deploying Siperian Hub according to the requirements of an organization. Tasks include (but are not limited to) creating design objects, building the schema, defining match rules, performance tuning, and other activities.

incremental load

Any load process that occurs after a base object has undergone its initial data load. Called incremental loading because only new or updated data is loaded into the base object. Duplicate data is ignored. See initial data load.

initial data load

The very first time that you data is loaded into an empty base object. During the initial data load, all records in the staging table are inserted into the base object as new records.

intertable matching

Process of matching on the match columns of a child base object. Match columns can be used to match on a match column from a child base object, which in turn can be based on any text column or combination of text columns in the child base object. See match column, match process.

job execution log

In the Batch Viewer and Batch Group tools, a log that shows job completion status with any associated messages, such as success, failure, or warning.

Glossary 175

job execution script

For Siperian Hub implementations, a script that is used in job scheduling software (such as Tivoli or CA Unicenter) that executes Siperian Hub batch jobs via stored procedures.

key match job

A Siperian Hub batch job that matches records from two or more sources when these sources use the same primary key. Key Match jobs compare new records to each other and to existing records, and then identify potential matches based on the comparison of source record keys as defined by the primary key match rules. See primary key match rule, match process.

key type

Identifies important characteristics about the match key to help Siperian Hub generate keys correctly and conduct better searches. Siperian Hub provides the following match key types: Person_Name, Organization_Name, and Address_Part1. See match process.

key width

During match, determines how fast searches are during match, the number of possible match candidates returned, and how much disk space the keys consume. Key width options are Standard, Extended, Limited, and Preferred. Key widths apply to fuzzy match objects only. See match process.

land process

Process of populating landing tables from a source system. See source system, landing table.

landing table

A table where a source system puts data that will be processed by Siperian Hub.


linear decay

The trust level decreases in a straight line from the maximum trust to the minimum trust. See decay type, trust.

link process

Process of linking two or more records in a base object table because they have the same value (or very similar values) in the specified match columns. Contrast with merge process. See consolidation process, autolink, manual link, manual unlink.

link-style base object

Type of base object that is used with Siperian Hub’s match and link capabilities. Link-style base objects have an associated LINK table. See link process.

load insert

When records are inserted into the target table (base object or dependent object). During the load process, if a record in the staging table does not already exist in the target table, then Siperian Hub inserts the record into the target table. See load process, load update.

load process

Process of loading data from a staging table into the corresponding base object or dependent object in the Hub Store. If the new data overlaps with existing data in the Hub Store, Siperian Hub uses trust settings and validation rules to determine which value is more reliable. See trust, validation rule, load insert, load update.

load update

When records are inserted into the target table (base object or dependent object). During the load process, if a record in the staging table does not already exist in the target table, then Siperian Hub inserts the record into the target table. See load process, load insert.

Glossary 177

lookup

Process of retrieving a data value from a parent table during Load jobs. In Siperian Hub, when configuring a staging table associated with a base object, if a foreign key column in the staging table (as the child table) is related to the primary key in a parent table, you can configure a lookup to retrieve data from that parent table.

manual link

Process of merging records manually. For link-style base objects only. Match rules can result in automatic linking or manual linking. A match rule that instructs Siperian Hub to perform a manual link identifies records that have enough points of similarity to warrant attention from a data steward, but not enough points of similarity to allow the system to automatically link the records. See autolink, link-style base object.

manual merge

Process of merging records manually. For merge-style base objects only. Match rules can result in automatic merging or manual merging. A match rule that instructs Siperian Hub to perform a manual merge identifies records that have enough points of similarity to warrant attention from a data steward, but not enough points of similarity to allow the system to automatically merge the records. See automerge, merge-style base object.

manual unmerge

Process of unmerging records manually. For merge-style base objects only. See manual merge, merge-style base object.

manual unlink

Process of unlinking records manually. For link-style base objects only. See manual link, link-style base object.


mapping

Defines a set of transformations that are applied to source data. Mappings are used during the stage process (or via a Siperian Hub operation) to transfer data from a landing table to a staging table. A mapping identifies the source column in the landing table and the target column to populate in the staging table, along with any intermediate cleanse functions used to clean the data. See conditional mapping, distinct mapping.

master data

A collection of common, core entities—along with their attributes and their values—that are considered critical to a company's business, and that are required for use in two or more systems or business processes. Examples of master data include customer, product, employee, supplier, and location data. See Master Data Management (MDM), Customer Data Integration (CDI).

Master Data Management (MDM)

The controlled process by which the master data is created and maintained as the system of record for the enterprise. MDM is implemented in order to ensure that the master data is validated as correct, consistent, and complete, and—optionally—circulated in context for consumption by internal or external business processes, applications, or users. See master data, Customer Data Integration (CDI).

Master Database

Database that contains all the Siperian Hub metadata, including configuration settings and other information that Siperian Hub requires to run properly. The default name of the Master Database is CMX_SYSTEM. See also Operational Record Store (ORS).

Glossary 179

match

The process of determining whether two records should be automatically merged or should be candidates for manual merge because the two records have identical or similar values in the specified columns. See match process.

match codes

Strings of characters representing the contents of the data to be compared. During the match process, the more complex match types result in the generation of sophisticated match codes based on the degree of similarity required. See also tokenizing, match process.

match column

A column that is used in a match rule for comparison purposes. Each match column is based on one or more columns from the base object. See match process.

match column rule

Match rule that is used to match records based on the values in columns you have defined as match columns, such as last name, first name, address1, and address2. See primary key match rule, match process.

match list

Define custom-built standardization lists. Functions are pre-defined functions that provide access to specialized cleansing functionality such as address verification or address decomposition. See match process.

match path

Allows you to traverse the hierarchy between records—whether that hierarchy exists between base objects (inter-table paths) or within a single base object (intra-table paths). Match paths are used for configuring match column rules involving related records in either separate tables or in the same table.


match process

Process of comparing two records for points of similarity. If sufficient points of similarity are found to indicate that two records probably are duplicates of each other, Siperian Hub flags those records for merging.

match purpose

For fuzzy-match base objects, defines the primary goal behind a match rule. For example, if you're trying to identify matches for people where address is an important part of determining whether two records are for the same person, then you would use the Match Purpose called Resident. Each match purpose contains knowledge about how best to compare two records to achieve the purpose of the match. Siperian Hub uses the selected match purpose as a basis for applying the match rules to determine matched records. The behavior of the rules is dependent on the selected purpose. See match process.

match rule

Defines the criteria by which Siperian Hub determines whether records might be duplicates. Match columns are combined into match rules to determine the conditions under which two records are regarded as being similar enough to merge. Each match rule tells Siperian Hub the combination of match columns it needs to examine for points of similarity. See match process.

match rule set

A logical collection of match rules that allow users to execute different sets of rules at different stages in the match process. Match rule sets include a search level that dictates the search strategy, any number of automatic and manual match rules, and optionally, a filter that allows you to selectively include or exclude records during the match process Match rules sets are used to execute to match column rules but not primary key match rules. See match process.

Glossary 181

match subtype

Used with base objects that containing different types of data, such as an Organization base object containing customer, vendor, and partner records. Using match subtyping, you can apply match rules to specific types of data within the same base object. For each match rule, you specify an exact match column that will serve as the “subtyping” column to filter out the records that you want to ignore for that match rule. See match process.

match table

Type of system table, associated with a base object, that supports the match process. During the execution of a Match job for a base object, Siperian Hub populates its associated match table with the ROWID_OBJECT values for each pair of matched records, as well as the identifier for the match rule that resulted in the match, and an automerge indicator. See match process.

match token

Strings that encode the columns used to identify candidates for matching. See match process.

match type

Each match column has a match type that determines how the match column will be tokenized in preparation for the match comparison. See match process.

match / search strategy

Specifies the reliability of the match versus the performance you require: fuzzy or exact. An exact match / search strategy is faster, but an exact match will miss some matches if the data is imperfect. See fuzzy match, exact match., match process.


maximum trust

The trust level that a data value will have if it has just been changed. For example, if source system A changes a phone number field from 555-1234 to 555-4321, the new value will be given system A’s maximum trust level for the phone number field. By setting the maximum trust level relatively high, you can ensure that changes in the source systems will usually be applied to the base object.

merge process

Process of combining two or more records of a base object table because they have the same value (or very similar values) in the specified match columns. Contrast with link process. See consolidation process, automerge, manual merge, manual unmerge.

merge-style base object

Type of base object that is used with Siperian Hub’s match and merge capabilities. See merge process.

Merge Manager

Tool used to review and take action on the records that are queued for manual merging.

message

In Siperian Hub, refers to a Java Message Service (JMS) message. A message queue server handles two types of JMS messages:• inbound messages are used for the asynchronous processing of Siperian Hub

service invocations

• outbound messages provide a communication channel to distribute data changes via JMS to source systems or other systems.

Glossary 183

message queue

A mechanism for transmitting data from one process to another (for example, from Siperian Hub to an external application).

message queue rule

A mechanism for identifying base object events and transferring the effected records to the internal system for update. Message queue rules are supported for updates, merges, and records accepted as unique.

message queue server

In Siperian Hub, a Java Message Service (JMS) server, defined in your application server environment, that Siperian Hub uses to manage incoming and outgoing JMS messages.

message trigger

A rules that gets fired when which a particular action occurs within Siperian Hub. When an action occurs for which a rule is defined, a JMS message is placed in the outbound message queue. A message trigger also specifies the queue in which messages are placed.

metadata

Data that is used to describe other data. In Siperian Hub, metadata is used to describe the schema (data model) that is used in your Siperian Hub implementation. Metadata describes the various schema definition components—tables, columns, indexes, key relationships, and so on—in the Hub Store. See also schema, metadata validation.

metadata validation

Process of verifying the completeness and integrity of the metadata that describes a repository (ORS). The Metadata Manager tool runs this process. See also metadata, Operational Record Store (ORS).


minimum trust

The trust level that a data value will have when it is “old” (after the decay period has elapsed). This value must be less than or equal to the maximum trust. If the maximum and minimum trust are equal, the decay curve is a flat line and the decay period and decay type have no effect. See also decay period.

non-equal matching

When configuring match rules, prevents equal values in a column from matching each other. Non-equal matching applies only to exact match columns.

null value

The absence of a value in a column of a record. Null is not the same as blank or zero.

operation

Siperian Hub operation (API) that allows external applications to access specific Siperian Hub functionality via the Services Integration Framework (SIF) using a request/response model. See also GET, PUT.

Operational Record Store (ORS)

Database that contains all of the data you load or create within the Siperian Hub system, including all master record data. A Siperian Hub configuration can have one or more ORS databases. The default name of an ORS is CMX_ORS. See also Master Database.

overmatching

For fuzzy-match base objects only, a match that results in too many matches, including matches that are not relevant. When configuring match, the goal is to find the optimal number of matches for your data. See undermatching.

Glossary 185

package

A package is a public view of one or more underlying tables in Siperian Hub. Packages represent subsets of the columns in those tables, along with any other tables that are joined to the tables. A package is based on a query. The underlying query can select a subset of records from the table or from another package.

password policy

Specifies password characteristics for Siperian Hub user accounts, such as the password length, expiration, login settings, password re-use, and other requirements. You can define a global password policy for all user accounts in a Siperian Hub implementation, and you can override these settings for individual users.

path

See match path.

policy decision points (PDPs)

In Siperian Hub implementations, specific security check points that determine, at run time, the validity of a user’s identity (authentication), along with that user’s access to Siperian Hub resources (authorization).

policy enforcement points (PEPs)

In Siperian Hub implementations, specific security check points that enforce, at run time, security policies for authentication and authorization requests.

population

Defines certain characteristics about data in the records that you are matching. By default, Siperian Hub comes with the US population, but Siperian provides a standard population per country. Populations account for the inevitable variations and errors that are likely to exist in name, address, and other identification data; specify how Siperian Hub builds match tokens; and specify how search strategies and match


purposes operate on the population of data to be matched. Used only with the Fuzzy match/search strategy.

primary key

In a relational database table, a column (or set of columns) whose value uniquely identifies a record. For example, the Department_Number column would be the primary key of the Department table.

primary key match rule

Match rule that is used to match records from two systems that use the same primary keys for records. See also match column rule.

private resource

A protected Siperian Hub resource that is hidden from the Roles tool, preventing its access via Services Integration Framework (SIF) operations. When you add a new resource in Hub Console (such as a new base object), it is designated a PRIVATE resource by default. See also secure resource, resource.

privilege

Permission to access a Siperian Hub resource. With Siperian Hub internal authorization, each role is assigned one of the following privileges.

Privilege Allows the User To....

READ View but not change data.

CREATE Create data records in the Hub Store.

UPDATE Update data records in the Hub Store.

MERGE Merge and unmerge data.

EXECUTE Execute cleanse functions and batch groups.

Glossary 187

Privileges determine the access that external application users have to Siperian Hub resources. For example, a role might be configured to have READ, CREATE, UPDATE, and MERGE privileges on particular packages and package columns. These privileges are not enforced when using the Hub Console, although the settings still affect the use of Hub Console to some degree. See secure resource, role.

profile

In Hierarchy Manager, describes what fields and records an HM user may display, edit, or add. For example, one profile can allow full read/write access to all entities and relationships, while another profile can be read-only (no add or edit operations allowed).

provider

See security provider.

provider property

A name-value pair that a security provider might require in order to access for the service(s) that they provide.

PUT

A Siperian Hub operation that inserts or updates a record in the base object. See GET.

query

A request to retrieve data from the Hub Store. Siperian Hub allows administrators to specify the criteria used to retrieve that data. Queries can be configured to return selected columns, filter the result set with a WHERE clause, use complex query syntax (such as GROUP BY, SORT BY, and HAVING clauses), and use aggregate functions (such as SUM, COUNT, and AVG).


query group

A logical group of queries. A query group is simply a mechanism for organizing queries. See query.

raw table

A table that archives data from a landing table.

real-time mode

Way of interacting with Siperian Hub via third-party applications, which invoke Siperian Hub operations via the Services Integration Framework (SIF) interface. SIF provides operations for various services, such as reading, cleansing, matching, inserting, and updating records. See also batch mode, Services Integration Framework (SIF).

record

A row in a table that represents an instance of an object. For example, in an Address table, a record contains a single address.

referential integrity

Enforcement of parent-child relationship rules among tables based on configured foreign key relationship.

regular expression

A computational expression that is used to match and manipulate text data according to commonly-used syntactic conventions and symbolic patterns. In Siperian Hub, a regular expression function allows you to use regular expressions for cleanse operations. To learn more about regular expressions, including syntax and patterns, refer to the Javadoc for java.util.regex.Pattern.

Glossary 189

reject table

A table that contains records that Siperian Hub could not insert into a target table, such as:• staging table (stage process) after performing the specified cleansing on a record of

the specified landing table

• Hub store table (load process)

A record could be rejected because the value of a cell is too long, or because the record’s update date is later than the current date.

relationship

In Hierarchy Manager, describes the affiliation between two specific entities. Hierarchy Manager relationships are defined by specifying the relationship type, hierarchy type, attributes of the relationship, and dates for when the relationship is active. See relationship type, hierarchy.

relationship base object

A relationship base object is a base object used to store information about Hierarchy Manager relationships.

relationship type

Describes general classes of relationships. The relationship type defines:• the types of entities that a relationship of this type can include

• the direction of the relationship (if any)

• how the relationship is displayed in the Hub Console

See relationship, hierarchy.

repository

See Operational Record Store (ORS).


resource

Any Siperian Hub component that is used in your Siperian Hub implementation. Certain resources can be configured as secure resources: base objects, dependent objects, mappings, packages, remote packages, cleanse functions, HM profiles, the audit table, and the users table. In addition, you can configure secure resources that are accessible by SIF operations, including content metadata, match rule sets, metadata, batch groups, the audit table, and the users table. See private resource, secure resource, resource group.

resource group

A logical collection of secure resources that simplify privilege assignment, allowing you to assign privileges to multiple resources at once, such as easily assigning resource groups to a role. See resource, privilege.

RISL decay

Rapid Initial Slow Later decay puts most of the decrease at the beginning of the decay period. The trust level follows a concave parabolic curve. If a source system has this decay type, a new value from the system will probably be trusted but this value will soon become much more likely to be overridden.

role

Defines a set of privileges to access secure Siperian Hub resources. See user, user group, privilege.

row

See record.

rule

See match rule.

Glossary 191

rule set

See match rule set.

rule set filtering

Ability to exclude records from being processed by a match rule set. For example, if you had an Organization base object that contained multiple types of organizations (customers, vendors, prospects, partners, and so on), you could define a match rule set that selectively processed only vendors. See match process.

sandbox

In Hierarchy Manager, a virtual playground where users with the appropriate privileges (as defined in their HM Profile) can manipulate relationship criteria without impacting the original master record. In this way, users can predict the results of their manipulations and determine whether they will be meaningful in the context for which they were created.

schema

The data model that is used in a customer’s Siperian Hub implementation. Siperian Hub does not impose or require any particular schema. The schema is independent of the source systems.

search levels

Defines how stringently Siperian Hub searches for matches: narrow, typical, exhaustive, or extreme. The goal is to find the optimal number of matches for your data—not too few (undermatching), which misses significant matches, or too many (overmatching), which generates too many matches, including insignificant ones. See overmatching, undermatching.


secure resource

A protected Siperian Hub resource that is exposed to the Roles tool, allowing the resource to be added to roles with specific privileges. When a user account is assigned to a specific role, then that user account is authorized to access the secure resources via SIF according to the privileges associated with that role. In order for external applications to access a Siperian Hub resource via SIF operations, that resource must be configured as SECURE. Because all Siperian Hub resources are PRIVATE by default, you must explicitly make a resource SECURE after the resource has been added. See also private resource, resource.

security

The ability to protect information privacy, confidentiality, and data integrity by guarding against unauthorized access to, or tampering with, data and other resources in your Siperian Hub implementation. See also authentication, authorization, privilege, resource.

security provider

A third-party organization that provides security services (authentication, authorization, and user profile services) for users accessing Siperian Hub.

security payload

Raw binary data returned by a Siperian Hub operation request that can contain supplemental data required for further authentication and/or authorization.

Status Setting Description

SECURE

PRIVATE Hides this Siperian Hub resource from the Roles tool. Default. Prevents its access via Services Integration Framework (SIF) operations. When you add a new resource in Hub Console (such as a new base object), it is designated a PRIVATE resource by default.

Glossary 193

segment matching

Way of limiting match rules to specific subsets of data. For example, you could define different match rules for customers in different countries by using segment matching to limit certain rules to specific country codes. Segment matching is configured on a per-rule basis and applies to both exact-match and fuzzy-match base objects.

Services Integration Framework (SIF)

The part of Siperian Hub that interfaces with client programs. Logically, it serves as a middle tier in the client/server model. It enables you to implement the request/response interactions using any of the following architectural variations:• Loosely coupled Web services using the SOAP protocol.

• Tightly coupled Java remote procedure calls based on Enterprise JavaBeans (EJBs) or XML.

• Asynchronous Java Message Service (JMS)-based messages.

• XML documents going back and forth via Hypertext Transfer Protocol (HTTP).

Each of the above SIF protocols sits on top of the native Siperian Hub protocol, which accepts requests in the form of XML documents or EJBs and returns responses the same way.

SIRL decay

Slow Initial Rapid Later decay puts most of the decrease at the end of the decay period. The trust level follows a convex parabolic curve. If a source system has this decay type, it will be relatively unlikely for any other system to override the value that it sets until the value is near the end of its decay period.

source system

A system that provides data to Siperian Hub. See distinct source system.


stage process

Process of reading the data from the landing table, performing any configured cleansing, and moving the cleansed data into the corresponding staging table. If you enable delta detection, Siperian Hub only processes new or changed records. See staging table, landing table.

staging table

A table where cleansed data is temporarily stored before being loaded into base objects and dependent objects via load jobs. See stage process, load process.

stored procedure

A named set of Structured Query Language (SQL) statements that are compiled and stored on the database server. Siperian Hub batch jobs are encoded in stored procedures so that they can be run using job execution scripts in job scheduling software (such as Tivoli or CA Unicenter).

stripping

Deprecated term. See tokenizing.

strip table

Deprecated term. See token table.

system column

A column in a table that contains Siperian Hub metadata. For each type of table in an ORS, Siperian Hub automatically creates system columns. Typical system columns for a base object include ROWID_OBJECT, CONSOLIDATION_IND, and LAST_UPDATE_DATE. See column.

Glossary 195

table

In a database, a collection of data that is organized in rows (records) and columns. A table can be seen as a two-dimensional set of values corresponding to an object. The columns of a table represent characteristics of the object, and the rows represent instances of the object. In the Hub Store, the Master Database and each Operational Record Store (ORS) represents a collection of tables. Base objects and dependent objects are stored as tables in an ORS.

target database

In the Hub Console, the Master Database or an Operational Record Store (ORS) that is the target of the current tool. Tools that manage data stored in the Master Database, such as the Users tool, require that your target database is the Master Database. Tools that manage data stored in an ORS require that you specify which ORS to

token table

When you specify a match column, Siperian Hub creates a special key called a match key (also known as a token string) on a special table called the token table (formerly referred to as the strip table). Before the Siperian Hub Match batch job runs, it first ensures that the correct match keys have been generated in the token table. The match job compares the match keys according to the match rules that have been defined to determine which records are duplicates. See also tokenizing.

tokenizing

Specialized form of data standardization that is performed before the match comparisons are done. For the most basic match types, tokenizing simply removes “noise” characters like spaces and punctuation. The more complex match types result in the generation of sophisticated match codes—strings of characters representing the contents of the data to be compared—based on the degree of similarity required. See also token table, match codes.


traceability

The maintenance of data so that you can determine which systems—and which records from those systems—contributed to consolidated records.

transactional data

Represents the actions performed by an application, typically captured or generated by an application as part of its normal operation. It is usually maintained by only one system of record, and tends to be accurate and reliable in that context. For example, your bank probably has only one application for managing transactional data resulting from withdrawals, deposits, and transfers made on your checking account.

trust

Mechanism for measuring the confidence factor associated with each cell based on its source system, change history, and other business rules. Trust takes into account the age of data, how much its reliability has decayed over time, and the validity of the data.

trust level

For a source system that provides records to Siperian Hub, a number between 0 and 100 that assigns a level of confidence and reliability to that source system, relative to other source systems. The trust level has meaning only when compared with the trust level of another source system.

trust score

The current level of confidence in a given record. During load jobs, Siperian Hub calculates the trust score for each records. If validation rules are defined for the base object, then the Load job applies these validation rules to the data, which might further downgrade trust scores. During the consolidation process, when two records are candidates for merge or link, the values in the record with the higher trust score wins. Data stewards can manually override trust scores in the Merge Manager tool.

Glossary 197

undermatching

For fuzzy-match base objects only, a match that results in too few matches, which misses relevant matches. When configuring match, the goal is to find the optimal number of matches for your data. See overmatching.

unlink

Process of unlinking previously-linked records. For link-style base objects only. See manual unlink, link-style base object.

unmerge

Process of unmerging previously-merged records. For merge-style base objects only. See manual unmerge, merge-style base object, cascade unmerge.

user

An individual (person or application) who can access Siperian Hub resources. Users are represented in Siperian Hub by user accounts, which are defined in the Master Database. See user group, Master Database.

user group

A logical collection of user accounts. See user.

validation rule

Rule that tells Siperian Hub the condition under which a data value is not valid. When data meets the criteria specified by the validation rule, the trust value for that data is downgraded by the percentage specified in the validation rule. If the Reserve Minimum Trust flag is set for the column, then the trust cannot be downgraded below the column’s minimum trust.


workbench

In the Hub Console, a mechanism for grouping similar tools. A workbench is a logical collection of related tools. For example, the Cleanse workbench contains cleanse-related tools: Cleanse Match Server, Cleanse Functions, and Mappings.

write lock

In the Hub Console, a lock that is required in order to make changes to the underlying schema. All non-data steward tools (except the ORS security tools) are in read-only mode unless you acquire the write lock. Write locks prevent multiple users from making changes to the same data at the same time.

Glossary 199

Index

Aaddress household objects 45addresses design patterns 43analyze phase 7analyzing data

business processes and data 12data cleansing 12data set, assembling 11data sizing 11data source characteristics, determining

10introduction 10match rules 14standardization rules 12trust levels 13validation rules 14

API calls and validation rules 66asynchronous batch jobs 108audience xAuto Match and Merge jobs 113Autolink jobs 112Automerge jobs 115

Bbase objects, defined 23batch groups

about batch groups 137

cmxbg.execute_batchgroup stored pro-cedure 138

cmxbg.get_batchgroup_status storedprocedure 142

cmxbg.reset_batchgroup stored proce-dure 140

stored procedures for 138batch jobs

asynchronous execution 108Auto Match and Merge jobs 113Autolink jobs 112Automerge jobs 115C_REPOS_JOB_CONTROL table 110C_REPOS_JOB_METRIC table 110C_REPOS_JOB_METRIC_TYPE table

110C_REPOS_JOB_STATUS_TYPEC ta-

ble 110C_REPOS_TABLE_OBJECT_V table

107execution scripts 104Generate Match Token jobs 118Key Match jobs 120Load jobs 121Match Analyze jobs 128Match for Duplicate Data jobs 130Match jobs 127results monitoring 108scheduling 103

201

Stage jobs 131Unmerge jobs 133

batch processing with validation rules 66build phase 8business party model 36business processes and data 12

CC_REPOS_JOB_CONTROL table 110C_REPOS_JOB_METRIC table 110C_REPOS_JOB_METRIC_TYPE table 110C_REPOS_JOB_STATUS_TYPEC table110C_REPOS_TABLE_OBJECT_V table 105,107cardinality, defined 76cascade unmerge 134cell updates 62cmxbg.execute_batchgroup 138cmxbg.get_batchgroup_status 142cmxbg.reset_batchgroup 140communication channel models 46conceptual models 19consensus deletes 58consolidation and data modeling 30cross-column validation 68custom buttons

about custom buttons 151adding 160appearance of 152clicking 154custom functions, writing 155deploying 159examples of 156

icons 159listing 160properties file 160text labels 159type change 160updating 160

custom functionsclient-based 155deleting 160server-based 156writing 155

custom stored proceduresabout custom stored procedures 145example code 146parameters of 145registering 149

customer data modelsbusiness party model 36differentiated customer models 36

Ddata cleansing 12data modeling

conceptual models 19design deliverables 18design patterns 42design principles 27logical models 20overview 18physical models 24

data set, assembling 11data sizing 11data sources

characteristics, analyzing 10


third-party 14data tokenization 78deletes

consensus 58direct 57

dependent objects, defined 23dependent tables and match rules 91deploy phase 8design patterns

addresses 43communication channel models 46households 42overview 42

design phase 7design principles

consolidation 30customers, mixing different types of 36deep versus wide 28independence test 33landing and staging data 40match requirements 29overview 27

differentiated customer model 36direct deletes 57dirty bits, defined 78discover phase 6

Eexact match column properties

null match 87overview 87segment matches 88

execution scripts 104

GGenerate Match Tokens jobs 118

HHierarchy Manager

described 96implementation process 98preparing to implement 97

households design patterns 42

Iimplementation methodology 2implementation projects

methodology 2phases 6roles 4

intended audience x

KKey Match jobs 120

Llanding tables 40linear unmerge 134Load jobs 121logical models

design flaws fixed, example of 22design flaws, example of 21objects in the logical model 23overview 20

203

pre-existing logical models 23

MMatch Analyze jobs 128match batch sizes 91Match for Duplicate Data jobs 130Match jobs 127match key types 80match key widths 79match levels 84match purposes

defined 81field types 83using 82

match rulesbest practices 86data analysis 14data modeling design principles 29defining 85dependent tables 91exact match column properties 87how matching works 76match batch sizes 91match key types 80match key widths 79match levels 84match purposes 81matching, defined 76populations 77prerequisites for defining 76search strategies 80testing 85tokenizing data 78tokens for match keys 77

matching, defined 76matchy data, defined 76merging

automerge 94manual merge 94

methodologycore principles 2introduction 2

Nnull matches in exact match column proper-ties 87

Pphases in an implementation project

analyze phase 7build phase 8deploy phase 8design phase 7discover phase 6overview 6

physical modelsexample of 25flexibility for future use 26overview 24performance 26required functionality 26scalability 26Siperian product roadmap 27

populations 77projects

phases 6role 4


Rroles in an implementation project 4

Ssearch strategies and match rules 80segment matches in exact match columnproperties 88Siperian, about xiisizing data 11source systems, ranking trust levels for 55Stage jobs 131staging tables 41standardization rules 12static tokens, defined 77stored procedures

batch groups 138batch jobs 111custom stored procedures 145

Ttables

C_REPOS_JOB_CONTROL table 110C_REPOS_JOB_METRIC table 110C_REPOS_JOB_METRIC_TYPE table

110C_REPOS_JOB_STATUS_TYPE table

110C_REPOS_TABLE_OBJECT_V table

105tokenization incomplete bits, defined 78tokenizing data 78

at load 78

at match 78at put 78

tokens for match keys 77training xiiitree unmerge 135trust levels

best practices 58cell updates 62configuration guidelines 60configuring 60data analysis 13decay periods 54defined 52defining 61how trust works 52ranking source systems 55stored procedure example 63using with validation rules 70

UUnmerge jobs 133

cascade unmerge 134linear unmerge 134tree unmerge 135unmerge all 134

Vvalidation rules

API calls 66batch processing 66best practices 68complex validation rules 68cross-column validation 68

205

data analysis 14defined 65grouping of 66how validation works 65load effects 68

ordering of 66overview 65performance effects 68SQL statements in 69using with trust levels 70


46574483 siperian hub implementer guide (1)

Documents