dq 861 userguide

Informatica Data Quality (Version 8.6.1)

User Guide

Informatica Data Quality User GuideVersion 8.6.1September 2008

Copyright (c) 2001–2008 Informatica Corporation.

All rights reserved.

This software and documentation contain proprietary information of Informatica Corporation and are provided under a license agreement containing restrictions on use and disclosure and are also protected by copyright law. Reverse engineering of the software is prohibited. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. This Software may be protected by U.S. and/or international Patents and other Patents Pending.

Use, duplication, or disclosure of the Software by the U.S. Government is subject to the restrictions set forth in the applicable software license agreement and as provided in DFARS 227.7202-1(a) and 227.7702-3(a) (1995), DFARS 252.227-7013(c)(1)(ii) (OCT 1988), FAR 12.212(a) (1995), FAR 52.227-19, or FAR 52.227-14 (ALT III), as applicable.

The information in this product or documentation is subject to change without notice. If you find any problems in this product or documentation, please report them to us in writing.

Informatica, PowerCenter, PowerCenterRT, PowerCenter Connect, PowerCenter Data Analyzer, PowerExchange, PowerMart, Metadata Manager, Informatica Data Quality, Informatica Data Explorer, Informatica B2B Data Exchange and Informatica On Demand are trademarks or registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties, including without limitation: Copyright © Sun Microsystems. All rights reserved. Copyright © Platon Data Technology GmbH. All rights reserved. Copyright © Melissa Data Corporation. All rights reserved. Copyright © 1995-2006 MySQL AB. All rights reserved

This product includes software developed by the Apache Software Foundation (http://www.apache.org/). The Apache Software is Copyright © 1999-2006 The Apache Software Foundation. All rights reserved.

ICU is Copyright (c) 1995-2003 International Business Machines Corporation and others. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of the ICU software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so.

ACE(TM)and TAO(TM), are copyrighted by Douglas C. Schmidt and his research group at Washington University, University of California, Irvine, and Vanderbilt University, Copyright (c) 1993-2006, all rights reserved.

Tcl is copyrighted by the Regents of the University of California, Sun Microsystems, Inc., Scriptics Corporation and other parties. The authors hereby grant permission to use, copy, modify, distribute, and license this software and its documentation for any purpose.

InstallAnywhere is Copyright © Macrovision (Copyright ©2005 Zero G Software, Inc.) All Rights Reserved.

Portions of this software use the Swede product developed by Seaview Software (www.seaviewsoft.com).

This product includes software developed by the JDOM Project (http://www.jdom.org/). Copyright © 2000-2004 Jason Hunter and Brett McLaughlin. All rights reserved.

This product includes software developed by the JFreeChart project (http://www.jfree.org/freechart/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.

This product includes software developed by the JDIC project (https://jdic.dev.java.net/). Your right to use such materials is set forth in the GNU Lesser General Public License Agreement, which may be found at http://www.gnu.org/copyleft/lgpl.html. These materials are provided free of charge by Informatica, “as is”, without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and fitness for a particular purpose.

This product includes software developed by lf2prod.com (http://common.l2fprod.com/). Your right to use such materials is set forth in the Apache License Agreement, which may be found at http://www.apache.org/licenses/LICENSE-2.0.html.

DISCLAIMER: Informatica Corporation provides this documentation “as is” without warranty of any kind, either express or implied, including, but not limited to, the implied warranties of non-infringement, merchantability, or use for a particular purpose. Informatica Corporation does not warrant that this software or documentation is error free. The information provided in this software or documentation may include technical inaccuracies or typographical errors. The information in this software and documentation is subject to change at any time without notice.

Part Number: IDQ-USG-86100-0002

Table of Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Informatica Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Chapter 1: Informatica Data Quality Features and Functionality . . . . . . . . . . . . . . . . . 1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Data Quality Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Project Manager and File Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Publishing Plans to Data Quality Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Exporting and Importing Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Running Plans: Local and Remote Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Plan Resources and Plan Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Version Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Working with Multiple Instances of a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Organizing the Workbench User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2: Data Source Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

CSV Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Database Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Fixed Width Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Realtime Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

SAP Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

CSV Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

CSV Dual Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Database Match Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Dual Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

CSV Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

DB Identity Group Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Chapter 3: Data Target Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

CSV Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Fixed Width Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

CSV Merge Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

CSV Match Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Match Key Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Database Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Database Report Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

SAP Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

i i i

Realtime Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Identity Group Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Chapter 4: Frequency Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

MinAvgMax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Range Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Chapter 5: Analysis Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Character Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Token Labeller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Chapter 6: Transformation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Search Replace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Word Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

To Upper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Rule Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Chapter 7: Parsing Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Token Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Profile Standardizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Context Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Chapter 8: Key Field Generator Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Soundex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Nysiis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

Chapter 9: Matching Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Identity Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

iv Table of Contents

Edit Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

Jaro Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Bigram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Mixed Field Matcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Weight Based Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 10: Address Validation Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

Global AV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

Chapter 11: Dictionary Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Dictionary Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Updating Dictionary Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Creating a Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Chapter 12: Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Viewing Data in the Report Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Standard View and Dashboard View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Viewing Plan Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Report Viewer Parameters and Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Tracking Changes in Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Importing Report Files and Working with Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Chapter 13: Deploying Plans for Runtime Execution . . . . . . . . . . . . . . . . . . . . . . . . 119

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Deploying Runtime Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Running a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

Command Line Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Multi-Threading and Multi-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Appendix A: Rule Based Analyzer Rule Statements . . . . . . . . . . . . . . . . . . . . . . . . 127

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Functional Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

Appendix B: Global AV: Output Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 131

Global AV Output Field Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Appendix C: Search/Replace Operations and Noise Removal . . . . . . . . . . . . . . . . 135

Noise Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

v

Appendix D: Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Matching Formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

Appendix E: SQL Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Creating a MySQL Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Use of MAX Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Nested Groups and Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Appendix F: ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Using the ODBC Data Source Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Appendix G: Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Character Encodings and Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Appendix H: Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Data Quality Workbench Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Appendix I: Output Options in the CSV Match Target . . . . . . . . . . . . . . . . . . . . . . . 147

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Configuring the Outputs for Identified Matches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Appendix J: Informatica Data Quality Naming Conventions . . . . . . . . . . . . . . . . . . 149

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

vi Table of Contents

Preface

Welcome to Informatica Data Quality, the latest-generation data quality management system from Informatica Corporation. Informatica Data Quality will empower your organization to solve its data quality problems and realize real, sustainable data quality improvements.

The high-level objectives for this guide are to describe the functionality of Informatica Data Quality in the following areas:

♦ How to build data quality plans using the data sources, data targets, and operational components available in the Workbench in the user interface.

♦ How to manage your data quality projects, plans, and associated resource files through Informatica Data Quality Workbench.

♦ How to use dictionaries and reference data content.

This document builds on the Getting Started Guide. Before reading this document, Data Quality users should read the Getting Started Guide to familiarize themselves with data quality concepts and product capabilities.

Note: The Informatica Data Quality Integration for PowerCenter is not documented in this guide. For more information on the Data Quality Integration, see the Data Quality Data Quality Integration for PowerCenter Guide.

Informatica Resources

Informatica Customer PortalAs an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.

Informatica DocumentationThe Informatica Documentation team takes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at [email protected]. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments.

vii

Informatica Web SiteYou can access the Informatica corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and sales offices. You will also find product and partner information. The services area of the site includes important information about technical support, training and education, and implementation services.

Informatica Knowledge BaseAs an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. You can also find answers to frequently asked questions, technical white papers, and technical tips.

Informatica Global Customer SupportThere are many ways to access Informatica Global Customer Support. You can contact a Customer Support Center through telephone, email, or the WebSupport Service.

Use the following email addresses to contact Informatica Global Customer Support:

♦ [email protected] for technical inquiries

♦ [email protected] for general customer service requests

WebSupport requires a user name and password. You can request a user name and password at http://my.informatica.com.

Use the following telephone numbers to contact Informatica Global Customer Support:

North America / South America Europe / Middle East / Africa Asia / Australia

Informatica Corporation Headquarters100 Cardinal WayRedwood City, California 94063United States

Toll Free +1 877 463 2435

Standard RateBrazil: +55 11 3523 7761 Mexico: +52 55 1168 9763 United States: +1 650 385 5800

Informatica Software Ltd.6 Waltham ParkWaltham Road, White WalthamMaidenhead, BerkshireSL6 3TNUnited Kingdom

Toll Free 00 800 4632 4357

Standard RateBelgium: +32 15 281 702France: +33 1 41 38 92 26Germany: +49 1805 702 702Netherlands: +31 306 022 797United Kingdom: +44 1628 511 445

Informatica Business Solutions Pvt. Ltd.Diamond DistrictTower B, 3rd Floor150 Airport RoadBangalore 560 008India

Toll Free Australia: 1 800 151 830Singapore: 001 800 4632 4357

Standard RateIndia: +91 80 4112 5738

viii Preface

C H A P T E R 1

Informatica Data Quality Features and Functionality

This chapter includes the following topics:

♦ Overview, 1

♦ Data Quality Plans, 2

♦ Project Manager and File Manager, 2

♦ Publishing Plans to Data Quality Server, 4

♦ Running Plans: Local and Remote Execution, 6

♦ Plan Resources and Plan Execution, 7

♦ Version Control, 8

♦ Working with Multiple Instances of a Plan, 11

♦ Organizing the Workbench User Interface, 11

Overview

This chapter discusses the project management, file management, and plan management options available through Data Quality, including the capabilities of Data Quality Workbench in conjunction with Data Quality Server. If you are running Data Quality Workbench in stand-alone or client-only mode, some functionality might not be available to you.

Note: For more information on the components that make up the Informatica Data Quality suite, see the Informatica Data Quality Installation Guide and the Getting Started with Data Quality Guide.

1

Data Quality Plans

Informatica Data Quality Data analyzes and enhances your source data through processes called plans that you create in its Workbench application. A data quality plan is a self-contained and executable set of data analysis or data enhancement steps consisting of one or more of the following types of components:

A plan must contain at least one data source and data target. It can use any number of operational components. A plan that writes data directly from one file or database to another does not require operational components.

Figure 1-1 shows the components in a plan arranged in the Data Quality Workbench user interface:

The arrows indicate the direction of the data flow through the plan, from data source, through operational components, to data target.

Note: You can move components in the workspace. Arrows are not foolproof indicators of the precise progress of data in the plan.

Each operational component in Workbench performs a different type of analysis or enhancement task on your data. Configure an operational component to execute on a subset of the data that it receives or to filter the data that it makes available to other components in the component chain.

Many plans make use of text- or table-based reference dictionaries. Informatica provides a set of reference dictionary files with its Content Installer. You can add dictionaries to several components in Workbench, and you can define dictionaries in live tables within a database, ensuring that reference tables stay current.

You can edit and define your own dictionary files through the Dictionary Manager. Dictionary files are stored as text files (.DIC files) in a Dictionaries folder in the Informatica Data Quality directory.

Note: Data Quality dictionaries install through the Content Installer, a separate installer within the Informatica Data Quality installation. The Content Installer also installs any reference data and processing engine updates that you receive from Informatica.

Project Manager and File Manager

Workbench stores plans in the Data Quality repository and reads reference data from the file system. It provides separate browsers to view the contents of the repository and the file system.

Table 1-1. Data Quality Plan Components

ComponentRequired/Optional

Description

Data source Required Provides input data for the plan.

Data target Required Collects data output from the plan.

Operational Optional Performs the data analysis or data enhancement actions on the data they receive. Most plans contain multiple operational components.

Figure 1-1. Plan Components in the Data Quality Workspace

2 Chapter 1: Informatica Data Quality Features and Functionality

♦ Project Manager. Lists the plans and project folders in the local Data Quality repository and any available repositories on a Data Quality service domain. Allows you to organize plans in folders, publish plans from the local repository to a service domain repository, export plans to PowerCenter repositories, and run plans.

♦ File Manager. Allows you to access and move files within the local file system and across the service domain file system. With the File Manager, you can access any file type stored on a server.

In stand-alone installations of Data Quality Workbench, the File Manager and Project Manager provide access to the local system and local repository only.

To view the Project Manager:

In Informatica Data Quality Workbench, click the Projects tab.

To view this File Manager:

In Informatica Data Quality Workbench, click the Files tab.

Working with the File ManagerThe File Manager provides visibility to a Data Quality service domain in the following way:

♦ The names of the servers configured in the domain appear under the service domain name.

♦ The servers are host to the client user spaces and a shared file space for all users. These user spaces contain the dictionary files and other resource files for plans stored in the service domain repository.

♦ The server hosts a Dictionaries folder that all service domain repository plans can read from. This folder is created by the Data Quality installer and populated by the Content Installer.

♦ The local computer structure also appears.

To work with files within the File Manager, right-click a file or folder and select the required operation from the shortcut menu that appears. The permitted operations are as follows:

♦ (Create) New Folder

♦ Rename

♦ Delete

♦ Cut

♦ Copy

♦ Paste

♦ Refresh

♦ Open Externally

♦ Security

The following procedure illustrates how to use the File Manager.

Note: You cannot copy files from another system, such as Windows Explorer, into File Manager folders.

To copy local files to the service domain with the File Manager:

1. Under the File Manager tab, browse the local folder structure and locate the required file.

2. Right-click the file name and select Copy from the context menu that appears.

3. On the service domain, expand the folders of the server to which you’ll copy the file and locate the destination folder.

4. Right-click the folder name and select Paste from the context menu that appears.

Project Manager and File Manager 3

Publishing Plans to Data Quality Server

Publishing is the process of copying plans from a Workbench repository to a Data Quality Server repository. Publishing deploys plans in a networked environment, allowing domain users with appropriate permissions to access and execute the plans. Administrators set user permissions in the Data Quality Administration Console.

A published plan contains version control information that references the owner of the original plan, allowing the genealogy of plans to be traced across repositories.

To publish a plan from the local repository to a domain repository:

1. Right-click the plan(s) you want to publish.

2. Select Copy from the context menu.

3. Browse the domain repository and locate the folder where you would like to publish the plan(s).

4. Right-click the folder and select Paste from the context menu.

5. Copy all necessary plan resources to the server file system, ensuring that you recreate the folder path structures used in the source WorkBench plan. For more information on placing resources in the correct locations, see “Implications for Plan Design” on page 8.

Note: When plans are published, the latest base version of the plan is used. Any changes saved since this version are not published. For more information about plan version control, see “Version Control and Plan Publication” on page 10.

Exporting and Importing Plans

Use Data Quality Workbench to export and import plans to and from your local repository. Export plans directly into the PowerCenter repository as mapplets, or export them as files that can be imported by other Data Quality users.

The following export and import options are available:

♦ Export plans directly into the PowerCenter repository as mapplets. Use this option to run Data Quality plans natively within PowerCenter.

♦ Export plans in XML format. XML plans can be used by the runtime version of Data Quality as part of command batch jobs or scheduled processes.

♦ Back up plans to Data Quality PLN files for storage.

♦ Import plans from PLN or XML formats. Informatica recommends importing from PLN files in order to preserve the layout of the original plan.

Exported and imported plans do not contain plan version histories.

Exporting Plans to PowerCenterUse Workbench to export Data Quality plan metadata directly to a PowerCenter repository.

To export plans into a PowerCenter repository, perform the following steps:

1. Right-click the plan(s) you want to export.

2. Select Export > PowerCenter Mapplet > To PowerCenter Repository.

3. Enter your connection details in the ‘Connect to PowerCenter Repository’ dialog box. Ensure you select the correct PowerCenter repository version.

4. Choose a destination repository folder for the exported plans.


PowerCenter users can also import plan metadata to the PowerCenter repository if they have installed the Data Quality Integration transformation. PowerCenter runs plans saved through the Data Quality Integration transformation in mappings and sessions by loading an instance of the Data Quality engine. When you export a plan as a mapplet, PowerCenter runs its parent mapping and session within the PowerCenter engine.

Note: You can also export plans as PowerCenter mapplet files in XML form. To access this option, right-click and select Export > PowerCenter Mapplet > To XML File.

Exporting Plans for Runtime UseExport plans as XML files for use during runtime execution. Runtime execution uses a command-line version of the Data Quality engine to run plans as part of a scheduled or batch process. For more information on runtime execution, see “Deploying Plans for Runtime Execution” on page 119.

To export a plan for runtime use:

1. Right-click on the plan(s) you want to export.

2. Select Export > IDQ Runtime Plan(s) (.xml).

3. Choose a destination folder for the XML plans, and click Select.

4. In the Export a Plan to XML dialog box, choose the operating system on which the plan will run and select OK. If the exported plans contain file-based sources or targets, you can perform the following actions in this dialog box:

♦ Change the paths for the sources or targets.

♦ Select OK to All to use the same paths for all file-based sources or targets.

5. Copy the exported XML file to the computer that will run the plans.

6. Copy all necessary source and reference files to the computer that will run the plans, ensuring that they are placed in the proper locations. For more information, see “Plan Resources and Plan Execution” on page 7.

Backing Up PlansCreate backup copies of your plans in PLN format. Do not create XML copies of plans for backup purposes. PLN files retain the original onscreen appearance of the plans.

To back up your plans:

1. Right-click on the plan(s) you want to export.

2. Select Export > Workbench Plan(s) (.pln).

3. Choose a destination folder, and click Select.

4. If reference files are required for the exported plans, back up these files to ensure that the backup plan is fully functional.

Importing PlansInformatica recommends using PLN files as the source for your plan imports. While you can import XML plans, these plans separate all component instances into individual components. This greatly increases the visual complexity of many plans in the Workbench user interface. Export plans as XML files for runtime execution.

To import plans:

1. Right-click the destination project or folder for the imported plan.

2. Select Import > Workbench Plan(s) (.pln).

3. Choose a file, and click Select.

Exporting and Importing Plans 5

4. If source and reference files are required for the imported plans, verify that these files are available to Data Quality Workbench.

Running Plans: Local and Remote Execution

The plan execution process in Data Quality Workbench differs slightly for client-only, license users and users in client-server environments. Client-only license users define and run plans locally. Full Informatica Data Quality users can select any available plan in the service domain and run the plan on any available server. Any machine on the service domain can run a plan if it is host to an Execution service, the Informatica Data Quality service that executes the plan.

Before you run a plan, make sure all necessary resources, such as the data source files and any required reference data, are present on the computer that runs the plan and in locations recognized by Data Quality.

When you run a plan locally through your local Workbench this is automatically the case, unless you have moved any resources between design-time and execution. When you run a plan on a remote server, you must ensure that the necessary resources are present in the correct locations on the server that runs the plan.

In remote execution scenarios, it is possible for the Execution service and domain repository to reside on separate servers. The server that runs the plan is the server on which the Execution service is present.

Running a Data Quality PlanUse the following procedure to run data quality plans in Workbench.

To run a data quality plan in Workbench:

1. Ensure the required plan is selected in the workspace.

2. Click the Run Plan toolbar button.

A dialog box opens with the plan name in its title bar.

3. Click Run.

The plan executes.

If you are connected to a Data Quality service domain, you can also select a remote Data Quality computer on which to run the plan. That is, you can specify the Execution service that will run the plan. You can run a plan from any repository available on the service domain. For example, you can open a plan from the domain repository on Server 1 and run the plan on Server 2.

The Run Plan dialog features a progress bar that states the percentage of the data processed as the plan executes. You can click the Stop button at any time to end plan execution and view the results so far.

This dialog box also has a menu that allows you to select the percentage of data to use in the plan. The default setting is 100 percent. You can select a smaller percentage if you want to test that a plan will run as anticipated. This can be useful if you have designed a complex plan that will take time to execute.

Reporting OptionsAs well as generating file-based and table-based output, Data Quality Workbench offers graphical reporting options. These include a proprietary format that lets you view high-level and fine-grained plan results, to create scorecards, and to export data to file. For more information, see “Report Viewer” on page 109.


Plan Resources and Plan Execution

Before you run a plan, check that all relevant files are available to the computer that runs it.

When you run a plan locally, the source data and reference data files are set when you configure the components. Unless you move the data between designing and running the plan, the locations are understood when you run the plan.

When you run a plan on a remote computer, the Data Quality Server reads the plan, identifies the original path to each resource, and replaces each path with a corresponding path on the server. The server substitutes the Windows drive letter with your file folder in the Server host folder structure. Therefore, you must ensure that the source data and reference data files are available to the Server in locations that the Server expects.

Note: If you have used third-party data in the plan, ensure that the third-party data is installed in a location accessible to the Execution service that runs the plan.

The following sections describe how Data Quality handles resource files in cases of remote plan execution.

Data Source FilesData Quality Server recognizes a specific set of folders as valid resource file locations. If a plan refers to a source file stored in the following location on the Workbench computer:

C:\Myfiles\File.txt

A Data Quality Server on Windows looks for the file here:

C:\Program Files\Informatica Data Quality\users\user.name\Files\Myfiles

A Data Quality Server on UNIX installed at /home/Informatica/Data Quality/ looks for the file here:

/home/Informatica/DataQuality/users/user.name/Files/Myfiles

For further information, see “Implications for Plan Design” on page 8.

Note: If you have published a file for runtime execution and your source file is located in a non-standard location, you can provide a parameter file with the runtime command that maps the original location to the required location.

Dictionary FilesData Quality looks for dictionary files in a different way to source files.

The installation processes for Data Quality Workbench and Server creates an empty Dictionaries folder under the top-level Informatica Data Quality folder. This folder is populated with dictionary files by the Content Installer.

By default, the Dictionaries folder is created at the following location on Windows systems:

C:\Program Files\Informatica Data Quality\Dictionaries

and at the following location on UNIX systems:

/home/Informatica/DataQuality/Dictionaries

Data Quality Server also creates a separate dictionary folder for each Data Quality user that connects into the service domain. The folder is created when the client user first opens the File Manager or first attempts to run a plan remotely.

A remotely-run plan first looks for dictionaries in the client user’s Dictionaries folder. If this folder does not contain the required dictionaries, the plan looks in the Dictionaries folder created during installation. Therefore, when you run a plan to the server, you do not need to copy dictionary files to your user dictionary folder on the server if those dictionaries already exist in the server’s dictionary folder.

By default, user dictionary folders are created in the following server locations:

♦ UNIX: /home/Informatica/DataQuality/users/user.name/Dictionaries

♦ Windows: C:\Program Files\Informatica Data Quality\users\user.name\Dictionaries

Plan Resources and Plan Execution 7

Cross-Platform Plan File ConventionsData Quality Server handles the translation of client-to-server file paths and Windows-to-UNIX file paths seamlessly. When a plan is opened on a Windows system, Data Quality ensures that all paths are in a Windows format, with folders separated by back slashes. When a plan is opened on a UNIX system, Data Quality renders all paths in UNIX format with folders separated by forward slashes. The transformations and file paths are case-sensitive and case-preserving.

Implications for Plan DesignWhen you design a plan in Data Quality Workbench, you should ensure that the folders you create for file resources can map efficiently to the server folder structure.

For example, a plan runs in Workbench and reads a source file from the following location:

C:\Program Files\Informatica Data Quality\Sources

When this plan runs on a remote Windows machine, Data Quality Server looks for the source file in the following location:

C:\Program Files\Informatica Data Quality\users\user.name\Files\Program Files\Informatica Data Quality\Sources

The folder path Program Files\Informatica Data Quality is repeated here. In this case, good plan design suggests the creation of folders under C:\ that can be recreated efficiently on the server.

Version Control

Data Quality’s version control features enable you to save multiple versions of a plan, to view the plan version history, and to edit and run historical versions of the plan.

As well as the most recently-saved version of a plan, Data Quality stores any earlier versions that have been flagged for retention in the repository. This allows you to save versions of a plan at meaningful points in its development and to revert to earlier versions of the plan if necessary.

For the purposes of version control, each Data Quality plan has a latest version and one or more base versions.

♦ Latest version. The most recently-saved state of a plan.

♦ Base versions. Earlier versions that have been preserved in the repository

When you save a plan for the first time, you automatically create a base version. If you do not create another base version, the plan version history shows details for that base version and the latest version only.

Note the following:

♦ A base version cannot be overwritten. If you are working in a base version and save your changes, the newly-saved state becomes the latest version.

♦ Version control does not keep every saved state of a plan. It is possible to open, edit, and save a plan multiple times without adding base versions to the version history.

♦ Version control applies to plans only. Version control does not apply to projects or to the external resources that a plan may require to run successfully.

♦ Version history is reset when you copy or publish a plan. Version information does not move with a plan when it is copied within a repository, as this operation effectively creates a new plan. When a plan is published, it retains the version details of the base version published from the Workbench repository – the base version number on the client computer, the creation date and time of that base version, the user who created it, and the comment added by that user. For more information, see “Version Control and Plan Publication” on page 10.


Version Control CommandsYou can perform all plan activities in Data Quality without interacting with the version control features. However, all plans in the repository are assigned a version history that you can access through a shortcut menu.

When you right-click a plan name and select Version Control, a submenu opens as follows:

The Version Control submenu displays the following options:

♦ History. Opens the History Viewer dialog box, which provides file properties for the latest and base versions of the plan.

♦ Get Latest Version. Opens the last-saved version of the plan or, if the plan is open, restores the onscreen plan to its last-saved version.

♦ Save Plan as Base Version. Saves the current state of the plan as a new base version. You must enter a comment describing your changes when you save a new version of the plan.

Viewing Version HistoryThe History Viewer dialog box lists the plan versions maintained in the repository, with the latest version at the top of the list.

It lists the latest and base versions of the plan, showing the version number, creation date and time, author (the user who saved the plan), and the comment provided by the author when the version was created.

The Comment for Version pane shows the full text of the comment entered for the version.

Figure 1-2 shows the History Viewer dialog box:

Tracking Plans Across the Service Domain

The History Viewer can be useful to service domain users who want to track the progress of a plan through the enterprise. As a plan retains the version details of its meaningful iterations, the History Viewer facilitates an audit trail that can assist collaboration between plan designers and the users who deploy the plans.

Opening Plans with Version Control When you double-click a plan in the Project Manager, you retrieve its latest saved version. You can also open the latest version of a plan through the version control menus by right-clicking a plan name and selecting Version Control > Get Latest Version.

Figure 1-2. History Viewer

Version Control 9

The Get Latest Version option also allows you to revert to the latest saved version while working with a plan. If your plan has unsaved changes when you select Get Latest Version, Data Quality prompts you to confirm the command, since reverting to the latest version will undo your changes.

Use the following procedure to open a base version of the plan.

To open a base version of a plan:

1. In the Project Manager, right-click a plan name and select Version Control > History.

2. In the History Viewer dialog box, select the required base version and click Open Selected Version.

Saving, Deleting, and Renaming PlansVersion control is sensitive to general plan operations. By default, any save command will update the latest plan version.

When you save a plan for the first time, you automatically create a base version. When you create a subsequent base version, the latest version is automatically updated.

When you rename a plan, the name change is propagated through all base versions of the plan.

When you delete a plan, you delete all versions. It is not possible to delete a specific base revision of a plan.

To create a base version:

1. In the Project Manager, right-click the name of the plan and select Version Control > Save Plan as Base Version.

2. In the Confirm Base Version Creation dialog box, type a comment explaining the operation.

You will not be allowed to proceed without typing a comment in this dialog box.

3. Click Set As Base Version.

Version Control and Plan PublicationData Quality treats version control differently for publication and local repository copy/move operations. Publication preserves a plan’s most recent base version information. Local repository copy/move operations do not.

Consider a plan published from the local repository to the domain. Publishing the plan sends its most recent base version, with that version information, to the domain repository. Version information copied with the published version includes the version number of the published base version on the client, the user who created the base version on the client, a date-time stamp for the creation of that version, and the comments added when the version was created. In this way, a plan on the domain is traceable back to its point of origin.

The domain repository also initiates its own version history for the plan. When a plan is first published, the domain repository assigns it a base version number of 1 while retaining also the client-side version data for the published version. If a client user subsequently publishes the plan a second time, the domain repository increments its base version number while again retaining the client-side version data.

For example, you have published base version 5 of a plan from your Workbench repository to the domain repository. The domain repository creates base version number 1. After working locally on the plan, you publish base version number 8 from your Workbench repository to the domain, creating a new base version number in the domain repository.

Table 1-2 illustrates the changes in version details:

Table 1-2. Version Data Updated During Plan Publication

Client Repository Domain Repository

Version Number 5 1

Version Number 8 2


Note:

♦ Publication copies/moves the most recent base version, which may not be the latest saved version.

♦ When a plan is copied within the client repository, only the latest saved version is copied/moved. All base versions are discarded.

Working with Multiple Instances of a Plan

Data Quality is designed to be flexible. To enable teamwork between plan designers, it does not apply any locks to an open plan. Though it is possible for users on different systems to work on a plan concurrently, this is not recommended.

The following section describes plan behavior in the event that different instances of Data Quality Workbench are working with the same plan.

♦ When you save a plan, Data Quality checks the repository to determine if there have been any updates to the plan since its last “save” event. If it finds such an update, the system prompts you to confirm that you want to overwrite the saved plan. This updates the latest version in the repository. Any changes made by the other user will be lost.

♦ When you save a plan as a base version, Data Quality checks for any updates to the list of base versions for that plan. If it finds such an update, the system notifies you that a new base version will be created with a version number incremented from the version most recently created by the other user.

♦ Updating a base version also overwrites the latest saved version in the repository. Data Quality performs two checks in this case: to establish if the latest version has been updated and to establish if a more recent base version has been created. When you create a base version in this case, you are asked to accept the changes to both versions of the plan. If you click No in either case, the plan will not be saved and the base version not created.

Organizing the Workbench User Interface

You can organize the components on the plan workspace in any manner you choose. The Data Quality Workbench user interface provides menu options that allow you to organize your plan components in a meaningful way:

♦ The component icons are connected by directional lines in the workspace. These lines indicate the directions in which data flows within the plan. However, the directional lines do not provide a foolproof indicator of whether one component precedes another in plan operations. The relative positions of the icons in the workspace do not affect the running of the plan.

♦ Another method of keeping track of the component dependencies in a plan is to assign components to one or more layers. Layers let you show or hide component icons onscreen. You can create a layer through the Plan Layer Manager, available from the Tools menu.

To assign a component to a layer, right-click it and select Assign To Layer from the context menu. To view only the components in a single layer, select View > Plan Layers.

♦ To view a snapshot of the current source data in the plan, open the Source Viewer (F6). This window appears in the workspace and displays the first 250 rows of the source data currently in use.

♦ The plan components can make use of reference dictionary files to determine the validity of data values. These dictionaries are visible through the Workbench Dictionary Manager (F8).

♦ You can read or add notes to a plan by opening the Plan Notes window (F11). This window is a free-text tool that allows you to comment on any aspect of the plan.

Working with Multiple Instances of a Plan 11

Workbench Naming ConventionsWhen you design or edit plans that will be shared with other users, it is good practice to name your Workbench elements in an agreed and consistent manner.

You and your team should agree a clear and consistent set of naming conventions for projects, folders, plans, configurable components, component elements, and dictionaries.

For a comprehensive guide to developing a naming system for these elements, see “Informatica Data Quality Naming Conventions” on page 149.


C H A P T E R 2

Data Source Components


♦ Overview, 13

♦ CSV Source, 13

♦ Database Source, 14

♦ Fixed Width Source, 16

♦ Realtime Source, 16

♦ SAP Source, 17

♦ CSV Match Source, 19

♦ CSV Dual Match Source, 19

♦ Database Match Source, 20

♦ Group Source, 21

♦ Dual Group Source, 21

♦ CSV Identity Group Source, 22

♦ DB Identity Group Source, 23

Overview

Source components are used to specify the location of the input data files for a plan.

CSV Source

The CSV Source component connects to files with data organized in a delimited format, such as comma delimited (CSV), to provide source data for a plan. When configuring this component you specify the location of the delimited file, the type of delimiter used, and other options as described below.

Configuration The CSV Source configuration dialog box contains the following editable fields:

13

♦ Source File. Displays the name of the file to which the component connects.

♦ Select. Click this button to browse to the source file.

When you click Select, the Select a CSV File as a Source dialog box opens. This dialog box provides an option to identify the character encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on page 143.

♦ Field Delimiter. Select a field delimiter appropriate to the source data from this menu. The default option is comma. If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

♦ Text Qualifier. Select a qualifier appropriate to the source data from this menu.

The application in which the source file was last edited may have saved information with a text qualifier. The default option is the [“] double quote.

♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as a header and thus distinguish it from the rest of the dataset.

Database Source

The Database Source component connects directly to a database to provide source data for a plan. When configuring a Database Source, you identify the required database type, connect to a database available to Data Quality, and configure the tables and columns on the database to produce a source dataset for your plan.

ConfigurationThe component dialog box displays configuration options across four tabs: Connect To Database, Before, During, and After.

The connection is defined on the Connect To Database tab. The Before tab settings create the database table that will be populated with the source data for the plan. The During options define the data that is used in the plan, i.e. by selecting and joining columns from the available databases and adding the data to the table defined in the Before tab. The After tab updates the table configured on the previous tabs and determines the state of the data as it will be used by other plan components.

Note: The Before, During, and After tabs work in the same fashion for all database types.

Connect To Database Tab

When connecting to a database source, first identify the database type.

The Database Type menu provides five options: Staging, IBM DB2, Oracle, Microsoft SQL Server, and ODBC (connection to a ODBC-compliant database).

Staging is the default option. It refers to the local database used by Data Quality. The remaining Database Information and Login Information fields are disabled for this option. That is, you can connect to the local repository without setting any other options on this page.

When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must provide a Data Source Name (DSN) for the database and you might be prompted to provide a valid username and password combination. The DSN field identifies the database on the network.

When you connect to an Oracle database, you must provide the System Identifier (SID) that refers to the Oracle instance.

The Encoding menu lists the available character encodings that can be applied to the data as it is used in the plan. For more information, see “Character Encodings and Unicode” on page 143.

14 Chapter 2: Data Source Components

The Login Information area contains Username and Password fields. Use these fields when access permissions have been applied to the database in question. Data Quality does not require this information by default.

Click Connect to establish the connection.

Before Tab

The Before tab has a Database pane and SQL Script pane.

The Database pane displays the available databases and tables in a folder hierarchy. Browse the hierarchy to locate the data source tables and columns and write the SQL script that defines the table in the SQL Script pane. Clicking on a folder or column in the left pane transposes its name to the right pane to aid accuracy in scripting.

The following sample script creates an elementary table called Names:

drop table if exists names; # overwrites any existing names tablecreate table names(id int, # id field populated by integers name varchar(255) # name field entries up to 255 chars);

Click Execute to run the script and create the table. You must click Execute before proceeding to the During tab.

Click Stop On Error if you want the system to stop the script operation and display an error message if the execution encounters a problem.

During Tab

The During tab allows you to browse database tables and filter the columns to provide source data for your plan. You can also apply conditions to tables and join columns from multiple tables. The tab shows five columns:

♦ Database. Like the Before tab, the Database column displays the database structure as a folder hierarchy of tables and columns.

♦ Select. Provides check boxes for the column on the explored tables. Check a column check box under Select to add its data to the dataset.

♦ Join. Lets you select columns from multiple tables for “join” operations so their data is added to the dataset.

♦ Where and Text. These columns allow you to specify the conditions for data inclusion, both for the columns identified in the Select column and the columns to be joined. Note the following:

− To activate the editable fields in the Where and Text columns, click in the column. Use the fields in the Where column to access conditional statements. You can enter text in the Text column for each database column.

− You can use the Where statement builder to specify the join condition to join two databases using two Database Source components. Select a database table in the Join column by checking its check box. A new Join column, such as Join1, appears to its right.

The During tab also contains the following options:

♦ Trim Leading Spaces and Trim Trailing Spaces. Use these options to remove leading spaces or trailing spaces from the dataset. They are cleared by default.

♦ Expert mode. Use to view and edit the underlying SQL query statements, and to create advanced select statements. This option is cleared by default.

♦ Preview. Use the Preview option to view the dataset as defined by the configured settings in this dialog box. The Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.

♦ Validate. Use the Validate option to verify that the SQL query is valid. This option allows you to periodically test validity as you are constructing an SQL query.

Database Source 15

After Tab

The After tab completes the process of generating the plan dataset. The Before tab runs SQL scripts on the database prior to its configuration The After tab permits SQL scripts to run on the configured dataset. Like the Before tab, the After tab displays Database and SQL Script panes.

You can browse the configured tables and columns in the left pane and write the SQL script to run on data in the right pane.

For more information and examples, see “SQL Scripts” on page 139.

Fixed Width Source

Use this component to specify a fixed-width file as the data source for your plan. This component allows you to edit column names, widths, and data types.

Configuration The Fixed Width Source configuration dialog box contains the following features:

♦ Source File. Displays the name of the file to which the source components connects.

♦ Select. Click this button to browse to the source file.

When you click Select, the Select a Fixed Width File as a Source dialog box opens. You can create a new file by typing a name in the File Name field of this dialog. In this dialog box, you can identify the character encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on page 143.

♦ Fixed Width columns. The columns in this group allow you to enter the name, width, and datatype for each field in the file.

♦ Remove Trailing Spaces. Use this option to remove trailing spaces, extra spaces at the end of data, from the dataset used in the plan.

♦ Preview. Use this option to view the dataset as defined by the configured settings in this dialog box. The Preview option runs the entire plan to generate the preview and displays the first 250 rows of data.

Realtime Source

The Realtime Source allows you to develop plans that accept input in real time from live data entry or other applications. To configure this component, define the input fields that will run data to the plan.

ConfigurationThe Realtime Source configuration dialog box includes an Inputs column and an Input Type column and, when first added to a plan, a single, undefined row.

To Add or Delete rows to or from the table, right-click in the dialog box and use the context menu. The Delete option deletes the highlighted row.

The following columns display:

♦ Inputs. Double-click a field in this column to edit the input name. Click OK to apply your changes before moving from the field.

♦ Input Type. Click a field in this column to view options for defining the input data type. The options are String or Float.


For example, you may want to design a simple real-time plan to test the validity of a data code. The data code is valid within an organization if it contains the correct year (for example, 2005 in Figure 2-1). You can write a rule in the Rule Based Analyzer to check if any given input string contains this value. When you test the plan in Workbench, an input dialog box like the following appears:

Type the year (or any value) in the Value field and click OK to return a result. In a real-time scenario, data inputs are checked without any direct user activity.

SAP Source

The SAP Source component allows you to use an SAP database as the data source in a plan. To obtain the data, the SAP Source connects to a SAP system and uses a BAPI (Business API) function to read data from the SAP database.

In the SAP Source component configuration dialog box you can identify the SAP system and set the input and output parameters of the function. Set the input parameters to filter the database for the data relevant to your plan. Set the output parameters to specify the data to be used in the plan.

Data Quality SAP connectivity is licensed separately from other Workbench components. If your license does not include SAP connectivity, contact Informatica Global Customer Support. Similarly, the SAP Source requires a valid connection to the SAP System and a corresponding SAP license for the SAP System.

ConfigurationThe configuration dialog box for the SAP Source displays its options on two tabs:

♦ Connection

♦ SAP System

Connection Tab

The Connection tab displays the following options:

♦ Host. The name or IP address of the SAP host computer.

♦ Client Number. Identifies a SAP client that you are authorized to use.

A SAP system can have multiple clients, each identified by a three-digit client number.

♦ System Number. A two-digit number that identifies the application server to which you want to connect. SAP allows multiple application server instances to run against a database.

♦ Encoding. Character encodings that can be applied to the data as it is used in the plan. For more information, see “Character Encodings and Unicode” on page 143.

♦ Username and Password. SAP username and password to identify you to the SAP system.

SAP System Tab

After entering the required information on the Connection tab, click Connect to open the SAP System tab.

Figure 2-1. Realtime Source: Data Setup Dialog Box

SAP Source 17

The SAP application areas available on the connected system are listed on the left. On the right appears options for defining the input and output parameters to be used in the function call to the SAP database.

You can explore the SAP application areas to reveal the business objects defined for each area and the functions that can be configured for each business object. The icons associated with each level are color-coded: application area icons are yellow, business object icons are green, and function icons are red.

Your first task is to explore the available objects and select the function you want to run. Then, you can define the function using the Import and Export tab options.

Import Tab

On the Import tab, you can set the input parameters of the function that retrieves data from the SAP database. With this tab selected, two columns display:

♦ Name. Lists the input parameters available for the function.

♦ Value. Use to filter parameter output. To enter a filter, click in the Value column for the the parameter and enter a filter string.

Note that there are three types of parameters. Configure the values on the Import tab based on the parameter type:

♦ Scalar parameter. A single name-value pair of the type described above, such as “Town – Chicago.”

♦ Structure parameter. A group of one or more scalar parameters, such as a multi-line address group. A structure can have multiple rows but has a single column of values, for example:

♦ Table parameter. Contains one or more rows of data described by one or more columns. For example, each name below has multiple values:

Export Tab

The Export tab displays output parameters that correspond to the settings on the Import tab. The export parameters determine the data values that are “exported” from the SAP database for use as source data in your data quality plan.

The export parameters that appear are specific to the function being used:

♦ Value. To select a parameter for data export to your plan, use the Value check box of the parameter. Depending in the parameter type, you might need to select individual data elements for export.



ADDRESS

AddressLine1 781 Fifth Avenue

AddressLine2 New York

AddressLine3 NY

AddressLine4 10022

CUSTOMERS

Name AddressLine1 AddressLine2 AddressLine3

Smith Fifth Avenue New York NY 10022

Jones Park Avenue New York NY 10128

Wilson Columbus Avenue New York NY 10025


Click OK in the configuration dialog box to save your changes.

CSV Match Source

The CSV Match Source compares the records in a single source file to identify duplicates. The source file must be delimited. This component makes use of a CSV file in a similar manner to the CSV Source component, then selects data for a matching operation. To match between two delimited source files, use the CSV Dual Match Source component. For more information, see “CSV Dual

Match Source” on page 19.

When the CSV Match Source has been configured, two versions of each field in the source dataset will be visible to the matching components. To distinguish between them, “_1” and “_2” are appended to the field names.

The CSV Match Source is one of two components that enable the generation of match cluster information by the CSV Match Target. The other source component is the Group Source. If you want to use the CSV Match Target Identified Matches option to generate match cluster information, you must use CSV Match Source or Group Source in the plan.

ConfigurationThe configuration dialog box contains the following fields:

♦ Source File. Displays the name of the file to which the source component connects.

♦ Select. Click this button to browse to the source file. When you click Select, the Select a CSV file as a Source dialog box opens. You can identify the character encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on page 143.

♦ Field Delimiter. Select a field delimiter used in the source file. The default option is comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

♦ Text Qualifier. Select the text qualifier used in the source file. The default option is the quotation mark (“).

♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading text and distinguish it from the dataset.

CSV Dual Match Source

This component allows you to match data from two delimited source files. The functionality of the component is similar to that of the CSV Match Source, except the Dual Match Source compares data across two files.

ConfigurationThe CSV Dual Match Source configuration dialog box displays a set of options in a two areas: Source 1 and Source 2. Each area provides identical settings for selecting and configuring a dataset. The settings in each area are identical to those in the configuration dialog for the CSV Match Source:



CSV Match Source 19




Note: If the CSV Dual Match Source component is being used for Match-and-Append operations, the reference file appears in the Source 2 area.

Database Match Source

The Database Match Source component lets you explore the Data Quality repository to select tables and columns for use in a matching plan. To configure this component you connect to the Data Quality repository and configure the dataset.

The Database Match Source provides a single-component alternative for plans that use two Database Source components to match data across a single table.

ConfigurationThe Database Match Source configuration dialog box includes two tabs: Connect to Database and Match Selection. The Connect To Database tab options are identical to the Connect to Database tab on the Database Source configuration dialog box, as described in “Database Source” on page 14.

Connect to Database Tab

The Database Match Source connects to the Data Quality repository. This option may be named Staging in the configuration dialog box.

Click Connect to effect the connection and open the Match Selection tab. The remaining options on this tab are disabled.

Match Selection Tab

The options on this tab allow you to explore the database tables defined in the repository and select the columns to provide data for the matching plan:

♦ Database. Displays the repository structure as a folder hierarchy of tables and columns.

♦ Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data to the dataset.

♦ Unique ID. Use to identify the data column to provide the unique ID for the dataset. The dataset can have one unique ID only.

♦ Group Key. The fields that the matching plan searches for common values. Select one or more group keys.



Note: Configuring a column for UniqueID or GroupKey automatically checks the Select option to add the column to the dataset. However, clearing either option does not automatically remove them from the dataset. Clear the Select option to remove a column from the dataset.


Group Source

The Group Source component defines the input data for a plan by reading the set of group files created by a Group Target in another plan. When you configure the Group Source to connect to the set of group files, the Group Source uses the dataset underlying these files as the source for the plan, providing the data to the operational components on a group-by-group basis.

Grouped data is chiefly used in matching plans, although it can be used in other types of plans.

Groups are produced by the Group Target component. The Group Target creates a set of delimited text files in a proprietary format and saves the files in a user-defined directory. The files use the extension SSG. When configuring the Group Source, you need to specify the host directory for the grouped files.

Groups are created in the Group Target component by defining one or more key grouping fields for the dataset. All records with common values in the key grouping fields will be associated with a single group.

The Group Source is one of two components that enable the generation of match cluster information by a CSV Match Target. The other source component is the CSV Match Source. If you want to use the CSV Match Target Identified Matches option to generate match cluster information, you must use Group Source or CSV Match Source in the plan.

You can use the Dual Group Source to group data from two data sources. For more information, see “Dual Group Source” on page 21.

ConfigurationThe Group Source configuration dialog box contains the following features:

♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use. To add a directory, right-click in the pane and click Add from the menu.

♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act as the source directory. Be sure to select a folder, not a file.

♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select Directories pane. This pane has no editable options.

Note the following:

♦ Group files do not contain data from the underlying dataset, and group creation does not edit the underlying dataset in any way. Groups are a way to identify data records with a common values so these records can be processed together in matching operations. Matching operations can be performed on grouped data at significantly higher speeds than on non-grouped data.

♦ The column names in the Column Headers pane are appended with “_1” or “_2.” The columns are derived from the source dataset in the plan that generated the SSG files. Each column in the dataset is duplicated so their data values can be matched.

Dual Group Source

The Dual Group Source allows you to perform matching operations on grouped data from two different data sources. It uses the SSG files defined for two datasets as input.

ConfigurationThe Dual Group Source configuration dialog box contains the same elements as the Group Source component. However, the Dual Group Source dialog box displays two instances of each pane.

The Dual Group Source configuration dialog box contains the following features:

Group Source 21

♦ Select Directories pane. Identifies the directory or directories containing the grouped data you want to use. To add a directory, right-click in the pane and click Add from the menu.

♦ Select a Source Group Directory dialog box. Appears after you add a directory. Use to select a folder to act as the source directory. Be sure to select a folder, not a file.

♦ Column Headers pane. Displays the headings for each data column in the group highlighted in the Select Directories pane. This pane has no editable options.

For more information about using grouped data in plans, see “Group Source” on page 21.

CSV Identity Group Source

The CSV Identity Group Source performs identity matching on CSV sources using keys created by the Identity Group Target. To use the CSV Identity Group Source, you must first run a plan containing an Identity Group Target. The Identity Group Target stores keys in an identity index within Informatica Data Quality. The CSV Identity Group Source matches input data against the keys in this identity

index.

In both the CSV Identity Group Source and the Identity Group Target, you must select the same Population and Key Type, and ensure that the Input Column in both components contains the same type of data. Additionally, the data sources used in both components must contain the same number of columns.

Note: Identity Group components require population files that install through the Content Installer. You must contact Informatica to purchase and download population files separately. For information on installing population files, consult the Informatica Data Quality Installation Guide.

ConfigurationThe configuration dialog box contains the following fields:






♦ Population. Populations contain key-building algorithms that are customized for specific countries and languages. Select the population that most closely matches the origin of the input data.

♦ Key Type. The standard populations provided by Informatica can generate keys for three types of index data: person names, organizations, and addresses. Select the Key Type corresponding to the type of data that you wish to use in key generation.

♦ Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of search quality and search speed. The search speed is inversely related to the number of matches returned, so


that faster searches return fewer matches. The following table describes the search speed and matching criteria for each Search Level.

♦ Input Column. The input column specifies the source data that the CSV Identity Group Source uses for matching. Choose an input column that contains the type of data specified in the Key Type field.

The order of individual strings in the selected input column should match the normal string order used in the population Key Type you selected. For example, in English-speaking countries the normal string order for person names is as follows:

First Name + Middle Name(s) + Family Name(s)

♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an example of a a Key Index Location with multiple subdirectories:

UK/Person/Name

DB Identity Group Source

The DB Identity Group Source performs identity matching on database sources using keys created by the Identity Group Target. To use the DB Identity Group Source, you must first run a plan containing an Identity Group Target. The Identity Group Target stores keys in an identity index within Informatica Data Quality. The DB Identity Group Source matches input data against the keys in this identity index.

In both the DB Identity Group Source and the Identity Group Target, you must select the same Population and Key Type, and ensure that the Input Column in both components contains the same type of data. Additionally, the data sources used in both components must contain the same number of columns.

Note: Identity Group components require population files that install through the Content Installer. Informatica provides these files separately from Data Quality. You must contact Informatica to purchase and download population files. For information on installing population files, consult the Informatica Data Quality Installation Guide.

Search Level

Search Speed

Matching Criteria

Description

Narrow Fastest Nearly exact This Search Level performs the fastest and most exact matches. For example, using a Narrow Search Level for person name matching returns exact matches and name abbreviation matches (initials).

Typical Fast Strict This Search Level performs fast searches with strict matching criteria. For example, using a Typical Search Level for person name matching returns data with name abbreviation matches and some potential errors (e.g., incorrect initials).

Exhaustive Average Loose This Search Level performs average speed searches with loose matching criteria. For example, using an Exhaustive Search Level for person name matching returns matches that may represent substantial spelling errors.

Extreme Slow Very Loose This Search Level performs slow searches with very loose matching criteria. For example, using an Extreme Search Level for person name matching may return matches with a very wide variety of spelling errors.

DB Identity Group Source 23

ConfigurationThe DB Identity Group Source configuration dialog box includes two tabs: Connect to Database and Match Selection.

Connect to Database Tab

The Connect To Database tab options are identical to the Connect to Database tab on the Database Source configuration dialog box. For more information about the Connect to Database tab options, see “Database Source” on page 14.

Click Connect to effect the connection and open the Match Selection tab.

Match Selection Tab

The options on this tab allow you to explore database tables and select the columns to provide data for the matching plan:

♦ Database. Displays the database structure as a folder hierarchy of tables and columns.

♦ Select. Provides check boxes for the column on the explored tables. Check Select for a column to add its data to the dataset.

♦ Input Column. The input column specifies the source data that the DB Identity Group Source uses for matching. You can only select one input column. Choose an input column that contains the type of data specified in the Key Type field.


First Name + Middle Name(s) + Family Name(s)

♦ Group Key. The fields that the matching plan searches for common values. Select one or more group keys.

Note: Do not select the same column as the Input Column and Group Key. The selections must be different. Both are mandatory.



♦ Search Level. Select the Search Level that fits your matching needs. Each level uses a different balance of search quality and search speed. The search speed is inversely related to the number of matches returned, so that faster searches return fewer matches. The following table describes the search speed and matching criteria for each Search Level.

Search Level

Search Speed

Matching Criteria

Description

Narrow Fastest Nearly exact This Search Level performs the fastest and most exact matches. For example, using a Narrow Search Level for person name matching returns exact matches and name abbreviation matches (initials).

Typical Fast Strict This Search Level performs fast searches with strict matching criteria. For example, using a Typical Search Level for person name matching returns data with name abbreviation matches and some potential errors (e.g., incorrect initials).


♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory that contains the key index. Enter the Key Index Location specified in the Identity Group Target. The following string displays an example of a a Key Index Location with multiple subdirectories:

UK/Person/Name


♦ Stop on Error. Select this option if you want to stop script operation and display an error message if the execution encounters a problem.


Note: Configuring a column for InputColumn or GroupKey automatically checks the Select option to add the column to the dataset. However, clearing either option does not automatically remove them from the dataset. Clear the Select option to remove a column from the dataset.

Exhaustive Average Loose This Search Level performs average speed searches with loose matching criteria. For example, using an Exhaustive Search Level for person name matching returns matches that may represent substantial spelling errors.

Extreme Slow Very Loose This Search Level performs slow searches with very loose matching criteria. For example, using an Extreme Search Level for person name matching may return matches that contain a very wide variety of spelling errors.

Search Level

Search Speed

Matching Criteria

Description

DB Identity Group Source 25

C H A P T E R 3

Data Target Components


♦ Overview, 27

♦ CSV Target, 27

♦ Fixed Width Target, 28

♦ Report Target, 29

♦ CSV Merge Target, 30

♦ CSV Match Target, 31

♦ Match Key Target, 33

♦ Group Target, 35

♦ Database Target, 36

♦ Database Report Target, 38

♦ SAP Target, 38

♦ Realtime Target, 40

♦ Identity Group Target, 40

Overview

Just as you configure source components to specify input data for your data quality plan, you configure target components to specify plan output. Targets are designed to accept data derived from the source and operational components of a plan.

CSV Target

The CSV Target component defines a delimited file, such as a comma-separated file, as the output format for your data quality plan.

The component allows you to do the following:

♦ Specify the fields included in the output file, including any combination of data source fields and fields generated within the plan.

♦ Specify the position of each field in the output file.

27

♦ Enter a condition to filter data written to the output file.

♦ Configure the plan to create new output files or append data to an existing file.

ConfigurationThe CSV Target configuration dialog box contains the following options:

♦ Target File. Identifies the output file for the data target.

♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also identify the character encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on page 143.

♦ Overwrite file? When checked, this option specifies that the plan overwrites the target file every time it runs (in cases where the target file name and path are unchanged for successive executions of the plan). When cleared, this option specifies that the plan writes its output to the end of the existing target file each time it runs. In this case, the target file grows in size each time the plan is run. This box is checked by default.

♦ Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the target. Use the filter to limit the records written to the output file.

Specify a condition by selecting a single input data field, an operator, and a condition value.

♦ Inputs. This pane lists the field types available to the target, typically, the data derived from the operational components of the plan and the source dataset. Beside each field type is a check box. Use the check box to add a field to the target output.

♦ Outputs. This pane shows the fields that have been selected from Inputs for inclusion in the data output. To change the order of the output fields, use the Up and Down arrows.

♦ Launch Viewer. If there is a program associated with the file type, use this option to launch a database table view of the target output automatically when the plan is executed.

♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading text and distinguish it from the rest of the dataset.

♦ Field Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is a comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

♦ Text Qualifier. Select a qualifier appropriate to the data from this menu. The default option is a quotation mark (“).

Fixed Width Target

The Fixed Width Target component generates plan output in a fixed-width file format.

The component allows you to do the following:

♦Specify the fields included in the output file, including any combination of data source fields and fields generated within the plan.

♦ Specify the position of each field in the output file.

♦ Specify the length of each fixed width column.

♦ Enter a condition to filter data written to the output file.

♦ Configure the plan to create new output files or append data to an existing file.

ConfigurationThe Fixed Width Target configuration dialog box contains the following features:

28 Chapter 3: Data Target Components

♦ Target File. Identifies the output file for the data target.

♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a Target dialog box opens. You can create a new file by typing a name in the File Name field. You can also identify the character encoding associated with the dataset. For more information, see “Character Encodings and Unicode” on page 143.

♦ Condition. Use to create a condition-based filter in the form of an IF statement to the data processed by the target. Use the filter to limit the records written to the output file.

Specify a condition by selecting a single input data field, an operator, and a condition value.

♦ Overwrite File. Use to overwrite the target file with successive executions of the plan.This option is checked by default. Clearing this option keeps the selected target file from being overwritten, making it read-only.

♦ Inputs. This pane lists the field types available to the target, typically, the data derived from the operational components of the plan and the source dataset. Beside each field type is a check box. Use the check box to add a field to the target output.

♦ Outputs. Lists the name, width, and type of each selected input. The values in the cells of the Width column determine the width as a number of characters for the associated columns of output data.

If the data values are longer than the width specified, the data will be truncated in the output file.

The default data type is String. Valid types are String, Number, and Date.

♦ Launch Viewer. If there is a program associated with the file type, use this option to launch a database table view of the target output automatically when the plan is executed.

♦ First Line of File is the Header. Use this option to designate the first line of data in the source file as heading text and distinguish it from the rest of the dataset.

Note that the Fixed Width Source does not use a header record. Clear this option if you intend to use the fixed-width target output file as a source in another plan.

♦ Launch Specification Viewer. Use this option to open the fixed-width specification file, which specifies the field names and widths defined for the target output file.

Report Target

The Report Target generates an easy-to-read report file that displays plan output data. The report files can be opened in other applications, including web browsers and spreadsheets.

You can create three types of report files: HTML, CSV (delimited flat file), and SSR (a proprietary Informatica Data Quality format). SSR reports can be viewed as dashboards in the Data Quality Report Viewer. For more information, see “Report Viewer” on page 109.

When you use Report Target, you need to use a frequency component, such as Count, before Report Target. The data fields counted in the Report Target are determined in the frequency component preceding it in the plan.

Note: The Report Target does not read outputs from the Aggregation component.

ConfigurationThe Report Target configuration dialog box contains the following features:

♦ Report File. Identifies the output file for the data target.

♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a Report as a Target dialog box opens. You can create a new file by typing a name in the File Name field of this dialog. By default, files of the type specified by the Report Transform options display.

♦ Report Transform. Determine the output file type.

Report Target 29

− Check the Standard option to enable the file type selection menu. The options are HTML, CSV, and SSR. The HTML option activates the Include Chart menu, which allows you to add a pie chart, bar chart, or line chart to the report.

− Check the Custom option to write the target output to a customized HTML report template and to generate graphical reports. Click Select beside the Custom text field to browse to a template file.

♦ Launch Report on Completion. Use to launch the report file automatically when the plan is executed.

CSV Merge Target

The CSV Merge target merges columns from two sources to a single target file. It can be used in matching plans that compare a dataset against a reference dataset. The component operates as follows:

♦The target lists data fields available from the other components in the plan as inputs. Select the input fields to write as outputs to the target.

♦ The inputs defined as Source 1 are automatically written to the resulting merged target.

♦ The inputs defined as Source 2 constitute reference data. Data values from Source 2 are appended to the merged target where good matches are found with Source 1 data, as determined by the Match Input Field and Match Threshold settings.

Note: When more than one positive match is identified, the match with the highest score is appended.

ConfigurationThe CSV Merge Target configuration dialog box contains the following features:

♦ Target File. Identifies the output file for the merged data.

♦ Select. Use to browse to the output file for the data target.

When you click Select, the Select a CSV file as a Target dialog box opens. You can create a new file by typing a name in the File Name field of this dialog.

♦ Inputs. Lists the potential input fields for the target. Input fields can be added to the Source 1 or Source 2 output panes so their data can be considered for inclusion in plan output. Add an input column to either pane by right-clicking a field name in the Inputs pane and selecting Add to Source 1 List or Add to Source 2 List.

♦ Launch Match File. Use to open the output file automatically when the plan is run.

♦ Match Threshold. Filters the columns in the Source 2 Outputs pane according to their scores in the key matching field, as defined for the target on the Match Input Field. Records in these columns with match scores below this value are not included in the merged output. The default value is 0.9.

♦ Match Input Field. Lists the key matching fields defined by the plan components. Use this menu to select the field on which to base the matching calculation. The Match Threshold applies to this calculation.

♦ Use First Line as Header. Use this option to designate the first line of data in the source file as heading text and distinguish it from the rest of the dataset.

♦ CSV Separator: Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

♦ CSV Separator: Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation mark (“).


CSV Match Target

The CSV Match Target creates a delimited output file containing data generated by a matching plan.

The component can generate two types of output: a HTML match report displaying match clusters and corresponding match scores, and a CSV file containing data values that meet or exceed the match

threshold score. This match file can be used as input for the consolidation process.

The principal steps in configuring the CSV Match Target are:

♦ Select the data fields whose data matches you want to include in the target output. Include at least one matching component output field.

♦ Select the match input field to which you want to apply the match threshold. This field and the match threshold value constitute a filter for the plan output data.

♦ Select the types of output you want the target to generated. The target can generate a HTML report or a CSV file in one of two formats.

For more information about formatting CSV outputs, see “Output Options in the CSV Match Target” on page 147.

The input fields listed in the CSV Match Target configuration dialog box are numbered by appending “_1” and “_2” to the field names. When you match data fields from a single source file, “_1” and “_2” are appended to the field names. When you match data fields in two data sources, the fields, “_1” is appended to the fields in one source and “_2” is appended to the fields in the other source.

ConfigurationThe CSV Match Target configuration dialog box contains the following options:

♦ Target File. Identifies the CSV output file for the data target.

♦ Select. Use to browse to the output file for the data target. When you click Select, the Select a CSV File as a Target dialog box opens. You can create a new file by typing a name in the File Name.

♦ Inputs. Lists the data fields that can be included in the target output. Check a field to include it in the plan output calculations. You must select at least one output from a matching component.

♦ Outputs. Lists the fields selected in the Inputs field. Use the Up and Down arrows to change the order of the output fields, that is, the order in which you want them to appear in the plan output.

♦ Use First Line as Header. Check to designate the first line of data in the source file as heading text and so distinguish it from the dataset.

♦ Launch Viewer. Use to open the output files automatically when the plan executes.

♦ Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

♦ Qualifier. Select a qualifier appropriate to the data from this menu. The default is quotation mark (“).

♦ Create HTML Match Report. Use to generate a HTML report displaying the match clusters found by the plan. This option is checked by default.

Note: An HTML match report can only be generated for plans that use a Group Source or CSV Match Source. If your plan does not include one of these two sources, an error message appears. If you are running a CSV Match target plan created in an earlier version of Workbench, check the source configuration to make sure that the plan continues to run successfully.

♦ Match Output Type (Matched Pairs/Identified Matches). These options determine how the CSV report file displays the matches found by the plan.

CSV Match Target 31

Use the Matched Pairs option to list matching values together in the file output. For example, if the strings “John Smith” and “John Smyth” are identified as a matched pair, both these strings will be written to a single row along with the match score:

Use the Identified Matches option to append the match cluster ID and the number of records per cluster to records identified as matches by the plan. For example, in a plan that matches the four input records “John Smith,” “Bill Brown,” “Mary Murphy,” and “John Smyth,” the Identified Matches option appends the following columns to the target file and populate the columns as follows.

Here, “John Smith” and “John Smyth” share a common Cluster ID, indicating that they satisfy the plan’s matching criteria.

Also note the following points about the Identified Matches option:

− The Identified Matches option requires inputs from a CSV Match Source or a Group Source. If you add inputs from other sources to the CSV Match Target and select the Identified Matches option, the plan registers an error.

− Clustering does not group matching records in the output file. The data input order corresponds to the data output order.

− The columns listed in the Outputs pane must be organized by data source, with an equal number of columns for records from each data source. The match score column must appear after the record columns. Figure 3-1 illustrates the correct order.

− If you select the Identified Matches option, match score values do not appear in the file output for this Target, even if you select a match score in the Outputs pane. This is because Identified Matches causes data to be written one by one, and any given data row can have multiple rows associated with it.

For more information about formatting outputs, see “Output Options in the CSV Match Target” on page 147.

♦ Field. Lists the output fields defined by the matching components in the plan. Use this menu to select the field from which the CSV Match Target reads the match score. The match threshold values set in this dialog box apply to the match scores achieved in this field.

John Smith John Smyth 0.9

Name Cluster ID Records Per Cluster

John Smith 1 2

Bill Brown 3 1

Mary Murphy 2 1

John Smyth 1 2

Figure 3-1. CSV Match Target Outputs Pane, Showing Column Order for Identified Matches


♦ Thresholds fields (Lower and Upper). Filter the data record values written as plan output according to the record scores in the match input field (see Field menu above).

Enter a lower and upper limit for the match scores in these fields, between 0 and 1. Data from records whose scores fall outside this range will not be included in the output. The default values are 0.9 for Lower and 1.0 for Upper. The Lower field is not designed to calculate matches with a value of 1.

Match Key Target

The Match Key Target component is commonly used in consolidation plans. It allows you to append match plan output data directly to the source database. This eliminates the need to write match data to a new target table. With the Match Key Target, matching and consolidation information is written and held in database tables. The outputs of this component are CSV and HTML reports.

Data may be written by the Match Key Target if the following criteria are met in the source table structure:

♦ The source table contains a column that can be used by the Match Key Target to uniquely identify a record. This record will be a primary key — unique, non-null, and a sequence auto-increment.

♦ The source table contains a column in which the system stores the match score for each matching record. This field must be of datatype Float.

♦ The source table contains a column in which the match key is recorded. This key identifies the consolidated records within a cluster.

ConfigurationThe configuration options in the Match Key Target configuration dialog box are arranged on three tabs: Database, Match Details, and Outputs.

Database Tab

The Database Type menu lists a static option, Staging, representing the Data Quality repository. The remaining fields are disabled.

Click the Connect button to access the database data. This opens the Match Details tab.

Match Details Tab

The options on this tab are arranged in three areas:

♦ Table Details. Table Details area contains the Table Names menu. This menu lists the database tables available to the target Use this menu to select the table to which the target will write the output data.

♦ Column Details. These menu options relate to the table identified under Table Details, whereas the Inputs menu options list all columns in the database tables available according to the Database tab settings.

The Column Details area contains three fields:

− UniqueID. Select the column that contains the unique ID (primary key) of this table.

− Match Key. Select the column to record the match key. The match key is the primary key of the master record in a match cluster.

− Match Score. Select the column to store the match score between each record and its master.

If the table does not already have a column created to hold the match key and match score, the table structure must be altered to generate these fields. The match key and match score are populated when the matching plan is run.

♦ Inputs. This area contains two fields: Unique ID - Input 1 and Unique ID - Input 2. Select the columns on which to base the matching operations.

Match Key Target 33

Outputs Tab

The options on this tab let you configure a HTML match report and CSV match file to display the data output from the target. The match report presents the matches in clusters, and the match file presents a single row for each matched pair.

The creation of a report or file is optional. Also, fields selected under Match Table Column Selection and Ordering appear in the match report and match file.

The Outputs tab displays the following areas:

♦ Match Report. This area contains the following options.

− Create Report. Check to create a match report when the plan is executed.

− Select. Click to browse to the report file. When you click Select, the Select a HTML file for the Report dialog box opens. You can create a new file by typing a name in the File name field.

− Launch Viewer. Enabled when the Create Report is checked. When selected, the report opens automatically when the plan runs.

− Clusters Per Page. Determines how many match clusters appear on each page in the report.

♦ Match Table Column Selection and Ordering. This area shows two panes. The left pane lists the columns available on the table selected on the Match Details tab. The right pane lists the columns to appear in the report or match file. To add a column to the right pane, click its check box in the left pane.

♦ Match Input. The match report presents each match cluster along with the selected input fields from related match sources and the field selected from the Match Input menu. The Match Input selection and the primary key of the source data appear as default fields on this report.

The Match Input menu lists the key fields defined by the matching components in the plan. The field you select, in conjunction with its match threshold score, determines the records to be included in the target output.

Likewise, the range of values you set in the Match Threshold fields are applied to the Match Input key field. Matching records whose scores fall outside this range are not be included in the output. You can set lower and upper values between 0 and 1. The default values are 0.75 and 1.0.

♦ Match File. Like the match report, the match file contains records that contain matches within the match threshold for the field selected from the Match Input menu. The file contains the columns selected in the Match Table Column Selection and Ordering area. Match File has the following options:

− Create File. Check to create a match file when the plan is executed.

− Select. Click to browse to the report file. When you click Select, the Select a CSV File as a Target dialog box opens. You can create a new file by typing a name in the File name field of this dialog.

− Launch Viewer. Enabled when the Create File box is checked. When selected, the file opens automatically when the plan runs.

− Delimiter. Select a field delimiter appropriate to the data from this menu. The default option is comma (,). If headings for the column source data contain this delimiter, you must use a text qualifier to preserve the data structure.

− Qualifier. Select a qualifier appropriate to the data from this menu. The default option is the quotation mark (“).

Note: It is good practice to run a plan populating an audit trail table with the unique IDs of each matching record for every match created. When the data is consolidated, duplicate records are removed from the source table.


Group Target

The Group Target component creates groups, a series of files in a Data Quality-proprietary format that organizes plan data according to key data fields that you configure.

Grouping involves grouping records based on similar or identical values in one or more fields and performing matching operations on the records assigned to each group.

Group Target output files can be used by a Group Source or Dual Group Source to organize the data inputs to a matching plan.

Grouping large datasets is a useful precursor to running a matching plan. Matching operations can be performed on grouped data improves performance with minimal loss of matching accuracy.

Grouped data is stored in local directories as a set of delimited files with the extension SSG. Set up groups by defining one or more group key fields for the dataset. All records with common value in the defined key fields are written to a single group file.

Note: Group files are organized separately from the original dataset and do not modify the original dataset in any way. A large number of SSG files can be created in the group directories, depending on the number of records with common data in the key fields.

ConfigurationThe Group Target configuration dialog box contains the following options:

♦ Directory. The location and name of the directory in which the groups are created. This field is not editable.

♦ Select. Click to open the Select the Group Directory dialog box and browse to the required directory. To select a directory, highlight it in the main window and click Select. Select a directory, not a file.

♦ Outputs. This pane lists the columns available in the dataset. Check the column name to include its data in the plan output. The columns you select are added to the Grouping Fields pane.

Tip: Right-click in this pane to display a Select All option.

♦ Grouping Fields. Select a group key. The group files created in the group directory are based on the key you select.

♦ Maximum Group Size. The maximum number of records assigned to a group file. If the Group Target reaches this limit when writing to a group file, it creates another file for the group. The default value is zero, no limit.

Note: Matching operations are performed within group files. This is standard behavior for matching operations on grouped records. Although a reduction in group size can lead to faster processing times, it can also impact the accuracy of match results.

♦ Maximum Files Per Group. The maximum number of group files written to a given folder on disk. The default value is 5000. When this number is exceeded, the Group Target creates one or more sub-folders to house the remaining files. If this value is set to zero, no limit is be imposed and files are written to a single folder.

♦ Ignore Empty Group Field Values. Use to avoid the creation of a group based on records with null values in a group key field.

Note: The group files you create are overwritten if you run a plan again without changing the target configuration details. To preserve a set of group files, select a new group directory before you run the plan again.

Group Target 35

Database Target

The Database Target (or DB Target) component allows you to write plan output to a database. Data produced by the plan can update selected tables in the database or can be inserted in new or existing tables.

In addition to its own repository, Data Quality connects to Oracle, IBM DB2, and Microsoft SQL Server databases and also supports ODBC connections. A single plan can write to multiple databases using multiple Database Targets.

The Database Target can write the data records processed by the plan to the database, or it can write data from the Aggregation component detailing the frequency of occurrence of data values.

ConfigurationThe Database Target configuration dialog box contains four tabs:

♦ Connect To Database

♦ Before

♦ During

♦ After

The connection is defined on the Connect To Database tab.

Connect To Database Tab

This tab contains two areas: Database Information and Target Format.

You must identify the target database in the Database Information fields.

♦ Database Type. This menu provides five options: Staging (the local repository), IBM DB2, Oracle, Microsoft SQL Server, and ODBC (as a connection to a ODBC-compliant database).

Note: When you select Oracle, you are prompted for a Oracle database system identifier. If you select another database type, you are prompted for a data source name.

♦ DSN. Data Source Name. Identifies the database on the network. This is required for all database connection types except Oracle.

♦ SID. Source Identifier. Identifies the instance of the Oracle database.

♦ Encoding. Lists the available character encodings that can be applied to the data output. For more information, see “Character Encodings and Unicode” on page 143.

♦ Login Information. Contains username and password text fields. You must provide your login when access permissions have been applied to the database.

♦ Connect. Click to establish the connection.

You must also set the target format.

♦ Select Normal Mode to write the plan data to the database.

♦ Select Aggregation Mode to write data summarizing the frequency of occurrence of data values, as tabulated by the Aggregation component, to the database. When you select this option, select the component from which the component will read the data.

Note: When you select Normal Mode, the outputs from all components except the Aggregation component are available to the target. When you select Aggregation Mode, only the outputs from the Aggregation component are available.

Before Tab

The Before tab contains Database pane and a SQL Script pane. This tab is typically used in the Database Target to create new tables in the selected database. You can also create Pre-INSERT and Pre-UPDATE statements.


♦ Click Execute to implement the SQL script. Click Execute before proceeding to the During tab.

♦ Check the Stop On Error check box to stop the script operation and open a message box if the execution encounters ungrammatical script.

During Tab

The During tab enables you to browse the database tables and filter the columns that will constitute the data written to the database. Use this tab to create INSERT and UPDATE statements. You can also apply conditions to tables and join columns from multiple tables. The During tab includes five columns: Database, Insert, Update, Where, and Text.

Figure 3-2 displays the Database Target During tab:

Note:

♦ Like the Before tab, the Database column displays the database structure as a hierarchy of tables and columns.

♦ To write to a column in a database table, select the required Data Quality output from the corresponding list in the Insert or Update column.

♦ Use Stop On Error to stop the script operation and open a message box if the execution encounters ungrammatical script.

♦ Use Roll Back on Error to commit data to the database at the end of the batch operation. If this box cleared, data is committed to the database at the end of each transaction.

♦ Use Expert Mode to view and edit the underlying SQL query. Expert Mode is typically used to create more advanced statements.

Any changes made in Expert Mode are lost if you clear this box and return to standard mode.

♦ Click the Condition option to create a condition-based filter in the form of an IF statement to the data processed by the target. Use the filter to limit the records written to the output file.

♦ In Aggregation Mode, only outputs from Aggregation component are available. You can use Expert mode to perform additional calculations on aggregates.

After Tab

Use the After tab options to write post-insert or update SQL statements for a table. Use this tab to configure primary keys and indexes for tables.

The After tab completes the process of defining the target output. The Before tab runs SQL scripts on the data prior to its configuration. The After tab runs SQL scripts on the configured dataset. Its Database and SQL

Figure 3-2. Database Target, During Tab

Database Target 37

Script panes are identical to those of the Before tab. You can browse configured tables and columns in the database and write the SQL script to run on selected data.

For more information about SQL scripts, see “SQL Scripts” on page 139.

Database Report Target

The Database Report Target component generates report data for a plan and inserts this data to the Data Quality repository. Like the Report Target, Database Report Target accepts input from frequency components.

The Database Report Target also makes Data Quality report data accessible to external applications through an ODBC connection. You can analyze and present the results of a data quality plan through a range of analytical software tools, including Microsoft Excel and Crystal Reports.

Note: Unlike the Report Target component, the Database Report Target does not produce a formatted report on the data. Instead, it writes report data to local Data Quality MySQL database tables. The tables can then be made available to other applications through ODBC.

The MySQL database tables that store the Data Quality report data are located in the Data Quality repository, named repository.t_athanor_report (master record) and repository.t_athanor_report_detail (detail record).

ConfigurationThe Database Report Target configuration dialog box contains the following:

♦ Connection Details Area. Because the Database Report Target always writes data to the Data Quality repository, the connection options shown in this area are static.

♦ Parameters Area. This area contains the following fields:

− Report Name. Enter a report name. The report data is saved in the repository under this name.

− Maintain Reports. When this box is checked, a new record containing the report data is inserted in the MySQL database tables each time the plan executes. Each instance of the report is identified on the MySQL table by a unique report ID and timestamp. When this box is cleared, the record containing the report data is updated with the latest report data each time the plan is executed.

Technical RequirementsA MySQL ODBC Driver is required when importing data from the MySQL database to an external application. This is available to download from http://www.mysql.com.

MaintenanceTo ensure reasonable table size, it might be necessary to remove historical data from the database tables that store report data. When deleting a record from these tables, ensure that the record in question is deleted from both the Master and Detail records to avoid creating orphaned records.

SAP Target

The SAP Target allows you to write plan output to a SAP database. This component complements the SAP Source component, which allows you to obtain data from the SAP database for use as source data in a plan.


There are three basic steps to configuring the target to write data to the SAP database:

1. Define a connection between Data Quality and the target SAP system.

2. Browse the list of BAPI functions on the SAP system and select the function associated with the data.

3. Configure one or more parameters on the function to be populated with data from the Data Quality plan.

Perform these steps using options on the SAP Target configuration dialog box.

ConfigurationThe configuration dialog box for the SAP Source displays its options on two tabs:

♦ Connection. Use the Connection tab options to establish the connection to the SAP system.

♦ SAP System. When connected, use the SAP System tab options to locate the appropriate BAPI and link its parameters to the output columns in your plan.

Connection Tab

The Connection tab contains the following options:

♦ Host. The name or IP address of the SAP host computer.

♦ Client Number. Identifies the SAP client that you are authorized to use. A SAP system can have multiple clients, each of which is identifiable by the three-digit client number.

♦ System Number. SAP allows multiple application server instances to run against a database. The system number is a two-digit number that identifies the application server to which you want to connect.

♦ Encoding. This menu lists the available character encodings that can be applied to the data as it is used in the plan. For more information, see “Character Encodings and Unicode” on page 143.

♦ Username and Password. These fields identify you to the SAP system.

Clicking Connect opens the SAP System tab.

SAP System Tab

This tab is divided into two panes. The left pane lists the SAP application areas and functions available on the connected system, and the right pane lists the parameters defined on the highlighted function.

You can explore the application area pane as an alphabetical list or as a hierarchy that groups areas together according to user-defined criteria. The areas can be expanded to reveal the business objects defined for each area and the functions configured for each business object. Application areas are read from the SAP system.

The icons associated with each level in the left pane are color-coded: application area icons are yellow, business object icons are green, and function icons are red.

Explore the available objects and select the function you want to use to write to the SAP database. Then, configure one or more of the function parameters to receive data from one or more plan output columns.

As demonstrated for the SAP Source configuration dialog, there are three parameter types:

♦ Scalar. A single name-value pair, such as Town – Chicago.

♦ Structure. A group of one or more scalar parameters, like a multi-line address group. A structure may have multiple rows but has a single column of values.

♦ Table. Contains one or more rows of data described by one or more columns.

Note: The SAP Target treats each field in a parameter as a scalar parameter, regardless of whether it is a single-field scalar parameter or a multi-field table.

To configure a parameter:

1. Examine the parameter and identify the fields to which you want to add data.

2. Double-click the Value field of the parameter:

SAP Target 39

If you select a scalar parameter, this opens the Edit Scalar Parameter dialog box.

If you select a structure or table parameter, this opens Edit Structure Parameter or Edit Table Parameter dialog box in which constituent scalar values can be configured. Double-clicking a value in these dialogs opens the Edit Scalar Parameter dialog box.

3. In the Edit Scalar Parameter dialog box, click the Down arrow by the Value field to see a list of available output columns.

You can also enter a column name.

4. Select a column, and click OK.

5. Repeat these steps for all required parameters.

Realtime Target

The Realtime Target enables you to develop plans to process output data in real time and deliver data to another application. With this component, you can define a set of columns that determine the data sources for a plan executed by the Data Quality engine a real-time environment.

You can develop, run, and test the plan using the Workbench user interface.

When the Data Quality engine executes a real-time plan, the records passed to the application contains all fields selected as outputs from the Realtime Target. When configuring Realtime Target, select only the data fields that your application needs.

ConfigurationThe Realtime Target configuration dialog box displays a single pane that lists all available data fields. Select the required fields individually, or right-click within the selection pane to Select All.

Identity Group Target

The Identity Group Target component generates keys for groups of input data. It stores these keys and the input data in an identity index within Informatica Data Quality. The CSV Identity Group Source and the DB Identity Group Source require the key values in this index to perform identity matching on plan data.

All identity matching operations require two plans that must be run consecutively. The first plan must contain an Identity Group Target. The second plan must contain either a CSV Identity Group component or a DB Identity Group component. These components search the data for the keys defined by the Identity Group Target in the first plan.

Note: Do not use the Identity Group Target in the same plan as any Data Quality match source component.

Identity Group components require population files that install through the Content Installer. You must contact Informatica to purchase and download population files separately. For information on installing population files, consult the Informatica Data Quality Installation Guide.

ConfigurationThe Identity Group Target configuration dialog box contains the following options:


♦ Input. This pane lists the potential input columns available to the target. Use the check box next to each column to add that column to the target output. At least one input column should contain person name, organization, or address data, as the Identity Group Target uses these data types for key generation.

Tip: Right-click in the input pane to display a Select All option.

♦ Outputs. This pane contains outputs for each selected input column. The outputs are automatically generated when you add input columns.



♦ Key Level. The Key Level determines the number and variety of keys generated by the Identity Group Target. The three key levels are Limited, Standard, and Extended. The following table describes the features of each Key Level:

♦ Input Column. The input column specifies the source data that the Identity Group Target uses for key generation. Choose an input column that contains the type of data specified in the Key Type field.


First Name + Middle Name(s) + Family Name(s

♦ Key Index Location. The Key Index Location specifies the Data Quality subdirectory where the key index will be generated. Set a unique Key Index Location for each plan to avoid overwriting other key indexes.

You can specify a Key Index Location with multiple subdirectories in order to help organize your Identity Key Indexes. The following string displays an example of a a Key Index Location with multiple subdirectories:

UK/Person/Name

Key LevelDisk Space Usage

Matching Success Intended Use

Limited Low Finds likely matches; does not find all probable matches

Non-critical searches on systems with limited disk space

Standard High Overcomes most variations in word order, missing words, and extra words

Most search applications

Extended Very high Finds most possible matches, regardless of word order variation and concatenation

High-risk or mission-critical search applications

Identity Group Target 41

C H A P T E R 4

Frequency Components


♦ Overview, 43

♦ Count, 43

♦ Sum, 46

♦ Aggregation, 47

♦ MinAvgMax, 49

♦ Range Counter, 50

♦ Missing Values, 51

Overview

Data Quality provides five components that determine the frequencies of values within selected data fields. These components allow you to determine the frequencies of all values, specific values, and defined ranges of values within data fields.

Frequency Analyzer components are essential in plans that use the Report Target or Database Report Target to create plan output. Report Target and Database Report Target can only accept inputs from frequency components.

Data Quality provides the following frequency components:

♦ Count

♦ Aggregation

♦ MinAvgMax

♦ Range Counter

♦ Missing Values

Count

The Count component determines the number of unique values in a column and calculates the frequency of occurrence of each value. Count is a frequency component and therefore can provide data input to the Report Target and Database Report Target.

43

For example, consider the addresses listed in Table 4-1:

Applying Count to the Address2 column results in the following data:

When the Count component output is read by a Report Target, and the plan output viewed in the Report Viewer, you can drill-down on any item heading to view underlying data values.

ConfigurationThe Count configuration dialog box displays its settings on two tabs:

♦ Inputs

♦ Parameters

Inputs Tab

The Inputs tab lists the data columns available to the Count component from other components in the plan. Select a column to add it to the Report Target.

Parameters Tab

The Parameters tab allows you to select and filter the data values that are counted by the component and passed to the Report Target. It also lets you edit the output names for each counted column. The tab lists the columns selected on the Inputs tab. For each column, three fields are displayed: Min Count, Max Cases, and Output Name.

♦ Min Count. Specifies the minimum number of times a value must occur in a column before being listed in the report output. For example, if a SURNAME column is selected on the Inputs tab, and the Min Count value for SURNAME is 5, then a given surname must appear at least five times in the column to appear on

Table 4-1. Count Component: Sample Address List

Address1 Address2 Address3 State Zip

2440 Camino Ramon San Ramon Contra Costa CA 94583-4296

2306 Shoreline Loop # 132 San Ramon Contra Costa CA 94583

2050 Shoreline Loop San Ramon Contra Costa CA 94583-5502

1200 Concord Ave Concord Contra Costa CA 94520-4915

1350 Montego Walnut Creek Contra Costa CA 94598-2822

1200 Montego Walnut Creek Contra Costa CA 94598-2820

108 Summerwood Pl Concord Contra Costa CA 94518-2718

305 Reflections Cir Apt 27 San Ramon Contra Costa CA 94583-5204

101 Ygnacio Valley Rd Ste 300 Walnut Creek Contra Costa CA 94596-4061

2245 Via De Mercados Concord Contra Costa CA 94520-4919

2000 Crow Canyon Pl Ste 206 San Ramon Contra Costa CA 94583-4633



2400 Camino Ramon Ste 100 San Ramon Contra Costa CA 94583-4287

San Ramon 8

Concord 3

Walnut Creek 3

44 Chapter 4: Frequency Components

the list of surnames in the generated report. If the surname appears fewer than five times, its occurrences are added to the Filtered total on the report.

♦ Max Cases. The Max Cases field specifies a stopping point for the count operation by setting an upper limit on the number of different values the component lists in the report. When this limit is reached, the number of uncounted records is included in the Others column of the report.

♦ Output Name. The name of each column sent to the target component. You can edit the name in each field.

Example

The following data sample contains eight different surnames in eleven records. A Min Count value of 2 returns all surnames that occur more than once, Smith and Jones. A Max Cases of 7 continues counting until finding seven different names, so the eighth name, Yeung, is added to the Others figure on the report.

The Max Cases setting takes precedence over the Min Count setting. Max Cases determines the number of data “buckets” available in the output. The Max Cases limit can be reached without identifying all the values that meet or exceed the Min Count setting. For this reason, note the percentage of values represented by the Others total.

For example, with the same settings but data ordered differently, as shown below, the most common name would not be listed on the report:

In this case, the Max Cases setting of 7 does not reach the eighth surname, Smith, which in fact is the most common name in the dataset.

The Parameters options allow you to tune the performance of the plan in a number of ways.

SURNAME

1 Smith

2 Jones

3 Adams

4 Jones

5 Smith

6 Brady

7 Baldwin

8 Smith

9 Chase

10 Powell

11 Yeung

SURNAME

1 Powell

2 Jones

3 Adams

4 Jones

5 Chase

6 Brady

7 Baldwin

8 Yeung

9 Smith

10 Smith

11 Smith

Count 45

For example, you require the fifty most common surnames in a dataset of one million records. Assuming the surnames are spread randomly throughout the dataset, applying a Max Cases figure in excess of fifty should return the most common surnames without counting all rows.

There is no limit to the number that can be applied for Max Cases. However, when the total number of different counts is greater than 20,000, plan performance may slow. When the number of counts is below 20,000, all values being counted are held in memory. If the number exceeds 20,000, all counts above this number are held in the database as the count operations are carried out.

The following examples demonstrate how the two parameters can be used:

♦ To check for non-unique values in a field that should contain only unique values. Set the Min Count value to 2. The report identifies all non-unique values, those that occur more than once.

The Max Cases field should be set to the number of records in the dataset. This ensures that sufficient counts are performed so that even if the last two rows in the table are the only two with duplicate values, they are identified.

♦ To count the frequency of values in a column where a finite number of different values are possible. In this case, set Min Count to 1 and Max Cases to any value greater than the maximum number of possible values.

Sum

The Sum component calculates sums for the numeric values in each selected column. This component classifies numeric values as positive, negative, invalid, or filtered, and provides count and sum totals for each of these classes.

Use outputs from the Sum component as inputs for the Report Target and DB Report Target.

Note: The Sum component processes positive and negative numbers, for example 10 and -10. Do not prefix a positive number with a + symbol. The Sum component will treat numbers entered in other formats (for example, (10) or “10”) as invalid values.

ConfigurationThe Sum configuration dialog box contains the following:

♦ Inputs tab

♦ Parameters tab

Inputs Tab

The Inputs tab lists the data columns available to the component from other components in the plan. Check the column name to assign it as an input.

Parameters Tab

Use the options on the Parameters tab to set a minimum value for inclusion in the “Positive” category for each input column.

Positive numeric values that are less than or equal to the Min value for a column are classified as filtered. The default Min value is 0.

Use the Parameters tab to rename the column outputs for the Sum components.


Aggregation

The Aggregation component provides a number of methods to calculate the frequency of occurrence of data values both in a single column and across multiple columns. It can create detailed metrics that demonstrate value frequencies across a dataset without writing the data in a temporary staging area or using SQL.

The Aggregation’s capabilities include the following:

♦ It tabulates the quantities of records that contain common values in a selected field. The Count component also performs this operation.

♦ It can tabulate the quantities of records that share a set of common values across multiple fields.

♦ It can calculate a sum of the numerical values in a given column.

♦ It can apply conditional rules to the data in selected columns so that additional counts are performed for values that satisfy the conditions. Sum calculations do not use conditions.

The Aggregation component delivers outputs directly to a Database Target. Its outputs are not compatible with other components.

Note: Set the Database Target to Aggregation Mode to enable it to read the Aggregation outputs.

ConfigurationThe Aggregation’s configuration dialog box displays its settings on three tabs:

♦ Inputs

♦ Parameters

♦ Outputs

Inputs Tab

The Inputs tab lists the data columns available to the component from other components in the plan. Select one or more columns for configuration on the Parameters tab.

Note: When you select one or more columns on this tab, the Aggregation performs an aggregate count operation on all data from these columns. This output appears as the Count field on the Outputs tab. You do not need to configure other parameters to create this output, and you cannot deselect this output in the Aggregation component.

Parameters Tab

The Parameters tab allows you to select and filter the data values that are counted by the component and passed to the Database Target. The tab contains an upper area that lists the columns selected on the Inputs tab and a lower area that lets you define conditions to apply to the inputs.

Beside the input names in the upper area are two columns: Group and Sum.

♦ Check the Group option for one or more input columns to generate totals for each pattern of values that occurs across those columns. See “Calculating in Groups” on page 48.

♦ Check the Sum option for one or more input columns to calculate a total for the numerical values in those columns. See “Calculating Sums” on page 48.

The Parameters tab also contains a Conditional Counts area. This allows you to filter the data to which a count calculation is applied.

♦ Define a conditional count by selecting an input field and operators from the Conditional Count area and clicking Add. To delete a condition, select it in the lower area and click Delete.

You can define conditional counts for individual columns, and you can add multiple conditional counts on this tab.

Aggregation 47

Calculating in Groups

Table 4-2 provides sample bank account data that illustrates how group calculations work.

Figure 4-1 illustrates a sample configuration for the Aggregation component based on this data:

In Figure 4-1, the Group options for CITY and STATE are checked. Thus the component will aggregate data patterns across both columns and send the following totals to a Database Target:

Calculating Sums

In Figure 4-1, the Sum option is checked for the BALANCE column. Thus the component will calculate the sum of all values in this column, which is $62,453.70.

Sum calculations ignore all non-numeric data.

Table 4-2. Sample Input Data for Aggregation Component

NAME CITY STATE BALANCE

John Smith Brooklyn NY 36541.64

Mary Jones Brooklyn NY 6345.87

Estelle Franklin Brooklyn NY 354.12

Brian Franklin New York NY -650.01

Tina Brooks New York NY 3515.21

Charles Cowell New York NY 216.87

Marian Hodges New York NY 32.81

Kate Lee Albany NY 354.21

Albert Chung Albany NY 15498.32

Gillian Ross Buffalo NY 244.66

Figure 4-1. Aggregation Component Dialog Box. Parameters Tab

Brooklyn NY 3

New York NY 4

Albany NY 2

Buffalo NY 1


Conditional Counts

The Conditional Counts area lets you define a condition with Argument, Operator, and Value variables. A condition acts as a filter for count calculations in the selected column.

Argument. The input column whose data will be filtered.

Operator. A mathematical operator applied to the argument data.

Value. The filter value.

Figure 4-1 contains a condition that will count the quantity of negative values in the BALANCE column, which equates to the quantity of overdrawn accounts. You cannot define conditions for Sum calculations.

Outputs Tab

This tab lists the outputs that are written to the Database Target. You can edit the output names.

Figure 4-2 shows the outputs for the Parameters set in the previous example.

CITY and STATE. The quantities of common values in these fields will be calculated in group fashion. Group calculations are not prefixed.

Count. This output is created when a column is selected on the Inputs tab. It sends a count of all value quantities in all columns selected on the Inputs tan to the Database Target.

(Sum)BALANCE. All number in the BALANCE column will be added together and the sum sent to the Database Target.

(Where)BALANCE<0. The quantity of negative balances will be sent to the Database Target.

MinAvgMax

This component returns the minimum, maximum, and average data values for selected columns.

The MinAvgMax only recognizes data in the Float datatype that originates as output from the Rule Based Analyzer.

ConfigurationThe MinAvgMax configuration dialog box displays an Inputs tab with a single pane beneath listing the columns you can use. Only numeric fields appear in the Inputs tab.

The calculations for the selected columns are sent to the Report Target.

Figure 4-2. Aggregation Component, Outputs Tab

MinAvgMax 49

Range Counter

The Range Counter calculates the frequency and distribution of numerical data in selected fields. It does so by counting the numbers of values between user-defined intervals in the data.

To configure the Range Counter, select a data column and an interval, or a series of custom intervals, to apply to the data. You can define multiple such instances within the component.

ConfigurationThe Range Counter configuration dialog box contains the following:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

Components Pane

The Components pane shows the instances of the component available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the options on the Inputs and Parameters tabs.

To add an instance, right-click in this pane and select Add from the context menu. You can also remove an instance by selecting Delete from the context menu.

Inputs Tab

The Inputs tab lists the data columns available to the component from the other components in the plan. Check the column name to assign it to the highlighted instance in the Components pane.

Parameters Tab

The options on the Parameters tab determine how the range of data is represented in the report. The parameters divide the data into meaningful subsets. While the Count component counts the overall number of data values in a given column, the Range Counter divides the column data into subsets and counts the data values in each subset.

The parameters are organized in two areas, Select Range Type and Select Intervals. The Select Range Type area provides two options:

♦ Linear Numeric Range. Select to apply a uniform interval to the data column associated with the highlighted instance.

When you select this option, the Select Intervals area displays a single Interval Value field. The value you enter determines the size of the subsets in which the reported data is organized.

♦ Variable Numeric Range. Select to apply custom intervals to the data column associated with the highlighted instance. When you select this option, the Select Intervals area displays. When you first configure the component, this area shows a single row with three fields: Label, Start, and End. It also shows an All check box. You can add as many rows as you need. Each row defines an interval, and each interval can be a different size.

Label field. Allows you to enter a descriptive label for the data row that appears in the report.

Start and End fields. Allow you to set the interval boundaries for the ranges displayed in the report.

Add button. Adds a row beneath the existing rows.

Remove button. Deletes the selected row. To delete a row from the report, check its box and click Remove. To delete all rows, check the All option and click Remove.


Missing Values

The Missing Values component searches for specific values in an input field and determines the frequency of the values within the field. Use for searching for known bad or absent data values.

The Report Target creates a table listing the searched-for values and the number of times they occur in the related column.

ConfigurationThe Missing Values configuration dialog box contains an upper pane that lists the data columns available to the component, and a Missing Values pane to specify the data values you want to find.

To configure the component, highlight and select a data column in the upper pane. Next, right-click in the Missing Values pane and select Add Value or Add Null Value from the context menu.

When you select Add Value, a message appears. Double-click the text as prompted and type a value on the edit line. The value you provide will be assigned to the highlighted column. To save your changes, press Enter before moving from the edit line. You can assign multiple values to a single column.

Note: You can select all columns in the upper pane with a context menu option. However, values are assigned only to the highlighted column. You can also add multiple values for a single column.

Selecting Add Null Value adds the text “Null Value” to the pane and instructs Data Quality to search for null values in the selected column.

To delete a value from the Missing Values pane, select Delete Value from the context menu.

Missing Values 51

C H A P T E R 5

Analysis Components


♦ Overview, 53

♦ Character Labeller, 53

♦ Token Labeller, 56

Overview

Analysis components are used to identify data quality problems within individual fields in a dataset. The analysis components identify features within free-text or non-numeric fields. The frequency of these features can then be counted using the Count component and included in the plan report. The features can also be used directly in cleansing and standardization routines.

Data Quality provides the following analysis components:

♦ Character Labeller

♦ Token Labeller

Character Labeller

The Character Labeller creates a character-by-character profile of data values in a data field. The component categorizes some or all characters in the input fields according to character type. The character types recognized by the component are:

♦ Alpha. An alphabetic character. The default label is c.

♦ Digit. A numeric character. The default label is n.

♦ Symbol. A symbol, such as a period. The default label is s.

♦ Space. Any space between data elements. The default label is _.

You can configure the component to identify all instances of one or more of these types in the input data. The Character Labeller searches each field in the dataset for the character types you specify and writes a new column containing codified representations of where your selections occur.

For example, the Character Labeller labels the string “01/01/2008” as “nn/nn/nnnn” with the Digit type selected. It labels the same string as “nnsnnsnnnn” with the Digit and Symbol types selected.

53

You can change the labels assigned to the character types. You can also define custom labels that represent a single character value or a set of character values.

ConfigurationThe Character Labeller configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Filters tab

♦ Dictionaries tab

♦ Outputs tab

Components Pane

The Components pane shows the instances of the component available to the plan. Use the Components pane to define an instance of the component for use in the plan.

When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the options on the tabs.

To add an instance, right-click in this pane and select Add from the context menu. You can remove an existing instance by highlighting it and selecting Delete from the context menu.

Inputs Tab

This tab lists the data columns available to the component from the other components in the plan. Check a column name to assign the column to the instance highlighted in the Components pane. You can assign a single input to each instance.

Parameters Tab

The Parameters tab options are organized in two areas:

♦ Standard Symbols. This area lists the standard symbols that can be applied to input data. To filter the input fields for a character type, check its check box. If you clear a box, the underlying data for that character type is returned.

You can select multiple character types for each instance of the component. You can also edit the symbols returned for the character types. Table 5-1 lists the default symbols for each character type:

♦ Substring. This area provides options for returning the underlying data characters instead of the character symbols for data in a field. It returns underlying characters based on their positions in the field.

For the data fields on the selected component instance, you can determine how many underlying characters to return and where in the field to locate them.

Check Use Position to activate these settings.

Table 5-1. Character Type Default Symbols

Character Type Default Symbol

Alpha c

Digit n

Space _ (underscore)

Symbol s

54 Chapter 5: Analysis Components

− Start Position. Determines the starting location in the field for this operation. For example, with a setting of 3, the Character Labeller returns underlying data starting at the third character in the string.

− Length. Determines the number of underlying characters to be returned, starting with the character identified by the Start Position setting. For example, in a Date field with values in the mm/dd/yyyy format, a Start Position of 7 and a Length of 4 returns the underlying year values for this field. You must enter a value in this field to activate the substring settings.

Filters Tab

The Filters options allow you to define filters for the input data on a component instance. You can use one or more characters to define a filter. When the Character Labeller encounters the filter string in the input data, it returns the underlying data characters rather than the character type symbol.

For example, in a numeric field containing quantities, such as the number of transactions in an account, you might define a filter of 0 (zero) as it is impossible that a customer would have zero transactions. In such a case, non-zero values will be reported by the Digit symbol while values of zero will be reported by the zero digit.

♦ To create a filter, right-click in the Filters pane and select Add from the context menu. This opens the Filter Setup dialog box. Type the required string in the Filter Text field and set the Enable Substring options if required. If you do not select Enable Substring, the filter will apply to all characters in the field.

♦ Check Use Position to activate the substring settings.

− The Start Position option determines the starting location in the field for the filter operation.

− The Length option determines the number of underlying characters to be returned, starting with the character identified by the Start Position setting. You must enter a value in this field to activate the substring settings.

− The Case Sensitive option applies the filter text in a case-sensitive manner, that is, the filter will only recognize alphabet characters in the same case (upper or lower) as the characters in the Filter Text field.

♦ The Transform all filtered text to upper case option changes the case of filtered characters to upper case. This option not affect the operation of the Case Sensitive option. Transform all filtered text to upper case operates on text that has already passed the Case Sensitive option, if the latter option is selected.

Dictionaries Tab

This tab allows you to apply dictionaries to the input data for the highlighted component instance. A dictionary acts as another type of filter for the input data. Any character string that appear in the dictionary will be filtered, and a user-defined character returned for them.

For example, you can apply a dictionary of state names to a customer address file, having first removed the name of your home state. Using this dictionary, you can set the Character Labeller to replace any values in the state field with an easily recognizable value such as X. This may assist a business that charges different postal rates for out of state customers.

To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. The Dictionary Setup dialog box opens. In this dialog, click the Select button to browse to a dictionary, and type a single filter character in the Format Text field. The Character Labeller uses one character only.

Note: You must set the Enable Substring options on this tab if you select a dictionary. You cannot apply a dictionary to all characters in a field.

♦ Check the Use Position option to activate the substring settings.

− The Start Position field determines the starting location in the field for the dictionary filter operation.

− The Length field determines the number of underlying characters to be filtered, starting with the character identified by the Start Position setting.

Note: The Character Labeller applies dictionaries to the dataset in the order they are listed under the Dictionaries tab for a highlighted component. You can adjust the dictionary order using the Up/Down arrows.

Character Labeller 55

Outputs Tab

This tab lists the names of the data outputs for the highlighted component instance as they appear in other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

Token Labeller

The Token Labeller analyzes the format of the data values within a field and categorizes each value according to a list of standard or user-defined tokens.

The Token Labeller component defines nine standard tokens:

♦ Word (alphabetic)

♦ Number (numeric)

♦ Code (alphanumeric mix)

♦ Initial (single alphabetic character)

♦ Init Set (multiple alphabetic characters)

♦ Symbol (punctuation or other symbols)

♦ Dictionary

♦ Word Symbol (mix of alphabet and symbols)

♦ Code Symbol (mix of alpha-numeric tokens and symbols)

The Token Labeller searches the dataset for the tokens you specify and returns a profile detailing how these tokens occur in the dataset.

Table 5-2 shows a sample Customer_Name data extract:

Table 5-3 displays a data profile itemizing the occurrences of tokens in the data extract:

You can define additional token types for the Token Labeller. Customized tokens are called filters in the Token Labeller configuration dialog box.

Table 5-2. Sample Customer_Name Data Extract

Customer_Name Customer_Name

Mr Matthew Evans Robert Chad Griffin

Jason R Taylor Ms Megan Adams

Amanda Parker Antonio Reed

Heather Gray D M Jenkins

Scott Campbell Mrs L Perry

Table 5-3. Profile of Tokens

Data Values Quantity Percent

firstname surname 4 40

nameprefix firstname surname 2 20

nameprefix initial surname 1 10

initial initial surname 1 10

firstname firstname surname 1 10


ConfigurationThe Token Labeller configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Filters tab


♦ Outputs tab

Components Pane

The Components pane shows the instances of the component that are available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance by working with the options on the tabs.

To add an instance, right-click in this pane and select Add from the context menu. You can also remove an instance from this pane by selecting Delete from the context menu.

Inputs Tab

This tab lists the data columns available to the component from the other components in the plan. Check a column name to assign the column to the instance highlighted in the Components pane. You can assign a single input to each instance.

Parameters Tab

The Parameters tab options are organized in three areas:

♦ Tokens. Lists the standard tokens that can be applied to input fields. To filter the input fields for a token type, select the token. You can select multiple tokens for each instance of the component. If you clear a selected token, the underlying data for that token type is returned.

♦ Case Sensitive. Lists the standard tokens that can be rendered in upper or lower case, except Number and Symbol. To generate case-sensitive output for a token type, select the token.

Case-sensitive output means that the token appearance in the analysis output will mirror the case of the related characters in the source data. For example, with case sensitivity applied, the name Lyndon B Johnson is rendered, “Word INIT Word.” With case sensitivity inactive, the name is rendered “word init word.”

♦ Lookup. Check to apply case sensitivity to any dictionaries specified on the Dictionaries tab.

♦ Delimiters area. Provides a list of the punctuation symbols used to delimit data entries in a flat file. As with the Tokens area, select the symbol if you want to use as a delimiter between data fields. Any punctuation marks or symbols not selected are considered part of the dataset.

Filters Tab

The Filters options allow you to define and edit custom token types for a component instance and to specify the data values to correspond to those types.

For example, data might contain fields of null or system-default data with their null status represented in multiple ways, such as Null, Missing, N/A, or Other. The Filters tab allows you to create a token type, such as “Null” and assign one or more data values to it. When the Token Labeller encounters that value, it identifies it as the token you have created. In effect, a filter type with multiple values assigned to it is a form of reference dictionary.

To create a filter:

1. Right-click in the Filters pane and select Add from the context menu.

This opens the Filter Setup dialog box.

Token Labeller 57

2. In the Format Text field, enter a filter type, that is, a token type.

3. Type a data value in the Filter Text field.

When the Token Labeller encounters the Filter Text value, it generates the Format Text custom token type. You can add multiple filters with different Filter Text entries and a common Format Text entry.

The context menu also provides options to edit and delete filters from a component instance.

Note: Filters defined on this tab are not governed by the Parameters tab options. They are always applied to the input data for the component instance with which they were created.

Dictionaries Tab

This tab allows you to use one or more reference dictionaries as token identifiers. The Token Labeller assigns dictionary entries to a single token type.

For example, you add a US_CITY dictionary to an instance of the component and assign the token type CITY to it. Now any value in the dataset that matches a dictionary value will be recognized as the token type CITY by the Token Labeller.

To add a dictionary:

1. Right-click in the Dictionaries pane and select Add from the context menu.

This opens the Dictionary Setup dialog box.

2. In this dialog, click Select and browse to a dictionary.

3. In the Format Text field, type a name for the dictionary value type, that is, a token type.

In the Dictionary Setup dialog box, the Inclusive and Priority options determine how the Token Labeller treats the data values it recognizes in a dictionary:

♦ Inclusive. When selected, the Token Labeller assigns the Format Text label to every data value it finds in the dictionary for the highlighted instance. If this box is cleared, the Token Labeller assigns the Format Text label to all data values that are not listed in the dictionary for the highlighted instance. This option is useful for identifying invalid or non-dictionary matches.

♦ Priority. Determines how the Token Labeller treats strings located a dictionary entry. If this box is checked, the Token Labeller treats the entire contents of a field as a single entity and labels it as a dictionary match. If this box is cleared, the Token Labeller treats the matching string as a dictionary match and labels the rest of the field separately.

For example, a company name column contains a field with the string “Informatica Corporation.” A Corporate Suffix dictionary is applied to this column, so the Token Labeller identifies any string containing Ltd, Inc, Corp, LLP, or any other standard corporate suffix.

When you check Priority for the Corporate Suffix dictionary, the Corporate Suffix dictionary treats the string “Informatica Corporation” as a single entity and returns a corresponding value: companyname. If you clear this option, the Token Labeller returns two values for this string: word companyname.

Note: The Token Labeller applies dictionaries to the dataset in the order they are listed under the Dictionaries tab. You can adjust the dictionary order using the Up/Down arrows.

When multiple dictionaries have been assigned to a component instance and a data value appears in more than one such dictionary, the Token Labeller applies the token defined for the first dictionary in which it finds the value.

Outputs Tab

This tab lists the names of the data outputs for the highlighted component instance as they appear in other components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus from the field.

You can save the data output from a Token Labeller instance as metadata with the following procedure.


To save data output from a Token Labeller:

1. In the Meta Data area of the output pane, check Store.

This activates the Metadata and Profile menu fields.

2. Type the metadata and profile names in these two fields or select from existing names.

3. Click OK.

There is no need to create metadata more than once. After metadata has been created for a component instance, you can clear the Store option so metadata is not recreated each time the plan runs. Recreate metadata only when the plan input dataset changes.

Token Labeller 59

C H A P T E R 6

Transformation Components


♦ Overview, 61

♦ Search Replace, 61

♦ Word Manager, 63

♦ Merge, 64

♦ To Upper, 65

♦ Rule Based Analyzer, 67

♦ Scripting, 69

Overview

Data Quality transformation components allow you to adjust source data. They are typically used in standardization plans.

Data Quality provides the following transformation components:

♦ Search Replace

♦ Word Manager

♦ Merge

♦ To Upper

♦ Rule Based Analyzer

♦ Scripting

Note: Transformation components create new fields for altered data. The original data remains untouched.

Search Replace

Use this component to standardize data. Like the Word Manager, the Search Replace component can be used to remove unwanted values from a group. While the Word Manager uses dictionaries, the Search Replace component makes use of user-defined values.

You can use the Search Replace component in the following ways:

61

♦ Search for a user-defined data string and remove it from the dataset.

♦ Search for a user-defined data string and replace it with another string.

♦ Insert a user-defined data string at the start or end of a field.

ConfigurationThe Search Replace configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Actions tab

♦ Outputs tab

Components Pane

The Components pane shows the instances of the component available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog box.


Inputs Tab

The Inputs tab lists the data columns available to the component instance highlighted in the top pane. Select a field by highlighting it and clicking its check box. You can select a single column for each highlighted instance.

Actions Tab

The Actions tab lists the search and replace operations defined for the highlighted component instance. To add an action, right-click in the pane and select Add from the context menu. This opens the Action Setup dialog box:

The dialog box provides three options — Replace, Remove, and Insert — and a grid of text fields where you can type one or more strings to be replaced or removed. Below this grid is a field where you can type any values that you want to add to data. At the bottom of the dialog box are three buttons that determine where in each input field the search and replace operation should be conducted.

The settings in this dialog box depend on the type of action you require. If you select Replace, all fields remain available, so you can search for one or more strings and replace them with another string. If you select Remove, the With field is disabled. If you select Insert, the search grid and also Anywhere option are disabled.

The search grid has twelve input fields by default. To add more fields, right-click in the grid and select Add from the context menu. Likewise you can right-click and select Delete from the context menu to remove a row from the grid. The highlighted row will be removed.

Figure 6-1. Action Setup Dialog Box

62 Chapter 6: Transformation Components

When you have finished working in this dialog box, click OK to save your action. To edit previously created actions, right click on an action and choose Edit from the context menu.

If your Search Replace component contains multiple actions, you can change the order in which these actions are performed. Select an action and click the arrows to move it up or down in the list.

Outputs Tab

The Output tab lists the names of the data outputs for the highlighted component instance as they appear in other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

Word Manager

The Word Manager applies one or more reference sources, data dictionaries, to an input dataset and thus can be used to determine and improve the usability of the dataset.

The Word Manager is used for three main tasks:

♦ Determining the accuracy or inaccuracy of data in a column based on a reference source.

♦ Removing terms from a data column.

♦ Replacing terms in a data column.

Principally the Word Manager is used for data enhancement operations.

For example, by comparing an address data column containing European city names with a reference dictionary of city names, you can evaluate the accuracy of data in this column.

If the dictionary includes variant spellings of city names, you can use the Word Manager to standardize spelling by creating a new output column based on the dictionary entries.

You can check for original data entries that are not recognized by the dictionary. The Word Manager provides an option to return only those values that are not recognized by the dictionary. The output column contains only non-standard data. You can then subject that data to further evaluation.

ConfigurationThe Word Manager configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab


♦ Outputs tab

Components Pane



Word Manager 63

Inputs Tab

This tab lists the data columns available to the component from the other components in the plan. Check a column to assign that column to the instance highlighted in the Components pane. You can assign a single input to each instance.

Parameters Tab

The Parameters tab displays two groups of editable options:

♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries you specify for the data on the Dictionaries tab. Check this option if the parsing operation should apply dictionaries to the input data in a case-sensitive manner.

♦ Delimiters. Displays a list of delimiting characters. Check the delimiters applicable to your source dataset.

If your input data includes multi-domain fields, you must indicate the delimiters in use in the dataset so that the Word Manager can distinguish between the words in the field and apply the transformative rules you define.

Dictionaries Tab

This tab allows you to use one or more reference dictionaries to analyze or improve input data.

To add a dictionary, right-click in the Dictionaries pane and select Add from the context menu. This opens the Dictionary Setup dialog box. In this dialog, click Select to browse to a dictionary.

The Remove Dictionary Matches option ensures that only input data values that are not recognized by the dictionary are returned in the output column.

Dictionaries are applied to the input data in the order listed in the Dictionaries pane. You can change this order with the Up and Down arrows.

Outputs Tab

This tab lists the names of the data outputs for the highlighted component instance as they appear in other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

Merge

The Merge component combines the data values from multiple input fields to form a single output field. This component is common in standardization and analysis plans. For example, you can combine Customer_Firstname and Customer_Surname fields to create a new field called Customer_Name. You set the order in which the input values are merged. For example, you can create

a Customer_Name field in which surname precedes firstname or firstname precedes surname.

ConfigurationThe Merge configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab


Components Pane

The Components pane shows the instances of the component are available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog box.


Inputs Tab

The Inputs tab lists the data fields available for assignment to the highlighted component. Select a field by highlighting it and clicking its check box. Select at least two matching components on this tab.

Note: The order in which you check the boxes determines the order in which the columns are merged. If, in the example above, you check the Customer_Surname field before the Customer_Firstname field, the merged output lists the surname before the first name. The default name given to the output for the instance lists the field whose box was checked first.

Parameters Tab

This tab displays the output order of the selected inputs and the join character used to merge them. To change the output order, select an input and click the arrows to move it up or down in the list.

In the Select Join Character dropdown, choose the character to place between the merged items. Table 6-1 lists the available characters:

Outputs Tab

This tab lists the names of the configured outputs as they appear in any other components connected to the Merge component. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

To Upper

The To Upper component provides several ways to alter the case of a dataset. The component provides pre-set methods to transform case and also allows you to use dictionaries when determining which strings to transform.

To Upper is often used to create data uniformity before matching, standardization, or analysis operations.

ConfigurationThe To Upper configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

Table 6-1. Available Join Characters for the Merge Component

Available Characters

Space Double Quote Comma Full Stop

Semi-Colon Single Quote Underscore Tab

Dash Pipe Forward Slash At Symbol (@)

NONE

To Upper 65

♦ Parameters tab

♦ Delimiters tab

♦ Outputs tab

Components Pane

The Components pane shows the instances of the component are available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance using the options on the tabs in the dialog box.


Inputs Tab

The Inputs tab lists the fields available for assignment to the highlighted instance. Select a field by highlighting it and clicking its check box. You can add multiple fields to a single component instance. Each input field has its own output field.

Parameters Tab

On this tab, the Case Transform area allows you to select the transformation method for the case of the data, and the Options area provides additional options for dictionary use and underlying data in uppercase form.

The methods for transforming case are as follows:

♦ Uppercase. Converts all letters to uppercase.

♦ Lowercase. Converts all letters to lowercase.

♦ Toggle Case. Converts each lowercase letter to uppercase and vice versa.

♦ Title Case. Capitalizes the first letter in each sub-string.

♦ Sentence Case. Capitalizes the first letter of the field data string.

♦ No transform. No case transformation is applied. This option is generally used with the Capitalize option.

The Options area provides the following options:

♦ Capitalize Using Dictionary Entries. Use this option if you want to use a reference dictionary to identify data strings for capitalization. Click Select to browse to a dictionary. Data strings recognized in the dictionary are returned in the case style of their respective dictionary entries.

♦ Leave UPPERCASE Words as Found. Use this option to override the Capitalize option if the input data string is already in upper case.

Delimiters Tab

When the input dataset consists of multi-domain fields, you might need to specify the delimiting symbol used in the fields. The Delimiters tab lists the delimiters recognized by the component:

Check the delimiters you want the component to recognize. You can use multiple delimiters.

Table 6-2. Available Delimiters for the To Upper Component

Available Characters

Space Double Quote Comma Full Stop

Semi-Colon Single Quote Underscore Tab

Dash Pipe Forward Slash At Symbol [@]


Outputs Tab

This tab lists the names of the configured outputs as they appear in other components connected to the To Upper component. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

Rule Based Analyzer

The Rule Based Analyzer allows you to define and apply one or more business rules to selected input data. It requires no previous knowledge of scripting or coding.

You can define two types of rules in this component: Condition and Assignment. Define a conditional rule using IF-THEN-ELSE logic. Define an assignment by assigning a value to an output.

ConfigurationWhen opened, the Rule Based Analyzer configuration dialog box displays any rules defined for the component. Rule names appear in the Description column. The Status field indicates whether the plan can run the rule as currently defined. A red icon in this field indicates that the rule has not been properly configured.

To add a rule, right-click in this pane and select Add Condition or Add Assignment from the context menu.

When you add a rule, default text appears in the Description field. Double-click in the field to exit the default text. To configure the rule, right-click in this field and select Edit from the context menu.

Selecting Edit for a condition rule opens the Standard Rule dialog box. Selecting Edit for an assignment rule opens the Set Rule dialog box.

Defining a Conditional Rule

The Standard Rule dialog box lists the IF, THEN, and ELSE statements defined for the component. You can add multiple sets of statements. To edit a statement, right-click it and select Edit from the context menu. Editing a statement involves working with a Rule Wizard to define the criteria for the statement.

When you enter multiple statements in the IF pane, those statements have an AND relationship.

The condition outputs are identified in the lower half of the Standard Rule dialog box. You can define multiple outputs and assign a THEN or ELSE statement to any one of them.

Defining an Assignment Rule

The Set Rule dialog box provides fewer options than the Standard Rule dialog box. In place of the If, Then, and Else panes, it has a single SET pane that lists the assignment settings defined for the rule. To edit a SET statement, right-click its name and select Edit from the context menu.

As with conditional rules, editing a SET statement involves working with a Rule Wizard to define the criteria for the statement. Similarly, you can define multiple potential outputs in the lower half of the dialog box and assign the SET statement to any one of them.

The conditional rule logic is essentially a superset of assignment rule logic. If you add another THEN or ELSE statement to a conditional rule, the Standard Rule dialog box displays the text “Assignment statement, right click and select Edit.”

Expert Mode

The rule wizards allow you to write condition and assignment rules even if you have no knowledge of programming. However, these rules retain their underlying code and syntax. To view and edit the underlying code, use the Expert Mode option in the Standard and Set Rule dialog boxes. The code below is taken from a

Rule Based Analyzer 67

conditional rule defined to check the validity of a data values, Input1, by comparing them with a reference dataset, Input2:

IF (Input1 = Input2) THEN Output1 := "INVALID"ELSE Output1 := "VALID"ENDIF

Use Expert Mode to construct more complex rules than are possible in the rule wizard, such as nested IF statements.

Click the Validate button to validate the syntax of a rule.

Click OK to save your work. Informatica Data Quality displays an error message if the rule is invalid.

You can save an invalid or incomplete rule in Expert Mode. Complete or repair the rule before running the plan.

Clearing the Expert Mode option before saving your work restores the dialog box defaults and discards any changes you have made in the Scripts window.

For a list of keywords and expressions usable in Expert Mode, see “Rule Based Analyzer Rule Statements” on page 127.

Example: CONTAINS Function

Use the CONTAINS function to create a rule that determines if a given string contains a user-defined value. This function is useful when checking if data entry strings contain predicted data, for example, checking the validity of a product code at the point of data entry.

The syntax for creating such a CONTAINS rule in Expert Mode is as follows:

Output1 := CONTAINS (Input2, Input1)

Where Input1 is the input string and Input2 is the string to be located.

The function returns an integer indicating the position of the value or the position of the first character in the string. If the value is present in multiple positions on the string, the function returns the first position in which it occurs. If the value is not present, the function returns 0.

The CONTAINS function is case-sensitive.

Example: DATECONVERT Function

Use the DATECONVERT function to create a rule that converts a date to a different format. For example, a plan might use a rule that converts a date from typical UK format (DD/MM/YYYY) to U.S. format (MM/DD/YYYY). The syntax for such a rule is:

Output1 := DATECONVERT(Input1,"DD/MM/YYYY","MM/DD/YYYY")

Date Functions

Date functions only accept numerical dates and do not accept leading or trailing spaces. Use a slash to separate date elements in input strings. The Rule Based Analyzer processes all Gregorian dates.

When a two-digit year value is entered, Data Quality uses the following rules to determine the century:

♦ If the two-digit year value is less than ten, the year is treated as twenty-first century. Therefore, the Rule Based Analyzer handles the year digits 00-09 as 2000-2009.

♦ If the two-digit year value is ten or more, the year is treated as twentieth century. Therefore, the Rule Based Analyzer handles the year digits 10-99 as 1910-1999.


Treatment of Locale Numbers

All numerical inputs and outputs in the Rule Based Analyzer are interpreted in a locale-specific format. For example, when using a French locale setting, the Rule Based Analyzer accepts and generate outputs using the comma as a decimal separator.

If you want to use numbers in a format that differs from the default setting, place them in quotation marks, as shown in the second point below:

♦ Generic format: 1.65

♦ Locale format: “1,65”

Error Handling

When invalid parameters are passed into Rule Based Analyzer functions, the error is logged and the plan continues execution. For example, if a numeric value is incorrectly passed to a Date Compare function, Data Quality executes the plan, but the Rule Based Analyzer output appears in the output file as “Invalid Value.”

When conditional statements contain incorrect syntax, Data Quality produces an error message and the plan fails.

Scripting

The Scripting component provides greater flexibility than the Rule Based Analyzer to build customized rules and processes into a data quality plan.

Note: The Scripting component allows you to write scripts using Tool Command Language (TCL). As such, the component requires some knowledge of this language.

For a standard dataset and for standard rules, the Rule Based Analyzer is typically adequate. Informatica recommends the Scripting component only for rules of a complexity that the Rule Based Analyzer cannot handle.

ConfigurationThe Scripting configuration dialog box contains the following areas:

♦ Inputs

♦ Script

♦ Outputs

It does not have a Components pane and does not permit multiple instances to be defined for a single component.

♦ Inputs. Allows you to identify the data columns that constitute the input data for the component. These fields list the input fields available to the component. Click a field to access a menu and choose a column.

The columns you select are numbered in the Input Index fields.

♦ Script. Provides a workspace for writing the TCL script that can make use of the inputs defined above.

The Save and Load options allow you to save the script to a file and to load a pre-saved script from file. These options act on the TCL script written in the Script pane only — they do not save or load other settings in the dialog box.

♦ Outputs. Displays the output name for the generated data as it appears to other components. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

The Output Type field allows you to change the output data type. Two types are available: String and Float.

Scripting 69

For more information about the range of functionality within the Scripting component, contact Informatica Global Customer Support.


C H A P T E R 7

Parsing Components


♦ Overview, 71

♦ Parser, 71

♦ Splitter, 72

♦ Token Parser, 73

♦ Profile Standardizer, 76

♦ Context Parser, 78

Overview

The parsing components allow you to extract relevant data from a field and separate extracted data into a standardized format.

Data Quality provides the following parsing components:

♦ Parser

♦ Splitter

♦ Token Parser

♦ Profile Standardizer

♦ Context Parser

Parser

Informatica partners use the Parser component to implement customized parsing plug-ins. Parsing plug-ins read specified input strings and create one or more new custom values from the words or characters in the string.

Developers implement this component using the Global Component SDK. For more information, see the Global Component SDK Guide.

71

Splitter

The Splitter component parses data values in a text field into new fields by comparing source data with one or more reference datasets. Each instance of the Splitter parses a single data column.

Configure the Splitter by:

♦ Selecting data input, that is, a column on the dataset already configured in the plan.

♦ Identifying another data column to use as a reference dataset,

♦ Optionally, defining output field variables or identifying a dictionary for use as a filter on parsed data.

You can use the Splitter with or without a dictionary. The method you choose depends on the composition of your dataset and the available dictionaries.

Parsing Data Without a Dictionary

You want to parse a column of names by gender and your dataset already contains a Gender column, so you do not need a dictionary. First, select the source data column, such as the First_name field and then select the Gender column for reference purposes.

Next, identify the variables you want the Splitter to match against the reference data. The variables should match the possible values in the reference field, in this case MALE and FEMALE.

The Splitter component creates output fields based on the defined variables. Each value in the First_name field identified as MALE in the reference data is written to a corresponding new MALE data field, and each source value defined as FEMALE is written to a new FEMALE field. By default, the Splitter also creates an Overflow field to capture any source data that cannot be identified by the reference column.

Parsing Data with a Dictionary

You want to parse a column of account names based on their residence in the United States. Instead of adding variables for the names and possible abbreviations of every state, you can use a dictionary.

First, select a source data column, such as the Surname field, then select an appropriate column address column, such as State or Zip, for reference purposes.

Next identify an appropriate dictionary, in this case, all valid U.S. zip codes. The entries in this dictionary are compared with the reference column data. By default, the Splitter creates an output field for source data recognized by the dictionary and an overflow field for values not recognized. In this way, the Splitter produces two columns, one each for U.S. and non-U.S. account names.

Note the following:

♦ You can use multiple dictionaries and multiple variables.

♦ Dictionaries and variables are not mutually exclusive. You can use either or both with an instance of the Splitter. Each has its own output column.

♦ The variables or the dictionaries you select are compared with the reference dataset, not the source dataset.

ConfigurationThe Splitter configuration dialog box contains two menus for identifying the input and reference data fields, and two panes that you can populate using context menus:

♦ Source Input menu. Use to identify the data column to be parsed.

♦ Reference Input menu. Use to identify data column with which the defined variables or dictionaries will be compared.

♦ Lookup (Case Sensitive) option. Use if you want the Splitter to apply case sensitivity when comparing a dictionary with the reference data.

72 Chapter 7: Parsing Components

To add a dictionary or variable, right-click in the pane beneath the Lookup option and select Add Dictionary or Add Value from the context menu.

The Splitter creates an output column for each entry in the upper pane and lists them in the Outputs pane. Edit an output column name or overflow output field name by double-clicking it.

Token Parser

The Token Parser is designed to parse free-text fields that contain multiple tokens. It parses each token to a separate field. The component identifies each value in the field by data type and writes each value to a user-defined output field.

For example, a single free-text address field such as “3 Trebovir Rd, London, SW1” can be parsed to the following output fields:

The Token Parser searches an input field for the data types defined on the Outputs tab of the configuration dialog box. When it finds a type specified for the first defined output, it writes that data to the associated output field. It then searches the field for the type defined in the second output. When a specified data type is not found, the corresponding output is left blank.

The parsing operation passes through each field only once. The parsing operation does not reset to the start of the field when a data value is recognized.

The Token Parser uses the same set of generic data types as in the Token Labeller component:

♦ Word

♦ Code

♦ Number

It also allows you to define data types by dictionary.

ConfigurationThe Token Parser configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab


♦ Outputs tab

Components Pane



House Number Street Name Address Suffix City Postcode

3 Trebovir Road London SW1

Token Parser 73

Inputs Tab

The Inputs tab lists the data fields available to the highlighted component instance. Select a field by highlighting it and clicking its check box. You can select a single field for each component instance.

Parameters Tab

The Parameters tab displays the following editable options:

♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to your source dataset.

♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead of the default direction of left to right. This option enables you to parse data based on the final values in a field, such as postcode.

♦ Overflow Reverse Enabled. When selected, overflow data from a reverse-enabled parsing operation is written to the Overflow output in reverse, right to left. Enabled when you use the Reverse Enabled option, this option is selected by default. If you clear this option, overflow output for the parsed data is written left to right.

♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Dictionaries tab. Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive manner. When this option is checked, the dictionary will only recognize tokens in the same case as the dictionary labels.

Note: This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the lookup.

♦ Multiple Dictionary Outputs. Determines whether the component creates a single output column for the dictionary or dictionaries applied to the instance, or whether a separate output column is created for each dictionary. This option is selected by default.

Multiple Dictionary Operations

When you enable the Multiple Dictionary Outputs option, an output column is created for each dictionary applied to the instance. The input is parsed by the first selected dictionary, and the first match found is written to the dictionary output field.

If a match is found, the next dictionary is invoked, and this dictionary searches for a match within the remaining non-parsed tokens. It does not search the tokens already searched by the former dictionary. If no match is found, the dictionary output field is left blank and the process begins again by invoking the next dictionary. This process continues for all dictionaries applied to the instance.

When the Multiple Dictionary Outputs option is cleared, a single output field is created. All dictionaries are searched in the order in which they are listed on the Dictionaries tab, but only the first term identified is written to the output column. The remaining non-parsed terms are passed to the text, number, and code outputs, or alternatively to the overflow column.

Dictionaries Tab

The options on this tab allow you to apply a Data Quality dictionary to the input strings so that any input data that matches a dictionary entry will be returned as a dictionary output. You can configure each dictionary to write the input token unchanged to the dictionary output column or to standardize the input token to the dictionary version of the token.

To add a dictionary to the instance highlighted in the Components pane, right-click in the pane beneath the Dictionaries tab and select Add from the context menu. This opens the Dictionary Setup dialog box. Click the Select button in this dialog to browse to the required dictionary.

The Dictionary Setup dialog box contains a Dictionary Standardization option. Check this option to return the dictionary version of the token. Unchecked, this option returns the token as it appears in the input string.


Outputs Tab

The Outputs tab options define the output columns into which the input data values are parsed. Figure 7-1 shows the Outputs tab of the Token Parser:

The Token Parser can create up to five types of output column:

Code. Any value that mixes alphabetical and numerical data. Right-click in the Add Code Outputs field to create a code output column.

Number. Any purely numerical value identified in the input data. Right-click in the Add Number Outputs field to create a number output column.

Text. Any purely alphabetical value identified in the input data. Right-click in the Add Text Outputs field to create a text output column.

Dictionary. Lists the columns defined on the Dictionaries tab. You cannot add or delete dictionary outputs from the Outputs tab.

Overflow. A single column to which any non-parsed data is written. This field is created by the component and cannot be deleted from the component

The Token Parser creates its outputs as follows:

♦ First, the component applies any user-set dictionaries to the input data. Any tokens recognized by the dictionaries are written to the columns specified in the Dictionary Outputs field.

♦ Next, the component looks for output columns defined for code, number, and text tokens, in that order. If it finds such columns, it writes any recognized tokens to the respective columns.

♦ You can create multiple output columns for a Token type. For example, if your input data is composed of records containing three address fields, create three text outputs. If your input data contains a telephone number and a five-digit zip code, create two code outputs.

♦ The component attempts to populate the first output column of each token type and then moves down the columns listed for that type. If the component cannot find an appropriate column for a token, it writes that token to the overflow column.

Figure 7-1. Token Parser, Outputs Tab

Token Parser 75

Note: The parsing operation passes through each input record once only. The parsing operation does not reset to the start of the record when a data value is recognized.

Profile Standardizer

The Profile Standardizer uses the output data from a Token Labeller as input data in a parsing operation. The Profile Standardizer parses input data to a number of output fields based on a data structure that you define.

A Profile Standardizer parses one or more inputs from a single Token Labeller. To parse output from another Token Labeller, use another Profile Standardizer.

ConfigurationThe Profile Standardizer configuration dialog box enables you to define a multi-field data structure for the tokens recognized by the Token Labeller. Figure 7-2 displays the Profile Standardizer configuration dialog box:

Using the Profile Standardizer, you can create new data columns into which one or more tokens are parsed. You can create a rule for each combination of tokens, so that each underlying value is written to a new field.

For example, a Customer Account dataset includes a single Name field for customer names, including first and middle names, surnames, and initials. The Token Labeller recognizes the types of tokens present in the Name field data. The Profile Standardizer accepts the Token Labeller output and lists the various combinations of tokens in the Name field. The Profile Standardizer can new columns for first names, middle names, and surnames.

Figure 7-2 shows a Profile Standardizer in mid-configuration. You do not have to create rules for every combination of tokens.

In Figure 7-2, the rule applied to line 3, word word, sends the first token to a new first name field and the second token to a surname field. Similarly, the combination word word word on line 5 correspond to a

Figure 7-2. Profile Standardizer Configuration Dialog Box


customer firstname, middle name, and surname, and the rule is defined accordingly. Depending on the dataset, there can be an element of trial and error to maximizing the output of the Profile Standardizer. The rules might require tuning to recognize your target level of parsing quality.

When you define a rule for a token combination, its row changes appearance.

♦ Components pane. Lists the instances defined for the Profile Standardizer. When first opened, this pane lists a single instance, You can add multiple instances as long as they are linked to the same Token Labeller.

♦ Inputs pane. Lists the Token Labeller outputs available to the highlighted component instance. Select an input by highlighting it and clicking its check box. You can select a single input.

The Metadata and Profile menus let you identify the metadata associated with the Token Labeller output. A single Token Labeller can store multiple metadata and profile combinations. Selecting a new metadata-profile combination in the Profile Standardizer can provide a new range of input options.

Save any changes you have made in the component before changing the current metadata or profile.

When the input, metadata, and profile are selected for the current instance, the Profiles column is populated with the profiles created by the Token Labeller. You can now define the target columns for each set of tokens.

Right-click anywhere in the Profiles pane to add, insert, delete or rename columns from a context menu. When you add a column, it appears to the right of existing columns.

Applying Rules to ProfilesAfter you created the new columns that you need, you can define the rules that determine how input data values are parsed to new fields.

You do not have to define rules every token profile. Defining a small number of rules can often parse a large percentage of input data. You can subsequently add or edit rules to reach your target levels for parsing quality.

As with other parsing components, the Profile Standardizer creates an Overflow column automatically for all data that is not parsed by the defined rules.

To apply rules to profiles:

1. Click a field in a user-defined column to open the Edit Profile Rule dialog box.

This displays the tokens available for insertion to that field, that is, the tokens in the Name input field for that record. Tokens are listed in order of their occurrence in the source field, from top to bottom.

2. Select a token to send all values corresponding to that token to the new field.

3. Define a rule for a field and click Apply.

The Edit Profile Rule dialog box automatically moves to the next field in the row and displays its token options.

Reusing Profile DataConfigured Profile Standardizer instances are saved with the metadata and profile from which the Profile Standardizer drew the input token information. The metadata and profile appear in menus in the dialog box. Any rules you save with a Profile Standardizer can be accessed by other instances of the component in the plan, or in any other plans that access the same metadata repository.

Changing or deleting the Token Labeller can affect the input to the Profile Standardizer, but does not affect the rules already created for a profile. Changing the inputs selected in the Inputs window of the Profile Standardizer does not affect the rules already saved in the component. These rules remain in the table for any other inputs selected in the component.

When a component is saved with a particular profile and rules and a new profile is introduced and assigned parsing rules, the rules from the previously-selected profile are appended to the end of the new table. The rules from the previous profile are displayed by a light grey font on a dark grey background.

Profile Standardizer 77

Changing the Number of Displayed ProfilesThe number of profiles displayed within the Profile Standardizer is limited by default to 500 rows. You can change the maximum number of rows by editing the config.xml file located in your Data Quality installation folder, by default: C:\Program Files\Informatica Data Quality\config.xml.

The value is configured as MetaDataProfiles:

<MetadataProfiles>500</MetadataProfiles>

Note: Restart Data Quality Workbench for the changes to take effect.

Context Parser

Like the Token Parser, the Context Parser is designed to parse free-text fields containing multiple tokens into multiple single-token fields. Context Parser operations are based on the values and the relative positions of the tokens.

The high-level steps in configuring the Context Parser are as follows:

1. Select an input data column for each instance.

2. Specify the delimiters to use when parsing input data.

3. Configure the output columns where individual tokens will be parsed:

♦ Determine the number of tokens you expect in the output data.

♦ Add an output field for each of these tokens.

♦ Define a token type for each output you add.

The output columns can contain one or more data values, which can be of the following types:

♦ Word

♦ Number

♦ Code

♦ Symbol

♦ Init

♦ Dictionary (listed in a specified dictionary)

By using a combination of positional hierarchy, generic token types, and dictionary-determined data, you can achieve highly-effective parsing results even in very “noisy” datasets.

ConfigurationThe Context Parser configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Components Pane




Inputs Tab

The Inputs tab lists the data fields available to the highlighted component instance. Select a field by highlighting it and clicking its check box. You can select a single field for each component instance.

Parameters Tab

The Parameters tab displays the following editable options:

♦ Delimiters. The Delimiters area displays a list of delimiting characters. Select the delimiters applicable to your source dataset.

♦ Reverse Enabled. Use to read data inputs associated with the highlighted instance from right to left, instead of the default direction of left to right. This option enables you to parse data based on the final values in a field, such as postcode.

♦ Dictionary Lookup (Case Sensitive). Applies to any dictionaries specified for the data on the Outputs tab. Use this option if the parsing operation should apply dictionaries to the input data in a case-sensitive manner.

This option does not enable or disable dictionary lookup. It only determines the case sensitivity of the lookup.

Outputs Tab

This tab displays the user-defined output columns for the highlighted component instance. With no outputs defined, this area is empty. Right-click below the tab and select Add Output to add an output column.

Each output is defined by two fields. The output name appears in an editable upper field. The lower field lists the types of data values to be parsed to the field. You can set the output field to accept any of six data value types, and you can organize these types in any order.

The input data is parsed according to the order in which the outputs are listed on this tab, and within each output column, by the order in which the data types are listed. You can change the order of the output columns by right-clicking an output name and selecting Move Up or Move Down from the context menu.

Note the following:

♦ The Context Parser performs a single sweep of each input field. As a result, the Context Parser works best for structured data. For less- structured data, the Profile Standardizer may be more appropriate.

For example, you add an output of type NUMBER, and below it add an output of type WORD. When parsing “12 Main Street,” the Context Parser locates “12,” then “Main.” If you reverse the output types, the Context Parser locates the “Main” but skips the number “12.”

♦ You can configure an output to accept more than one token by adding multiple token types to the output or by selecting the Toggle Merge option.

Right-click a data type and select Toggle Merge from the context menu to place multiple values of that type in a single output field if they occur consecutively within the input field. For example, right-clicking a WORD data type and selecting Toggle Merge returns consecutive words, starting with the first word in the field.

♦ An overflow output is created automatically for any input values that have not been handled by the component.

Context Parser 79

C H A P T E R 8

Key Field Generator Components


♦ Overview, 81

♦ Normalization, 81

♦ Soundex, 81

♦ Nysiis, 83

Overview

Key Field Generator components group data in preparation for the matching process. With these components, you can create the keys by which the data is grouped. When you group data, you enhance the efficiency of the matching process.

Data Quality provides the following key field generator components:

♦ Normalization

♦ Soundex

♦ Nysiis

Normalization

Informatica partners use the normalization component to implement customized normalization plug-ins. Normalization plug-ins read input values and write standardized versions of those values.


Soundex

The Soundex component recognizes phonetic matches between alphabetic strings. It analyzes the phonetic components of a word and assigns a value to the string based on the phonetic characteristics

81

of the initial characters in the string. Because it can identify matches between words based on an analysis of how the words sound rather than how they are spelled, Soundex allows for spelling errors at the point of data entry.

Use Soundex to generate a phonetic key for grouping similar records before matching. Soundex can be applied to any free-text field.

For every field analyzed, Soundex generates a code beginning with the first letter in the word and followed by a series of numbers representing successive consonants. Generally, similar-sounding consonants are assigned the same code. The Soundex depth, the number of alphanumeric characters returned, is set to 3 by default. This means the Soundex code consists of the first letter in the string and two numbers representing the next two distinct-sounding consonants. You can change the Soundex depth.

ConfigurationThe Soundex configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Components Pane


You can add multiple instances of the component. To add an instance, right-click in this pane and select Add from the context menu. You can remove an instance by selecting Delete from the context menu.

Cut and Copy options are also available on the context menu. These options allow you to paste instances within the component and from one Soundex component to another.

Inputs Tab

The Inputs tab lists the data fields available to the highlighted component instance. Select a field by highlighting it and clicking its check box. You can select multiple inputs for each instance in the Components pane, but all inputs share a common Soundex depth.

Parameters Tab

The Parameters tab allows you to set the number of alphanumeric characters Soundex returns, called the depth. The default depth is 3, with an alphabetic character representing the first letter in the word, and two numbers representing the next two letters.

Increasing the depth means increasing the number of digits generated to represent additional letters in the word. The depth setting applies to the highlighted instance in the upper pane.

The following table illustrates different Soundex depth codes:

Surname Value Soundex Value - Depth 3 Soundex Value - Depth 4

Broderick B63 B636

Smith S53 S530

Ford F63 F630

Burton B63 B635

82 Chapter 8: Key Field Generator Components

Outputs Tab

This tab lists the names of the data outputs for the highlighted component instance as they appear in other components in the plan. Double-click a name to edit it. To save your edits, press Enter before removing focus from the field.

Deriving Soundex Depth CodesThe Soundex depth code consists of the first letter of the string in a given field, followed by a series of numbers that represent some or all of the remaining letters in the string. The component skips all vowels and similar letters:

a, e, i, o, u, h, w, y

It adds numbers for other letters as shown in the following table:

The following general rules apply:

♦ If two or more consecutive letters have the same code number, they are coded together, allowing Soundex to skip to the next distinct consonant sound. This rule applies in all cases, including the first and second letters of the word.

For example:

Gutierrez is coded G362: G, 3 = T, 6 = both Rs, 2 = Z Pfister is coded P236: P, (F skipped for having the same code as P), 2 = S, 3 = T, 6 = R

♦ If there are an insufficient letters for the Soundex depth, the remaining numbers in the code appear as zero. For example, if the depth is set to 5 and the word in question has three letters, Soundex completes the code with zeros.

♦ Letters are counted as consecutive when they are separated by a vowel or consonant skipped by Soundex.

If a vowel separates two consonants that have the same Soundex code, the consonant to the right of the vowel is coded.

For example:

Tymczak is coded as T522: T, 5 = M, 2 = C, Z skipped, 2 = K). As "A" separates Z and K, the K is coded.

If “H” or “W” separate two consonants that have the same Soundex code, the consonant to the left of the vowel is coded and the vowel to the right ignored.

For example:

Ashcraft is coded A261 (A, 2 = S, C ignored, 6 = R, 1 = F). It is not coded A226.

Nysiis

The Nysiis component converts the values of an input field to their phonetic equivalent.

Table 8-1. Soundex Depth Codes

Code Letters

1 B, F, P, V

2 C, G, J, K, Q, S, X, Z

3 D, T

4 L

5 M, N

6 R

Nysiis 83

Unlike the Soundex component, Nysiis does not create a code to represent the string, instead, it reconstitutes the spelling of the string based in its phonetic characteristics. While Soundex focuses on similarities in spelling at the start of matched strings, Nysiis looks for overall similarities between strings.

Nysiis uses a phonetic encoding algorithm created for the New York State Identification and Intelligence System.

ConfigurationThe Nysiis configuration dialog box consists of the following areas:

♦ Inputs tab

♦ Outputs tab

Inputs Tab

The Inputs tab lists the input columns available to the component. To select an input, check its check box. You can access a Select All option in the context menu by right-clicking in the dialog box. You can create a single instance of Nysiis for each component.

Outputs Tab

This tab lists the names of the data outputs as they appear in other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

The following table shows examples of Name-to-Nysiis value conversions:

Surname Value Nysiis Value

Adams Adan

Adames Adan

Adems Adan

Barnes Barn

Barns Barn

Bearns Barn

Adams Adan

84 Chapter 8: Key Field Generator Components

C H A P T E R 9

Matching Components


♦ Overview, 85

♦ Identity Match, 86

♦ Similarity, 88

♦ Edit Distance, 88

♦ Jaro Distance, 89

♦ Hamming Distance, 90

♦ Bigram, 91

♦ Mixed Field Matcher, 92

♦ Weight Based Analyzer, 94

Overview

Data Quality provides matching components that are explicitly designed to determine the degrees of similarity between given data values. Each matching component applies a different algorithm to its data input, and each is suited to a different type of data quality problem:

♦ Identity Match. Performs matching operations on input data at an identity level.

♦ Similarity. Implements custom plug-ins to calculate the type and degree of similarity between two strings.

♦ Edit Distance. Calculates the edit distance between two strings.

♦ Jaro Distance. Calculates the difference between two strings using a variation of the a variation of the Jaro-Winkler1 algorithm.

♦ Hamming Distance. Calculates the number of positions in which characters differ two strings.

♦ Bigram. Calculates the occurrence of matching pairs between two strings.

♦ Mixed Field Matcher. Compares multiple fields between two strings based on selected match calculations.

♦ Weight Based Analyzer. Calculates an aggregate match score based on the output scores from other matching components using user-defined weights for each score.

Note: Distance components are case-sensitive.

Matching components calculate numerical scores representing the similarity or dissimilarity between pairs of data values, generating a match score between 0 and 1. The higher the score, the greater the degree of similarity between the two strings based on the match component criteria.

85

For information about the formulas used to calculate match scores, see “Matching Formulas” on page 137.

Identity Match

The Identity Match component performs matching operations on input data at an identity level. An identity is a set of fields providing name and address information for a person or organization. The component treats one or more input fields as a defined identity and performs matching analysis between the identities it locates in the input data.

The component analyzes records regardless of the character sets in which they are stored. Use this component to identify similar or duplicate identities across datasets that may use several different language locales or character encodings.

Informatica uses population files to describe key-building algorithms, search strategies, and matching schemes that are customized for specific countries and languages. These customized settings improve match accuracy for data sourced from those countries and languages.

There are three main steps to configuring the Identity Match component:

♦ Select a population in the upper menu in the configuration dialog box.

♦ Select the type of identity to analyze in the lower menu of this dialog box. Table 9-1 lists the type of identity you can analyze. The fields available will depend on the population selected.

♦ Select the data fields you want to analyze and apply them to the template fields for your chosen identity type. The fields available will depend on the population selected.

ConfigurationThe Identity Match configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Components Pane

The Components pane shows the instances of the component available to the plan. When first opened, this pane lists a single unconfigured instance. Configure this instance using the options below this pane and on the Inputs, Parameters, and Outputs tabs.

Below the Components pane are two drop-down menus:

♦ Use the upper drop-down menu to select the population that you will apply to the data. Select the Identity Match Country option for a single locale or region, or select the Identity Match - Multiple Populations option.

♦ Use the lower menu to specify the type of identity data that the component will match. For example, the Contact option relates to the names and addresses of members of organizations. The option you select here determines the fields that are displayed on the Inputs tab. Each population selected in the upper menu has its own set of information types. Table 9-1 lists the type of identity you can analyze.

Table 9-1. Identity Type

Options Description

Wide_Contact Matches person name at organization name

Contact Matches person name at organization name and address

86 Chapter 9: Matching Components

Inputs Tab

The Inputs tab allows you to configure the data input fields. The Input Fields Mapping Area contains two columns:

♦ The left-hand column lists the field names. The names displayed depend on the population selected in the Components pane. Mandatory input fields are highlighted in the column.

♦ The right-hand column lists the available inputs for the selected input field. Select an option from each drop-down list to map an available input to the selected input field.

Note: If you have selected the Identity Match - Multiple Populations option in the upper drop-down menu beneath the Components pane, the Population field name is displayed and highlighted as mandatory in the left-hand column. Select a population field on the right-hand column.

Note: For all field names (except for the Population field name) you must select values for the field name in pairs. For example, when using field names PERSON_NAME1 and PERSON_NAME2 you must select values for both field names in the right-hand column. This enables the component to match input fields against each other.

Parameters Tab

The Parameters tab contains the following options:

♦ Default Population. Sets the default population if the multiple populations option has been selected in the Components pane.

When you opt to match data from several populations, the Identity Match component looks to the specified population first, and then to the other populations configured, when determining what population to apply to the data.

♦ Match Level. Sets the match level to one of the following:

Typical. Accepts reasonable matches. This is the default selection if no other match level is specified. The Accept Limit is 89 and the Reject Limit is 70.

Conservative. Accepts only close matches. The Accept Limit is 90 and the Reject Limit is 80.

Loose. Accepts matches with a high degree of variation. The Accept Limit is 75 and the Reject Limit is 50.

♦ Stop on Error. Check this option if you want the plan to stop running when the plan cannot locate up-to-date population data. When this option is checked, the plan will stop running if it finds that the population data is absent. When this option is unchecked, the plan will run as normal and write a status code to the output column.

Individual Matches person with either name id or birth date

Resident Matches person name at address

Address Matches address

Organization Matches organization name

Division Matches organization name at address

Household Matches family name at address

Person_Name Matches person name

Fields For general use for any one or combination of fields

Corp_Entity Matches company name

Family Matches family name at either address or phone number

Wide_Household Matches family name or phone number at address

Table 9-1. Identity Type

Options Description

Identity Match 87

♦ Advanced Matching. The Overriding Match Control Field allows you to override the population settings by providing a dialog in which you enter a query. The query syntax specifies the Identity Match options to be used.

Note: For more information on the query syntax, refer to the Informatica Identity Systems Naming Server documentation.

Outputs tab

This tab lists the possible output fields for the data associated with the instance highlighted in the Components pane. The tab shows two output fields:

♦ Identity Match Score. The score can range between zero (no similarity) and 1 (perfect match) and is correct to two decimal places.

♦ Identity Match Decision. Accept, Reject, Undecided, or Processed. The decisions returned are based on a combination of the Match Score and the Match Level specified on the Parameters tab (Typical, Conservative, or Loose).

Double-click a field name to render it editable. To save your edits, press Enter before removing focus from the field.

Similarity

Informatica partners use the Similarity component to implement customized similarity plug-ins. Similarity plug-ins read a pair of input values and compute the type and degree of identity between the two values, expressing this identity as a numerical value.


Edit Distance

The Edit Distance component derives a match score for two data values by calculating the minimum “cost” of transforming one string to another by the inserting, deleting, or replacing characters.

The result of this calculation is the edit distance. The higher the edit distance score, the greater the similarity between the two strings.

This component is ideal for matching fields containing a single word or a short text string such as a name or short address field. You can use it to compare corresponding fields across two records or to compare different fields within the same record.

For example, an edit distance calculation is performed on two street names:

The component calculates the cost of transforming the “a” in Collage to an “e” and inserting a period after “St.”

ConfigurationThe Edit Distance configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

College St. Collage St


♦ Parameters tab

♦ Outputs tab

Components Pane



Cut and Copy options are also available on the context menu. These options allow you to paste instances within the component and from one Edit Distance component to another.

Inputs Tab

The Inputs tab lists the data fields available to the highlighted component instance. Select a field by highlighting it and clicking its check box. You must select two fields for each instance in the Components pane.

Parameters Tab

The Parameters tab allows you to set the output score assigned to a matched pair when one or both fields are empty or contain null values.

The Single Null Match Value setting applies when one field in the pair of matched values is null. The Both Null Match Value setting applies when both fields are null. Possible values range between 0 and 1.

Outputs Tab

This tab lists the names of the configured output as they appear in other components in the plan. Double-click a name to render it editable. To save your edits, press Enter before removing focus from the field.

Jaro Distance

Like the Edit Distance component, the Jaro Distance component calculates the general similarity between two data values. However, the Jaro Distance component reduces the match score when a pair of values do not share a common prefix.

Like other Data Quality matching components, the higher the match score, the greater the similarity between the strings.

The component uses a variation of the Jaro-Winkler1 algorithm. The algorithm penalizes the match if the first four characters in each string are not identical. The default penalty is 0.2.

ConfigurationThe Jaro Distance configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Jaro Distance 89

Components Pane

The Components pane shows the instances of the component available to the plan. When first opened, this pane lists a single, unconfigured instance. Configure this instance using the options on the tabs in the dialog box.


Cut and Copy options are also available on the context menu. These options allow you to paste instances within the component and from one Jaro Distance component to another.

Inputs Tab


Parameters Tab

The Parameters tab allows you to define the output score assigned when one or both fields are empty or contain null values.


The Penalty field determines the value subtracted from the match score if the first four characters of both strings are not identical. The default setting is 0.2.

The Case Sensitive check box, when checked, specifies that the matching calculation will consider the case of the characters when determining the identity between them. This box is unchecked by default.

Outputs Tab


Hamming Distance

The Hamming Distance component derives a match score by calculating the number of positions in which characters differ for a pair of data strings. Use the Hamming Distance component when the position of the data characters is a critical factor, as in numeric or code fields such as telephone numbers, zip codes, dates, and product codes.

By default, the Hamming Distance component reads data from left to right. You can reverse this setting.

ConfigurationThe Hamming Distance configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab


Components Pane



Cut and Copy options are also available on the context menu. These options allow you to paste instances within the component and from one Hamming Distance component to another.

Inputs Tab


Parameters Tab



This tab also displays the Reverse Hamming option. Use this option to configure the Hamming Distance component to read data from right to left instead of the default, left to right.

Outputs Tab


Bigram

The Bigram component matches data based on the occurrence of consecutive characters in both data strings in a matching pair, looking for pairs of consecutive characters that are common to both strings. The greater the number of common identical pairs between the strings, the higher the match score.

This component is useful in the comparison of long text strings, such as free format address lines or lines of user comments.

For example, when the following two names are analyzed by the Bigram component:

The bigram pairs for the two inputs are as follows:

Da, am, mi, ie, en Da, ar, rr, re, en

There are ten pairs in this example, yielding four matches or two matched pairs. Therefore, the Bigram Distance between these strings is 0.4.

ConfigurationThe Bigram configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

Damien Darren

Bigram 91

♦ Parameters tab

♦ Outputs tab

Components Pane



Cut and Copy options are also available on the context menu. These options allow you to paste instances within the component and from one Bigram component to another.

Inputs Tab


Parameters Tab



Outputs Tab


Mixed Field Matcher

While the distance matching components compare pairs of data values at a time, the Mixed Field Matcher compares multiple fields in different match calculations.

The Mixed Field Matcher component identifies matches in a dataset where data values of the same or similar types appear across multiple fields, such as freeform address fields where address elements like the apartment number, city, or zip code can exist in different fields for different records.

The component provides several mechanisms for fine-tuning the match score computation, so you can give different priorities to matches or near-matches of different types and levels of approximation.

To configure this component, select two groups of data fields to be matched and identify the matching algorithm to apply to the data. You can also activate and tune priority levels for incorrect or approximate matches. However, Informatica recommends using the default settings for these parameters.

Note: Matching operations in this component can incur a significant performance overhead and may take longer to execute than operations in other matching components.

ConfigurationThe Mixed Field Matcher configuration dialog box contains the following areas:

♦ Inputs tab


♦ Parameters tab

♦ Output tabs

Inputs Tab

The Inputs tab allows you to view available data fields and select the sets of input fields to be compared. To compare data, assign fields to Input Group A and Input Group B.

Note: Groups A and B must contain the same number of fields.

The Inputs pane lists the data fields available to the component. To add a data field to either input group, right-click it and select Add to Group A or Add to Group B from the context menu. The data fields you select display in the input group panes.

To remove a field from either pane, right-click it and select the Remove context menu option.

Use Ctrl-A to select all fields in these panes. Select multiple fields using Shift-click or Ctrl-click.

Parameters Tab

The Parameters tab options allow you to fine-tune the component matching operations. The tab organizes its parameters in three areas:

♦ General. This area contains the following options:

− Relative Position Factor. When the Mixed Field Matcher compares two fields from different record sets, the relative position within each record of each field affects the strength of the match. For example, when the Mixed Field Matcher matches a pair of fields in two records, it considers the match stronger when the two records are in the same column. If the same two fields appear in different columns, it considers them a relatively inferior match.

You can set Relative Position Factor to Off, Low, Medium, and High. Medium is the default.

− Matching Order Factor. This setting is concerned with the relative order of the best matches between the input record sets. For example, when matching two fields in the record sets representing Firstname and Surname, the Mixed Field Matcher matches John Smith with Joan Smith better than with Smith Joan even though the individual fields match with the same score.

You can set Matching Order Factor to Off, Low, Medium, and High. Medium is the default.

− Empty Input Fields Factor. This setting calculates the number of empty fields in a record as a proportion of the total number of input fields. A high proportion of empty fields lowers the match score for fields in the record.

You can set Empty Input Fields Factor to Off, Low, Medium, and High. Medium is the default.

− Different Input Sizes Factor. This property compares the numbers of empty or null fields found in a pair of records. When two records have different numbers of empty or null fields, this difference is incorporated into the final matching score.

You can set Different Input Sizes Factor to Off, Low, Medium, and High. Medium is the default.

♦ Field Match. This area contains the following options:

− Match Method. This menu identifies the overall key for the matching operations. The default setting is LCS (Longest Common Subsequence). This setting considers the length of any common character strings in a pair of input fields and adds a factor based on the longest such string to the final score.

The default setting does not require input from another matching component in the plan. The other settings in this menu provide for scores from other matching components.

− Single Null Match Value. This settings applies if one of the two compared fields is empty. The default setting is 0.5.

− Both Null Match Value. This setting applies if both fields are empty. The default setting is 0.5.

♦ Advanced Area. In most situations there is no need to change the advanced settings for this component. For more information about these settings, consult Informatica Global Customer Support

Mixed Field Matcher 93

Outputs Tab


Weight Based Analyzer

The Weight Based Analyzer takes the results from two or more matching operations and calculates a single match score. The component accepts data from any matching component and allows you to assign weights to their match scores so the overall score for a field pair can reflect the priorities of the data.

You can define more than one instance in the Weight Based Analyzer. This allows you to configure each component with different combinations of input fields and different weights as required.

You can use the Weight Based Analyzer to calculate overall matching scores for the plan. For effective matching, assign higher weightings to the more important fields.

ConfigurationThe Weight Based Analyzer configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Components Pane



Inputs Tab


You must select at least two matching components on this tab.

Parameters Tab

This tab displays the matching components selected on the Inputs tab. Each matching component has a text field in which you can edit the weight defined for it. The higher the value in a text field, the higher the priority given by the component to the overall match score.

Outputs Tab



C H A P T E R 1 0

Address Validation Components


♦ Overview, 95

♦ Global AV, 96

Overview

Data Quality installs with address validation engines that process address data within a plan while the Data Quality engine processes other aspects of the plan. It also accepts address validation engines developed as plug-ins in accordance with the requirements of Data Quality Global Component SDK. Data Quality installs a single address validation component to handle these validation engines, called the Global AV. It also supports plans that contain deprecated address validation components from earlier versions of Data Quality.

Note: The Global AV matches input address data against reference datasets of postal addresses. Before you can use the Global AV, you must install reference data for the countries you are interested in. Data Quality does not install these datasets by default. You can purchase reference datasets for the default-installed validation engines from Informatica.

The Global AV and the installed validation engines deliver the following functionality:

♦ They validate the accuracy and deliverability of addresses according to the best reference data available for the country in question. Some countries provide complete address information, down to premise level, and can also enrich the address with new information, for example providing a nine-digit zip code in place of a five digit zip. Other countries provide last-line address information only, that is, information on city, province, or post code (information commonly found on the “last line” on the envelope).

♦ Where possible, they correct errors in addresses and complete partial address records. An address engine may find a match for an input address in its reference dataset that is more complete or formally correct than the input address. The component can return the reference address as an enhanced version of the input address.

♦ They add postally-relevant information to the address that may not appear in the data source or “on the envelope.” For example, they can report on whether an address has a physical address or is at a commercial mailbox location.

♦ They provide detailed status reports on the validity of each input address, describing its deliverable status and the nature of any errors or ambiguities it contains.

♦ In addition to returning individual fields that contain postal address and other value-added information, they can provide output addresses in an envelope-ready format.

The Global AV provides the user interface to all address validation engines, including engines that users add to Data Quality through the Global Component SDK. Data Quality no longer installs a separate operational component for each installed address validation engine.

95

This installation of Data Quality supports plans that contain address validation components installed with earlier product versions. The supported components are the Address Validator, the International AV, and the North America AV. You cannot create new instances of these components.

Installing Validation Components and Reference DataData Quality installs three address validation engines: Melissa Data, QAS, and Address Doctor. The Data Quality Content Installer installs the reference datasets for these engines. You purchase and download address reference datasets on a country-by-country basis from Informatica.

You can also use the Content Installer to install updates to these validation engines. For more information, consult the Informatica Data Quality Installation Guide.

Note: Data Quality also permits approved third parties to add address validation engines to the Data Quality system. These engines and their functionality must meet the requirements of the Data Quality Global Component SDK. The Global AV component acts as a shell for all address validation engines.

Global AV

The Global AV component provides access to address data functionality and processing capabilities in Data Quality. It provides a means of validating addresses from anywhere in the world through a single component.

The Global AV compares your input data records to reference databases of postally valid address information to quantify, verify, and enhance the quality and deliverability of your address records. It provides access to all address validation engines installed with or linked to Data Quality.

Configuration The Global AV configuration dialog box contains the following areas:

♦ Components pane

♦ Inputs tab

♦ Parameters tab

♦ Outputs tab

Components Pane


Inputs Tab

The Inputs tab lists all available data columns. Select a column to add it to the instance highlighted in the Components pane.

You can select multiple address columns for the component instance. In general, the more columns you provide, the greater the opportunity for the Global AV to locate the correct address in its reference data. However, incorrect input data does not enhance the matching operation.

Parameters Tab

The Parameters tab options allow you perform the following operations:

♦ Set the principal country database to use when validating input data.

96 Chapter 10: Address Validation Components

♦ Verify or change the address structure for the input address strings.

♦ Add CASS/DPV or Geocoding information to the outputs (country data permitting).

The options displayed on this tab change according to the country database option (single or multiple) that you select:

♦ Validating data from one country. Check the Select Single Country option, and then select the required country from the Select Country menu.

♦ Validating data from several countries. Check the Select Multiple Countries option, and then select a country from the Select Default Country menu.

The process for validating addresses from several countries works as follows:

− The Global AV first looks for a populated country code field in the address. This must be a three-letter ISO country code.

− If it finds a country code for a country on its menu, the component sends the address to the database for that country.

− If it does not find an address match in the default database, the component applies the address to the default country database.

− You do not have to set a country for the Select Multiple Countries option. If you select the NONE option in the country database menu, the component will search all input addresses for a country code and attempt to validate the addresses accordingly. If it does not find a country code for the address, the component will not perform a validation check for that address.

Note: When you opt to validate data from several countries, the Global AV looks to the country code first, and then to the database selected, when determining what country database to apply to the data.

Note: Do not select a Single Country database in the Global AV unless the input data relates exclusively to the country you specify.

The Services Required area contains options relating to the enrichment of address information with Geocoding and DPV information and to the handling of the plan in cases of critical reference data errors.

♦ Geocoding. Check this option to return latitude and longitude coordinates for each input address. This option is available for the United States, United Kingdom, and Australia. This option is also available when you choose the Select Multiple Countries option, but it only returns data from country databases containing Geocoding data.

♦ CASS/DPV (Delivery Point Values). Check this option to return a two-digit Delivery Point Value for the address. This option is available for the United States only. This option is also available when you choose the Select Multiple Countries option, but it only returns data from a country database containing DPV data.

A delivery point value is a two-digit code that can uniquely represent, along with the nine-digit zip code, any mailbox address. The full delivery code, including the zip and DPV information, is typically added to sorted mail as a bar code. This option is available for the U.S. only. CASS (Coding Accuracy Support System) is a United States Postal Service means of certifying the accuracy of address validation by software.

♦ Stop on Error. Check this option if you want the plan to cease execution if the plan cannot locate up-to-date country reference data. When this option is checked, the plan will stop running if it finds that the reference data is absent, or expired, or lacks a current license. When this option is unchecked, the plan will run as normal and write a status code to the output columns.

The Input Fields Mapping area contains a Parameters column and an Input Fields column.

♦ The Parameters column lists the field names selected on the Inputs tab. The component will validate the address fields in the order in which they appear in this column.

♦ The Input Fields column contains a set of menus for every field in the Parameters column. Each menu contains an address element.

Use these menus to build the address that the component will send to the validation engines. Map each field name you require from the Parameters column to a unique field name under Input Fields.

Note: You must map an input field to the Addressline1 parameter. You must also map an input field to the Country parameter if you choose the Select Multiple Countries option.

Global AV 97

Outputs tabThis tab lists the possible output fields for the data associated with the instance highlighted in the Components pane. The tab shows all the following options:

♦ All address field options associated with the country database selected on the Parameters tab.

♦ Formatted address fields that provide envelope-ready address lines in the manner expected by the postal carrier of the country in question.

♦ Options providing postally-relevant information in areas such as CASS/DPV certification and Geocoding.

The CASS/DPV options are enabled if a current set of United States reference data is installed on your system. Geocoding options are enabled if current reference data for the United States, United Kingdom, or Australia is installed.

Check the fields you want to use as outputs from the component.

The two outputs at the top of this pane provide information on the quality of the match found between the input address and the reference data. These outputs do not provide address data. You cannot clear these options:

♦ Match Status. Describes the type of match found for each input address.

♦ Match Code. Describes the success of the match found for each address.

For more information about the meanings of these variables, see “Global AV: Output Field Descriptions” on page 131.

Understanding Match Status and Match Code Outputs

The Global AV provides access to the processing capabilities of the address validation engines installed with Data Quality and also to any third-party address validation engines installed as plug-ins. The output fields for the Global AV are based on the output fields of these engines. The output fields for the address validation engines that are installed with the product are described here.

The Global AV reads its Match Status and Match Code values directly from the underlying component engines. Table 10-1 lists the code values returned for the engines installed by default with Data Quality. The engine names in these tables correspond to the names of the

Table 10-2 lists the status values returned for each engine.

Table 10-1. Match Code Comparison Across All Validation Components

Global AV Address Validator International AV North America AV

Match Code Match Code Match Type Status Code (if successful match) or Error Code, Error String

Table 10-2. Match Status Comparison Across All Validation Components


Validated Verified Correct Validated V

Unmatched Unmatched Poor/Fair deliverability X and S

Validated Verified Correct Validated 6

Multiple Matches Multiple Matches 7

Validated Verified Correct Validated 9

Good Match Good Match

Partial Match Partial Match

Tentative Match Tentative Match

Foreign Address Foreign Address

Poor Match Poor Match


Use these tables to compare the values across components. These codes are also listed in appendixes for the four validation components.

Formatted Address OutputsIn addition to analyzing and enhancing input address elements, the Global AV can assemble validated address outputs in a standardized envelope-ready format. The component uses the validated input data to build the formatted addresses, eliminating the need to manually parse address values from multiple fields into standardized formats. The Global AV engines create formatted addresses on a record-by-record basis, so that each address is created in the envelope format expected by the postal carrier in its country.

Because standard address formats differ from country to country, the formatted address lines are named generically in the Global AV. The component provides ten lines for formatted addresses. Select as many lines as your address may need. The address validation engines ignore any address lines that are unused.

The outputs are named as follows:

Formatted_Address_Line_1, Formatted_Address_Line_2... Formatted_Address_Line_10

Example: United States

The Global AV uses up to four lines to create an address in the standard USPS format. Table 10-3 shows how the Global AV builds the formatted address:

The address format shown in Table 10-3 is a business address. It does not include personal name information. You can select this information separately when configuring the plan outputs.

Note: You cannot change the output values that the component writes to the formatted address fields. The selections are determined in the underlying validation engines.

Corrected Corrected

Good Deliverability Good Deliverability

Not Processed Not Processed

Engine not installed

Engine not licensed

Reference Data Missing

Reference Data Not Licensed

Reference Data Expired E

Reference Data License Expired

Unsupported Country

Incorrect Postal Code F

Table 10-3. Standard United States Business Address Format

Global AV Output Description

Formatted_Address_Line_1 Company or organization name

Formatted_Address_Line_1 Urbanization (where applicable, for example in Puerto Rican addresses)

Formatted_Address_Line_3 Street address, including Suite/Suite Range fields

Formatted_Address_Line_4 City, State, Zip code

Table 10-2. Match Status Comparison Across All Validation Components


Global AV 99

Writing Formatted Addresses To Target Components

Formatted addresses answer a particular business need. If you do not need envelope-ready address information, you need not select the formatted address options in the Global AV or in your plan target components. If you select these options, you must have a strategy for using the information when it leaves the data quality plan. You should consider the structure of the file or database table that will contain the formatted addresses.

When defining or editing a plan to create formatted addresses, consider the following strategies:

♦ Add an additional target to an address validation plan, and select only the formatted address outputs in that target.

♦ Create a copy of an address validation plan and replace the target components with new targets that use the formatted outputs only.

Address Formatting And Invalid Addresses

When your dataset contains only validated addresses, you can follow the strategies above with no difficulty. When your dataset produces mixed validation results, you must decide how to handle the addresses that Data Quality identifies as invalid or partially valid.

How the Global AV formats a poor-quality address depends on the engine that processes reference data for that address. Informatica provides reference data on a country-by-country basis. For example:

♦ If the Global AV cannot validate an input address from the United States data, the Global AV does not write any values to formatted address fields. The Global AV calls the Melissa Data processing engine to process United States address data.

♦ If the Global AV cannot validate an input address from France, it writes the original input values to the formatted address fields. The Global AV calls the QAS processing engine to process French address data.

You must test your plan output to verify that you receive the formatting results you expect. If your plan writes both valid and invalid addresses to the formatted address fields, you can use a Rule Based Analyzer to create new outputs from formatted addresses where the address records meet one or more validation status criteria.

Reference Data Engines And Supported Countries

Use Table 10-4 to determine how the Global AV handles non-validated addresses from different countries when populating the formatted address fields:

Table 10-4. Address Formatting By Country (Invalid Data)

Country Processing Engine Formatted Address Handling When Data Is Invalid

Brazil Address Doctor Global AV writes the best available values to the formatted address fields.

Argentina Address Doctor Global AV writes the best available values to the formatted address fields.

Australia QAS Global AV writes original input values to formatted address fields.

Canada Melissa Data Global AV does not write data to formatted address fields.

Czech Republic Address Doctor Global AV writes the best available values to the formatted address fields.

Denmark QAS Global AV writes original input values to formatted address fields.

France QAS Global AV writes original input values to formatted address fields.

India Address Doctor Global AV writes the best available values to the formatted address fields.

Luxembourg QAS Global AV writes original input values to formatted address fields.

Mexico Address Doctor Global AV writes the best available values to the formatted address fields.

Netherlands QAS Global AV writes original input values to formatted address fields.


Enhancing Address Validation Engine PerformanceYou can edit the configuration files associated with the Melissa Data and Address Doctor engines to improve data processing speed and to log messages warning of data expiry. For more information, see the Data Quality Installation Guide.

Poland Address Doctor Global AV writes the best available values to the formatted address fields.

Russia Address Doctor Global AV writes the best available values to the formatted address fields.

Singapore QAS Global AV writes original input values to formatted address fields.

South Africa Address Doctor Global AV writes the best available values to the formatted address fields.

Turkey Address Doctor Global AV writes the best available values to the formatted address fields.

United Kingdom QAS Global AV writes original input values to formatted address fields.

United States Melissa Data Global AV does not write data to formatted address fields.

Table 10-4. Address Formatting By Country (Invalid Data)

Country Processing Engine Formatted Address Handling When Data Is Invalid

Global AV 101

C H A P T E R 1 1

Dictionary Management


♦ Overview, 103

♦ Dictionary Manager, 104

♦ Updating Dictionary Files, 104

♦ Creating a Dictionary, 106

Overview

Informatica Data Quality plans can use the following types of reference data:

♦ Dictionary files. Plain-text files provided by Informatica and saved in the DIC file format. These files are usable in many Workbench components and are installed by the Content Installer.

♦ Database dictionaries. User-created reference datasets stored in database tables. These tables can be updated dynamically when the underlying data is updated. Informatica does not provide these dictionaries.

Database dictionaries are a convenient way to use data that has been created for other purposes. By making use of a dynamic connection, data quality plans can always point to the current version of a database dictionary.

♦ Third-party reference data. File-based and database reference datasets originating from third party sources and offered by Data Quality as additional product options. Required for address validation components. The Content Installer installs these datasets.

This chapter describes the DIC files provided by Informatica and the process to create a dictionary. For more information about third-party reference data, contact Informatica Global Support.

Dictionary FilesDictionary files provide an authoritative reference source for many areas in which common terminology is used, including postal address terms, city names, units of measurement, personal salutations, telephone area codes, and company names. Many Data Quality components provide options for comparing or updating input data against dictionary data. These dictionaries are editable, and you can also define your own dictionaries.

A dictionary file is essentially a text file saved in a proprietary (.DIC) format. Each file contains one or more label entries with one or more item entries for each label. The label represents the correct or standard form of a word or term. The item values for each label represent a range of variant or alternative spellings. Any operation that updates your dataset from a dictionary does so by locating an item entry and returning its corresponding label.

103

Data Quality reads dictionary files from the Dictionaries folder created at install time. The Data Quality installer does not add dictionaries to this folder. Dictionaries are added by the Content Installer.

When you run a local plan, Data Quality Workbench looks for any dictionaries cited in the plan in the Dictionaries folder of your Workbench installation. When you run a plan across the service domain, Data Quality Server looks in the local Dictionaries folder and also in the your Dictionaries folder on the service domain. For more information, see “Dictionary Files” on page 7.

Note: The dictionary folders read by Data Quality are set during product installation. Their locations can be changed later if necessary. For information on changing these locations, contact Informatica Global Customer Support.

Dictionary Manager

The Dictionary Manager is an applet within Workbench that allows you to view and manage the contents of the local Dictionaries folder. To open the Dictionary Manager in Workbench, press F8.

When you use the Dictionary Manager for the first time following the Content Install, it appears populated with multiple folders. Figure 11-1 displays the Dictionary Manager window:

Note: The Content Installer overwrites any files with the same names that it finds in the Dictionaries folders. If you have created, renamed, or moved any dictionaries since install and wish to rerun the Content Installer, back up these files first.

Updating Dictionary Files

A dictionary file is organized as a table with a column of definitive spellings for the terms in the dictionary and one or more columns for matching or acceptable variant spellings. Each dictionary term has entries in at least two fields:

♦ Label field. Represents the spelling that will be written back to the plan.

Figure 11-1. Dictionary Manager

104 Chapter 11: Dictionary Management

♦ Item fields. Represents the forms of spelling that are recognized as a match for the Label in the input data. The first item field always contains the same spelling as the Label field, that is, it matches the formally correct or approved spelling of the term.

You can create or update a dictionary in the following ways:

♦ Add or delete an item. Add or delete variant spellings for an existing dictionary term.

♦ Add or delete a label and its related items. Add or delete a definition from the dictionary.

♦ Create a new dictionary file. See page 106.

Before deleting data from a dictionary, be sure that doing so is appropriate for all plans that reference the dictionary.

Note: You should backup or rename any dictionary you edit. If you rename a dictionary that is used by a plan, you must edit the plan components to recognize the new dictionary name. If you edit a dictionary but do not change its name, you do not need to update the plan configuration.

Adding New ItemsYou can add new spellings to existing definitions. For example, the Numeric Patterns dictionary contains character patterns for many types of personal data, such as Social Security numbers, telephone numbers, and zip codes. You can add a variant pattern for one of these data types.

In Figure 11-2, a pattern for a U.S. area code and telephone number has been added to the Item4 field. This pattern divides the numbers with blank spaces, indicated by an underscore:

To add new spellings to a term in the dictionary:

1. Open the dictionary in the Dictionary Manager and locate the row containing the term.

2. Type the new spelling in the first empty cell on the row.

Adding New LabelsYou can add new terms to a dictionary and define the related spellings. Dictionary labels do not need to be in alphabetical order.

The decision to add terms to a dictionary depends on the purposes of the plans that will use it. You might not want to recognize all possible variations in a data value.

To add a new term to a dictionary:

1. Open the dictionary and type the formal spelling in the first empty Label field and the Item1 field. These two fields must be identical. You might need to scroll the dictionary contents to reach an empty row.

2. In the adjacent Item fields, type any variant spellings you want to include in the dictionary. Start in the Item2 column.

Figure 11-2. Numeric Patterns Dictionary

Updating Dict ionary Files 105

Creating a Dictionary

You can create text dictionaries or database dictionaries.

To create a text dictionary:

1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.

2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Text.

An empty dictionary worksheet displays.

3. Type or copy a list of values into the Label and Item columns of the dictionary.

4. Close the dictionary and click Yes to save the dictionary.

The dictionary appears in the folder with the name New Dictionary.

5. To rename the dictionary, right-click the dictionary name and select Rename

6. Type a new name for the dictionary.

The newly-created dictionary can be viewed in the Dictionary Manager and can be found in the Dictionaries folder of your Data Quality installation.

Note: You can add a correctly-formatted text file with the extension DIC to folders in the Dictionaries folder structure. The file will be visible in the Dictionary Manager.

To create a database dictionary:

1. Open the Dictionary Manager and select the folder where you want to create the new dictionary.

2. Right-click in the right pane of the Dictionary Manager and click New Dictionary > Database.

The Select Two Columns for Dictionary dialog box opens.

3. Complete the enabled fields under the Connect To Database tab and click Connect.

Fields differ based on the database type you select.

The default database setting is Staging. It refers to the local database used by Data Quality. You can select any valid connection.

♦ When you connect to IBM DB2, Microsoft SQL Server, or ODBC-compliant databases, you must provide a DSN (Data Source Name) for the database. You might be prompted to provide a valid login. The DSN field identifies the database on the network.

♦ When you connect to an Oracle database, you must provide the SID (System Identifier) for the Oracle instance.

♦ You might be prompted for login information if you select a non-default database type.

♦ You can identify the character encoding associated with the data in the dictionary. For more information, see “Character Encodings and Unicode” on page 143.

4. Click Connect.

The During tab displays.

5. Under this tab, select the two columns to use for the Label and Item1 values in the dictionary, and click OK.

Creating Dictionary Files with the Report ViewerThe Data Quality Report Viewer allows you to create dictionary files from the output of a data quality plan.

To create or append to a dictionary file using the Report Viewer, your plan should write its output to a Report Target. A Report Target creates output files in a proprietary SSR file format that allows plan data to display graphically and in Data Quality dashboards.


The Report Target accepts data only from a frequency component, such as a Count component. The Count component counts the occurrences of data values in a selected column. You can drill-down into the summary calculations for each column in the Report Viewer to locate the raw data for a dictionary file. When you drill-down into data, you can select a data column and add it to an existing dictionary or create a new dictionary.

For more information about the Report Target, see page 29. For more information about the Report Viewer, see page 109.

To create or append to a dictionary file using the Report Viewer:

1. Open the Report Viewer. Open the SSR file that references the plan data to be added to the dictionary.

You can open an SSR file in two ways:

♦ In Workbench, run a Data Quality plan with a Report Target, ensuring that the Report Target has been configured to launch the Report Viewer on plan execution.

♦ In the Report Viewer, click File > Open and browse to the SSR file for the report in question.

2. With the report open in standard view, right-click the row for the relevant data instance and select Open.

A spreadsheet opens, showing all data rows for the instance you have selected.

3. If you want to save the full contents of a column to a dictionary file, right-click in the column and click Edit > Select Column.

The entire column is highlighted.

-or-

If you want to save a selection from a column to a dictionary file, Shift-click to select the required values.

4. Right-click the highlighted values and select Export To > Dictionary File.

The Select Dictionary Name dialog box opens.

5. Browse to a location in the Informatica Data Quality Dictionaries folder structure.

6. If you want to create a new dictionary, type a new dictionary name.

-or-

If you want to append to or replace a dictionary, select a dictionary name.

You will be prompted to append to or overwrite the current data for the dictionary.

7. Click OK.

Creating a Dictionary 107

Figure 11-3 illustrates how you can drill-down through report data, right-click on a column, and save the column data as a dictionary file. This file becomes populated with Label and Item1 entries corresponding to the column data:

In this case, the dictionary will contain a list of serial numbers from customer records that include invalid zip codes. You can now create plans to check customer databases against these serial numbers.

Figure 11-3. Creating a Dictionary File with the Report Viewer


C H A P T E R 1 2

Report Viewer


♦ Overview, 109

♦ Viewing Data in the Report Viewer, 109

♦ Standard View and Dashboard View, 111

♦ Viewing Plan Data, 114

♦ Report Viewer Parameters and Settings, 115

♦ Tracking Changes in Data Quality, 116

♦ Importing Report Files and Working with Groups, 117

Overview

The chapter describes the Data Quality Workbench Report Viewer. The Report Viewer allows you to perform the following tasks:

♦ Display plan results, both in graphical and numerical formats and in a dedicated viewing application.

♦ View drill-down analysis of the raw data underlying the plan results.

♦ Create data quality dashboards that can be exported in spreadsheet and HTML form for business users and other interested parties.

♦ Save key subsets of plan data to file for use as reference dictionaries.

The Report Viewer is particularly suited to displaying data quality dashboards, those that explore the quality of a dataset according to criteria set by the business.

You can use the Report Viewer to view the SSR report files that are created by plans containing a Report Target.

Viewing Data in the Report Viewer

You can open and read data in the Report Viewer.

Opening the Report ViewerThe Report Viewer can be activated in three ways:

109

♦ Configuring the Report Target to generate a report in Standard/SSR report format, check Launch Report on Completion, and then execute the plan.

♦ Open the Report Viewer from the Data Quality Workbench program group via the Windows Start menu. You can use the Report Viewer’s File menu to open a report file.

♦ Click the Report Viewer toolbar button in the Data Quality Workbench user interface.

Reading Report DataThe Report Viewer can display data for all items selected frequency components of the plan. Data items typically have many kinds of data associated with them.

When you select a data item in the Count component, you add the number of times each value occurs to the report.

For example, a plan might contain a business rule defined in a Rule Based Analyzer that tests the accuracy of the currency type associated with data records. In this case, the Rule Based Analyzer creates a new data column whose fields may read Valid Currency or Invalid Currency.

The Report Viewer might also show the number of empty fields and values excluded from calculations depending on the parameters of the preceding operational component, such as the number of values classified as Others by the Count component. For this reason, it is important to understand how frequency components are configured. A large number of Others values can indicate that the Count component needs to be reconfigured.

Types of GraphIn standard mode, you can choose from two graphing options for a data item from the View menu:

♦ Pie Chart

♦ Bar Chart

Beneath each chart type, the data for the item is tabulated. The No Graph option omits both chart types.

When you open the Report Viewer, the right pane displays data for one item at a time. You can select an All Reports option through the View menu that displays all items in scrollable form in the right pane.

The View menu also lets you set the orientation of the bars in the chart to horizontal or vertical. The legend for the charted item appears below the chart, providing precise metrics for the quantity and percentage of the charted data.

Figure 12-1. Report Viewer, Standard View

110 Chapter 12: Report Viewer

Standard View and Dashboard View

You can view data in the report viewer in two modes:

♦ Standard view

♦ Dashboard view

Standard ViewWhen first opened, the Report Viewer opens in Standard view, presenting its information in two panes. The left pane lists the source fields selected in the frequency components in the plan. The right pane displays the following information:

♦ A bar chart or pie chart for each item in the left pane.

♦ The numbers of records that satisfy or do not satisfy the quality criterion for each item and the percentage of data in the item that each number represents.

Any changes you make to the view settings for the report are stored to a master settings file for the Report Viewer. For example, if you leave the standard mode by selecting Dashboard view, the report data displays in dashboard mode the next time the SSR file is opened.

Dashboard ViewDashboards illustrate the ongoing progress of the dataset towards data quality business targets. When you activate the dashboard, the standard view is collapsed, and the items are presented in a series of bar charts that can be arranged in data quality categories.

Dashboards can display the following information:

♦ The percentage of records that satisfy the data quality criterion underlying each item.

♦ The data quality target set by the business for each item.

♦ Horizontal bars charting the percentage of good quality records in each item with each bar color-coded to indicate whether the data meets or misses its target.

♦ An icon that indicates whether the data quality in the item is improving over time.

♦ The percentage of records in each item that satisfied the respective data quality criteria in previous executions of the plan.

Select View > Dashboard from the main menu to toggle between standard and dashboard modes.

Setting Data Quality Targets in the DashboardThe fields in the Target column for each data item are editable. You can activate the cursor in each field and type a percentage target value for it.

♦ When a data item meets its target, when the percentage in the Passed field meets or exceeds the percentage in the Target field, the horizontal bar for that item turns green.

♦ When the Passed percentage is lower than the Target percentage, the horizontal bar turns red, except in cases where the shortfall is within the threshold set in the Settings dialog box.

Modifying Dashboard Calculation ParametersIn addition to setting the weight associated with an item and its target percentage, you can add or remove data elements from the data quality percentage calculation for that item. This allows you to display the data quality compliance percentages for constituent elements within the data item.

Standard View and Dashboard View 111

To view and edit the list of data elements for a data item, right-click the item and select Configure Items. This opens a configuration dialog box that lists the data elements associated with the item and shows which ones are applied to the passed percentage calculation.

Check an element to add it to the calculation. To remove an element, clear its checkbox. Select at least one element.

Note: Item configuration changes made in the dashboard are not applicable to the charts and statistics in standard mode.

Dashboard CategoriesIn dashboard mode, you can create categories and assign data items to them. You typically create categories to display items with common data quality criteria. Figure 12-2 on page 112 shows categories for Accuracy, Completeness, Conformity, and Consistency and also the default New Items category.

Categories are managed through the Dashboard Categories dialog box. This dialog box provides options to add new categories, edit category names, and move categories higher or lower in the dashboard report.

To open this dialog box, right-click any data item on the dashboard and select Configure Categories:

Creating a Category

Use the following procedure to create categories.

To create a category:

1. Open the Dashboard Categories dialog box and click Add.

The Category Name dialog box opens.

Figure 12-2. Report Viewer, Showing Dashboard Categories


2. Type a name in this dialog and click OK.

3. Click Close in the Dashboard Categories dialog box.

Assigning Items

All dashboards contain a single category when first created, named after the plan. All data items reside in this category before you assign them to other categories.

Data Quality Workbench creates a new category for each new plan/group added to the report.

To assign a data item to a category:

1. On the dashboard, highlight the item name.

2. Right-click the category and select Move to from the context menu.

This displays a list of available categories.

3. Without leaving the context menu, select a new category for the item.

Note: A dashboard displays all items available to the Report Target. Items cannot be hidden or deleted from the dashboard.

Moving Rows within Categories

You can move a row of data within a dashboard category.

To move a data row within a dashboard category:

Hold the Alt key and drag the row to a different location in the category.

Deleting a Category

You can delete categories from a dashboard. A category that contains a data item cannot be deleted from the dashboard. Assign the data item to a different category before deleting the category.

To remove a category from the dashboard:

Highlight the category in the Dashboard Categories dialog box and click Remove.

Assigning Weights to Data ItemsEach category on the dashboard has a weighted average, the average pass percentage across all items in the category calculated based on the weight assigned to each item.

By default, all items have an equal weight of 1.0. You might change this value based on the business importance of the item within the category or the relative number of data records represented by the category. A higher number reflects higher relevance for that item. A lower number reflects lower importance. Setting the number to 0 removes the item from the calculation of the average pass rate for the category.

To review and edit the weight assigned to an item:

1. Highlight the first row in its category, right-click and selecting Configure Items.

This opens the Weighted Average Configuration dialog box, which lists the items in the category and the current weight for each one.

Note: The first row in each category is named Weighted Average by default. This name can be changed in the Weighted Average Configuration dialog box. However, the first row always provides the weighted average pass rate for the category and appears in bold type. The configuration dialog box name is static regardless of the item name displayed in the first row.

2. Enter new weights as necessary.

Standard View and Dashboard View 113

Viewing Plan Data

You can use the Report Viewer to drill-down into the underlying plan data, including the source data, in tabular form. From the drill-down table, you can filter the data to pinpoint different data values and copy all or part of the dataset to a CSV file or clipboard.

In standard mode, you can double-click any chart element in the right pane to open a new window that displays data records matching the properties of that element. You can also right-click any highlighted element in the legend and select Open.

Dashboards provide another means to view the underlying data.

To view the records that do not satisfy the quality criteria for that item:

Right-click a highlighted data item in dashboard mode and select View Exceptions.

Note: When you drill-down to data within the Report Viewer, you refresh the view of the underlying plan data, displaying the current state of the dataset. If the data has changed since the plan was last run in Workbench, these changes are available to the Report Viewer. This does not alter the SSR file or the plan.

Drill-down mode can display either the columns in plan source data or all columns used in the plan. The latter includes both source data columns and columns created in the plan. Configure this setting in the Report Viewer Settings dialog box.

Exporting and Filtering Data in Drill-Down ModeIn drill-down mode you can export data to CSV file and to dictionary (.DIC) file.

To export data to a dictionary file:

1. Right-click the data values you want to export and click Export To > Dictionary.

This Select Dictionary Name dialog box displays.

2. You can append the data to the dictionary or overwrite existing data by selecting an existing dictionary file.

-or-

You can enter a new name in the File name field to create a new Data Quality Workbench dictionary with values for Label and Item1.

3. Save the dictionary in a location recognized by the Dictionary Manager.

To export data to a CSV file:

1. Right-click the data values you want to export and click Export To > CSV File.

The Select CSV File Name dialog box displays.

2. You can overwrite data in an existing file.

-or-

You enter a new name in the File name field to create a new CSV file.

You can use the context menu to filter the data that displays and focus on a subset of data. The drill-down context menu provides the following options:

♦ Edit > Select Column. Selects all values in the column.

♦ Edit > Select All. Selects all values in the table.

♦ Edit > Copy. Copies the highlighted cells to the Windows clipboard. You can use Ctrl or Shift-click to highlight cells across multiple rows and columns, and then copy their contents to the clipboard.

♦ Export to > Dictionary. Copies the highlighted cells to a reference dictionary (.DIC) file.

For more information about creating dictionaries using the Report Viewer, see “Creating Dictionary Files with the Report Viewer” on page 106.


♦ Export to > CSV File. Copies the highlighted cells to a CSV File.

♦ Filter > Filter by Selection. Hides all records that do not contain the value in the highlighted cell.

♦ Filter > Remove Filters. Removes the filter applied and restores the data table.

♦ Filter > Auto Filter. Adds a new cell at the top of every column in the table. Each cell provides a menu of every data value in the column. You can select a value from any cell to filter the table for records containing the same value in the same column.

You can use multiple cells in a a filter, resulting in data that fulfills all filter requirements. Select Unfilter to clear these filters.

♦ Find. Opens a dialog box that permits searches of selected columns or the entire table.

Report Viewer Parameters and Settings

Bear the following points in mind:

♦ The Report Viewer displays report files. The SSR files displayed in the Report Viewer are written or updated only when the plan is executed using the Workbench Run Plan command. You cannot edit or save report files using the Report Viewer.

♦ The Report Viewer stores settings in a master report settings file. Some display settings are stored automatically, such as the display mode and report charts display. Other settings can be set as properties. The Report Viewer does not store report settings in the SSR file.

♦ Some key report settings cannot be restored if they are changed in the Report Viewer. If you delete the dashboard history, for example, you cannot restore it, even if you run the plan again or have a back up SSR file. There is no Undo function in the Report Viewer.

Editing Report Viewer SettingsSeveral settings and display parameters relating to all viewed reports can be set manually.

The following settings are available in the Report Viewer Settings dialog box. Click File > Preferences to access this dialog box.

♦ Limit pages to n records. Sets the number of records displayed when you drill-down to the data records underlying the plan. The default value is 500.

♦ Limit record retrieval to [n] records. Sets the number of records retrieved in a drill-down operation. This setting is useful when you want a snapshot of the plan data and do not need to run the entire plan. The default value is 2000.

♦ Limit column autosizing to [n] characters. This value sets the default column width. Any field that is not wide enough to display all characters in a string displays an arrow indicator. The default value is 30 characters.

♦ Limit Pie chart to [n] slices. This value sets the number of slices that display in report pie charts. Any data values that do not fall into the number of slices set by this field are aggregated into a single slice.

The default value is 10 slices, displaying a maximum of nine slices that refer to data elements and a tenth slice for the remaining elements.

Use this setting to keep pie chart easy to read. It is also a useful method of grouping data elements for drill-down purposes.

♦ Limit Bar chart to [n] bars. This value sets the number of bars that display in report bar charts. Any data values that do not fall into the number of bars set by this field are aggregated into a single bar.

The default value is 10 bars, displaying a maximum of nine bars that refer to data elements and a tenth bar for the remaining elements.

As is the case with pie charts, you can use this setting to group data elements for drill-down purposes.

Report Viewer Parameters and Settings 115

♦ Show orange bar when within [n] percent of target. This setting relates to dashboards. It provides a visual cue to indicate when a data quality level approaches its data quality target. The default setting is 5 percent.

♦ Show component columns. Use this option to show all data columns available in the plan in drill-down view. This option is cleared by default, displaying only source data columns for drill-down.

♦ Report template. Displays the path to the XSL template on which the standard report view is based.

♦ Dashboard template. Displays the path to the XSL template on which the dashboard view is based.

♦ Dashboard history template. Displays the path to the template for the dashboard history graph.

Hiding Data Elements in Standard View

In addition to limiting the bar chart and pie chart segments displayed through the Settings dialog box, you can hide data elements through the legend displayed in standard mode.

To hide data elements:

Right-click the element and click Hide.

The item is removed from the legend and from any chart above it.

To restore hidden data elements:

Right-click the legend and click Unhide.

The resulting dialog box will list all hidden items. You can choose one or more of these to restore.

Note: In dashboard view, the Report Viewer stores drill-down settings across successive Report Viewer sessions and successive plan executions. However in standard view, hidden data settings are not stored.

Tracking Changes in Data Quality

A dashboard is particularly useful for tracking changes in the data quality levels of the dataset, data item by data item. It provides two means to do so:

♦ Historical percentages

♦ Historical trend graphs

Historical PercentagesA dashboard can show the changes in the percentage data quality achieved by a data item over time. The Report Viewer remembers the data quality percentages from the most recent dashboard view on each day that the report is opened. That is, the Report Viewer remembers one set of percentages a day. These percentages appear on the right of the dashboard.

Historical Trend GraphsAt a high level, an arrow in the left-most column on the dashboard will indicate whether the data quality for an item has improved or disimproved since the base point date. (No arrow means there has been no change.)

For a more detailed view, highlight the item name, right-click on the dashboard, and select View History... from the context menu. This opens a line graph plotting the progress in data quality for the item over time.

Viewing the Line Graph

The line graph displays percentage values on its vertical axis and date values on its horizontal axis. Right-clicking in graph area provides access to the following options:


♦ Copy. Use to copy the chart image to the clipboard.

♦ Set as base point. Use to set the selected percentage as the baseline for the graph. In a graph with multiple data points, a pair of dotted X-Y lines identify the selected percentage.

♦ Clear history before point. Use to clear all history before this date. When you select this option, you are asked if you want to clear the history for all other items on the dashboard. The default option is Yes. Click No if you want to clear the history for this item only. Click Cancel to cancel the operation.

Note: The Clear command deletes the earlier graph history and the associated historical data on the dashboard itself. Once deleted, this information cannot be restored.

Importing Report Files and Working with Groups

You can combine data from multiple report files into a single view in the Report Viewer by using the Import command. This command identifies an SSR file and imports its data into an open report.

When you import a report, you create a group comprising data from the imported report and the report previously-open in the Report Viewer. A group is a collection of settings saved to the master report settings file that points to multiple SSR files and defines how they display.

The group does not store report data or edit the SSR files.

Creating a GroupUse the following procedure to create groups.

To import data from a report file and create a group:

1. Select File > Import... from the main menu.

The Import Report dialog box opens.

2. Browse to the location of the SSR file and click OK.

When you identify the relevant file, a new dialog prompts you to type a group name for the combined report data.

Managing GroupsUse the following procedure to view or delete group.

To view the groups available to the Report Viewer:

1. Click File > Groups to open the Manage Groups dialog box.

2. To view a group, highlight its name and click Open.

3. To delete a group, highlight it and click Delete.

Clicking the Close button closes this dialog box.

You cannot delete the currently open group.

Groups and DashboardsGroups are useful for aggregating and displaying the data analyses of several plans. This can provide a wide-angle view of the quality of the business data, particularly when scorecards are built for the group.

You can define a dashboard for a group as you do for a single report. With group dashboards, you can define one or more categories containing key items from multiple reports.

Import ing Report Files and Working with Groups 117

Note: You cannot toggle between a dashboard for a single report file and for a group. When you view the dashboard for a group, the Report Viewer drops the dashboard for the originally-opened report file and displays dashboards for available groups for the remaining Report Viewer session. To return to the earlier report file, you need to open the file again.


C H A P T E R 1 3

Deploying Plans for Runtime Execution


♦ Overview, 119

♦ Deploying Runtime Plans, 119

♦ Running a Plan, 120

♦ Command Line Arguments, 122

♦ Performance, 123

♦ Multi-Threading and Multi-Processing, 124

♦ Security, 125

Overview

Data Quality supports the deployment of plans for runtime execution — that is, for execution as part of a scheduled or batch process. Plans created in Data Quality Workbench can be published from one Data Quality repository to another. The execution of the plans is then managed from the command line. You can deploy plans on Windows and UNIX platforms.

Note: In earlier versions of Informatica Data Quality, the capability to deploy plans for scheduled or batch execution was delivered through a separate application called Data Quality Runtime. In this version, Runtime functionality has been incorporated into Data Quality Server. This chapter describes the runtime plans.

For information about the prerequisites and system requirements for runtime functionality, see the Informatica Data Quality Installation Guide.

Deploying Runtime Plans

Plans deployed for batch or scheduled execution can be run from one of two locations:

♦ Directly from the Data Quality repository (enterprise installs only).

♦ As an XML file from the local file system.

119

The local or remote Data Quality repository is identified in the config.xml file on the machine that runs the plan.

Data Quality Workbench users in a service domain can use the Project Manager and File Manager to publish plans and move file resources to a remote Data Quality repository for deployment. All plans published to the repository are available for execution by Informatica Data Quality as long as the paths to all relevant data and dictionary files are valid for the plan. You can identify the paths and filenames using parameter files. For more information, see “The -c Option” on page 122.

You can convert plans to XML files from the Workbench interface and deploy the plan files and other resource files. For example, you can transfer files to another computer using FTP.

Note: When executing a runtime plan, Data Quality looks in the default Dictionaries folder for plan dictionaries. However, you can specify data source files that anywhere on the Runtime host as long as their locations are specified in a parameter file associated with the plan. For this reason, Data Quality Workbench allows you to specify the source and target file locations when you save a plan as XML.

Use runtime plans in environments where the data repository is updated periodically from one or more low-quality source systems when you need to cleanse and run reports on data periodically.

On Windows, the executable file for implementing runtime functionality is Athanor-RT.exe, located in the bin folder of the Data Quality Server installation.

On UNIX and Linux the executable file is a script located in the bin folder of the Data Quality Server installation, named “athanor-rt.” This script calls the Athanor-RT executable file using a suitable environment.

Note: Do not run the Athanor-RT executable directly on non-Windows platforms.

Running a Plan

Data Quality can execute a plan as an XML file from the file system or from the Data Quality repository.

The -f flag specifies that athanor-rt should read a plan from an XML file in the local file system. The -p flag specifies that the plan should be read from the repository identified in the local config.xml file. For example, the following code runs myplan.xml from the home/Informatica/DataQuality/plans folder:

athanor-rt -f home/Infomatica/DataQuality/plans/myplan.xml

The following code runs myplan from the Folder1 folder in the Project1 project in the repository:

athanor-rt -p project1/folder1/myplan

Note the following:

♦ You can use the -c command to have Data Quality read plan variables and source file locations from a parameter file. This allows you to reuse a plan without having to edit the plan for each scenario. For more information, see “Command Line Arguments” on page 122.

♦ Parameter files are also important elements in plan execution. Use -p as the parameter file to identify the locations of the data source files.

♦ As the Data Quality executes plans, it logs messages to the screen, to the local log file, and to the Event Log on Windows platforms or syslog on UNIX platforms as configured in the config.xml file.

Version ControlData Quality Server provides version control for plans stored in the repository. The -p option allows you to identify a base version of a plan for runtime execution.

For example, the following code runs base version 3 of myplan:

athanor-rt -p project1/folder1/myplan:3

120 Chapter 13: Deploying Plans for Runtime Execution

Scheduling OperationsData Quality can run plans in batch mode automatically, by means of a scheduling application, or manually, by an operator. For example, when an overnight batch schedule updates a database from a series of data feeds, you can call the Data Quality engine to check the feeds for data quality problems. You can call the command line application with a scheduler such as Windows Task Scheduler or UNIX Cron.

Windows Scheduling

The following steps describe how to schedule a plan on a Windows computer:

1. Create a batch file QualityReport.bat and add the desired command, for example:

C:\Program Files\IDQ\bin\Athanor-RT.exe -f C:\Plans\QualityReport.bat

2. Run the batch file to ensure that it works as expected.

Run the file with the user profile of its intended user.

3. Add a new task.

Open the Scheduled Tasks window from the Windows Control Panel. Right-click in the window and click New > Scheduled Task from the shortcut menu, and name the task.

4. Open the property sheet for this task and edit its settings as follows:

On the Task tab:

♦ Type the local path to the batch file in the Run field, such as C:\Plans\QualityReport.bat.

♦ Type the path to the Data Quality installation in the Start In field, such as C:\Program Files\IDQ.

♦ Select the user profile that will run the plan. Remember to confirm that the file will run correctly for that user.

On the Schedule tab, specify when you want to run the task.

Review the Settings tab fields. The default settings on this tab are sufficient for most tasks.

5. Click OK and, if prompted, enter a username and password.

The task is now under the control of the Windows Task Scheduler.

6. To add pre- or post-task operations, add steps to the batch file or add new tasks to the Scheduler.

You can use any scheduler with the ability to run command line tools.

Note: If the Windows Scheduler cannot find the specified file, check for spaces in the paths provided in step 4 above. Check the path by running the file from the command line. If spaces are present, surround the path with quotation marks, as follows:

"C:\My Tasks\QualityReport.bat"

The batch file returns the error code of the last command executed.

UNIX Scheduling

The following steps illustrate the scheduling of plan Profile.xml on a Solaris machine using the cron scheduler:

1. Create a shell script called QualityReport.sh and add the run command, for example:

$ home/athanor/bin/athanor-rt -f $HOME/Plans/Profile.xml

2. Run the batch file to make sure it performs as expected.

3. Create a new scheduled task using the crontab -e shell command.

The following task runs QualityReport.sh and logs standard and error messages to /tmp/QualityReport.log:

0 02 * * * sh -f /export/home/athanor/QualityReport.sh > /tmp/QualityReport.log 2>&1

You can use any scheduler than has the ability to run command line tools. For more information on using cron and crontab, see the “man crontab” and “man cron” commands or contact your system administrator.

Running a Plan 121

Command Line Arguments

Typing athanor-rt -? at the command prompt displays the following output:

Usage: .\Athanor-RT.exe [ -f <XML plan filename> | -p <project name>[/<folder name> ... ]/<plan name>[:<version id>] ] [ Options ]Specify a plan: -f <XML plan filename> Run the plan contained in the runtime plan XML file -p <Repository plan> Run the plan from the repository specified by the pathOptions: -c f Use the parameter file f to override values in the XML plan -i n Display progress information every n records -? Display this usage screen -h Display this usage screen

For more information about options -f and -p, see “Running a Plan” on page 120.

The -c OptionData Quality supports the use of parameter files that can facilitate the deployment of a plan in one or more environments. The parameter file is passed to the Data Quality engine using the -c command.

The parameter file defines the environment-specific values to be used when the plan is executed. For example, a mapping between the original location of a source file and its new location can be mapped in the parameter file:

C:\Program Files\IDQ\DevData\Source.csv= C:\Program Files\IDQ\users\user.name\Files\ProdData\Source.csv

Such mappings are platform-independent, that is, a Windows path can be mapped to a UNIX path, and vice versa.

You can export or publish a plan and notify an administrator who applies the parameter file. Alternatively, you can prepare the parameter file before exporting or publishing the plan.

To make best use of the -c option, establish a standard convention to indicate the kind of information files contain. Take care when defining mappings in the parameter file. For example, the mapping “word=book” will replace all instances of “word” in the XML file, including tags such as <password>, which can result in an invalid plan.

Encryption

Often the details in a parameter file, such as passwords and database connection details, are secured. To maintain security, an administrator can encrypt the parameter file by passing it to the Athanor-Encode utility. This generates an encrypted file with the extension .enc appended to the original parameter file name.

This file can only be read by Data Quality or by Informatica Global Customer Support. You can edit the parameter file in a secure environment and place the encrypted version in the production environment.

Passwords

You can apply the parameter file in encrypted or plain text mode. In plain text mode, when you edit the password tag, the parameter will be applied each time the plan is run.

When you want to replace encrypted passwords at execution time, you must edit the XML plan and replace the encrypted password with a placeholder. For example, the following line:

<Password EncryptionLevel='1'>W3uC+PY/kzcAUw==</Password>

should be replaced with an non-encrypted placeholder than can be easily communicated and defined in production parameter files, for example:

<Password>PasswordHolder</Password>

In a parameter file, the password can now be substituted using the following mapping:

PasswordHolder=user.name


Shared Databases Details

A plan may be designed for use with two databases with common connection details, then in production, the plan is run against two different databases. In such a case, Data Quality cannot distinguish between the two. You must edit the original plan so that it refers to the production databases, or add placeholders for the production databases before moving plans to a different domain. Alternatively, as best practice, it may be worth developing the convention of using distinct database details and accounts for each database when a plan is in design.

The -i OptionUse the -i option for checking system performance and establishing the reasons why a plan is behaving in a certain way.

For example, if plan n reads a CSV source and changes two fields within the dataset to uppercase, then it writes the data to a CSV target. Its input fields are as follows:

CUSTOMER_KEY, FIRST_NAME, LAST_NAME, ADDREESS_LINE_1, ADDRESS_LINE_2... ADDRESS_LINE_6

Running the plan and specifying -ix at the command, where x is a positive integer, produces the output shown below, whenever x records (plus 1 for the initial record) are processed:

Time in long seconds 1063104892Local time Tue Sep 09 11:54:52 2003[0] DataSource Progress = 0[1] DataSource Num Records = 9975[2] DataSource Num Comparisons = 4[3] Similarity Record ID = 4[4] CUSTOMER_KEY = 12321[5] FIRST_NAME = Edward[6] LAST_NAME = Oconnell[7] ADDRESS_LINE_1 = Clorane[8] ADDRESS_LINE_2 = Kiloimo[9] ADDRESS_LINE_3 = Co Limerick[10] ADDRESS_LINE_4 =[11] ADDRESS_LINE_5 =[12] ADDRESS_LINE_6 =[13] To Upper 2(FIRST_NAME) = EDWARD[14] To Upper 2(LAST_NAME) = OCONNELL

Each row corresponds to a memory location in the engine. The time in long seconds is useful for checking the performance of the engine. For most tasks, every set of x records should be processed in the same amount of time. If this is not the case, a performance bottleneck exists.

Performance

The time it takes for a plan to execute depends on several factors. Some are related to Data Quality, and some are related to the environment in which the plan is executed.

In general, plan execution time includes time for the following:

1. Reading data from a data source.

2. Executing the business rules defined in the plan.

3. Writing data to a data target or report.

Reading and writing data depends on the speeds at which the Data Quality engine can read from and write to a data source or data target. With a slow-performing database source, the engine may spend more time waiting for data than processing it. Similarly, a slow-performing file target means that Data Quality may spend more time waiting for data to be written.

Performance 123

As a rule, database sources should be in as close as possible to the Data Quality instance that executes the plan. For example, a plan using a database source will run much faster if the database is located on the same local network than if the database is located at a remote site.

Similarly, when the Data Quality process is constrained by system resources such as CPU or available memory, it spends more time processing. When a plan consumes a large percentage of the CPU, it will probably execute faster on a higher-performance CPU.

Reading and WritingTuning database or file system access to reduce the time spent accessing data sources and targets allows Data Quality to concentrate on processing records.

ProcessingIncreasing the CPU speed means that records can be processed more quickly.

The MySQL database underlying the Data Quality repository or staging area can also be tuned.

Maintenance and HousekeepingIn case of the following:

♦ Plan failure. athanor-rt reports an error code of 1 if a plan fails to execute. The calling process can opt to fail or run again depending on the error code returned.

♦ Product failure. In the unlikely event that Data Quality crashes, you can facilitate crash diagnosis by performing a stack traceback and sending the results to Informatica Global Customer Support. For information about this operation, contact your systems administrator.

Multi-Threading and Multi-Processing

Data Quality applications are multi-threaded and therefore suited for multiple CPU environments. Multi-threading allows an application to make use of multiple CPUs to improve throughput. On a single CPU, multi-threading also allows an application to make use of a CPU while a slow input or output operation takes place. However, multi-threading is not the only way to improve throughput. Multi-processing can split a problem between multiple computing devices or multiple CPUs on a single device.

With multi-processing, you can decide how the best possible throughput can be achieved by dividing a problem into several different “jobs.” Each job then executes and solves a part of the overall problem. There are two major differences between this approach and multi-threading:

♦ Jobs can run on multiple devices and can provide greater computational power than any single device can offer.

♦ You might be able to accelerate processing beyond speeds possible with a generic threading approach.

Multi-processing and multi-threading provide complementary approaches to increasing throughput.

With Data Quality installed on a single machine, you can execute multiple processes concurrently, each process applying the same Data Quality plan to different parts of an overall dataset, and thus achieve greater throughput efficiency.

For example when matching large datasets, you might have six processes running on a four-CPU system, with each process tackling a different cluster of records. Each Data Quality process executes against only those clusters assigned to it.


The processing requirements of each cluster increase exponentially with the number of records in the cluster. Typically one process is assigned only a few very large clusters while other processes are assigned a large number of small clusters. Each process performs the same amount of work and each contributes to the overall operation.

A similar approach applies to the standardization of records. In this case, each Data Quality process executes on a subset of the data. As the time taken to process the overall dataset increases linearly with the number of records, it is a simple task to distribute the processing load across multiple Data Quality processes executing on one or more CPUs within one or more computing hosts.

Security

Note the following security-related details:

♦ To avoid storing potentially sensitive passwords in plain text, Data Quality can encrypt plan and parameter file passwords.

♦ The Data Quality installer on UNIX prevents the product from being installed by any user with root privileges. On UNIX, Data Quality requires no special user privileges, other than write access to /tmp. Consequently, a system administrator can restrict and control access to the product in the same manner as access to any other user-level application.

♦ The Data Quality staging area is configured by default to permit access to the underlying MySQL database to local users only. Extending access privileges requires the explicit granting of access to other users.

Security 125

A P P E N D I X A

Rule Based Analyzer Rule Statements

This appendix includes the following topics:

♦ Overview, 127

♦ Functional Operators, 128

Overview

When working with the Rule Based Analyzer, note the following points:

1. The rules are defined in a rule block.

2. Rule blocks contain a sequence of IF statements and assignment statements.

3. IF statements have the following form:

// Primary condition IF <boolean expression> THEN <Rule Block> // Optional arbitrary number of elseifsELSEIF <boolean expression> THEN <RuleBlock> // Optional elseELSE <Rule Block> ENDIF

The definition of a rule block allows for IF statements to be nested. Each IF statement must be closed by the ENDIF keyword.

Examples of IF statements:

IF input1 = "" // Testing if input 1 is empty THEN output1:= "Empty Input" ENDIF

IF (input1 < 100) and (input2 < 100) THEN output1:= 0

ELSEIF input1 > 100 THEN output1:= input1

ELSEIF input2 > 100 THEN output1:= input2

ELSE output1:= 100 ENDIF

127

4. You can add single-line text comments to logical expressions that start with two forward-slashes (//).

5. Assignment statements have the following form:

OUTPUTX:= <expression>(Where X ranges from 1 to the maximum output number.)

For example:

output1:= input1 * 123.5

6. Every expression has a type that is a Boolean, an integer, a floating point value, or a string. Expressions can be simple constant values, inputs, outputs, or operations. For example:

123 // Integer "123" // String 123.5 // Float Input1 // Input 1 type and value Output3 // Output 3 type and value 100 + 2 // Integer addition operation

7. Operations are composed of operators and their arguments.

Table A-1 lists operators you can use when building a rule:

Functional Operators

The Rule Based Analyzer accepts several functional operators in rules. You can apply them in the Rule wizard and in Expert Mode. The operators ISNUMBER and ISDATE appear as options in IF statements only.

Use the following rules and guidelines when you use functional operators:

♦ Operators that expect float arguments attempt to convert string arguments to floating point numbers where possible.

♦ The string concatenate operator [&] converts arguments to strings.

♦ Operators display an error message if an automatic conversion between types fails.

♦ The Rule Based Analyzer accepts all Gregorian dates.

Table A-1. Operators

Operator Types Operators

Prefix operators that take Boolean arguments NOT

Infix Operators that take Boolean arguments ANDORXOR (Exclusive or =)

Prefix Operators that take numerical arguments (integer or float) - (Negative)

Infix Operators that take numerical arguments (integer or float) = (Equal)<> (Not equal) < (Less than)<= (Less than or equal to)> (Greater than)>= (Greater than or equal to)- (Minus)+ (Plus)* (Multiply)/ (Divide)% (Modulo)^ (Power)

Operators that take String arguments = (Equal)<> (Not equal)& (Concatenate)

128 Appendix A: Rule Based Analyzer Rule Statements

♦ Date functions do not accept leading or trailing spaces.

Table A-2 describes the functional operators you can use when building a rule:

Table A-2. Functional Operators

Functional Operator Returns Description

ISNUMBER (expression e) Boolean Returns true if the expression can be evaluated as a number.

ISDATE (expression e) Boolean Returns true if the expression can be evaluated as a date. Dates must be in the DD/MM/YYYY format.

TOINT (expression e) Integer Converts an expression to an integer.

TOFLOAT (expression e) Float Converts an expression to a floating point value.

TOSTRING (expression e) String Converts an expression to a string.

STRLEN (string s) Integer Returns the number of characters in s.

LEFTSTR (string s, integer n) String Returns the leftmost n characters of the input string, s. If n is greater than the length of s then s is returned.

RIGHTSTR (string s, integer size)

String Returns the rightmost n characters of the input string s. If n is greater than the length of s, then s is returned.

SUBSTR (string s, integer startPos, integer size)

String Returns a substring of s, starting at the position specified by startPos and with length specified by size.

DATECOMPARE (string s1, string s2, dateformat)

Integer Returns the number of days between s1 and s2. Must define date format, such as: DD/MM/YYYY.For example, DateCompare (“2003/03/04”, “2002/03/04”, “YYYY/MM/DD”) returns the number of days between the 4th March 2003 and 4th March 2002.

DATECONVERT (string s, dateformat1, dateformat2)

String Converts the date from one specified format to another.Must define date format, such as DD/MM/YYYY.See also Example, page 68.

MONTHCOMPARE (string s1, string s2, dateformat)

Integer Returns the number of months between s1 and s2. Must define date format, such as: DD/MM/YYYY.For example, MonthCompare (“2003/03/04”, “2002/03/04”, “YYYY/MM/DD”) returns the number of months between the 4th March 2003 and 4th March 2002.

TIMECOMPARE (string s1, string s2)

Integer Returns the number of seconds between s1 and s2. Both s1 and s2 must be in hh:mm:ss format. For example, TimeCompare(“13:35:27”, “13:34:28”) returns the integer value 59.

CHAR (integer i) String Returns a string containing the character with the specified ASCII code value.

CODE (string s) Integer Returns the ASCII code value for the first character of the specified string.

MAX (integer i1, integer i2) Integer Returns the maximum value of the two arguments.

MAX (float f1, float f2) Float Returns the maximum value of the two arguments.

MIN (integer i1, integer i2) Integer Returns the minimum value of the two arguments.

MIN (float f1, float f2) Float Returns the minimum value of the two arguments.

ABS (integer i1) Integer Returns the absolute value of the argument.

ABS (float f1) Float Returns the absolute value of the argument.

CURDATE (“DD/MM/YYYY”) String Returns the current date in DD/MM/YYYY format. Can also delimit date by [-], such as DD-MM-YYYY.

CURTIME () String Returns the current time in the hh:mm:ss format.

LTRIM (string s) String Returns the string created by trimming any white spaces from the start of string s.

Functional Operators 129

RTRIM (string s) String Returns the string created by trimming any blank spaces from the end of string s.

TRIM (string s) String Returns the string that is created by trimming any white spaces from the start and end of string s.

CONTAINS (string s2, string s1)

Integer Searches for string s2 in string s1. Returns the position of the string s2 in s1 or the position of the first character of s2 in s1.Case-sensitive. For more information, see “Example: CONTAINS Function” on page 68.

Table A-2. Functional Operators

Functional Operator Returns Description

130 Appendix A: Rule Based Analyzer Rule Statements

A P P E N D I X B

Global AV: Output Field Descriptions


♦ Global AV Output Field Map, 131

Global AV Output Field Map

This appendix contains information about the codes and values returned by the Global AV component. The table below lists the Global AV output field names and maps these names to the outputs that can be created by the underlying validation engines.

Table B-1. Global AV Outputs and Corresponding Validation Engine Outputs

Global AV Output Name

Global AV Selection Status

Corresponding Address Validator Output

Corresponding International AV Output

Corresponding North America AV Output

Match Status Required Match Type Match Status Status Code

Match Code Required Match Code Match Code (previously Match Score)

Error Code and Error String

Address1 Default On Address1

Address2 Default On Address2

Organization Default On Organization Organization,

Building Default On Building Name Building,

Sub Building Default On Sub-Building Name

Sub Building,

House Number Default On Building Number House Number, Parsed Address Range

Street Name Default On Street Name, Parsed Street Name

City Abbreviation Default Off City Abbreviation

Locality/City Default On Post Town, Locality/City City

131

Additional Locality Default Off Additional Locality

Dependent Locality Default Off Dependent Locality

Dependent Locality

Dependant Thoroughfare

Default Off Dependant Thoroughfare

Thoroughfare Default Off Thoroughfare,

Double Dependant Locality

Default Off Double Dependant Locality

Province/State Default On County Name Province/State State

Postal Code/Zipcode Default On Postcode Postal Code/Zipcode

Zip

Zip Plus 4 Default On Zip Plus 4

PO Box Default Off Post-Office Box PO Box

Country Name Default Off Country Name Country Country Name

Country Code ISO 3 Digit

Default Off Three Character Country Code

Country Code

Carrier Route Default Off Carrier Route

Delivery Point Code Default Off Delivery Point Code

Delivery Point Check Digit

Default Off Delivery Point Check Digit

County FIPS Default Off County FIPS

Address Type Code Default Off Address Type Code

Address Type String Default Off Address Type String

Urbanization Default Off Urbanization

Congressional District Default Off Congressional District

Private Mailbox Default Off Private Mailbox

Time Zone Code Default Off Time Zone Code

Time Zone Default Off Time Zone

MSA Default Off MSA

PMSA Default Off PMSA

Suite Status Code Default Off Suite Status Code

EWS Flag Default Off EWS Flag

Zip Type Default Off Zip Type

Parsed Pre-Direction Default Off Parsed Pre-Direction

Parsed Suffix Default Off Parsed Suffix

Parsed Post-Direction Default Off Parsed Post-Direction







132 Appendix B: Global AV: Output Field Descriptions

Parsed Suite Name Default Off Parsed Suite Name

Parsed Suite Range Default Off Parsed Suite Range

Parsed Private Mailbox Name

Default Off Parsed Private Mailbox Name

Parsed Private Mailbox Number

Default Off Parsed Private Mailbox Number

LACS Default Off LACS

LACS Link Indicator Default Off LACS Link Indicator

LACS Link Return Code Default Off LACS Link Return Code

Element Match Status Default Off Element Match Status

Element Result Status Default Off Element Result Status

CMRA Required DPV CMRA

DPV Footnotes Required DPV DPV Footnotes

Delivery Point Suffix (DPS)

Default Off Postally Not Required (PNR) Locality. Data Quality does not populate this field.

GEO_StatusCode Required Geocoder Option

GEO_StatusCode

GEO_ErrorCode Required Geocoder Option

GEO_ErrorCode

GEO_CensusBlock Required Geocoder Option

GEO_CensusBlock

GEO_CensusTrack Required Geocoder Option

GEO_CensusTrack

GEO_CountyFips Required Geocoder Option

GEO_CountyFips

GEO_CountyName Required Geocoder Option

GEO_CountyName

GEO_Latitude Required Geocoder Option

Latitude for UK and AUS

GEO_Latitude

GEO_Longitude Required Geocoder Option

Longitude for UK and AUS

GEO_Longitude

Formatted_Address_n Default Off Outputs are not engine-specific







Global AV Output Field Map 133

134 Appendix B: Global AV: Output Field Descriptions

A P P E N D I X C

Search/Replace Operations and Noise Removal

This appendix includes the following topic:

♦ Noise Removal, 135

Noise Removal

This appendix contains information about noise removal, that is, removing extraneous characters from data strings. Noise removal can make data records more legible and facilitate matching operations.

When you run an analysis plan, identify any symbols, spaces, and unexpected characters in the source data fields so you can remove or replace them with a Search Replace component. This is known as noise removal.

Table C-1 lists some typical removal and replacement selections in the Search Replace component:

Table C-1. Standard Noise Removal and Replacement Operations

Data Element Action

. Replace with a single space.

, Replace with a single space.

- Replace with a single space.

/ Replace with a single space.

\ Replace with a single space.

; Replace with a single space.

Double Spaces Replace with a single space.

Blank space Remove at start.

ATTN: Remove at start.

C/O Remove at start.

C\O Remove at start.

Blank space Remove at end.

135

“ Remove.

“ Remove.

' Remove.

' Remove.

( Remove.

! Remove.

` Remove.

# Remove.

: Remove.

{ Remove.

} Remove.

[ Remove.

] Remove.

Table C-1. Standard Noise Removal and Replacement Operations

Data Element Action

136 Appendix C: Search/Replace Operations and Noise Removal

A P P E N D I X D

Matching Formulas


♦ Matching Formulas, 137

Matching Formulas

Given an input set of N records, the following number of comparisons is required without grouping:

If the records are grouped into m groups (G1…Gm being the number of records in groups 1…m) and comparisons only occur within records in the same group, the following number of comparisons is required:

In the worst case, this means that grouping leads to a reduction of comparisons, where Gmax is the size of the biggest group:

In practice, a greater reduction is expected since it is unlikely that every group is the same size.

137

138 Appendix D: Matching Formulas

A P P E N D I X E

SQL Scripts


♦ Overview, 139

♦ Creating a MySQL Table, 139

♦ Use of MAX Function, 140

♦ Nested Groups and Counts, 140

Overview

Data Quality is installed with a MySQL database system to which data files can be migrated and in which queries can be developed. Although SQL scripts are not required in the majority of cases when designing and running plans, there are cases in which SQL scripts can provide efficient solutions to particular data problems.

The Database Source and Database Target component configuration dialog boxes allow you to develop SQL scripts. The sections below describe some useful SQL scripts and the particular issues that they address.

Creating a MySQL Table

Use the following steps to create a MySQL table:

1. Using a Database Target component, create the database table to which you want to migrate a data file. In the Before pane, type the following:

drop table if exists table_name; # delete table if it already exists create table table_name # create table with following fields ( TableID int primary key, FieldA varchar(20), # use descriptive names for fields FieldB varchar(20), FieldC varchar(20), FieldD float FieldE int );

2. In the During pane, insert the data from the source file to new table.

Select Expert Mode to see the SQL scripting equivalent of the tab settings.

139

3. In the After pane, you should create an index, especially when dealing with large datasets. Use the following script:

Create index index_name on table_name(FieldE);

Use of MAX Function

The MAX function works best on numeric data.

You can use the following steps to use the MAX function to identify the most recent transaction for each customer:

1. Convert each date to YYYYMMDD format and store it as an numeric type data field.

With this step in place, you can add the following SQL scripts to the Database Source configuration dialog box to identify the most recent transaction for each customer.

2. Type the following in the Before tab:

Drop table if exists tmp; # create a temporary table CREATE table tmp (cust_ref varchar(20), numdate bigint); INSERT INTO tmp SELECT transtable.cust_ref, MAX(transtable.numdate)FROM transtable GROUP BY transtable.numdate CREATE index tmp_trans_index on tmp(cust_ref, numdate);

3. Type the following in the During tab:

SELECT select transtable.cust_ref, transtable.numdate, <any other fields> FROM transtable, tmp WHERE transtable.cust_ref = tmp.cust_ref AND transtable.numdate = tmp.numdate

4. Type the following in the After tab:

Drop table tmp;

Nested Groups and Counts

You might use the following steps to count the numbers of customers in your dataset by town and country:

1. In the During pane, select the data fields required for the report.

For this example, assume each unique record represents a single customer and that each record contains the following fields of information: Country and Town.

2. Check the Expert Mode option.

3. Edit the resulting script so that it reads as follows:

SELECT Table_name.Country, COUNT(table_name.Country), Table_name.town, COUNT (table_name.town) FROM table_name GROUP BY Table_name.country., Table_name.town

140 Appendix E: SQL Scripts

A P P E N D I X F

ODBC Data Source Administrator


♦ Using the ODBC Data Source Administrator, 141

Using the ODBC Data Source Administrator

Use the Microsoft ODBC Data Source Administrator when connecting to databases with ODBC. When the Database Source is configured to connect using ODBC, it requires a Data Source Name.

Note: The following procedure is written for Windows XP users. Details may differ slightly for for other versions of Windows.

To create a Data Source Name that is recognized by ODBC:

1. Open the Administrative Tools window.

2. Double-click Data Sources (ODBC).

The ODBC Data Source Administrator dialog box opens.

3. In this dialog box, select the System DSN tab and click Add.

The Create New Data Source dialog box prompts to select the driver for which you want to set up a data source.

4. Select the appropriate driver for the database that you want to connect to.

You might need to install a driver if you cannot locate one in the list.

When you have successfully identified the driver, a setup dialog box opens for the database driver you have selected.

5. Type a name for the data source in the Data Source Name field.

6. Click Select and browse to select the appropriate database for the new data source.

7. Click OK to exit the dialog boxes and return to Data Quality Workbench.

8. Under the Connect to Database tab of the Database Source configuration dialog box, type the newly-created Data Source Name in the relevant field and click Connect.

141

You should now see the data tables of the database that you associated with the data source name. You can drill down into the tables and select fields as required.

Note the following:

♦ You can apply Data Quality components directly to data retrieved by ODBC and write the results to local files. You can migrate the data retrieved by ODBC into a local Data Quality MySQL data table. This approach may prove useful if you are retrieving a large data set across a network that is prone to heavy traffic.

♦ When connecting to Microsoft Access databases, you might find that no tables or data fields are available for viewing after you establish an ODBC connection. This can occur if Access table names or field names include spaces. Most database vendors do not accept spaces in table names or field names.

♦ This naming convention is an accepted industry standard. To view data in this instance, you must remove all spaces from the Microsoft Access table names and field names.

142 Appendix F: ODBC Data Source Administrator

A P P E N D I X G

Character Encodings and Unicode


♦ Character Encodings and Unicode, 143

Character Encodings and Unicode

Informatica Data Quality is Unicode-compliant. Several components allow you to specify the character encodings to be applied to the data on which they operate. The character encoding options are generally available in the Encodings menu on the configuration dialog box for the component.

Entries on this menu include the default encoding for the current system based on the current locale, the standard UTF encodings (UTF-8 and UTF-16 little endian and big endian), and an option to choose other encodings not listed in the menu by default.

Encodings recently selected but not defined by the default selections are added to a history of previously-selected encodings. Only those encodings not available by default are added to the history. The history is limited to three entries.

Choosing a Non-Default EncodingClick Choose on the menu to open a new dialog box listing the available encodings as defined in the localeEncoding.csv file.

This dialog box lists the following:

♦ Base languages

♦ Encodings available for versions of the base language

♦ Countries associated with each version

♦ ISO number of each version

The list can be expanded and collapsed to aid list navigation. Highlight a language or dialect and click OK to select it for any data on which the component will operate.

Note that you select an encoding of the language rather than the base language, and that in some cases the versions are distinguished by operating system rather than region.

143

144 Appendix G: Character Encodings and Unicode

A P P E N D I X H

Data Quality Workbench Toolbar


♦ Data Quality Workbench Toolbar, 145

Data Quality Workbench Toolbar

Figure H-1 lists the names of Data Quality Workbench toolbar icons:

Figure H-1. Data Quality Workbench Toolbar

New Project New Plan Save Plan Run Plan Refresh Undo RedoCut Component

Copy Component

Paste Component

Configure Component

Delete Component

Show Source Viewer

Show Project Manager

Show Plan Notes

Import Workbench Plan

Export Workbench Plan

Import Realtime Plan

Export Realtime Plan

Import Runtime Plan

Export Runtime Plan

Open Report Viewer

Open Dictionary Manager

View Plan Layers

Tile Windows

Cascade Windows

Open Help Topics

145

146 Appendix H: Data Quality Workbench Toolbar

A P P E N D I X I

Output Options in the CSV Match Target


♦ Overview, 147

♦ Configuring the Outputs for Identified Matches, 148

Overview

Significant changes have been made to the CSV Match Target component in this version of Data Quality. The CSV Match Target component:

♦ Can generate a CSV file in two formats.

♦ Provides improved HTML reporting.

♦ Employs a new algorithm to generate match clusters.

New Output FormatsThe CSV Match Target provides two output formats:

♦ Identified Matches. Provides similar results to the HTML report output. In this format, the target reconstructs the original source file and appends a cluster ID and the number of records in each cluster to the record. As a result, the number of rows in the target output file should be the same as the number of input rows. Any record for which a match was not found will have its own unique cluster ID and a cluster size of 1.

♦ Matched Pairs. Delivers each matching pair that meets or exceeds the match threshold set in the target. (This corresponds to the target output in version 3.0 of the product.)

HTML ReportThe HTML Report format displays with the unique records in the cluster, with the best match identified and the score against that match.

147

UsageThe CSV Match Target only calculate clusters when configured to do so. Select the Identified Matches or HTML Report option to activate cluster generation.

You can also disable HTML report generation.

ClusteringThe clustering algorithm assigns all records identified as matches to a cluster. The algorithm runs while the plan runs and stores temporary data in memory.

In larger datasets, large quantities of matches can cause a large amount of memory to be used. Grouping data can keep group sizes within recommended parameters, so unnecessary matching operations are avoided. Informatica recommends a maximum 5,000 records per group.

SourcesThe CSV Match Target can calculate record clusters when used with the CSV Match Source or Group Source. When you use CSV Match Target with other sources and select the Identified Matches option, the plan does not run. If you select HTML Report is selected, then the plan runs, but the HTML page indicates that the report cannot be created.

Configuring the Outputs for Identified Matches

When you select the Identified Matches output format, you must review the order of the output columns in the Output pane.

The columns in the Outputs pane must be organized by data source, with an equal number of columns for records from each data source. The match score column must appear after the record columns. The logic is as follows:

♦ Data reaches the CSV Match Target as two input records side by side, For example, records with Name and Address fields reach the Target in the following format followed by the match score:

Name_1,Address_1,Name_2,Address_2

♦ When you select the Identified Matches format, the Target reconstructs the original input records. The previous example would be reconstructed as follows:

Name_1,Address_1Name_2,Address_2

♦ You must order the output columns in the Output pane so the columns from the first record are listed in order, followed by the columns in the second record, followed by the columns for the match scores. The Outputs pane for the previous example should look like this:

Name_1Address_1Name_2Address_2MatchScore

♦ Figure 3-1 on page 32 illustrates a well-ordered Outputs pane for the Identified Matches option.

Use the Up and Down arrows to order columns.

148 Appendix I: Output Options in the CSV Match Target

A P P E N D I X J

Informatica Data Quality Naming Conventions


♦ Overview, 149

Overview

This appendix describes a recommend naming system for Data Quality project elements. You and your team should agree a clear and consistent set of naming conventions for the elements you create in Workbench. Your exact approach to naming conventions will depend on your organization’s needs.

The elements to consider are:

♦ Projects. Create a project under the local repository (My Repository) in Workbench Project Manager. You cannot rename a Data Quality repository.

♦ Folders. Create a folder under a project in Workbench Project Manager. Folders can be nested in projects.

♦ Plans. Create a plan at folder or project level in Workbench Project Manager.

♦ Configurable components. Select a component from the Component Palette and add it to an open plan.

♦ Component instances. Open a component onscreen to view or edit an instance. A component comprises one or more instances.

♦ Component outputs. Open a component onscreen to view or edit its outputs. A component creates one or more output columns based on the rules applied to its inputs.

♦ Dictionaries. Open Workbench Dictionary Manager or the local file system to view dictionary (.DIC) files.

No element can share a name with another element at the same node in the Project Manager. For example, you cannot define two folders named MyFolder in the same project.

You can copy an element at its current location. In such cases, Workbench prefixes its name with “Copy of.” For example, you can make a copy of MyFolder and create a new folder named Copy of MyFolder by default in the same project. If the length of the new element is longer than permitted, Workbench truncates the name.

149

Naming ProjectsWorkbench creates a project with the default name “New Project”.

Project naming should be clear and consistent within a repository. Follow these guidelines:

♦ Limit project names to 22 characters. The repository imposes a limit of 30 characters. Limiting project names to 22 characters allows Workbench to prefix “Copy of ” to a copied project without truncating characters.

♦ Include enough descriptive information in the project name for an unfamiliar user to grasp the general purpose of the plans in the project.

♦ If plans within the project will operate on a single data source, incorporate the data source name in the project name.

♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions. They allow the PowerCenter repository to import the project without changing its name.

♦ If you use company codes or abbreviations in the project name, ensure they are consistent and well documented.

Naming FoldersWorkbench creates four folders by default beneath a new project. The folders are named Consolidation, Matching, Profiling, and Standardization and are listed alphabetically. These names relate to four common types of data quality plan. You can rename, delete, and create folders to suit your business and project objectives.

Naming guidelines for folders:

♦ Limit folder names to 42 characters. The repository imposes a limit of 50 characters. Limiting folder names to 42 characters allows Workbench to prefix “Copy of ” to a copied folder without truncating characters.

♦ Include enough descriptive information in the folder name for an unfamiliar user to grasp the purpose of the plans in the folder.

♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions. They allow the PowerCenter repository to import the folder without changing its name.

♦ If you use company codes or abbreviations in the folder name, ensure they are consistent and well documented.

Naming PlansWhen you create a new plan, Workbench prompts you to select one of four generic plan types as the plan name: Analysis, Consolidation, Matching, or Standardization. These names relate to the default folder names. Workbench provides them as an aid to project design.

These default names in no way determine or constrain plan functionality. You can add a new plan to any folder regardless of their names.

Note: Take particular care when naming plans, particularly if you will export the plan to a PowerCenter repository. Be as clear and descriptive as possible. Data quality operations are defined and implemented at plan level. Although you can see a plan’s folder and project parentage in Workbench, these elements may not be evident in the PowerCenter repository.

Naming guidelines for plans:

♦ Include the plan’s purpose or primary functionality in the plan name.

♦ If you will use the plan in a PowerCenter mapping or mapplet, prefix the plan name with dq_. This conforms to PowerCenter naming conventions. PowerCenter applies a lowercase prefix to all elements in its repository. For data quality plans, this is an optional but recommended step.

♦ Limit plan names to 42 characters. The repository imposes a limit of 50 characters. Limiting plan names to 42 characters allows Workbench to prefix “Copy of ” to a copied plan without truncating characters.

150 Appendix J: Informatica Data Quality Naming Conventions

♦ Include enough descriptive information in the plan name for an unfamiliar user to grasp the purpose of the plans in the folder.

♦ Use letters, numbers, and underscores in your name. Do not use spaces. These are PowerCenter conventions. They allow the PowerCenter repository to import the plan without changing its name.

♦ If you use company codes or abbreviations in the plan name, ensure they are consistent and well documented.

Naming ComponentsWhen you add a component to a plan, its default name appears underneath its icon in the plan workspace. Edit this name to provide a description of the component’s role in the plan. Prefix your new name with an abbreviation of the plan’s original name to make the plan more legible onscreen.

If the component type abbreviation itself is not sufficient to identify what the component does, include an identifier for the function of the component in its name.

Table J-1 lists prefixes you can use when renaming your components:

In addition, consider these naming guidelines for components:

♦ Keep component names short where possible. You may wish to reuse component names in field names, and your database may impose a limit on field length.

♦ Include the name of the input field or the field type.

Table J-1. Component Names and Prefixes

Component Prefix Component Prefix

Address Validator av_ Soundex sx_

Aggregation ag_ Splitter spL_

Bigram bg_ To Upper tu_

Character Labeller cl_ Token Labeller tl_

Context Parser cp_ Token Parser tp_

Count co_ Weight Based Analyzer wba_

Edit Distance ed_ Word Manager wm_

Global AV av_ SOURCES/TARGETS

Hamming Distance hd_ CSV Dual Source csv_m_

International AV iav_ CSV Match Source csv_d_

Jaro Distance jd_ CSV Merge Target csv_merge_

Merge MG_ CSV Source/Target csv_

MinAvgMax mam_ DB Match Source db_m_

Missing Values mv_ DB Report Target db_r_

Mixed Field Matcher mfm_ DB Source/Target db_

North America AV nav_ Dual Group Source dgs_

Nysiis nys_ Fixed Width Source/Target fws_

Profile Standardizer ps_ Group Source/Target grp_

Range Counter rc_ Match Key Target mks_

Rule Based Analyzer rba_ Realtime Source/Target rs_

Scripting sc_ Report Target rep_

Search Replace sr_ SAP Source/Target sap_

Overview 151

♦ Use letters, numbers, and underscores in your name. Do not use spaces.

♦ If you use company codes or abbreviations in the component name, ensure they are consistent and well documented.

Naming FieldsCareful field naming is essential when designing data quality plans. The power of Data Quality leads to complex plans with many components.

Data Quality requires that every component output field name is unique in the plan. Output field names do not persist from component to component.

Data Quality does not have the data lineage feature of PowerCenter, so the field name is the clearest indicator of the source of a data element when a plan is examined by a third party.

Naming guidelines for fields:

♦ Prefix each output field name with an abbreviation of its component name. For a list of usable abbreviations, see Table J-1.

♦ Use upper and lower case consistently.

♦ Do not rename output fields in target components unless necessary, as there is no convenient way to determine the origin of a renamed output field.

♦ If you use company codes or abbreviations in the field name, ensure they are consistent and well documented.

Naming Dictionary FilesDictionaries may be given any name suitable for the operating system on which they will be used.

Naming guidelines for dictionary files:

♦ Limit dictionary names to characters permitted by the operating system. If a dictionary is to be used on both Windows and UNIX, do not use spaces.

♦ If you modify a dictionary file from Informatica, rename or move it to a new folder before using it in a plan. In this way, you will not overwrite your modifications if you perform a Content update.

♦ If you use company codes or abbreviations in the dictionary name, ensure they are consistent and well documented.

152 Appendix J: Informatica Data Quality Naming Conventions

I N D E X

AAggregation component

configuring 47

BBigram component

configuring 91

C-c option

command line argument 122shared database details 123

categoriescreating dashboard 112dashboard 112deleting 113moving rows 113

character encodingconfiguring 143

Character Labeller componentconfiguring 53

charactersremoving extraneous 135

clusteringCSV Match Source algorithm 148

command line arguments-c option 122encrypting parameter files 122-i option 123overview 122

ComponentsAddress Validation Components

Global AV 96Analysis Components

Character Labeller 53Token Labeller 56

Frequency ComponentsAggregation 47Count 43MinAvgMax 49Missing Values 51Range Counter 50Sum 46

Key Field Generator ComponentsNormalization 81Nysiis 83

Soundex 81Matching Components

Bigram 91Edit Distance 88Hamming Distance 90Identity Match 86Jaro Distance 89Mixed Field Matcher 92Similarity 88Weight Based Analyzer 94

Parsing ComponentsContext Parser 78Parser 71Profile Standardizer 76Splitter 72Token Parser 73

Source ComponentsCSV 13CSV Dual Match 19CSV Identity Group 22CSV Match 19Database 14Database Match 20DB Identity Group 23Dual Group 21Fixed Width 16Group 21Realtime 16SAP 17

Target ComponentsCSV 27CSV Match 31CSV Merge 30Database 36Database Report 38Fixed Width 28Group 35Identity Group 40Match Key 33Realtime 40Report 29SAP 38

Transformation ComponentsMerge 64Rule Based Analyzer 67Scripting 69Search Replace 61To Upper 65Word Manager 63

153

Context Parser componentconfiguring 78

Count componentconfiguring 43

CSV Dual Match Source componentconfiguring 19

CSV Identity Group Source componentconfiguring 22

CSV Match Source componentconfiguring 19

CSV Match Target componentconfiguring 31Identified Matches option 31, 148Matched Pairs option 31output options 147sources for calculating clusters 148

CSV Merge Target componentconfiguring 30

CSV Source componentconfiguring 13

CSV Target componentconfiguring 27

Ddashboard view

Report Viewer 111dashboards

categories 112creating categories 112creating groups 117modifying calculation parameters 111setting Data Quality targets 111tracking changes 116tracking historical percentages 116tracking historical trends 116

dataviewing plan 114

data elementshiding 116

data matchingformulae 137

Data Quality staging areadefault permissions 125

data sourcescreating ODBC 141

database dictionariescreating 106description 103

Database Match Source componentconfiguring 20

Database Report Target componentconfiguring 38

Database Source componentconfiguring 14

Database Target componentconfiguring 36

databasesshared details 123

DB Identity Group Source componentconfiguring 23

deployingruntime plans 119

deploying plansusing the command line 122

dictionariesadding spellings 105creating 106overview 103updating files 104

Dictionary Manageroverview 104

Dual Group Source componentconfiguring 21

EEdit Distance component

configuring 88encodings

configuring 143encrypting

parameter files 122encryption

for password protection 125executing

plans 6

FFile Manager

description 2Fixed Width Source component

configuring 16Fixed Width Target component

configuring 28functional operators

in rules 128

GGlobal AV component

configuring 96Group Source component

configuring 21Group Target component

configuring 35groups

creating 117creating dashboards 117managing 117nested in scripts 140

HHamming Distance components

configuring 90hiding

data elements 116HTML

CSV Match Target component report format 147

154 Index

I-i option

command line argument 123Identified Matches option

configuring output 148CSV Match Target component 147

Identity Group Target componentconfiguring 40

Identity Matchpopulations 86

Identity Match componentconfiguring 86

International AV componentreturn codes and values 131

itemsassigning 113

JJaro Distance component

configuring 89

Lline graphs

viewing 116

MMatch Key Target component

configuring 33Matched Pairs option

CSV Match Target component 147MAX function

in scripts 140Merge component

configuring 64MinAvgMax component

configuring 49Missing Values component

configuring 51Mixed Field Matcher component

configuring 92multi-processing

overview 124multi-threading

overview 124MySQL tables

creating 139

Nnested groups

in scripts 140noise

removal 135Normalization component

configuring 81Nysiis component

configuring 83

OODBC

creating data sources 141ODBC Data Source Administrator

creating a DSN 141

Pparameter files

encrypting 122passwords 122

Parser componentconfiguring 71

passwordsparameter files 122

percentagestracking historical 116

performancechecking with command line argument 123tuning 123

plansexecuting 6overview 2performance tuning 123version control 8

Profile Standardizer componentconfiguring 76

Project Managerdescription 2

RRange Counter component

configuring 50Realtime Source component

configuring 16Realtime Target component

configuring 40removing

extra characters 135Report Target component

configuring 29Report Viewer

assigning weights to data items 113creating dictionary files 106creating groups 117dashboard view 111Data Quality targets on the dashboards 111editing settings 115exporting data 114filtering data 114importing report files 117managing groups 117parameters and settings 115standard view 111tracking changes 116viewing plan data 114working with groups 117

Rule Based Analyzerrule statements 127

Index 155

Rule Based Analyzer componentconfiguring 67

rulesfunctional operators 128

runtime executionplans 119

runtime plansdeploying 119

SSAP Source component

configuring 17SAP Target component

configuring 38scheduling

operations 121Scripting component

configuring 69Search Replace component

configuring 61security

encrypting parameter files 122tips 125

Similarity componentconfiguring 88

Soundex componentconfiguring 81

sourcescalculating clusters with CSV Match Target 148

Splitter componentconfiguring 72

SQL scriptssamples 139

standard dictionariescreating text 106description 103

standard viewReport Viewer 111

Sum componentconfiguring 46

system performancechecking with command line argument 123

Ttables

creating MySQL 139terms

adding new to dictionaries 105adding spellings to dictionaries 105

third-party reference datadescription 103

To Upper componentconfiguring 65

Token Labeller componentconfiguring 56

Token Parser componentconfiguring 73, 74multiple dictionary operations 74

toolbaricons 145

trendstracking historical 116

UUnicode

compliance 143UNIX installation

root privileges 125

Vversion control

plan publication 10plans 8tracking plans 9

viewsReport Viewer 111

WWeight Based Analyzer component

configuring 94weights

assigning to data items 113Word Manager component

configuring 63

156 Index

NOTICES

This Informatica product (the “Software”) includes certain drivers (the “DataDirect Drivers”) from DataDirect Technologies, an operating company of Progress Software Corporation (“DataDirect”) which are subject to the following terms and conditions:

1. THE DATADIRECT DRIVERS ARE PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

2. IN NO EVENT WILL DATADIRECT OR ITS THIRD PARTY SUPPLIERS BE LIABLE TO THE END-USER CUSTOMER FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR OTHER DAMAGES ARISING OUT OF THE USE OF THE ODBC DRIVERS, WHETHER OR NOT INFORMED OF THE POSSIBILITIES OF DAMAGES IN ADVANCE. THESE LIMITATIONS APPLY TO ALL CAUSES OF ACTION, INCLUDING, WITHOUT LIMITATION, BREACH OF CONTRACT, BREACH OF WARRANTY, NEGLIGENCE, STRICT LIABILITY, MISREPRESENTATION AND OTHER TORTS.

dq 861 userguide

Documents