chp 12 - data cleansing additional functionality

8/2/2019 Chp 12 - Data Cleansing Additional Functionality

1/91

Chapter 12 Data CleansingAdditional Functionality


2/91

2 Chapter 12 Data Cleansing Additional Functionality


3/91

12.5 Solutions to Exercises 3

12.1 Additional Data Quality/Cleansing Techniques

33

Objectives Discuss some additional data quality/cleansing

techniques.

44

Data Quality/CleansingThe following are additional techniques that can

be used to further enhance data quality:

Identification analysis

Gender analysis

Parsing

Concatenating

Casing


4/91


55

Data Quality/Cleansing

Control whether a text string isControl whether a text string is

represented as all capital lettersrepresented as all capital letters

or in mixed case.or in mixed case.

Casing

Given two (or more) text strings,Given two (or more) text strings,

concatenate the values into one string.concatenate the values into one string.

Concatenating

Given a text string, parse the stringGiven a text string, parse the string

into its individual elements.into its individual elements.

Parsing

Based on a person's name, determineBased on a person's name, determine

the gender.the gender.

Gender

Analysis

Based on a given name string,Based on a given name string,

determine whether the namedetermine whether the name

represents an individual or anrepresents an individual or an

organization.organization.

Identification

Analysis

66

Identification AnalysisIdentification analysis enables you to compare information

from the QKB with undetermined fields in your data to

determine whether each field contains the following:

For name information: an individual's name

an organization's name

empty

For address information: a street address

city/state/ZIP information

empty


5/91


77

Identification AnalysisFor data fields containing name data, identity analysis

returns INDIVIDUAL, ORGANIZATION, or UNKNOWN.

For data fields containing address data, identity analysis

returns one of the following:

ACCT (account number type information)

ADDR (address line1)

ADDR2 (address line 2)

ATTN (attention line)

BLANK (blank or null values)

CSZ (city state zip)

IND (an individual's name)

ORG (organization's type information)

UNK (Unknown)

88

Gender AnalysisGender analysis determines whether a particular name

is most likely feminine, masculine, or unknown.

The results are placed in a new field and have three

possible values:

"M" for male

"F" for female

"U" for unknown


6/91


1010

Parsing DataParsing is a simple but intelligent tool for separating

a multi-part field value into multiple, single-part fields

(tokens).

Each token is identified based on its individual

contribution to the overall field.

Mr. Linwood Leroy Bubar, III, M.D.

Name:

M.D.IIIBubarLeroyLinwoodMr.

NameAppendage

NameSuffix

FamilyName

MiddleName

GivenName

NamePrefix

1111

Concatenating DataConcatenating is essentially the opposite of the parse

step. Rather than separating a single field into multiple

fields, concatenating combines one or more fields into

a single field.

Given Name: Igor

Middle Name: Bela

Family Name: Bonski

Concatenated Name: Igor Bela Bonski


7/91


1212

CasingChanging case enables you to make all alphabetical

values in a field UPPERCASE, lowercase, or

Proper Case.

Proper case treats a field value as a proper name; that is,

the first letter of each word is capitalized, with the

remaining characters in lowercase.

As with standardization, changing case can make field

values more consistent.

1313

Applying TechniquesThese data quality/cleansing techniques can be applied

using the following:

dfPower Studio's dfPower Architect

the SAS Data Quality Server functions as column-level

transformations with SAS Data Integration Studio

the SAS Data Quality Server functions within a SAS

programming environment

Because the SAS Data Quality Server functions are the

same whether surfaced in SAS Data Integration Studio

or in a SAS session, you only look at these functions in

a SAS session.


8/91


12.2 Data Quality/Cleansing Using dfPower Architect

1515

Objectives Describe the functionality of dfPower Architect.

Explore various job flow steps that are available

to use.

Discuss the sequence of steps for building a job.

1616

dfPower Architect: IntroductiondfPower Architect brings much of the functionality of the

other dfPower Studio applications, as well as some

unique functionality, into a single, intuitive user interface.

To use dfPower Architect,

you specify operations by

selectingjob flow steps

and then configuring

those steps to meet your

specific data needs.

The steps you choose are

displayed asjob flow

icons, which together

form a visualjob flow.


9/91


1717

dfPower ArchitectWith dfPower Architect, you can perform the following

tasks:

identify and connect to multiple data sources, whether

those sources are local, over a network on a different

platform, or at a remote location

choose and configure job flow nodes for processing

your data

reconfigure existing job flow nodes as needed

view sample processed data at each job flow node

specify a variety of output options, including reports

and new data sources

run a job flow with a single click

1818

Accessing dfPower ArchitectdfPower Architect is invoked from the toolbar of dfPower

Studio by selecting BaseArchitect.

...


10/91


11/91


2121

dfPower Architect Interface

Job Flow AreaNodes List

...

2222

Job Flow StepsdfPower Architect's available job flow steps are grouped

into nine categories:


12/91


2323

Job Flow Steps: Data InputsJob flow steps in the Data Inputs category:

...

Node Description

Data Source identifies existing data sets to process.

SQL Query identifies existing data sets to process using SQL.

Text File Input accesses data in a plain-text file.

Fixed Width File Input accesses data in text file where the input is separated into fixed-width columns.

External Data Provider enables services for applications or processes that want to pass data into

dfPower Architect one record at a time; also can be used to call other Architectjob flows within a job when used in conjunction with the embedded job node.

Table Metadata is used for extracting meta information from a specific table within a database.

SAS Data Set is used to identify existing SAS data sets to process on the Microsoft Windows

platform

SAS SQL Query is used to identify existing data sets to process as with theSAS Data Set node.

This step, however, enables you to use SQL to select data.
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.html


13/91


2424

Job Flow Steps: Data OutputsJob flow steps in the Data Outputs category:

...

Node Description

Data Target (Update) updates existing data rather than create a new data source or replace an

existing source.

Data Target (Insert) outputs data in a variety of data formats to a new data source, leaving

your existing data as-is or overwriting your existing data.

Delete Record eliminates records from a data source using the unique key of those

records.

HTML Report creates an HTML-formatted report from the results of your job flow.

Text File Output creates a plain-text file with the results of your job flow.

Fixed Width File Output outputs your data to well-defined fixed-width columns in your output

file.

Frequency Distribution Chart creates a chart that shows how selected values are distributed throughout

your data.

Match Report generates a match report that can then be displayed with the Match

Report Viewer.

dfPower Merge File Output writes clustered data to a dfPower Merge file for use in dfPower Merge.


14/91


2525

Job Flow Steps: UtilitiesJob flow steps for Utilities:

...

Node Description

COM Plugin adds COM (Component Object Model) to your job flows.

Data Sorting re-orders your data set at any point in a job flow.

Expression runs a Visual Basic-like language to process your data sets in ways that are not built into

dfPower Studio.

Data Joining is used when you have two tables, each with the same number of records, and you want

to join them by location in the file rather than by a unique key.

Data Joining

(Non-Key)

is used when you have two tables, each with the same number of records, and you want

to join them by location in the file rather than by a unique key.

Data Union uses Data Joining to combine two data sets in an intelligent way so that the records ofone, the other, or both data sets are used as the basis for the resulting data set.

Concatenate performs essentially the opposite of the Parse node; rather than separate a single field

into multiple fields, Concatenate combines one or more fields into a single field.

Embedded Job embeds another dfPower Architect job in your current job flow.

Sequencer

(Autonumber)

creates a sequence of numbers given a starting number and a specified interval.

SQL Lookup finds rows in a database table that have one or more fields matching those in the job flow.

SQL Execute enables you to construct and execute any valid SQL statement (or series of statements);generally used to perform some database-specific task(s), either before, after, or in

between architect job flows; stand-alone node (no parents or children).

Field Layout enables you to rename and reorder field names as they pass out of this node.

Parameterized

SQL Query

provides a way to write an SQL query that contains variable inputs, also known as

parameters.


15/91


2626

Job Flow Steps: ProfilingJob flow steps for the Profiling category:

...

Node Description

Data Validation analyzes the content of data by setting validation conditions.

Pattern Analysis performs pattern analysis.

Basic Statistics calculates basic statistics.

Frequency Distribution creates a frequency distribution.

Basic Pattern Analysis provides the ability to run Pattern Analysis in a very similar manner as it is run

in dfPower Profile. (In contrast to advanced Pattern Analysis, the simplifiedversion does not employ Blue Fusion pattern identification definitions.)
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.html


16/91


2727

Job Flow Steps: QualityJob flow steps in the Quality category:

...

Node Description

Gender Analysis performs gender analysis.

Gender Analysis (Parsed) performs gender analysis on parsed information.

Identification Analysis performs identification analysis.

Parsing parses a field.

Standardization performs standardization of fields of data.

Standardization (Parsed) performs standardization of fields of parsed information.

Change Case enables the case of a field values to be set.

Locale Guessing attempts to guess the appropriate locale based on field information.

Right Fielding identifies the contents of fields and copies the data to fields with more

descriptive names.


17/91


2828

Job Flow Steps: IntegrationJob flow steps in the Integration category:

...

Node Description

Match Code generates match codes.

Match Codes (Parsed) generates match codes on parsed information.

Clustering generates clusters.

Cluster Update enables new records to be integrated with existing clusters.

Surviving Record Identification examines clustered data and determines a surviving record for each

cluster.

Cluster Diff compares sets of clustered records.

Exclusive Real Time Clustering

(ERTC)

facilitates the near real-time addition of new rows to previously

clustered data

Concurrent Real Time Clustering

(CRTC)

is similar to ERTC node in its outcomes; the difference between the

nodes is that the ERTC node interacts directly with the cluster state

file while the CRTC node interacts with a server that interacts with

the cluster state file.


18/91


2929

Job Flow Steps: EnrichmentJob flow steps in the Enrichment category:

...

Node Description

Address Verification (US/Canada) verifies, corrects, and enhances U.S. and Canadian addresses in

your existing data.

Address Verification (QAS) performs address verification on addresses from outside of the U.S.

and Canada.

Address Verification (World) performs address verification on addresses from outside of the U.S.

and Canada. (This step is similar to Address Verification (QAS) but

supports verification and correction for addresses from morecountries.)

Geocoding matches geographic information from the geocode referencedatabase with ZIP codes in your data to determine latitude,

longitude, census tract, FIPS (Federal Information Processing

Standard), and block information.

County matches information from the phone and geocode reference

databases with FIPS codes in your data to calculate several values.

Phone matches information from the phone reference database with

telephone numbers in your data.

Area Code matches information from the phone reference database with zip

codes in your data to calculate several values, primarily area code,

but also Overlay1, Overlay2, Overlay3, and Result.


19/91


3030

Job Flow Steps: Enrichment (Distributed)Job flow steps in the Enrichment (Distributed) category:

...

Node Description

Distributed Geocoding offloads geocode processing to a machine other than one running the

current dfPower Architect job.

Distributed Address Verification offloads address verification processing to a machine other than one

running the current dfPower Architect job.

Distributed Phone offloads phone data processing to a machine other than one running

the current dfPower Architect job.

Distributed Area Code offloads area code data processing to a machine other than one

running the current dfPower Architect job.

Distributed County offloads county data processing to a machine other than one running

the current dfPower Architect job.


20/91


3131

Job Flow Steps: MonitoringJob flow steps in the Monitoring category:

...

Node Description

Data Monitoring enables you to analyze data according to business rules that you create using

theBusiness Rule Manager. The business rules that you create in Rule

Manager can analyze the structure of the data and trigger an event, such as

logging a message or sending an e-mail alert, when a condition is detected.
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.html


21/91


3232

Getting Started with dfPower ArchitectA typical dfPower Architect session consists of the

following:

1. Plan the job flow.

2. Select the input data.

3. Build the job flow.

4. Specify the output.

5. Process the job flow.

3333

1. Plan the job flow.

2. Select the input data.3. Build the job flow.

4. Specify the output.5. Process the job flow.

A typical dfPower Architect session consists of the

following:

Getting Started with dfPower Architect

Identify how the data is to be

processed.

Select input data source(s)

and/or manipulate with SQL.

Select and configure job flow

nodes.

Identify the type of output, andwhere the output is to be saved.

Select to begin processing.


22/91


3434

Improve the Data

Standardize data.

Augment and validate data.

Create match codes.

Case Study TasksAnalyze and Profile the Data

Access and view the data.

Create and execute profiling

job(s).

This demonstration

illustrates the use of

dfPower Architect to

perform identification

analysis, gender analysis,

parsing, concatenation,

and casing. In addition,

other nodes are

investigated (frequency

distribution, frequency

distribution chart, and

HTML report).

3535

Improve the Data

Standardize data. Augment and validate data.

Create match codes.




job(s).

Task performedusing

1dfPowerStudio 7.1from DataFlux


23/91


Augmenting and Validating Data Using dfPower Architect

In this demonstration, first establish a data source to work with. Then run an identification analysis on aname field from this data source, with the results used to generate frequency counts of the identified types

of data. After you decide that the majority of data in the name field are individual names, run a gender

analysis with the results of this also used to generate frequency counts. As a last step, use the results fromthe identification and gender analysis to generate a pie chart.

1. If necessary, invoke dfPower Studio by selecting StartAll Programs

DataFlux dfPower Studio 7.1dfPower Studio.

2. Select Base from the toolbar, and then select Architect.


24/91


Identification and Gender Analysis

1. Add a data source to the job flow.

a. Expand the Data Inputs grouping of nodes.

b. Double-click the Data Source node.


25/91


The Data Source node is added to the job flow, and the Data Source Properties window opens.

To add a node to the job flow diagram, you can do the following: double-click

drag and drop

right-click and select Insert on Page


26/91


c. Specify properties for the Data Source node.

1) Enter Contacts as the name.

2) Select next to Input table.

3) Expand the DataFlux Sample database and select the Contacts table.

4) Select to close the Select Table window.


27/91


The Data Source Properties window shows available fields from the Contacts table.

5) Select (double-arrow) to move all fields from the Available area to the Selected area.


28/91


6) Select to close the Data Source Properties window.

The job flow diagram is updated to a display that resembles what is shown below:

2. With the data source node selected, select the Preview tab from the Details area (at the bottom of

dfPower Architect interface). The data from this node is displayed:


29/91


3. Perform an Identification Analysis using the Contact field.

a. Expand the Quality grouping of nodes.

b. Double-click the Identification Analysis node.


30/91


The Identification Analysis Properties window opens.

c. Move the CONTACT field from the Available area to the Selected area by double-clicking.

d. Double-click on the Definition column for the selected CONTACT field.

e. From the menu, select Individual/Organization.

f. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the

field CONTACT_Identity.

g. Select below the Available area.


31/91


h. Select (double-arrow) to move all fields from the Available area to the Selected area.

i. Select to close the Additional Outputs window.

j. Select to close the Identification Analysis Properties window.

4. Preview the results of the Identification Analysis.

a. Verify that the Identification Analysis node is selected.

b. Select the Preview tab at the bottom of dfPower Architect interface.

c. Scroll to the right to view the information populated for the CONTACT_Identity:

Although this preview is a good indication of the overall data values, it would be desirable to

verify that there are no odd data values.


32/91


5. Add a Frequency Distribution task to the job flow.

a. Expand the Profiling grouping of nodes.

b. Double-click the Frequency Distribution node.

The Frequency Distribution Properties window opens.


33/91


c. Move CONTACT_Identity from the Available area to the Selected area.

d. Select to close the Frequency Distribution Properties window. The Preview tab is

populated with the frequency report.

If you are satisfied that the majority (99%) of the observations represents individuals, you can

proceed with a gender analysis.


34/91


6. Perform a gender analysis using the Contact field.

a. Verify that the Frequency Distribution 1 node is selected in the job flow diagram.

b. Expand the Quality grouping of nodes.

c. Right-click on the Gender Analysis node and select Insert Before Selected.


35/91


The Gender Analysis Properties window opens.

d. Move the CONTACT field from the Available area to the Selected area by double-clicking.

e. Double-click on the Definition column for the selected CONTACT field.

f. Select Gender.g. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the

field CONTACT_Gender.

h. Select below the Available area.

i. Select (double-arrow) to move all fields from the Available area to the Selected area.

j. Select to close the Additional Outputs window.

k. Select to close the Identification Analysis Properties window.


36/91


7. Update the properties of the Frequency Distribution to include the CONTACT_Gender field.

a. Right-clickFrequency Distribution 1 in the job flow and select Properties.

b. Move the CONTACT_Gender field from the Available area to the Selected area.

c. Select to close the Frequency Distribution Properties window. The Preview tab is

populated with the frequency report.

A more visual approach for viewing the results uses a graphic representation of the information.


37/91


8. Add a Frequency Distribution Chart task to the job flow.

a. Expand the Data Outputs grouping of nodes.

b. Double-click the Frequency Distribution Chart node.


38/91


The Frequency Distribution Chart Properties window opens.

c. Select next to Chart name to choose a location for the output.

1) Navigate to S:\Workshop\winsas\didq.

2) Enter Contacts Gender Identity Chart as the value forFile name.

3) Select to close the Save As window.


39/91


d. Enter Gender & Identity Distribution from Contacts as the title for the chart.

e. Move both CONTACT_Identity and CONTACT_Gender from the Available area to the

Selected area.

f. Select to close the Frequency Distribution Chart Properties window. The Preview tab

is populated with the frequency report.


40/91


9. Run the entire job.

a. Select from the toolbar. The job processes, and the Run Job window opens with a status

indicator:

b. Select to close the Run Job window.


41/91


The Chart Viewer window opens.


42/91


c. Select to scroll to the next chart forCONTACT_Gender.

d. Select FileExit to close the Chart Viewer window.


43/91


10. Save the job.

a. From the dfPower Architect menu, select FileSave As.

b. EnterDIDQ Contact Gender/Identity Analysis as the name.

c. Enter Gender & Identity Analysis for Contacts table as the description.

d. Select to close the Save As window.


44/91


Parsing, Concatenation, and Casing

Name fields are often populated in a variety of ways: sometimes as FIRST MIDDLE LAST, and othertimes as LAST, FIRST. Parsing enables you to break a name field into portions. Concatenation can rejoin

the name field in a consistent fashion. After the field values are available in a consistent pattern, it is

useful to put the data in the correct case.

1. Start a new job by selecting FileNew.

2. Add a data source to the job flow:

a. Expand the Data Inputs grouping of nodes.

b. Double-click the Data Source node. The Data Source Properties window opens.

c. Specify properties for the Data Source node.

1) Enter Contacts as the name.


3) Expand the DataFlux Sample database and then select the Contacts table.


The Data Source Properties window shows available fields from the Contacts table.




45/91


3. Parse the Contact field.

a. Expand the Quality grouping of nodes.

b. Double-click the Parsing node.


46/91


The Parse Properties window opens.


47/91


c. Select CONTACT as the field to parse.

d. Select Name as the definition.

e. Select to move all tokens from the Available area to the Selected area.

f. Select below the Available area.

g. Select (double-arrow) to move all fields from the Available area to the Selected area.

h. Select to close the Additional Outputs window.

i. Select to close the Parse Properties window.


48/91


j. Select the Preview tab to view the results of the parse.


49/91


4. Concatenate the parsed fields.

a. Expand the Utilities grouping of nodes.

b. Double-click the Concatenate node.


50/91


The Concatenation Properties window opens.


51/91


c. Specify LastFirst as the output field.

d. Enter , (a comma and a space) as the value forLiteral text.

e. Select Family Name, and then select to move it to the Concatenation list area.

f. Select next to Literal text to move the text to the Concatenation list area after

Family Name.

g. Select Given Name, and then select to move it to the Concatenation list area.

h. Select below the Available fields area.

i. Select (double-arrow) to move all fields from the Available area to the Selected area.

j. Select to close the Additional Outputs window.

k. Select to close the Concatenation Properties window.


52/91


The Preview tab is populated. Scroll to find the new LastFirst column.

A more complete picture of the concatenation might be gained by viewing an HTML report.


53/91


5. Add an HTML Report task to the job flow.

a. Expand the Data Outputs grouping of nodes.

b. Double-click the HTML Report node.


54/91


The HTML Report Properties window opens.


55/91


c. Enter Concatenation Results as the value forReport title.

d. Enter NewName as the value forReport name.

e. Select the check box forDisplay report in browser after job runs.

f. Deselect all columns from Selected. (Select .)

g. Move CONTACT, Given Name, Family Name, and LastFirst from the Available area to the

Selected area.

h. Select to close the HTML Report Properties window.


56/91




indicator.


The appropriate browser opens and displays the HTML report.

c. Select FileClose to close the browser when you are finished viewing the report.


57/91


7. Change the case of the LastFirst field.

a. Select the HTML Report 1 node in the job flow.

b. Expand the Quality grouping of nodes.

c. Right-clickChange Case and select Insert Before Selected.

The Case Properties window opens.


58/91


d. Move LastFirst from the Available area to the Selected area.

e. Select Proper as the type of casing to use.

f. Select Proper (Name) as the definition to use.

g. Select below the Available area.

h. Select (double-arrow) to move all fields from the Available area to the Selected area.

i. Select to close the Additional Outputs window.

j. Select to close the Parse Properties window.


59/91


k. Select the Preview tab to view the results of the parse.

8. Update the HTML Report 1 node.

a. Double-click on the HTML Report 1 node in the job flow to open the HTML Report Propertieswindow.

b. Verify that the check box forDisplay report in browser after job runs is selected.

c. Deselect all columns from the Selected area. (Select .)

d. Move CONTACT, Given Name, Family Name, LastFirst, and LastFirst_Cased from the

Available area to the Selected area.

e. Select to close the HTML Report Properties window.


60/91




indicator.


The appropriate browser opens and displays the HTML report.

c. Select FileClose to close the browser when you are finished viewing the report.


61/91


10. Save the job.

a. From the dfPower Architect menu, select FileSave As.

b. EnterDIDQ Contact Parse/Concatenation Job as the name.

c. Enter Parse then concatenation of Contact fieldas the description.

d. Select to close the Save As window.

11. Select FileExit to close dfPower Architect.

12. Select StudioExit to close dfPower Studio.


62/91


12.3 Data Quality/Cleansing Using SAS

3838

Objectives Describe some SAS Data Quality Server functions.

List some basic examples using these functions.

3939

SAS Data Quality Server FunctionsThe SAS Data Quality Server provides a set of functions

that can be used to insure quality data. Of these, several

can be used to enhance the data:

DQIDENTIFY

DQGENDER

DQPARSE

DQPARSEINFOGET

DQPARSETOKENGET

DQCASE


63/91


4040

%DQPUTLOC MacroEach of these functions requires the specification

of a definition as part of the syntax.

The %DQPUTLOC AUTOCALL macro providesa quick means of displaying current information

in the SAS log for the specified locale that is loaded

into memory at that time.

The available locale information includes a list of alldefinitions, parse tokens, related functions, and the

names of the parse definitions that are related to

each match definition.

%DQPUTLOC(locale,, );%DQPUTLOC(locale,, );

%DQPUTLOC(locale,, );

where

locale specifies the locale of interest.

SHORT=0|1 optionally shortens the length of the entry in the SAS log. SHORT=1 removes the

descriptions of how the definitions are used. The default value is SHORT=0,

which displays the descriptions of how the definitions are used.

PARSEDEFN=0|1 optionally lists the related parse definition, if such a parse definition exists, witheach gender analysis definition and each match definition. The default value

PARSEDEFN=1 lists the related parse definition. PARSEDEFN=0 does not list

the related parse definition.


64/91


4141

%DQPUTLOC Macro ExampleIf the ENUSA locale is loaded, the %DQPUTLOC macro

returns information for the ENUSA definitions, such as the

following:

/*----------------------------------------------------------*//* GENDER DEFINITION(S) *//* *//* Gender definitions are used by the following: *//* dqGender function *//* dqGenderParsed function *//*----------------------------------------------------------*/

Gender/*----------------------------------------------------------*//* IDENTIFICATION DEFINITION(S) *//* *//* Identification definitions are used by the following: *//* dqIdentify function *//*----------------------------------------------------------*/

Contact InfoIndividual/OrganizationOffensive


65/91


4242

Identification Analysis in SASThe DQIDENTIFY function returns a value that indicates

the category of the content in an input character value.

The available categories and return values depend on

your choice of identification definition and locale.

DQIDENTIFY(char, 'identification-definition')DQIDENTIFY(char, 'identification-definition')

DQIDENTIFY(char, 'identification-definition')

where

char is the value that is transformed, according to the specified identificationdefinition. The value can be the name of a character variable, a character

value in quotation marks, or an expression that evaluates to a variable name

or a quoted value.

identification-definition specifies the name of the identification definition, which must exist in thespecified locale.

locale optionally specifies the name of the locale that contains the specifiedidentification definition. The value can be a name in quotation marks, the

name of a variable whose value is a locale name, or an expression that

evaluates to a variable name or to a quoted locale name.

The specified locale must be loaded into memory as part of the locale list. If

no value is specified, the default locale is used. The default locale is the first

locale in the locale list.


66/91


4343

Example of DQIDENTIFY FunctionThe following example determines if a character value

represents an individual or an organization.

The value returned forID in the SAS log wouldbe ORGANIZATION.

data _null_;id=dqidentify('LL Bean',

'Individual/Organization','ENUSA');

put id=;run;


67/91


4444

Gender Analysis in SASThe DQGENDER function evaluates the name of an

individual to determine the gender of that individual.

If the evaluation finds substantial clues that indicate

gender, the function returns a value that indicates

that the gender is female or male. If the evaluation

is inconclusive, the function returns a value that indicates

that the gender is unknown. The exact return value

is determined by the specified gender analysis definition

and locale.

DQGENDER(char, 'gender-analysis-definition' )DQGENDER(char, 'gender-analysis-definition' )

DQGENDER(char, 'gender-analysis-definition')

where

char is the name of a character variable, a character value in quotation marks,or an expression that evaluates to a variable name or a quoted value.

gender-analysis-definition specifies the name of the gender analysis definition, which must exist in

the specified locale.

locale optionally specifies the name of the locale that contains the specified

gender-analysis definition.


68/91


4545

Example of DQGENDER FunctionThe following example determines whether a character

value represents an individual or an organization:

The value returned forGender in the SAS log would beM.

data _null_;Gender=DQGENDER('Mr. Malcolm A. Lackey',

'gender','ENUSA');

put Gender=;run;


69/91


4646

Parsing in SASThe DQPARSE function returns a parsed character

value. The return value contains delimiters that identify

the elements in the value that correspond to the tokens

that are enabled by the parse definition.

DQPARSE(char, 'parse-definition' )DQPARSE(char, 'parse-definition' )

DQPARSE(char, 'parse-definition')

where

char is the value that is parsed according to the parse definition. The value can be thename of a character variable, a character value in quotation marks, or an expression

that evaluates to a variable name or a quoted value.

parse-definition specifies the name of the parse definition, which must exist in the specified locale.

locale optionally specifies the name of the locale that contains the specified gender

analysis definition.


70/91


4747

Parsing in SASThe DQPARSEINFOGET function returns the token

names in a parse definition.

The DQPARSETOKENGET function returns a token from

a parsed character value.

DQPARSEINFOGET('parse-definition' )DQPARSEINFOGET('parse-definition' )

DQPARSETOKENGET(parsed-char, 'token',

'parse-definition' )

DQPARSETOKENGET(parsed-char, 'token',

'parse-definition' )

DQPARSEINFOGET('parse-definition')

where


locale optionally specifies the name of the locale that contains the specified gender-analysis

definition.

DQPARSETOKENGET(parsed-char, 'token','parse-definition')

where

parsed-char is the parsed character value from which will be returned the value of the specified

token.

token specifies the name of the token that is returned from the parsed value.


localeoptionally specifies the name of the locale that contains the specified gender-analysis

definition.


71/91


4848

Example of Parsing FunctionsThe following example determines whether a character


The returned values in the SAS log would be as follows:

parsedValue=Mrs./=/Sallie/=/Mae/=/Pravlik/=//=/

prefix=Mrs.

given=Sallie.

data _null_;parsedValue=DQPARSE('Mrs. Sallie Mae Pravlik','NAME', 'ENUSA');

prefix=DQPARSETOKENGET(parsedValue,'Name Prefix', 'NAME', 'ENUSA');

given=DQPARSETOKENGET(parsedValue,'Given Name', 'NAME', 'ENUSA');

put parsedValue= prefix= given=;run;


72/91


4949

Changing Case in SASThe DQCASE function returns a character value with

standardized capitalization. The DQCASE function

operates on any character content, such as names,

organizations, and addresses. All instances of adjacent

blank spaces are replaced with single blank spaces.

DQCASE(char, 'case-definition' )DQCASE(char, 'case-definition' )

DQCASE(char,'case-definition')

where


73/91


char is the value that is transformed, according to the specified case definition.

case-definition specifies the name of the case definition that will be referenced during the

transformation.

locale optionally specifies the name of the locale that contains the specified gender-analysis

definition.

5050

Example of DQCASE FunctionThe following example determines whether a character


The value returned fororgname in the SAS log wouldbe Bill's Plumbing & Heating.

data _null_;orgname=DQCASE(

"BILL's PLUMBING & HEATING",'Proper',

'ENUSA');put orgname=;

run;


74/91


5151

Improve the Data

Standardize data.

Augment and validate data.

Create match codes.




job(s).

This demonstration

illustrates the use of

the SAS Data Quality

Server functions to

perform identification

analysis, gender

analysis, parsing,

concatenation, and

casing.

5252

Improve the Data

Standardize data. Augment and validate data.

Create match codes.




job(s).

Task performedusing


75/91


Augmenting and Validating Data Using SAS

In this demonstration, investigate four separate SAS programs. These programs investigate the use andresults of the DQIDENTIFY, DQGENDER, DQPARSE, and DQCASE functions. To investigate the

results from the programs, short FREQ or PRINT procedure steps will be added.

1. Start a SAS session by selecting StartAll ProgramsSASBIArchitectureStart SAS.

2. If the Getting Started with SAS window opens, do the following:

a. Select Dont show this dialog box again.

b. Select .

The SAS Display Manager session opens.


76/91


Using the DQIDENTIFY Function

1. Verify that the Enhanced Editor window is active.

2. Select FileOpen Program.

3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQIdentityFunctions.sas.

4. Select . The following program opens in the Enhanced Editor:

/* xxxx COMMENTS xxxx */%DQLOAD(DQLOCALE=(ENUSA),

DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',DQINFO=1);

PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"

DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";

SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

RUN;

data std_prospects;set prospects;length Identity $1;label Identity='Customer Identity Type';Identity = dqidentify(contact, 'Individual/Organization');

run;

In this program, the following occurs:

The %DQLOAD macro loads the ENUSA locale into memory.

The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.

The DATA step uses the DQIDENTIFY function to identify whether the value for the CONTACT

field is an individual, an organization, or not known.

5. Select RunSubmit to execute the SAS program.


77/91


6. Select ViewLog to activate the Log window.


78/91


7. To view the scheme data set, do the following:

a. Select SAS Explorer.

b. Double-click on the Libraries icon.

c. Double-click on the Worklibrary icon.

d. Double-click on the Std_prospects table to open it into a VIEWTABLE window.

e. Scroll to view the Customer Identity Type column.

f. Select FileClose to close the VIEWTABLE window.

8. Select WindowDQIdentityFunctions.sas.


79/91


9. Run a frequency report on the new identity column.

a. At the bottom of the program, after the RUN statement for the DATA step, uncomment the PROC

FREQ step (that is, remove the /* before the step and the */ after the step. The PROC FREQstep is as shown:

proc freq;

tables identity/nocum;run;

b. Highlight only these three new lines and then select RunSubmit.

The following report surfaces:


80/91


Using the DQGENDER Function

1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.


3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQGenderFunctions.sas.


/* xxxx COMMENTS xxxx */

%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',

DQINFO=1);


DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

RUN;

data std_prospects;set Prospects;

/* use the GENDER function to determine gender based on name*/

length custgender $1;label custgender='Customer Gender';custgender = dqgender(contact, 'gender');

run;

PROC FREQ;Tables custgender/nocum;

RUN;




The DATA step uses the DQGENDER function to identify whether the value for the CONTACT

field isM

(ale),F

(emale), orU

(nknown)

The PROC FREQ step generates a report of frequency counts on the custgender column.



81/91


6. Select ViewLog to activate the Log window. A portion of the DATA step and PROC FREQ step is

shown below:

7. Select ViewOutput to activate the Output window. The report shows the following:


82/91


Using the DQPARSE Function



3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQParseFunctions.sas.



%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\biarchitecture\Lev1\sasmain\dqsetup.txt',

DQINFO=1);

PROC IMPORT OUT= WORK.ProspectsDATATABLE= "newcustomers"

DBMS=ACCESS REPLACE;

DATABASE="S:\Workshop\winsas\didq\dqdata\newcustomers.mdb";


RUN;

data std_prospects;Set prospects;Parsedname=dqparse(contact, 'NAME');Prefix=dqparsetokenget(parsedname, 'Name Prefix', 'NAME');First_name=dqparsetokenget(parsedname, 'Given Name',

'NAME');Last_name=dqparsetokenget(parsedname, 'Family Name','NAME');run;

proc print;Var prefix first_name last_name;

run;



The PROC IMPORT step accesses theNewCustomers table in the NewCustomers Microsoft

Access database. The DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the

CONTACT field.

The PROC PRINT step produces a listing report of the results of the DQPARSETOKENGET

function usage.



83/91


6. Select ViewLog to activate the Log window. The portion for the DATA step and PROC PRINT

step is shown below:

7. Select ViewOutput to activate the Output window. The partial output is as follows:


84/91


Using the DQCASE Function



3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQPropercaseFunctions.sas.



%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',

DQINFO=1);


DBMS=ACCESS REPLACE;

DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";


RUN;

data std_prospects;set prospects;ParsedName=dqParse(contact, 'NAME');Prefix=dqParseTokenGet(parsedName, 'Name Prefix', 'NAME');First_name=dqParseTokenGet(parsedName, 'Given Name',

'NAME');Last_name=dqParseTokenGet(parsedName, 'Family Name','NAME');run;

data std_prospects;set std_prospects;length Contact2 $50;label Contact2='Re-formatted Prospect Name';Contact2 = trim(Last_Name) || ', ' || First_Name;length Contact3 $50;label Contact3='Proper Cased Re-formatted Prospect Name';Contact3 = dqcase(contact2,'PROPER');

run;

proc print;Var Contact Contact2 Contact3;

run;


85/91





The first DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the value

for the CONTACT field.

The second DATA step uses the concatenation operator (||) to rebuild a Name field (Contact2).

The DQCASE function is then applied to resolve the Contact2 field to proper casing.

The PROC PRINT step produces a listing report of some parsed and concatenated information.


6. Select ViewLog to activate the Log window. The portion for the DATA steps and PROC PRINT

step is shown below:


86/91


7. Select ViewOutput to activate the Output window. Partial output is shown below:

8. Close the SAS session and do not save any changes.


87/91


12.4 Exercises

1. Analyzing the NewCustomers Table

Use theNewCustomers table from the New Customers database to do the following:

Verify the type of information found for each record. (Identify records as individual ororganization.)

Calculate gender information for each record.

Create a frequency report and a frequency report chart on both the identity and gender

information.

Parse the Contact field.

Add a field that contains a name string of the formName_Prefix Given_NameFamily_Name.

Save the job as DIDQ Ch5Ex1 NewCustomers Analysis.


88/91


12.5 Solutions to Exercises

1. Analyzing the NewCustomers Table

a. If necessary, invoke dfPower Studio by selecting StartAll Programs

DataFlux dfPower Studio 7.1dfPower Studio.

b. Select Base from the toolbar, and then select Architect.

c. Expand the Data Inputs grouping of nodes.

d. Double-click the Data Source node.

1) EnterNew Customers as the name.


3) Expand the New Customers database and select the NewCustomers table.




e. With the data source node selected, select the Preview tab from the Details area (at the bottom of

dfPower Architect interface). The data from this node is displayed.

f. Expand the Quality grouping of nodes.

g. Double-click the Identification Analysis node. The Identification Analysis Properties window

opens.

1) Move the CONTACT field from the Available area to the Selected area by double-clicking.

2) Double-click on the Definition column for the selected CONTACT field.

3) From the menu, select Individual/Organization.

4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed

in the field CONTACT_Identity.

5) Select below the Available area.


7) Select to close the Additional Outputs window.

8) Select to close the Identification Analysis Properties window.


89/91


h. Preview the results of the Identification Analysis.

1) Verify that Identification Analysis is selected.

2) Select the Preview tab at the bottom of dfPower Architect interface.

3) Scroll to the right to view the information populated for the CONTACT_Identity field.

i. Expand the Quality grouping of nodes.

j. Double-click on the Gender Analysis node. The Gender Analysis Properties window opens.

1) Move the CONTACT field from the Available area to the Selected area by double-clicking.

2) Double-click on the Definition column for the selected CONTACT field.

3) Select Gender.

4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed

in the field CONTACT_Gender.




8) Select to close the Identification Analysis Properties window.

k. Expand the Profiling grouping of nodes.

l. Double-click the Frequency Distribution node.

1) The Frequency Distribution Properties window opens.

2) Move CONTACT_Identity and CONTACT_Gender from the Available area to the Selected

area.

3) Select to close the Frequency Distribution Properties window. The Preview tab is

populated with the frequency report

m. Expand the Data Outputs grouping of nodes.

n. Double-click the Frequency Distribution Chart node.

1) Select next to Chart name to choose a location for the output.

2) Navigate to S:\Workshop\winsas\didq.

3) EnterNew Customers Chart as the value forFile name.


5) EnterGender & Identity Distribution from New Customers as the title forthe chart.


90/91


6) Move both CONTACT_Identity and CONTACT_Gender from the Available area to the

Selected area.

7) Select to close the Frequency Distribution Chart Properties window. The Preview

tab is populated with the frequency report.

o. Select from the toolbar. The job processes, and the Run Job window opens with a statusindicator.

1) Select to close the Run Job window. The Chart Viewer window opens.

2) Select to scroll to the next chart forCONTACT_Gender.

3) Select FileExit to close the Chart Viewer window.

p. Save the job.

1) From dfPower Architect menu, select FileSave As.

2) EnterDIDQ Ch5Ex1 NewCustomers Analysis as the name.

3) EnterNew Customer Analysis as the description.


q. Select the Frequency Distribution 1 node in job flow.

r. Expand the Quality grouping of nodes.

s. Right-click the Parsing node and select Insert Before Selected.

1) Select CONTACT as the field to parse.

2) Select Name as the definition.

3) Select to move all tokens from the Available area to the Selected area.


5) Select to move all fields from the Available area to the Selected area.


7) Select to close the Parse Properties window.

t. Expand the Utilities grouping of nodes.

u. Double-click the Concatenate node. The Concatenation Properties window opens.

1) Specify PreFirstLast as the output field.

2) Enter (a space) as the value forLiteral text.

3) Select Name Prefix, and then select to move it to the Concatenation list area.


91/91


4) Select next to Literal text to move the text to the Concatenation list area after

Name Prefix.

5) Select Given Name, and then select to move it to the Concatenation list area.

6) Select next to Literal text to move the text to the Concatenation list area afterGiven Name.

7) Select Family Name, and then select to move it to the Concatenation list area.

8) Select below the Available fields area.

9) Select to move all fields from the Available area to the Selected area.

chp 12 - data cleansing additional functionality

Documents