chp 12 - data cleansing additional functionality
TRANSCRIPT
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
1/91
Chapter 12 Data CleansingAdditional Functionality
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
2/91
2 Chapter 12 Data Cleansing Additional Functionality
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
3/91
12.5 Solutions to Exercises 3
12.1 Additional Data Quality/Cleansing Techniques
33
Objectives Discuss some additional data quality/cleansing
techniques.
44
Data Quality/CleansingThe following are additional techniques that can
be used to further enhance data quality:
Identification analysis
Gender analysis
Parsing
Concatenating
Casing
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
4/91
4 Chapter 12 Data Cleansing Additional Functionality
55
Data Quality/Cleansing
Control whether a text string isControl whether a text string is
represented as all capital lettersrepresented as all capital letters
or in mixed case.or in mixed case.
Casing
Given two (or more) text strings,Given two (or more) text strings,
concatenate the values into one string.concatenate the values into one string.
Concatenating
Given a text string, parse the stringGiven a text string, parse the string
into its individual elements.into its individual elements.
Parsing
Based on a person's name, determineBased on a person's name, determine
the gender.the gender.
Gender
Analysis
Based on a given name string,Based on a given name string,
determine whether the namedetermine whether the name
represents an individual or anrepresents an individual or an
organization.organization.
Identification
Analysis
66
Identification AnalysisIdentification analysis enables you to compare information
from the QKB with undetermined fields in your data to
determine whether each field contains the following:
For name information: an individual's name
an organization's name
empty
For address information: a street address
city/state/ZIP information
empty
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
5/91
12.5 Solutions to Exercises 5
77
Identification AnalysisFor data fields containing name data, identity analysis
returns INDIVIDUAL, ORGANIZATION, or UNKNOWN.
For data fields containing address data, identity analysis
returns one of the following:
ACCT (account number type information)
ADDR (address line1)
ADDR2 (address line 2)
ATTN (attention line)
BLANK (blank or null values)
CSZ (city state zip)
IND (an individual's name)
ORG (organization's type information)
UNK (Unknown)
88
Gender AnalysisGender analysis determines whether a particular name
is most likely feminine, masculine, or unknown.
The results are placed in a new field and have three
possible values:
"M" for male
"F" for female
"U" for unknown
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
6/91
6 Chapter 12 Data Cleansing Additional Functionality
1010
Parsing DataParsing is a simple but intelligent tool for separating
a multi-part field value into multiple, single-part fields
(tokens).
Each token is identified based on its individual
contribution to the overall field.
Mr. Linwood Leroy Bubar, III, M.D.
Name:
M.D.IIIBubarLeroyLinwoodMr.
NameAppendage
NameSuffix
FamilyName
MiddleName
GivenName
NamePrefix
1111
Concatenating DataConcatenating is essentially the opposite of the parse
step. Rather than separating a single field into multiple
fields, concatenating combines one or more fields into
a single field.
Given Name: Igor
Middle Name: Bela
Family Name: Bonski
Concatenated Name: Igor Bela Bonski
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
7/91
12.5 Solutions to Exercises 7
1212
CasingChanging case enables you to make all alphabetical
values in a field UPPERCASE, lowercase, or
Proper Case.
Proper case treats a field value as a proper name; that is,
the first letter of each word is capitalized, with the
remaining characters in lowercase.
As with standardization, changing case can make field
values more consistent.
1313
Applying TechniquesThese data quality/cleansing techniques can be applied
using the following:
dfPower Studio's dfPower Architect
the SAS Data Quality Server functions as column-level
transformations with SAS Data Integration Studio
the SAS Data Quality Server functions within a SAS
programming environment
Because the SAS Data Quality Server functions are the
same whether surfaced in SAS Data Integration Studio
or in a SAS session, you only look at these functions in
a SAS session.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
8/91
8 Chapter 12 Data Cleansing Additional Functionality
12.2 Data Quality/Cleansing Using dfPower Architect
1515
Objectives Describe the functionality of dfPower Architect.
Explore various job flow steps that are available
to use.
Discuss the sequence of steps for building a job.
1616
dfPower Architect: IntroductiondfPower Architect brings much of the functionality of the
other dfPower Studio applications, as well as some
unique functionality, into a single, intuitive user interface.
To use dfPower Architect,
you specify operations by
selectingjob flow steps
and then configuring
those steps to meet your
specific data needs.
The steps you choose are
displayed asjob flow
icons, which together
form a visualjob flow.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
9/91
12.5 Solutions to Exercises 9
1717
dfPower ArchitectWith dfPower Architect, you can perform the following
tasks:
identify and connect to multiple data sources, whether
those sources are local, over a network on a different
platform, or at a remote location
choose and configure job flow nodes for processing
your data
reconfigure existing job flow nodes as needed
view sample processed data at each job flow node
specify a variety of output options, including reports
and new data sources
run a job flow with a single click
1818
Accessing dfPower ArchitectdfPower Architect is invoked from the toolbar of dfPower
Studio by selecting BaseArchitect.
...
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
10/91
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
11/91
12.5 Solutions to Exercises 11
2121
dfPower Architect Interface
Job Flow AreaNodes List
...
2222
Job Flow StepsdfPower Architect's available job flow steps are grouped
into nine categories:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
12/91
12 Chapter 12 Data Cleansing Additional Functionality
2323
Job Flow Steps: Data InputsJob flow steps in the Data Inputs category:
...
Node Description
Data Source identifies existing data sets to process.
SQL Query identifies existing data sets to process using SQL.
Text File Input accesses data in a plain-text file.
Fixed Width File Input accesses data in text file where the input is separated into fixed-width columns.
External Data Provider enables services for applications or processes that want to pass data into
dfPower Architect one record at a time; also can be used to call other Architectjob flows within a job when used in conjunction with the embedded job node.
Table Metadata is used for extracting meta information from a specific table within a database.
SAS Data Set is used to identify existing SAS data sets to process on the Microsoft Windows
platform
SAS SQL Query is used to identify existing data sets to process as with theSAS Data Set node.
This step, however, enables you to use SQL to select data.
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.html -
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
13/91
12.5 Solutions to Exercises 13
2424
Job Flow Steps: Data OutputsJob flow steps in the Data Outputs category:
...
Node Description
Data Target (Update) updates existing data rather than create a new data source or replace an
existing source.
Data Target (Insert) outputs data in a variety of data formats to a new data source, leaving
your existing data as-is or overwriting your existing data.
Delete Record eliminates records from a data source using the unique key of those
records.
HTML Report creates an HTML-formatted report from the results of your job flow.
Text File Output creates a plain-text file with the results of your job flow.
Fixed Width File Output outputs your data to well-defined fixed-width columns in your output
file.
Frequency Distribution Chart creates a chart that shows how selected values are distributed throughout
your data.
Match Report generates a match report that can then be displayed with the Match
Report Viewer.
dfPower Merge File Output writes clustered data to a dfPower Merge file for use in dfPower Merge.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
14/91
14 Chapter 12 Data Cleansing Additional Functionality
2525
Job Flow Steps: UtilitiesJob flow steps for Utilities:
...
Node Description
COM Plugin adds COM (Component Object Model) to your job flows.
Data Sorting re-orders your data set at any point in a job flow.
Expression runs a Visual Basic-like language to process your data sets in ways that are not built into
dfPower Studio.
Data Joining is used when you have two tables, each with the same number of records, and you want
to join them by location in the file rather than by a unique key.
Data Joining
(Non-Key)
is used when you have two tables, each with the same number of records, and you want
to join them by location in the file rather than by a unique key.
Data Union uses Data Joining to combine two data sets in an intelligent way so that the records ofone, the other, or both data sets are used as the basis for the resulting data set.
Concatenate performs essentially the opposite of the Parse node; rather than separate a single field
into multiple fields, Concatenate combines one or more fields into a single field.
Embedded Job embeds another dfPower Architect job in your current job flow.
Sequencer
(Autonumber)
creates a sequence of numbers given a starting number and a specified interval.
SQL Lookup finds rows in a database table that have one or more fields matching those in the job flow.
SQL Execute enables you to construct and execute any valid SQL statement (or series of statements);generally used to perform some database-specific task(s), either before, after, or in
between architect job flows; stand-alone node (no parents or children).
Field Layout enables you to rename and reorder field names as they pass out of this node.
Parameterized
SQL Query
provides a way to write an SQL query that contains variable inputs, also known as
parameters.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
15/91
12.5 Solutions to Exercises 15
2626
Job Flow Steps: ProfilingJob flow steps for the Profiling category:
...
Node Description
Data Validation analyzes the content of data by setting validation conditions.
Pattern Analysis performs pattern analysis.
Basic Statistics calculates basic statistics.
Frequency Distribution creates a frequency distribution.
Basic Pattern Analysis provides the ability to run Pattern Analysis in a very similar manner as it is run
in dfPower Profile. (In contrast to advanced Pattern Analysis, the simplifiedversion does not employ Blue Fusion pattern identification definitions.)
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.html -
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
16/91
16 Chapter 12 Data Cleansing Additional Functionality
2727
Job Flow Steps: QualityJob flow steps in the Quality category:
...
Node Description
Gender Analysis performs gender analysis.
Gender Analysis (Parsed) performs gender analysis on parsed information.
Identification Analysis performs identification analysis.
Parsing parses a field.
Standardization performs standardization of fields of data.
Standardization (Parsed) performs standardization of fields of parsed information.
Change Case enables the case of a field values to be set.
Locale Guessing attempts to guess the appropriate locale based on field information.
Right Fielding identifies the contents of fields and copies the data to fields with more
descriptive names.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
17/91
12.5 Solutions to Exercises 17
2828
Job Flow Steps: IntegrationJob flow steps in the Integration category:
...
Node Description
Match Code generates match codes.
Match Codes (Parsed) generates match codes on parsed information.
Clustering generates clusters.
Cluster Update enables new records to be integrated with existing clusters.
Surviving Record Identification examines clustered data and determines a surviving record for each
cluster.
Cluster Diff compares sets of clustered records.
Exclusive Real Time Clustering
(ERTC)
facilitates the near real-time addition of new rows to previously
clustered data
Concurrent Real Time Clustering
(CRTC)
is similar to ERTC node in its outcomes; the difference between the
nodes is that the ERTC node interacts directly with the cluster state
file while the CRTC node interacts with a server that interacts with
the cluster state file.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
18/91
18 Chapter 12 Data Cleansing Additional Functionality
2929
Job Flow Steps: EnrichmentJob flow steps in the Enrichment category:
...
Node Description
Address Verification (US/Canada) verifies, corrects, and enhances U.S. and Canadian addresses in
your existing data.
Address Verification (QAS) performs address verification on addresses from outside of the U.S.
and Canada.
Address Verification (World) performs address verification on addresses from outside of the U.S.
and Canada. (This step is similar to Address Verification (QAS) but
supports verification and correction for addresses from morecountries.)
Geocoding matches geographic information from the geocode referencedatabase with ZIP codes in your data to determine latitude,
longitude, census tract, FIPS (Federal Information Processing
Standard), and block information.
County matches information from the phone and geocode reference
databases with FIPS codes in your data to calculate several values.
Phone matches information from the phone reference database with
telephone numbers in your data.
Area Code matches information from the phone reference database with zip
codes in your data to calculate several values, primarily area code,
but also Overlay1, Overlay2, Overlay3, and Result.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
19/91
12.5 Solutions to Exercises 19
3030
Job Flow Steps: Enrichment (Distributed)Job flow steps in the Enrichment (Distributed) category:
...
Node Description
Distributed Geocoding offloads geocode processing to a machine other than one running the
current dfPower Architect job.
Distributed Address Verification offloads address verification processing to a machine other than one
running the current dfPower Architect job.
Distributed Phone offloads phone data processing to a machine other than one running
the current dfPower Architect job.
Distributed Area Code offloads area code data processing to a machine other than one
running the current dfPower Architect job.
Distributed County offloads county data processing to a machine other than one running
the current dfPower Architect job.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
20/91
20 Chapter 12 Data Cleansing Additional Functionality
3131
Job Flow Steps: MonitoringJob flow steps in the Monitoring category:
...
Node Description
Data Monitoring enables you to analyze data according to business rules that you create using
theBusiness Rule Manager. The business rules that you create in Rule
Manager can analyze the structure of the data and trigger an event, such as
logging a message or sending an e-mail alert, when a condition is detected.
http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.html -
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
21/91
12.5 Solutions to Exercises 21
3232
Getting Started with dfPower ArchitectA typical dfPower Architect session consists of the
following:
1. Plan the job flow.
2. Select the input data.
3. Build the job flow.
4. Specify the output.
5. Process the job flow.
3333
1. Plan the job flow.
2. Select the input data.3. Build the job flow.
4. Specify the output.5. Process the job flow.
A typical dfPower Architect session consists of the
following:
Getting Started with dfPower Architect
Identify how the data is to be
processed.
Select input data source(s)
and/or manipulate with SQL.
Select and configure job flow
nodes.
Identify the type of output, andwhere the output is to be saved.
Select to begin processing.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
22/91
22 Chapter 12 Data Cleansing Additional Functionality
3434
Improve the Data
Standardize data.
Augment and validate data.
Create match codes.
Case Study TasksAnalyze and Profile the Data
Access and view the data.
Create and execute profiling
job(s).
This demonstration
illustrates the use of
dfPower Architect to
perform identification
analysis, gender analysis,
parsing, concatenation,
and casing. In addition,
other nodes are
investigated (frequency
distribution, frequency
distribution chart, and
HTML report).
3535
Improve the Data
Standardize data. Augment and validate data.
Create match codes.
Case Study TasksAnalyze and Profile the Data
Access and view the data.
Create and execute profiling
job(s).
Task performedusing
1dfPowerStudio 7.1from DataFlux
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
23/91
12.5 Solutions to Exercises 23
Augmenting and Validating Data Using dfPower Architect
In this demonstration, first establish a data source to work with. Then run an identification analysis on aname field from this data source, with the results used to generate frequency counts of the identified types
of data. After you decide that the majority of data in the name field are individual names, run a gender
analysis with the results of this also used to generate frequency counts. As a last step, use the results fromthe identification and gender analysis to generate a pie chart.
1. If necessary, invoke dfPower Studio by selecting StartAll Programs
DataFlux dfPower Studio 7.1dfPower Studio.
2. Select Base from the toolbar, and then select Architect.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
24/91
24 Chapter 12 Data Cleansing Additional Functionality
Identification and Gender Analysis
1. Add a data source to the job flow.
a. Expand the Data Inputs grouping of nodes.
b. Double-click the Data Source node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
25/91
12.5 Solutions to Exercises 25
The Data Source node is added to the job flow, and the Data Source Properties window opens.
To add a node to the job flow diagram, you can do the following: double-click
drag and drop
right-click and select Insert on Page
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
26/91
26 Chapter 12 Data Cleansing Additional Functionality
c. Specify properties for the Data Source node.
1) Enter Contacts as the name.
2) Select next to Input table.
3) Expand the DataFlux Sample database and select the Contacts table.
4) Select to close the Select Table window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
27/91
12.5 Solutions to Exercises 27
The Data Source Properties window shows available fields from the Contacts table.
5) Select (double-arrow) to move all fields from the Available area to the Selected area.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
28/91
28 Chapter 12 Data Cleansing Additional Functionality
6) Select to close the Data Source Properties window.
The job flow diagram is updated to a display that resembles what is shown below:
2. With the data source node selected, select the Preview tab from the Details area (at the bottom of
dfPower Architect interface). The data from this node is displayed:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
29/91
12.5 Solutions to Exercises 29
3. Perform an Identification Analysis using the Contact field.
a. Expand the Quality grouping of nodes.
b. Double-click the Identification Analysis node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
30/91
30 Chapter 12 Data Cleansing Additional Functionality
The Identification Analysis Properties window opens.
c. Move the CONTACT field from the Available area to the Selected area by double-clicking.
d. Double-click on the Definition column for the selected CONTACT field.
e. From the menu, select Individual/Organization.
f. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the
field CONTACT_Identity.
g. Select below the Available area.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
31/91
12.5 Solutions to Exercises 31
h. Select (double-arrow) to move all fields from the Available area to the Selected area.
i. Select to close the Additional Outputs window.
j. Select to close the Identification Analysis Properties window.
4. Preview the results of the Identification Analysis.
a. Verify that the Identification Analysis node is selected.
b. Select the Preview tab at the bottom of dfPower Architect interface.
c. Scroll to the right to view the information populated for the CONTACT_Identity:
Although this preview is a good indication of the overall data values, it would be desirable to
verify that there are no odd data values.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
32/91
32 Chapter 12 Data Cleansing Additional Functionality
5. Add a Frequency Distribution task to the job flow.
a. Expand the Profiling grouping of nodes.
b. Double-click the Frequency Distribution node.
The Frequency Distribution Properties window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
33/91
12.5 Solutions to Exercises 33
c. Move CONTACT_Identity from the Available area to the Selected area.
d. Select to close the Frequency Distribution Properties window. The Preview tab is
populated with the frequency report.
If you are satisfied that the majority (99%) of the observations represents individuals, you can
proceed with a gender analysis.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
34/91
34 Chapter 12 Data Cleansing Additional Functionality
6. Perform a gender analysis using the Contact field.
a. Verify that the Frequency Distribution 1 node is selected in the job flow diagram.
b. Expand the Quality grouping of nodes.
c. Right-click on the Gender Analysis node and select Insert Before Selected.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
35/91
12.5 Solutions to Exercises 35
The Gender Analysis Properties window opens.
d. Move the CONTACT field from the Available area to the Selected area by double-clicking.
e. Double-click on the Definition column for the selected CONTACT field.
f. Select Gender.g. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the
field CONTACT_Gender.
h. Select below the Available area.
i. Select (double-arrow) to move all fields from the Available area to the Selected area.
j. Select to close the Additional Outputs window.
k. Select to close the Identification Analysis Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
36/91
36 Chapter 12 Data Cleansing Additional Functionality
7. Update the properties of the Frequency Distribution to include the CONTACT_Gender field.
a. Right-clickFrequency Distribution 1 in the job flow and select Properties.
b. Move the CONTACT_Gender field from the Available area to the Selected area.
c. Select to close the Frequency Distribution Properties window. The Preview tab is
populated with the frequency report.
A more visual approach for viewing the results uses a graphic representation of the information.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
37/91
12.5 Solutions to Exercises 37
8. Add a Frequency Distribution Chart task to the job flow.
a. Expand the Data Outputs grouping of nodes.
b. Double-click the Frequency Distribution Chart node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
38/91
38 Chapter 12 Data Cleansing Additional Functionality
The Frequency Distribution Chart Properties window opens.
c. Select next to Chart name to choose a location for the output.
1) Navigate to S:\Workshop\winsas\didq.
2) Enter Contacts Gender Identity Chart as the value forFile name.
3) Select to close the Save As window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
39/91
12.5 Solutions to Exercises 39
d. Enter Gender & Identity Distribution from Contacts as the title for the chart.
e. Move both CONTACT_Identity and CONTACT_Gender from the Available area to the
Selected area.
f. Select to close the Frequency Distribution Chart Properties window. The Preview tab
is populated with the frequency report.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
40/91
40 Chapter 12 Data Cleansing Additional Functionality
9. Run the entire job.
a. Select from the toolbar. The job processes, and the Run Job window opens with a status
indicator:
b. Select to close the Run Job window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
41/91
12.5 Solutions to Exercises 41
The Chart Viewer window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
42/91
42 Chapter 12 Data Cleansing Additional Functionality
c. Select to scroll to the next chart forCONTACT_Gender.
d. Select FileExit to close the Chart Viewer window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
43/91
12.5 Solutions to Exercises 43
10. Save the job.
a. From the dfPower Architect menu, select FileSave As.
b. EnterDIDQ Contact Gender/Identity Analysis as the name.
c. Enter Gender & Identity Analysis for Contacts table as the description.
d. Select to close the Save As window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
44/91
44 Chapter 12 Data Cleansing Additional Functionality
Parsing, Concatenation, and Casing
Name fields are often populated in a variety of ways: sometimes as FIRST MIDDLE LAST, and othertimes as LAST, FIRST. Parsing enables you to break a name field into portions. Concatenation can rejoin
the name field in a consistent fashion. After the field values are available in a consistent pattern, it is
useful to put the data in the correct case.
1. Start a new job by selecting FileNew.
2. Add a data source to the job flow:
a. Expand the Data Inputs grouping of nodes.
b. Double-click the Data Source node. The Data Source Properties window opens.
c. Specify properties for the Data Source node.
1) Enter Contacts as the name.
2) Select next to Input table.
3) Expand the DataFlux Sample database and then select the Contacts table.
4) Select to close the Select Table window.
The Data Source Properties window shows available fields from the Contacts table.
5) Select (double-arrow) to move all fields from the Available area to the Selected area.
6) Select to close the Data Source Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
45/91
12.5 Solutions to Exercises 45
3. Parse the Contact field.
a. Expand the Quality grouping of nodes.
b. Double-click the Parsing node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
46/91
46 Chapter 12 Data Cleansing Additional Functionality
The Parse Properties window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
47/91
12.5 Solutions to Exercises 47
c. Select CONTACT as the field to parse.
d. Select Name as the definition.
e. Select to move all tokens from the Available area to the Selected area.
f. Select below the Available area.
g. Select (double-arrow) to move all fields from the Available area to the Selected area.
h. Select to close the Additional Outputs window.
i. Select to close the Parse Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
48/91
48 Chapter 12 Data Cleansing Additional Functionality
j. Select the Preview tab to view the results of the parse.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
49/91
12.5 Solutions to Exercises 49
4. Concatenate the parsed fields.
a. Expand the Utilities grouping of nodes.
b. Double-click the Concatenate node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
50/91
50 Chapter 12 Data Cleansing Additional Functionality
The Concatenation Properties window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
51/91
12.5 Solutions to Exercises 51
c. Specify LastFirst as the output field.
d. Enter , (a comma and a space) as the value forLiteral text.
e. Select Family Name, and then select to move it to the Concatenation list area.
f. Select next to Literal text to move the text to the Concatenation list area after
Family Name.
g. Select Given Name, and then select to move it to the Concatenation list area.
h. Select below the Available fields area.
i. Select (double-arrow) to move all fields from the Available area to the Selected area.
j. Select to close the Additional Outputs window.
k. Select to close the Concatenation Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
52/91
52 Chapter 12 Data Cleansing Additional Functionality
The Preview tab is populated. Scroll to find the new LastFirst column.
A more complete picture of the concatenation might be gained by viewing an HTML report.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
53/91
12.5 Solutions to Exercises 53
5. Add an HTML Report task to the job flow.
a. Expand the Data Outputs grouping of nodes.
b. Double-click the HTML Report node.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
54/91
54 Chapter 12 Data Cleansing Additional Functionality
The HTML Report Properties window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
55/91
12.5 Solutions to Exercises 55
c. Enter Concatenation Results as the value forReport title.
d. Enter NewName as the value forReport name.
e. Select the check box forDisplay report in browser after job runs.
f. Deselect all columns from Selected. (Select .)
g. Move CONTACT, Given Name, Family Name, and LastFirst from the Available area to the
Selected area.
h. Select to close the HTML Report Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
56/91
56 Chapter 12 Data Cleansing Additional Functionality
6. Run the entire job.
a. Select from the toolbar. The job processes, and the Run Job window opens with a status
indicator.
b. Select to close the Run Job window.
The appropriate browser opens and displays the HTML report.
c. Select FileClose to close the browser when you are finished viewing the report.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
57/91
12.5 Solutions to Exercises 57
7. Change the case of the LastFirst field.
a. Select the HTML Report 1 node in the job flow.
b. Expand the Quality grouping of nodes.
c. Right-clickChange Case and select Insert Before Selected.
The Case Properties window opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
58/91
58 Chapter 12 Data Cleansing Additional Functionality
d. Move LastFirst from the Available area to the Selected area.
e. Select Proper as the type of casing to use.
f. Select Proper (Name) as the definition to use.
g. Select below the Available area.
h. Select (double-arrow) to move all fields from the Available area to the Selected area.
i. Select to close the Additional Outputs window.
j. Select to close the Parse Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
59/91
12.5 Solutions to Exercises 59
k. Select the Preview tab to view the results of the parse.
8. Update the HTML Report 1 node.
a. Double-click on the HTML Report 1 node in the job flow to open the HTML Report Propertieswindow.
b. Verify that the check box forDisplay report in browser after job runs is selected.
c. Deselect all columns from the Selected area. (Select .)
d. Move CONTACT, Given Name, Family Name, LastFirst, and LastFirst_Cased from the
Available area to the Selected area.
e. Select to close the HTML Report Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
60/91
60 Chapter 12 Data Cleansing Additional Functionality
9. Run the entire job.
a. Select from the toolbar. The job processes, and the Run Job window opens with a status
indicator.
b. Select to close the Run Job window.
The appropriate browser opens and displays the HTML report.
c. Select FileClose to close the browser when you are finished viewing the report.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
61/91
12.5 Solutions to Exercises 61
10. Save the job.
a. From the dfPower Architect menu, select FileSave As.
b. EnterDIDQ Contact Parse/Concatenation Job as the name.
c. Enter Parse then concatenation of Contact fieldas the description.
d. Select to close the Save As window.
11. Select FileExit to close dfPower Architect.
12. Select StudioExit to close dfPower Studio.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
62/91
62 Chapter 12 Data Cleansing Additional Functionality
12.3 Data Quality/Cleansing Using SAS
3838
Objectives Describe some SAS Data Quality Server functions.
List some basic examples using these functions.
3939
SAS Data Quality Server FunctionsThe SAS Data Quality Server provides a set of functions
that can be used to insure quality data. Of these, several
can be used to enhance the data:
DQIDENTIFY
DQGENDER
DQPARSE
DQPARSEINFOGET
DQPARSETOKENGET
DQCASE
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
63/91
12.5 Solutions to Exercises 63
4040
%DQPUTLOC MacroEach of these functions requires the specification
of a definition as part of the syntax.
The %DQPUTLOC AUTOCALL macro providesa quick means of displaying current information
in the SAS log for the specified locale that is loaded
into memory at that time.
The available locale information includes a list of alldefinitions, parse tokens, related functions, and the
names of the parse definitions that are related to
each match definition.
%DQPUTLOC(locale,, );%DQPUTLOC(locale,, );
%DQPUTLOC(locale,, );
where
locale specifies the locale of interest.
SHORT=0|1 optionally shortens the length of the entry in the SAS log. SHORT=1 removes the
descriptions of how the definitions are used. The default value is SHORT=0,
which displays the descriptions of how the definitions are used.
PARSEDEFN=0|1 optionally lists the related parse definition, if such a parse definition exists, witheach gender analysis definition and each match definition. The default value
PARSEDEFN=1 lists the related parse definition. PARSEDEFN=0 does not list
the related parse definition.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
64/91
64 Chapter 12 Data Cleansing Additional Functionality
4141
%DQPUTLOC Macro ExampleIf the ENUSA locale is loaded, the %DQPUTLOC macro
returns information for the ENUSA definitions, such as the
following:
/*----------------------------------------------------------*//* GENDER DEFINITION(S) *//* *//* Gender definitions are used by the following: *//* dqGender function *//* dqGenderParsed function *//*----------------------------------------------------------*/
Gender/*----------------------------------------------------------*//* IDENTIFICATION DEFINITION(S) *//* *//* Identification definitions are used by the following: *//* dqIdentify function *//*----------------------------------------------------------*/
Contact InfoIndividual/OrganizationOffensive
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
65/91
12.5 Solutions to Exercises 65
4242
Identification Analysis in SASThe DQIDENTIFY function returns a value that indicates
the category of the content in an input character value.
The available categories and return values depend on
your choice of identification definition and locale.
DQIDENTIFY(char, 'identification-definition')DQIDENTIFY(char, 'identification-definition')
DQIDENTIFY(char, 'identification-definition')
where
char is the value that is transformed, according to the specified identificationdefinition. The value can be the name of a character variable, a character
value in quotation marks, or an expression that evaluates to a variable name
or a quoted value.
identification-definition specifies the name of the identification definition, which must exist in thespecified locale.
locale optionally specifies the name of the locale that contains the specifiedidentification definition. The value can be a name in quotation marks, the
name of a variable whose value is a locale name, or an expression that
evaluates to a variable name or to a quoted locale name.
The specified locale must be loaded into memory as part of the locale list. If
no value is specified, the default locale is used. The default locale is the first
locale in the locale list.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
66/91
66 Chapter 12 Data Cleansing Additional Functionality
4343
Example of DQIDENTIFY FunctionThe following example determines if a character value
represents an individual or an organization.
The value returned forID in the SAS log wouldbe ORGANIZATION.
data _null_;id=dqidentify('LL Bean',
'Individual/Organization','ENUSA');
put id=;run;
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
67/91
12.5 Solutions to Exercises 67
4444
Gender Analysis in SASThe DQGENDER function evaluates the name of an
individual to determine the gender of that individual.
If the evaluation finds substantial clues that indicate
gender, the function returns a value that indicates
that the gender is female or male. If the evaluation
is inconclusive, the function returns a value that indicates
that the gender is unknown. The exact return value
is determined by the specified gender analysis definition
and locale.
DQGENDER(char, 'gender-analysis-definition' )DQGENDER(char, 'gender-analysis-definition' )
DQGENDER(char, 'gender-analysis-definition')
where
char is the name of a character variable, a character value in quotation marks,or an expression that evaluates to a variable name or a quoted value.
gender-analysis-definition specifies the name of the gender analysis definition, which must exist in
the specified locale.
locale optionally specifies the name of the locale that contains the specified
gender-analysis definition.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
68/91
68 Chapter 12 Data Cleansing Additional Functionality
4545
Example of DQGENDER FunctionThe following example determines whether a character
value represents an individual or an organization:
The value returned forGender in the SAS log would beM.
data _null_;Gender=DQGENDER('Mr. Malcolm A. Lackey',
'gender','ENUSA');
put Gender=;run;
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
69/91
12.5 Solutions to Exercises 69
4646
Parsing in SASThe DQPARSE function returns a parsed character
value. The return value contains delimiters that identify
the elements in the value that correspond to the tokens
that are enabled by the parse definition.
DQPARSE(char, 'parse-definition' )DQPARSE(char, 'parse-definition' )
DQPARSE(char, 'parse-definition')
where
char is the value that is parsed according to the parse definition. The value can be thename of a character variable, a character value in quotation marks, or an expression
that evaluates to a variable name or a quoted value.
parse-definition specifies the name of the parse definition, which must exist in the specified locale.
locale optionally specifies the name of the locale that contains the specified gender
analysis definition.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
70/91
70 Chapter 12 Data Cleansing Additional Functionality
4747
Parsing in SASThe DQPARSEINFOGET function returns the token
names in a parse definition.
The DQPARSETOKENGET function returns a token from
a parsed character value.
DQPARSEINFOGET('parse-definition' )DQPARSEINFOGET('parse-definition' )
DQPARSETOKENGET(parsed-char, 'token',
'parse-definition' )
DQPARSETOKENGET(parsed-char, 'token',
'parse-definition' )
DQPARSEINFOGET('parse-definition')
where
parse-definition specifies the name of the parse definition, which must exist in the specified locale.
locale optionally specifies the name of the locale that contains the specified gender-analysis
definition.
DQPARSETOKENGET(parsed-char, 'token','parse-definition')
where
parsed-char is the parsed character value from which will be returned the value of the specified
token.
token specifies the name of the token that is returned from the parsed value.
parse-definition specifies the name of the parse definition, which must exist in the specified locale.
localeoptionally specifies the name of the locale that contains the specified gender-analysis
definition.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
71/91
12.5 Solutions to Exercises 71
4848
Example of Parsing FunctionsThe following example determines whether a character
value represents an individual or an organization:
The returned values in the SAS log would be as follows:
parsedValue=Mrs./=/Sallie/=/Mae/=/Pravlik/=//=/
prefix=Mrs.
given=Sallie.
data _null_;parsedValue=DQPARSE('Mrs. Sallie Mae Pravlik','NAME', 'ENUSA');
prefix=DQPARSETOKENGET(parsedValue,'Name Prefix', 'NAME', 'ENUSA');
given=DQPARSETOKENGET(parsedValue,'Given Name', 'NAME', 'ENUSA');
put parsedValue= prefix= given=;run;
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
72/91
72 Chapter 12 Data Cleansing Additional Functionality
4949
Changing Case in SASThe DQCASE function returns a character value with
standardized capitalization. The DQCASE function
operates on any character content, such as names,
organizations, and addresses. All instances of adjacent
blank spaces are replaced with single blank spaces.
DQCASE(char, 'case-definition' )DQCASE(char, 'case-definition' )
DQCASE(char,'case-definition')
where
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
73/91
12.5 Solutions to Exercises 73
char is the value that is transformed, according to the specified case definition.
case-definition specifies the name of the case definition that will be referenced during the
transformation.
locale optionally specifies the name of the locale that contains the specified gender-analysis
definition.
5050
Example of DQCASE FunctionThe following example determines whether a character
value represents an individual or an organization:
The value returned fororgname in the SAS log wouldbe Bill's Plumbing & Heating.
data _null_;orgname=DQCASE(
"BILL's PLUMBING & HEATING",'Proper',
'ENUSA');put orgname=;
run;
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
74/91
74 Chapter 12 Data Cleansing Additional Functionality
5151
Improve the Data
Standardize data.
Augment and validate data.
Create match codes.
Case Study TasksAnalyze and Profile the Data
Access and view the data.
Create and execute profiling
job(s).
This demonstration
illustrates the use of
the SAS Data Quality
Server functions to
perform identification
analysis, gender
analysis, parsing,
concatenation, and
casing.
5252
Improve the Data
Standardize data. Augment and validate data.
Create match codes.
Case Study TasksAnalyze and Profile the Data
Access and view the data.
Create and execute profiling
job(s).
Task performedusing
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
75/91
12.5 Solutions to Exercises 75
Augmenting and Validating Data Using SAS
In this demonstration, investigate four separate SAS programs. These programs investigate the use andresults of the DQIDENTIFY, DQGENDER, DQPARSE, and DQCASE functions. To investigate the
results from the programs, short FREQ or PRINT procedure steps will be added.
1. Start a SAS session by selecting StartAll ProgramsSASBIArchitectureStart SAS.
2. If the Getting Started with SAS window opens, do the following:
a. Select Dont show this dialog box again.
b. Select .
The SAS Display Manager session opens.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
76/91
76 Chapter 12 Data Cleansing Additional Functionality
Using the DQIDENTIFY Function
1. Verify that the Enhanced Editor window is active.
2. Select FileOpen Program.
3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQIdentityFunctions.sas.
4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */%DQLOAD(DQLOCALE=(ENUSA),
DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',DQINFO=1);
PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"
DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";
SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;
RUN;
data std_prospects;set prospects;length Identity $1;label Identity='Customer Identity Type';Identity = dqidentify(contact, 'Individual/Organization');
run;
In this program, the following occurs:
The %DQLOAD macro loads the ENUSA locale into memory.
The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.
The DATA step uses the DQIDENTIFY function to identify whether the value for the CONTACT
field is an individual, an organization, or not known.
5. Select RunSubmit to execute the SAS program.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
77/91
12.5 Solutions to Exercises 77
6. Select ViewLog to activate the Log window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
78/91
78 Chapter 12 Data Cleansing Additional Functionality
7. To view the scheme data set, do the following:
a. Select SAS Explorer.
b. Double-click on the Libraries icon.
c. Double-click on the Worklibrary icon.
d. Double-click on the Std_prospects table to open it into a VIEWTABLE window.
e. Scroll to view the Customer Identity Type column.
f. Select FileClose to close the VIEWTABLE window.
8. Select WindowDQIdentityFunctions.sas.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
79/91
12.5 Solutions to Exercises 79
9. Run a frequency report on the new identity column.
a. At the bottom of the program, after the RUN statement for the DATA step, uncomment the PROC
FREQ step (that is, remove the /* before the step and the */ after the step. The PROC FREQstep is as shown:
proc freq;
tables identity/nocum;run;
b. Highlight only these three new lines and then select RunSubmit.
The following report surfaces:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
80/91
80 Chapter 12 Data Cleansing Additional Functionality
Using the DQGENDER Function
1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.
2. Select FileOpen Program.
3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQGenderFunctions.sas.
4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */
%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',
DQINFO=1);
PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"
DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;
RUN;
data std_prospects;set Prospects;
/* use the GENDER function to determine gender based on name*/
length custgender $1;label custgender='Customer Gender';custgender = dqgender(contact, 'gender');
run;
PROC FREQ;Tables custgender/nocum;
RUN;
In this program, the following occurs:
The %DQLOAD macro loads the ENUSA locale into memory.
The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.
The DATA step uses the DQGENDER function to identify whether the value for the CONTACT
field isM
(ale),F
(emale), orU
(nknown)
The PROC FREQ step generates a report of frequency counts on the custgender column.
5. Select RunSubmit to execute the SAS program.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
81/91
12.5 Solutions to Exercises 81
6. Select ViewLog to activate the Log window. A portion of the DATA step and PROC FREQ step is
shown below:
7. Select ViewOutput to activate the Output window. The report shows the following:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
82/91
82 Chapter 12 Data Cleansing Additional Functionality
Using the DQPARSE Function
1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.
2. Select FileOpen Program.
3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQParseFunctions.sas.
4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */
%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\biarchitecture\Lev1\sasmain\dqsetup.txt',
DQINFO=1);
PROC IMPORT OUT= WORK.ProspectsDATATABLE= "newcustomers"
DBMS=ACCESS REPLACE;
DATABASE="S:\Workshop\winsas\didq\dqdata\newcustomers.mdb";
SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;
RUN;
data std_prospects;Set prospects;Parsedname=dqparse(contact, 'NAME');Prefix=dqparsetokenget(parsedname, 'Name Prefix', 'NAME');First_name=dqparsetokenget(parsedname, 'Given Name',
'NAME');Last_name=dqparsetokenget(parsedname, 'Family Name','NAME');run;
proc print;Var prefix first_name last_name;
run;
In this program, the following occurs:
The %DQLOAD macro loads the ENUSA locale into memory.
The PROC IMPORT step accesses theNewCustomers table in the NewCustomers Microsoft
Access database. The DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the
CONTACT field.
The PROC PRINT step produces a listing report of the results of the DQPARSETOKENGET
function usage.
5. Select RunSubmit to execute the SAS program.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
83/91
12.5 Solutions to Exercises 83
6. Select ViewLog to activate the Log window. The portion for the DATA step and PROC PRINT
step is shown below:
7. Select ViewOutput to activate the Output window. The partial output is as follows:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
84/91
84 Chapter 12 Data Cleansing Additional Functionality
Using the DQCASE Function
1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.
2. Select FileOpen Program.
3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQPropercaseFunctions.sas.
4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */
%DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',
DQINFO=1);
PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"
DBMS=ACCESS REPLACE;
DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";
SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;
RUN;
data std_prospects;set prospects;ParsedName=dqParse(contact, 'NAME');Prefix=dqParseTokenGet(parsedName, 'Name Prefix', 'NAME');First_name=dqParseTokenGet(parsedName, 'Given Name',
'NAME');Last_name=dqParseTokenGet(parsedName, 'Family Name','NAME');run;
data std_prospects;set std_prospects;length Contact2 $50;label Contact2='Re-formatted Prospect Name';Contact2 = trim(Last_Name) || ', ' || First_Name;length Contact3 $50;label Contact3='Proper Cased Re-formatted Prospect Name';Contact3 = dqcase(contact2,'PROPER');
run;
proc print;Var Contact Contact2 Contact3;
run;
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
85/91
12.5 Solutions to Exercises 85
In this program, the following occurs:
The %DQLOAD macro loads the ENUSA locale into memory.
The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.
The first DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the value
for the CONTACT field.
The second DATA step uses the concatenation operator (||) to rebuild a Name field (Contact2).
The DQCASE function is then applied to resolve the Contact2 field to proper casing.
The PROC PRINT step produces a listing report of some parsed and concatenated information.
5. Select RunSubmit to execute the SAS program.
6. Select ViewLog to activate the Log window. The portion for the DATA steps and PROC PRINT
step is shown below:
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
86/91
86 Chapter 12 Data Cleansing Additional Functionality
7. Select ViewOutput to activate the Output window. Partial output is shown below:
8. Close the SAS session and do not save any changes.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
87/91
12.5 Solutions to Exercises 87
12.4 Exercises
1. Analyzing the NewCustomers Table
Use theNewCustomers table from the New Customers database to do the following:
Verify the type of information found for each record. (Identify records as individual ororganization.)
Calculate gender information for each record.
Create a frequency report and a frequency report chart on both the identity and gender
information.
Parse the Contact field.
Add a field that contains a name string of the formName_Prefix Given_NameFamily_Name.
Save the job as DIDQ Ch5Ex1 NewCustomers Analysis.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
88/91
88 Chapter 12 Data Cleansing Additional Functionality
12.5 Solutions to Exercises
1. Analyzing the NewCustomers Table
a. If necessary, invoke dfPower Studio by selecting StartAll Programs
DataFlux dfPower Studio 7.1dfPower Studio.
b. Select Base from the toolbar, and then select Architect.
c. Expand the Data Inputs grouping of nodes.
d. Double-click the Data Source node.
1) EnterNew Customers as the name.
2) Select next to Input table.
3) Expand the New Customers database and select the NewCustomers table.
4) Select to close the Select Table window.
5) Select (double-arrow) to move all fields from the Available area to the Selected area.
6) Select to close the Data Source Properties window.
e. With the data source node selected, select the Preview tab from the Details area (at the bottom of
dfPower Architect interface). The data from this node is displayed.
f. Expand the Quality grouping of nodes.
g. Double-click the Identification Analysis node. The Identification Analysis Properties window
opens.
1) Move the CONTACT field from the Available area to the Selected area by double-clicking.
2) Double-click on the Definition column for the selected CONTACT field.
3) From the menu, select Individual/Organization.
4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed
in the field CONTACT_Identity.
5) Select below the Available area.
6) Select (double-arrow) to move all fields from the Available area to the Selected area.
7) Select to close the Additional Outputs window.
8) Select to close the Identification Analysis Properties window.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
89/91
12.5 Solutions to Exercises 89
h. Preview the results of the Identification Analysis.
1) Verify that Identification Analysis is selected.
2) Select the Preview tab at the bottom of dfPower Architect interface.
3) Scroll to the right to view the information populated for the CONTACT_Identity field.
i. Expand the Quality grouping of nodes.
j. Double-click on the Gender Analysis node. The Gender Analysis Properties window opens.
1) Move the CONTACT field from the Available area to the Selected area by double-clicking.
2) Double-click on the Definition column for the selected CONTACT field.
3) Select Gender.
4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed
in the field CONTACT_Gender.
5) Select below the Available area.
6) Select (double-arrow) to move all fields from the Available area to the Selected area.
7) Select to close the Additional Outputs window.
8) Select to close the Identification Analysis Properties window.
k. Expand the Profiling grouping of nodes.
l. Double-click the Frequency Distribution node.
1) The Frequency Distribution Properties window opens.
2) Move CONTACT_Identity and CONTACT_Gender from the Available area to the Selected
area.
3) Select to close the Frequency Distribution Properties window. The Preview tab is
populated with the frequency report
m. Expand the Data Outputs grouping of nodes.
n. Double-click the Frequency Distribution Chart node.
1) Select next to Chart name to choose a location for the output.
2) Navigate to S:\Workshop\winsas\didq.
3) EnterNew Customers Chart as the value forFile name.
4) Select to close the Save As window.
5) EnterGender & Identity Distribution from New Customers as the title forthe chart.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
90/91
90 Chapter 12 Data Cleansing Additional Functionality
6) Move both CONTACT_Identity and CONTACT_Gender from the Available area to the
Selected area.
7) Select to close the Frequency Distribution Chart Properties window. The Preview
tab is populated with the frequency report.
o. Select from the toolbar. The job processes, and the Run Job window opens with a statusindicator.
1) Select to close the Run Job window. The Chart Viewer window opens.
2) Select to scroll to the next chart forCONTACT_Gender.
3) Select FileExit to close the Chart Viewer window.
p. Save the job.
1) From dfPower Architect menu, select FileSave As.
2) EnterDIDQ Ch5Ex1 NewCustomers Analysis as the name.
3) EnterNew Customer Analysis as the description.
4) Select to close the Save As window.
q. Select the Frequency Distribution 1 node in job flow.
r. Expand the Quality grouping of nodes.
s. Right-click the Parsing node and select Insert Before Selected.
1) Select CONTACT as the field to parse.
2) Select Name as the definition.
3) Select to move all tokens from the Available area to the Selected area.
4) Select below the Available area.
5) Select to move all fields from the Available area to the Selected area.
6) Select to close the Additional Outputs window.
7) Select to close the Parse Properties window.
t. Expand the Utilities grouping of nodes.
u. Double-click the Concatenate node. The Concatenation Properties window opens.
1) Specify PreFirstLast as the output field.
2) Enter (a space) as the value forLiteral text.
3) Select Name Prefix, and then select to move it to the Concatenation list area.
-
8/2/2019 Chp 12 - Data Cleansing Additional Functionality
91/91
12.5 Solutions to Exercises 91
4) Select next to Literal text to move the text to the Concatenation list area after
Name Prefix.
5) Select Given Name, and then select to move it to the Concatenation list area.
6) Select next to Literal text to move the text to the Concatenation list area afterGiven Name.
7) Select Family Name, and then select to move it to the Concatenation list area.
8) Select below the Available fields area.
9) Select to move all fields from the Available area to the Selected area.