chp 12 - data cleansing additional functionality

Upload: kaven

Post on 05-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    1/91

    Chapter 12 Data CleansingAdditional Functionality

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    2/91

    2 Chapter 12 Data Cleansing Additional Functionality

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    3/91

    12.5 Solutions to Exercises 3

    12.1 Additional Data Quality/Cleansing Techniques

    33

    Objectives Discuss some additional data quality/cleansing

    techniques.

    44

    Data Quality/CleansingThe following are additional techniques that can

    be used to further enhance data quality:

    Identification analysis

    Gender analysis

    Parsing

    Concatenating

    Casing

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    4/91

    4 Chapter 12 Data Cleansing Additional Functionality

    55

    Data Quality/Cleansing

    Control whether a text string isControl whether a text string is

    represented as all capital lettersrepresented as all capital letters

    or in mixed case.or in mixed case.

    Casing

    Given two (or more) text strings,Given two (or more) text strings,

    concatenate the values into one string.concatenate the values into one string.

    Concatenating

    Given a text string, parse the stringGiven a text string, parse the string

    into its individual elements.into its individual elements.

    Parsing

    Based on a person's name, determineBased on a person's name, determine

    the gender.the gender.

    Gender

    Analysis

    Based on a given name string,Based on a given name string,

    determine whether the namedetermine whether the name

    represents an individual or anrepresents an individual or an

    organization.organization.

    Identification

    Analysis

    66

    Identification AnalysisIdentification analysis enables you to compare information

    from the QKB with undetermined fields in your data to

    determine whether each field contains the following:

    For name information: an individual's name

    an organization's name

    empty

    For address information: a street address

    city/state/ZIP information

    empty

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    5/91

    12.5 Solutions to Exercises 5

    77

    Identification AnalysisFor data fields containing name data, identity analysis

    returns INDIVIDUAL, ORGANIZATION, or UNKNOWN.

    For data fields containing address data, identity analysis

    returns one of the following:

    ACCT (account number type information)

    ADDR (address line1)

    ADDR2 (address line 2)

    ATTN (attention line)

    BLANK (blank or null values)

    CSZ (city state zip)

    IND (an individual's name)

    ORG (organization's type information)

    UNK (Unknown)

    88

    Gender AnalysisGender analysis determines whether a particular name

    is most likely feminine, masculine, or unknown.

    The results are placed in a new field and have three

    possible values:

    "M" for male

    "F" for female

    "U" for unknown

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    6/91

    6 Chapter 12 Data Cleansing Additional Functionality

    1010

    Parsing DataParsing is a simple but intelligent tool for separating

    a multi-part field value into multiple, single-part fields

    (tokens).

    Each token is identified based on its individual

    contribution to the overall field.

    Mr. Linwood Leroy Bubar, III, M.D.

    Name:

    M.D.IIIBubarLeroyLinwoodMr.

    NameAppendage

    NameSuffix

    FamilyName

    MiddleName

    GivenName

    NamePrefix

    1111

    Concatenating DataConcatenating is essentially the opposite of the parse

    step. Rather than separating a single field into multiple

    fields, concatenating combines one or more fields into

    a single field.

    Given Name: Igor

    Middle Name: Bela

    Family Name: Bonski

    Concatenated Name: Igor Bela Bonski

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    7/91

    12.5 Solutions to Exercises 7

    1212

    CasingChanging case enables you to make all alphabetical

    values in a field UPPERCASE, lowercase, or

    Proper Case.

    Proper case treats a field value as a proper name; that is,

    the first letter of each word is capitalized, with the

    remaining characters in lowercase.

    As with standardization, changing case can make field

    values more consistent.

    1313

    Applying TechniquesThese data quality/cleansing techniques can be applied

    using the following:

    dfPower Studio's dfPower Architect

    the SAS Data Quality Server functions as column-level

    transformations with SAS Data Integration Studio

    the SAS Data Quality Server functions within a SAS

    programming environment

    Because the SAS Data Quality Server functions are the

    same whether surfaced in SAS Data Integration Studio

    or in a SAS session, you only look at these functions in

    a SAS session.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    8/91

    8 Chapter 12 Data Cleansing Additional Functionality

    12.2 Data Quality/Cleansing Using dfPower Architect

    1515

    Objectives Describe the functionality of dfPower Architect.

    Explore various job flow steps that are available

    to use.

    Discuss the sequence of steps for building a job.

    1616

    dfPower Architect: IntroductiondfPower Architect brings much of the functionality of the

    other dfPower Studio applications, as well as some

    unique functionality, into a single, intuitive user interface.

    To use dfPower Architect,

    you specify operations by

    selectingjob flow steps

    and then configuring

    those steps to meet your

    specific data needs.

    The steps you choose are

    displayed asjob flow

    icons, which together

    form a visualjob flow.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    9/91

    12.5 Solutions to Exercises 9

    1717

    dfPower ArchitectWith dfPower Architect, you can perform the following

    tasks:

    identify and connect to multiple data sources, whether

    those sources are local, over a network on a different

    platform, or at a remote location

    choose and configure job flow nodes for processing

    your data

    reconfigure existing job flow nodes as needed

    view sample processed data at each job flow node

    specify a variety of output options, including reports

    and new data sources

    run a job flow with a single click

    1818

    Accessing dfPower ArchitectdfPower Architect is invoked from the toolbar of dfPower

    Studio by selecting BaseArchitect.

    ...

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    10/91

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    11/91

    12.5 Solutions to Exercises 11

    2121

    dfPower Architect Interface

    Job Flow AreaNodes List

    ...

    2222

    Job Flow StepsdfPower Architect's available job flow steps are grouped

    into nine categories:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    12/91

    12 Chapter 12 Data Cleansing Additional Functionality

    2323

    Job Flow Steps: Data InputsJob flow steps in the Data Inputs category:

    ...

    Node Description

    Data Source identifies existing data sets to process.

    SQL Query identifies existing data sets to process using SQL.

    Text File Input accesses data in a plain-text file.

    Fixed Width File Input accesses data in text file where the input is separated into fixed-width columns.

    External Data Provider enables services for applications or processes that want to pass data into

    dfPower Architect one record at a time; also can be used to call other Architectjob flows within a job when used in conjunction with the embedded job node.

    Table Metadata is used for extracting meta information from a specific table within a database.

    SAS Data Set is used to identify existing SAS data sets to process on the Microsoft Windows

    platform

    SAS SQL Query is used to identify existing data sets to process as with theSAS Data Set node.

    This step, however, enables you to use SQL to select data.

    http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio7040.html
  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    13/91

    12.5 Solutions to Exercises 13

    2424

    Job Flow Steps: Data OutputsJob flow steps in the Data Outputs category:

    ...

    Node Description

    Data Target (Update) updates existing data rather than create a new data source or replace an

    existing source.

    Data Target (Insert) outputs data in a variety of data formats to a new data source, leaving

    your existing data as-is or overwriting your existing data.

    Delete Record eliminates records from a data source using the unique key of those

    records.

    HTML Report creates an HTML-formatted report from the results of your job flow.

    Text File Output creates a plain-text file with the results of your job flow.

    Fixed Width File Output outputs your data to well-defined fixed-width columns in your output

    file.

    Frequency Distribution Chart creates a chart that shows how selected values are distributed throughout

    your data.

    Match Report generates a match report that can then be displayed with the Match

    Report Viewer.

    dfPower Merge File Output writes clustered data to a dfPower Merge file for use in dfPower Merge.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    14/91

    14 Chapter 12 Data Cleansing Additional Functionality

    2525

    Job Flow Steps: UtilitiesJob flow steps for Utilities:

    ...

    Node Description

    COM Plugin adds COM (Component Object Model) to your job flows.

    Data Sorting re-orders your data set at any point in a job flow.

    Expression runs a Visual Basic-like language to process your data sets in ways that are not built into

    dfPower Studio.

    Data Joining is used when you have two tables, each with the same number of records, and you want

    to join them by location in the file rather than by a unique key.

    Data Joining

    (Non-Key)

    is used when you have two tables, each with the same number of records, and you want

    to join them by location in the file rather than by a unique key.

    Data Union uses Data Joining to combine two data sets in an intelligent way so that the records ofone, the other, or both data sets are used as the basis for the resulting data set.

    Concatenate performs essentially the opposite of the Parse node; rather than separate a single field

    into multiple fields, Concatenate combines one or more fields into a single field.

    Embedded Job embeds another dfPower Architect job in your current job flow.

    Sequencer

    (Autonumber)

    creates a sequence of numbers given a starting number and a specified interval.

    SQL Lookup finds rows in a database table that have one or more fields matching those in the job flow.

    SQL Execute enables you to construct and execute any valid SQL statement (or series of statements);generally used to perform some database-specific task(s), either before, after, or in

    between architect job flows; stand-alone node (no parents or children).

    Field Layout enables you to rename and reorder field names as they pass out of this node.

    Parameterized

    SQL Query

    provides a way to write an SQL query that contains variable inputs, also known as

    parameters.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    15/91

    12.5 Solutions to Exercises 15

    2626

    Job Flow Steps: ProfilingJob flow steps for the Profiling category:

    ...

    Node Description

    Data Validation analyzes the content of data by setting validation conditions.

    Pattern Analysis performs pattern analysis.

    Basic Statistics calculates basic statistics.

    Frequency Distribution creates a frequency distribution.

    Basic Pattern Analysis provides the ability to run Pattern Analysis in a very similar manner as it is run

    in dfPower Profile. (In contrast to advanced Pattern Analysis, the simplifiedversion does not employ Blue Fusion pattern identification definitions.)

    http://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.0/help/Studio/HTML/studio7320a.html
  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    16/91

    16 Chapter 12 Data Cleansing Additional Functionality

    2727

    Job Flow Steps: QualityJob flow steps in the Quality category:

    ...

    Node Description

    Gender Analysis performs gender analysis.

    Gender Analysis (Parsed) performs gender analysis on parsed information.

    Identification Analysis performs identification analysis.

    Parsing parses a field.

    Standardization performs standardization of fields of data.

    Standardization (Parsed) performs standardization of fields of parsed information.

    Change Case enables the case of a field values to be set.

    Locale Guessing attempts to guess the appropriate locale based on field information.

    Right Fielding identifies the contents of fields and copies the data to fields with more

    descriptive names.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    17/91

    12.5 Solutions to Exercises 17

    2828

    Job Flow Steps: IntegrationJob flow steps in the Integration category:

    ...

    Node Description

    Match Code generates match codes.

    Match Codes (Parsed) generates match codes on parsed information.

    Clustering generates clusters.

    Cluster Update enables new records to be integrated with existing clusters.

    Surviving Record Identification examines clustered data and determines a surviving record for each

    cluster.

    Cluster Diff compares sets of clustered records.

    Exclusive Real Time Clustering

    (ERTC)

    facilitates the near real-time addition of new rows to previously

    clustered data

    Concurrent Real Time Clustering

    (CRTC)

    is similar to ERTC node in its outcomes; the difference between the

    nodes is that the ERTC node interacts directly with the cluster state

    file while the CRTC node interacts with a server that interacts with

    the cluster state file.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    18/91

    18 Chapter 12 Data Cleansing Additional Functionality

    2929

    Job Flow Steps: EnrichmentJob flow steps in the Enrichment category:

    ...

    Node Description

    Address Verification (US/Canada) verifies, corrects, and enhances U.S. and Canadian addresses in

    your existing data.

    Address Verification (QAS) performs address verification on addresses from outside of the U.S.

    and Canada.

    Address Verification (World) performs address verification on addresses from outside of the U.S.

    and Canada. (This step is similar to Address Verification (QAS) but

    supports verification and correction for addresses from morecountries.)

    Geocoding matches geographic information from the geocode referencedatabase with ZIP codes in your data to determine latitude,

    longitude, census tract, FIPS (Federal Information Processing

    Standard), and block information.

    County matches information from the phone and geocode reference

    databases with FIPS codes in your data to calculate several values.

    Phone matches information from the phone reference database with

    telephone numbers in your data.

    Area Code matches information from the phone reference database with zip

    codes in your data to calculate several values, primarily area code,

    but also Overlay1, Overlay2, Overlay3, and Result.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    19/91

    12.5 Solutions to Exercises 19

    3030

    Job Flow Steps: Enrichment (Distributed)Job flow steps in the Enrichment (Distributed) category:

    ...

    Node Description

    Distributed Geocoding offloads geocode processing to a machine other than one running the

    current dfPower Architect job.

    Distributed Address Verification offloads address verification processing to a machine other than one

    running the current dfPower Architect job.

    Distributed Phone offloads phone data processing to a machine other than one running

    the current dfPower Architect job.

    Distributed Area Code offloads area code data processing to a machine other than one

    running the current dfPower Architect job.

    Distributed County offloads county data processing to a machine other than one running

    the current dfPower Architect job.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    20/91

    20 Chapter 12 Data Cleansing Additional Functionality

    3131

    Job Flow Steps: MonitoringJob flow steps in the Monitoring category:

    ...

    Node Description

    Data Monitoring enables you to analyze data according to business rules that you create using

    theBusiness Rule Manager. The business rules that you create in Rule

    Manager can analyze the structure of the data and trigger an event, such as

    logging a message or sending an e-mail alert, when a condition is detected.

    http://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.htmlhttp://c/Program%20Files/DataFlux/dfPower%20Studio/7.1/help/Studio/HTML/studio8850.html
  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    21/91

    12.5 Solutions to Exercises 21

    3232

    Getting Started with dfPower ArchitectA typical dfPower Architect session consists of the

    following:

    1. Plan the job flow.

    2. Select the input data.

    3. Build the job flow.

    4. Specify the output.

    5. Process the job flow.

    3333

    1. Plan the job flow.

    2. Select the input data.3. Build the job flow.

    4. Specify the output.5. Process the job flow.

    A typical dfPower Architect session consists of the

    following:

    Getting Started with dfPower Architect

    Identify how the data is to be

    processed.

    Select input data source(s)

    and/or manipulate with SQL.

    Select and configure job flow

    nodes.

    Identify the type of output, andwhere the output is to be saved.

    Select to begin processing.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    22/91

    22 Chapter 12 Data Cleansing Additional Functionality

    3434

    Improve the Data

    Standardize data.

    Augment and validate data.

    Create match codes.

    Case Study TasksAnalyze and Profile the Data

    Access and view the data.

    Create and execute profiling

    job(s).

    This demonstration

    illustrates the use of

    dfPower Architect to

    perform identification

    analysis, gender analysis,

    parsing, concatenation,

    and casing. In addition,

    other nodes are

    investigated (frequency

    distribution, frequency

    distribution chart, and

    HTML report).

    3535

    Improve the Data

    Standardize data. Augment and validate data.

    Create match codes.

    Case Study TasksAnalyze and Profile the Data

    Access and view the data.

    Create and execute profiling

    job(s).

    Task performedusing

    1dfPowerStudio 7.1from DataFlux

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    23/91

    12.5 Solutions to Exercises 23

    Augmenting and Validating Data Using dfPower Architect

    In this demonstration, first establish a data source to work with. Then run an identification analysis on aname field from this data source, with the results used to generate frequency counts of the identified types

    of data. After you decide that the majority of data in the name field are individual names, run a gender

    analysis with the results of this also used to generate frequency counts. As a last step, use the results fromthe identification and gender analysis to generate a pie chart.

    1. If necessary, invoke dfPower Studio by selecting StartAll Programs

    DataFlux dfPower Studio 7.1dfPower Studio.

    2. Select Base from the toolbar, and then select Architect.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    24/91

    24 Chapter 12 Data Cleansing Additional Functionality

    Identification and Gender Analysis

    1. Add a data source to the job flow.

    a. Expand the Data Inputs grouping of nodes.

    b. Double-click the Data Source node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    25/91

    12.5 Solutions to Exercises 25

    The Data Source node is added to the job flow, and the Data Source Properties window opens.

    To add a node to the job flow diagram, you can do the following: double-click

    drag and drop

    right-click and select Insert on Page

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    26/91

    26 Chapter 12 Data Cleansing Additional Functionality

    c. Specify properties for the Data Source node.

    1) Enter Contacts as the name.

    2) Select next to Input table.

    3) Expand the DataFlux Sample database and select the Contacts table.

    4) Select to close the Select Table window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    27/91

    12.5 Solutions to Exercises 27

    The Data Source Properties window shows available fields from the Contacts table.

    5) Select (double-arrow) to move all fields from the Available area to the Selected area.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    28/91

    28 Chapter 12 Data Cleansing Additional Functionality

    6) Select to close the Data Source Properties window.

    The job flow diagram is updated to a display that resembles what is shown below:

    2. With the data source node selected, select the Preview tab from the Details area (at the bottom of

    dfPower Architect interface). The data from this node is displayed:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    29/91

    12.5 Solutions to Exercises 29

    3. Perform an Identification Analysis using the Contact field.

    a. Expand the Quality grouping of nodes.

    b. Double-click the Identification Analysis node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    30/91

    30 Chapter 12 Data Cleansing Additional Functionality

    The Identification Analysis Properties window opens.

    c. Move the CONTACT field from the Available area to the Selected area by double-clicking.

    d. Double-click on the Definition column for the selected CONTACT field.

    e. From the menu, select Individual/Organization.

    f. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the

    field CONTACT_Identity.

    g. Select below the Available area.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    31/91

    12.5 Solutions to Exercises 31

    h. Select (double-arrow) to move all fields from the Available area to the Selected area.

    i. Select to close the Additional Outputs window.

    j. Select to close the Identification Analysis Properties window.

    4. Preview the results of the Identification Analysis.

    a. Verify that the Identification Analysis node is selected.

    b. Select the Preview tab at the bottom of dfPower Architect interface.

    c. Scroll to the right to view the information populated for the CONTACT_Identity:

    Although this preview is a good indication of the overall data values, it would be desirable to

    verify that there are no odd data values.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    32/91

    32 Chapter 12 Data Cleansing Additional Functionality

    5. Add a Frequency Distribution task to the job flow.

    a. Expand the Profiling grouping of nodes.

    b. Double-click the Frequency Distribution node.

    The Frequency Distribution Properties window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    33/91

    12.5 Solutions to Exercises 33

    c. Move CONTACT_Identity from the Available area to the Selected area.

    d. Select to close the Frequency Distribution Properties window. The Preview tab is

    populated with the frequency report.

    If you are satisfied that the majority (99%) of the observations represents individuals, you can

    proceed with a gender analysis.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    34/91

    34 Chapter 12 Data Cleansing Additional Functionality

    6. Perform a gender analysis using the Contact field.

    a. Verify that the Frequency Distribution 1 node is selected in the job flow diagram.

    b. Expand the Quality grouping of nodes.

    c. Right-click on the Gender Analysis node and select Insert Before Selected.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    35/91

    12.5 Solutions to Exercises 35

    The Gender Analysis Properties window opens.

    d. Move the CONTACT field from the Available area to the Selected area by double-clicking.

    e. Double-click on the Definition column for the selected CONTACT field.

    f. Select Gender.g. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the

    field CONTACT_Gender.

    h. Select below the Available area.

    i. Select (double-arrow) to move all fields from the Available area to the Selected area.

    j. Select to close the Additional Outputs window.

    k. Select to close the Identification Analysis Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    36/91

    36 Chapter 12 Data Cleansing Additional Functionality

    7. Update the properties of the Frequency Distribution to include the CONTACT_Gender field.

    a. Right-clickFrequency Distribution 1 in the job flow and select Properties.

    b. Move the CONTACT_Gender field from the Available area to the Selected area.

    c. Select to close the Frequency Distribution Properties window. The Preview tab is

    populated with the frequency report.

    A more visual approach for viewing the results uses a graphic representation of the information.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    37/91

    12.5 Solutions to Exercises 37

    8. Add a Frequency Distribution Chart task to the job flow.

    a. Expand the Data Outputs grouping of nodes.

    b. Double-click the Frequency Distribution Chart node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    38/91

    38 Chapter 12 Data Cleansing Additional Functionality

    The Frequency Distribution Chart Properties window opens.

    c. Select next to Chart name to choose a location for the output.

    1) Navigate to S:\Workshop\winsas\didq.

    2) Enter Contacts Gender Identity Chart as the value forFile name.

    3) Select to close the Save As window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    39/91

    12.5 Solutions to Exercises 39

    d. Enter Gender & Identity Distribution from Contacts as the title for the chart.

    e. Move both CONTACT_Identity and CONTACT_Gender from the Available area to the

    Selected area.

    f. Select to close the Frequency Distribution Chart Properties window. The Preview tab

    is populated with the frequency report.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    40/91

    40 Chapter 12 Data Cleansing Additional Functionality

    9. Run the entire job.

    a. Select from the toolbar. The job processes, and the Run Job window opens with a status

    indicator:

    b. Select to close the Run Job window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    41/91

    12.5 Solutions to Exercises 41

    The Chart Viewer window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    42/91

    42 Chapter 12 Data Cleansing Additional Functionality

    c. Select to scroll to the next chart forCONTACT_Gender.

    d. Select FileExit to close the Chart Viewer window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    43/91

    12.5 Solutions to Exercises 43

    10. Save the job.

    a. From the dfPower Architect menu, select FileSave As.

    b. EnterDIDQ Contact Gender/Identity Analysis as the name.

    c. Enter Gender & Identity Analysis for Contacts table as the description.

    d. Select to close the Save As window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    44/91

    44 Chapter 12 Data Cleansing Additional Functionality

    Parsing, Concatenation, and Casing

    Name fields are often populated in a variety of ways: sometimes as FIRST MIDDLE LAST, and othertimes as LAST, FIRST. Parsing enables you to break a name field into portions. Concatenation can rejoin

    the name field in a consistent fashion. After the field values are available in a consistent pattern, it is

    useful to put the data in the correct case.

    1. Start a new job by selecting FileNew.

    2. Add a data source to the job flow:

    a. Expand the Data Inputs grouping of nodes.

    b. Double-click the Data Source node. The Data Source Properties window opens.

    c. Specify properties for the Data Source node.

    1) Enter Contacts as the name.

    2) Select next to Input table.

    3) Expand the DataFlux Sample database and then select the Contacts table.

    4) Select to close the Select Table window.

    The Data Source Properties window shows available fields from the Contacts table.

    5) Select (double-arrow) to move all fields from the Available area to the Selected area.

    6) Select to close the Data Source Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    45/91

    12.5 Solutions to Exercises 45

    3. Parse the Contact field.

    a. Expand the Quality grouping of nodes.

    b. Double-click the Parsing node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    46/91

    46 Chapter 12 Data Cleansing Additional Functionality

    The Parse Properties window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    47/91

    12.5 Solutions to Exercises 47

    c. Select CONTACT as the field to parse.

    d. Select Name as the definition.

    e. Select to move all tokens from the Available area to the Selected area.

    f. Select below the Available area.

    g. Select (double-arrow) to move all fields from the Available area to the Selected area.

    h. Select to close the Additional Outputs window.

    i. Select to close the Parse Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    48/91

    48 Chapter 12 Data Cleansing Additional Functionality

    j. Select the Preview tab to view the results of the parse.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    49/91

    12.5 Solutions to Exercises 49

    4. Concatenate the parsed fields.

    a. Expand the Utilities grouping of nodes.

    b. Double-click the Concatenate node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    50/91

    50 Chapter 12 Data Cleansing Additional Functionality

    The Concatenation Properties window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    51/91

    12.5 Solutions to Exercises 51

    c. Specify LastFirst as the output field.

    d. Enter , (a comma and a space) as the value forLiteral text.

    e. Select Family Name, and then select to move it to the Concatenation list area.

    f. Select next to Literal text to move the text to the Concatenation list area after

    Family Name.

    g. Select Given Name, and then select to move it to the Concatenation list area.

    h. Select below the Available fields area.

    i. Select (double-arrow) to move all fields from the Available area to the Selected area.

    j. Select to close the Additional Outputs window.

    k. Select to close the Concatenation Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    52/91

    52 Chapter 12 Data Cleansing Additional Functionality

    The Preview tab is populated. Scroll to find the new LastFirst column.

    A more complete picture of the concatenation might be gained by viewing an HTML report.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    53/91

    12.5 Solutions to Exercises 53

    5. Add an HTML Report task to the job flow.

    a. Expand the Data Outputs grouping of nodes.

    b. Double-click the HTML Report node.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    54/91

    54 Chapter 12 Data Cleansing Additional Functionality

    The HTML Report Properties window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    55/91

    12.5 Solutions to Exercises 55

    c. Enter Concatenation Results as the value forReport title.

    d. Enter NewName as the value forReport name.

    e. Select the check box forDisplay report in browser after job runs.

    f. Deselect all columns from Selected. (Select .)

    g. Move CONTACT, Given Name, Family Name, and LastFirst from the Available area to the

    Selected area.

    h. Select to close the HTML Report Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    56/91

    56 Chapter 12 Data Cleansing Additional Functionality

    6. Run the entire job.

    a. Select from the toolbar. The job processes, and the Run Job window opens with a status

    indicator.

    b. Select to close the Run Job window.

    The appropriate browser opens and displays the HTML report.

    c. Select FileClose to close the browser when you are finished viewing the report.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    57/91

    12.5 Solutions to Exercises 57

    7. Change the case of the LastFirst field.

    a. Select the HTML Report 1 node in the job flow.

    b. Expand the Quality grouping of nodes.

    c. Right-clickChange Case and select Insert Before Selected.

    The Case Properties window opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    58/91

    58 Chapter 12 Data Cleansing Additional Functionality

    d. Move LastFirst from the Available area to the Selected area.

    e. Select Proper as the type of casing to use.

    f. Select Proper (Name) as the definition to use.

    g. Select below the Available area.

    h. Select (double-arrow) to move all fields from the Available area to the Selected area.

    i. Select to close the Additional Outputs window.

    j. Select to close the Parse Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    59/91

    12.5 Solutions to Exercises 59

    k. Select the Preview tab to view the results of the parse.

    8. Update the HTML Report 1 node.

    a. Double-click on the HTML Report 1 node in the job flow to open the HTML Report Propertieswindow.

    b. Verify that the check box forDisplay report in browser after job runs is selected.

    c. Deselect all columns from the Selected area. (Select .)

    d. Move CONTACT, Given Name, Family Name, LastFirst, and LastFirst_Cased from the

    Available area to the Selected area.

    e. Select to close the HTML Report Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    60/91

    60 Chapter 12 Data Cleansing Additional Functionality

    9. Run the entire job.

    a. Select from the toolbar. The job processes, and the Run Job window opens with a status

    indicator.

    b. Select to close the Run Job window.

    The appropriate browser opens and displays the HTML report.

    c. Select FileClose to close the browser when you are finished viewing the report.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    61/91

    12.5 Solutions to Exercises 61

    10. Save the job.

    a. From the dfPower Architect menu, select FileSave As.

    b. EnterDIDQ Contact Parse/Concatenation Job as the name.

    c. Enter Parse then concatenation of Contact fieldas the description.

    d. Select to close the Save As window.

    11. Select FileExit to close dfPower Architect.

    12. Select StudioExit to close dfPower Studio.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    62/91

    62 Chapter 12 Data Cleansing Additional Functionality

    12.3 Data Quality/Cleansing Using SAS

    3838

    Objectives Describe some SAS Data Quality Server functions.

    List some basic examples using these functions.

    3939

    SAS Data Quality Server FunctionsThe SAS Data Quality Server provides a set of functions

    that can be used to insure quality data. Of these, several

    can be used to enhance the data:

    DQIDENTIFY

    DQGENDER

    DQPARSE

    DQPARSEINFOGET

    DQPARSETOKENGET

    DQCASE

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    63/91

    12.5 Solutions to Exercises 63

    4040

    %DQPUTLOC MacroEach of these functions requires the specification

    of a definition as part of the syntax.

    The %DQPUTLOC AUTOCALL macro providesa quick means of displaying current information

    in the SAS log for the specified locale that is loaded

    into memory at that time.

    The available locale information includes a list of alldefinitions, parse tokens, related functions, and the

    names of the parse definitions that are related to

    each match definition.

    %DQPUTLOC(locale,, );%DQPUTLOC(locale,, );

    %DQPUTLOC(locale,, );

    where

    locale specifies the locale of interest.

    SHORT=0|1 optionally shortens the length of the entry in the SAS log. SHORT=1 removes the

    descriptions of how the definitions are used. The default value is SHORT=0,

    which displays the descriptions of how the definitions are used.

    PARSEDEFN=0|1 optionally lists the related parse definition, if such a parse definition exists, witheach gender analysis definition and each match definition. The default value

    PARSEDEFN=1 lists the related parse definition. PARSEDEFN=0 does not list

    the related parse definition.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    64/91

    64 Chapter 12 Data Cleansing Additional Functionality

    4141

    %DQPUTLOC Macro ExampleIf the ENUSA locale is loaded, the %DQPUTLOC macro

    returns information for the ENUSA definitions, such as the

    following:

    /*----------------------------------------------------------*//* GENDER DEFINITION(S) *//* *//* Gender definitions are used by the following: *//* dqGender function *//* dqGenderParsed function *//*----------------------------------------------------------*/

    Gender/*----------------------------------------------------------*//* IDENTIFICATION DEFINITION(S) *//* *//* Identification definitions are used by the following: *//* dqIdentify function *//*----------------------------------------------------------*/

    Contact InfoIndividual/OrganizationOffensive

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    65/91

    12.5 Solutions to Exercises 65

    4242

    Identification Analysis in SASThe DQIDENTIFY function returns a value that indicates

    the category of the content in an input character value.

    The available categories and return values depend on

    your choice of identification definition and locale.

    DQIDENTIFY(char, 'identification-definition')DQIDENTIFY(char, 'identification-definition')

    DQIDENTIFY(char, 'identification-definition')

    where

    char is the value that is transformed, according to the specified identificationdefinition. The value can be the name of a character variable, a character

    value in quotation marks, or an expression that evaluates to a variable name

    or a quoted value.

    identification-definition specifies the name of the identification definition, which must exist in thespecified locale.

    locale optionally specifies the name of the locale that contains the specifiedidentification definition. The value can be a name in quotation marks, the

    name of a variable whose value is a locale name, or an expression that

    evaluates to a variable name or to a quoted locale name.

    The specified locale must be loaded into memory as part of the locale list. If

    no value is specified, the default locale is used. The default locale is the first

    locale in the locale list.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    66/91

    66 Chapter 12 Data Cleansing Additional Functionality

    4343

    Example of DQIDENTIFY FunctionThe following example determines if a character value

    represents an individual or an organization.

    The value returned forID in the SAS log wouldbe ORGANIZATION.

    data _null_;id=dqidentify('LL Bean',

    'Individual/Organization','ENUSA');

    put id=;run;

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    67/91

    12.5 Solutions to Exercises 67

    4444

    Gender Analysis in SASThe DQGENDER function evaluates the name of an

    individual to determine the gender of that individual.

    If the evaluation finds substantial clues that indicate

    gender, the function returns a value that indicates

    that the gender is female or male. If the evaluation

    is inconclusive, the function returns a value that indicates

    that the gender is unknown. The exact return value

    is determined by the specified gender analysis definition

    and locale.

    DQGENDER(char, 'gender-analysis-definition' )DQGENDER(char, 'gender-analysis-definition' )

    DQGENDER(char, 'gender-analysis-definition')

    where

    char is the name of a character variable, a character value in quotation marks,or an expression that evaluates to a variable name or a quoted value.

    gender-analysis-definition specifies the name of the gender analysis definition, which must exist in

    the specified locale.

    locale optionally specifies the name of the locale that contains the specified

    gender-analysis definition.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    68/91

    68 Chapter 12 Data Cleansing Additional Functionality

    4545

    Example of DQGENDER FunctionThe following example determines whether a character

    value represents an individual or an organization:

    The value returned forGender in the SAS log would beM.

    data _null_;Gender=DQGENDER('Mr. Malcolm A. Lackey',

    'gender','ENUSA');

    put Gender=;run;

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    69/91

    12.5 Solutions to Exercises 69

    4646

    Parsing in SASThe DQPARSE function returns a parsed character

    value. The return value contains delimiters that identify

    the elements in the value that correspond to the tokens

    that are enabled by the parse definition.

    DQPARSE(char, 'parse-definition' )DQPARSE(char, 'parse-definition' )

    DQPARSE(char, 'parse-definition')

    where

    char is the value that is parsed according to the parse definition. The value can be thename of a character variable, a character value in quotation marks, or an expression

    that evaluates to a variable name or a quoted value.

    parse-definition specifies the name of the parse definition, which must exist in the specified locale.

    locale optionally specifies the name of the locale that contains the specified gender

    analysis definition.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    70/91

    70 Chapter 12 Data Cleansing Additional Functionality

    4747

    Parsing in SASThe DQPARSEINFOGET function returns the token

    names in a parse definition.

    The DQPARSETOKENGET function returns a token from

    a parsed character value.

    DQPARSEINFOGET('parse-definition' )DQPARSEINFOGET('parse-definition' )

    DQPARSETOKENGET(parsed-char, 'token',

    'parse-definition' )

    DQPARSETOKENGET(parsed-char, 'token',

    'parse-definition' )

    DQPARSEINFOGET('parse-definition')

    where

    parse-definition specifies the name of the parse definition, which must exist in the specified locale.

    locale optionally specifies the name of the locale that contains the specified gender-analysis

    definition.

    DQPARSETOKENGET(parsed-char, 'token','parse-definition')

    where

    parsed-char is the parsed character value from which will be returned the value of the specified

    token.

    token specifies the name of the token that is returned from the parsed value.

    parse-definition specifies the name of the parse definition, which must exist in the specified locale.

    localeoptionally specifies the name of the locale that contains the specified gender-analysis

    definition.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    71/91

    12.5 Solutions to Exercises 71

    4848

    Example of Parsing FunctionsThe following example determines whether a character

    value represents an individual or an organization:

    The returned values in the SAS log would be as follows:

    parsedValue=Mrs./=/Sallie/=/Mae/=/Pravlik/=//=/

    prefix=Mrs.

    given=Sallie.

    data _null_;parsedValue=DQPARSE('Mrs. Sallie Mae Pravlik','NAME', 'ENUSA');

    prefix=DQPARSETOKENGET(parsedValue,'Name Prefix', 'NAME', 'ENUSA');

    given=DQPARSETOKENGET(parsedValue,'Given Name', 'NAME', 'ENUSA');

    put parsedValue= prefix= given=;run;

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    72/91

    72 Chapter 12 Data Cleansing Additional Functionality

    4949

    Changing Case in SASThe DQCASE function returns a character value with

    standardized capitalization. The DQCASE function

    operates on any character content, such as names,

    organizations, and addresses. All instances of adjacent

    blank spaces are replaced with single blank spaces.

    DQCASE(char, 'case-definition' )DQCASE(char, 'case-definition' )

    DQCASE(char,'case-definition')

    where

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    73/91

    12.5 Solutions to Exercises 73

    char is the value that is transformed, according to the specified case definition.

    case-definition specifies the name of the case definition that will be referenced during the

    transformation.

    locale optionally specifies the name of the locale that contains the specified gender-analysis

    definition.

    5050

    Example of DQCASE FunctionThe following example determines whether a character

    value represents an individual or an organization:

    The value returned fororgname in the SAS log wouldbe Bill's Plumbing & Heating.

    data _null_;orgname=DQCASE(

    "BILL's PLUMBING & HEATING",'Proper',

    'ENUSA');put orgname=;

    run;

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    74/91

    74 Chapter 12 Data Cleansing Additional Functionality

    5151

    Improve the Data

    Standardize data.

    Augment and validate data.

    Create match codes.

    Case Study TasksAnalyze and Profile the Data

    Access and view the data.

    Create and execute profiling

    job(s).

    This demonstration

    illustrates the use of

    the SAS Data Quality

    Server functions to

    perform identification

    analysis, gender

    analysis, parsing,

    concatenation, and

    casing.

    5252

    Improve the Data

    Standardize data. Augment and validate data.

    Create match codes.

    Case Study TasksAnalyze and Profile the Data

    Access and view the data.

    Create and execute profiling

    job(s).

    Task performedusing

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    75/91

    12.5 Solutions to Exercises 75

    Augmenting and Validating Data Using SAS

    In this demonstration, investigate four separate SAS programs. These programs investigate the use andresults of the DQIDENTIFY, DQGENDER, DQPARSE, and DQCASE functions. To investigate the

    results from the programs, short FREQ or PRINT procedure steps will be added.

    1. Start a SAS session by selecting StartAll ProgramsSASBIArchitectureStart SAS.

    2. If the Getting Started with SAS window opens, do the following:

    a. Select Dont show this dialog box again.

    b. Select .

    The SAS Display Manager session opens.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    76/91

    76 Chapter 12 Data Cleansing Additional Functionality

    Using the DQIDENTIFY Function

    1. Verify that the Enhanced Editor window is active.

    2. Select FileOpen Program.

    3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQIdentityFunctions.sas.

    4. Select . The following program opens in the Enhanced Editor:

    /* xxxx COMMENTS xxxx */%DQLOAD(DQLOCALE=(ENUSA),

    DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',DQINFO=1);

    PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"

    DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";

    SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

    RUN;

    data std_prospects;set prospects;length Identity $1;label Identity='Customer Identity Type';Identity = dqidentify(contact, 'Individual/Organization');

    run;

    In this program, the following occurs:

    The %DQLOAD macro loads the ENUSA locale into memory.

    The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.

    The DATA step uses the DQIDENTIFY function to identify whether the value for the CONTACT

    field is an individual, an organization, or not known.

    5. Select RunSubmit to execute the SAS program.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    77/91

    12.5 Solutions to Exercises 77

    6. Select ViewLog to activate the Log window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    78/91

    78 Chapter 12 Data Cleansing Additional Functionality

    7. To view the scheme data set, do the following:

    a. Select SAS Explorer.

    b. Double-click on the Libraries icon.

    c. Double-click on the Worklibrary icon.

    d. Double-click on the Std_prospects table to open it into a VIEWTABLE window.

    e. Scroll to view the Customer Identity Type column.

    f. Select FileClose to close the VIEWTABLE window.

    8. Select WindowDQIdentityFunctions.sas.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    79/91

    12.5 Solutions to Exercises 79

    9. Run a frequency report on the new identity column.

    a. At the bottom of the program, after the RUN statement for the DATA step, uncomment the PROC

    FREQ step (that is, remove the /* before the step and the */ after the step. The PROC FREQstep is as shown:

    proc freq;

    tables identity/nocum;run;

    b. Highlight only these three new lines and then select RunSubmit.

    The following report surfaces:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    80/91

    80 Chapter 12 Data Cleansing Additional Functionality

    Using the DQGENDER Function

    1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.

    2. Select FileOpen Program.

    3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQGenderFunctions.sas.

    4. Select . The following program opens in the Enhanced Editor:

    /* xxxx COMMENTS xxxx */

    %DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',

    DQINFO=1);

    PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"

    DBMS=ACCESS REPLACE;DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

    RUN;

    data std_prospects;set Prospects;

    /* use the GENDER function to determine gender based on name*/

    length custgender $1;label custgender='Customer Gender';custgender = dqgender(contact, 'gender');

    run;

    PROC FREQ;Tables custgender/nocum;

    RUN;

    In this program, the following occurs:

    The %DQLOAD macro loads the ENUSA locale into memory.

    The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.

    The DATA step uses the DQGENDER function to identify whether the value for the CONTACT

    field isM

    (ale),F

    (emale), orU

    (nknown)

    The PROC FREQ step generates a report of frequency counts on the custgender column.

    5. Select RunSubmit to execute the SAS program.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    81/91

    12.5 Solutions to Exercises 81

    6. Select ViewLog to activate the Log window. A portion of the DATA step and PROC FREQ step is

    shown below:

    7. Select ViewOutput to activate the Output window. The report shows the following:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    82/91

    82 Chapter 12 Data Cleansing Additional Functionality

    Using the DQPARSE Function

    1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.

    2. Select FileOpen Program.

    3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQParseFunctions.sas.

    4. Select . The following program opens in the Enhanced Editor:

    /* xxxx COMMENTS xxxx */

    %DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\biarchitecture\Lev1\sasmain\dqsetup.txt',

    DQINFO=1);

    PROC IMPORT OUT= WORK.ProspectsDATATABLE= "newcustomers"

    DBMS=ACCESS REPLACE;

    DATABASE="S:\Workshop\winsas\didq\dqdata\newcustomers.mdb";

    SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

    RUN;

    data std_prospects;Set prospects;Parsedname=dqparse(contact, 'NAME');Prefix=dqparsetokenget(parsedname, 'Name Prefix', 'NAME');First_name=dqparsetokenget(parsedname, 'Given Name',

    'NAME');Last_name=dqparsetokenget(parsedname, 'Family Name','NAME');run;

    proc print;Var prefix first_name last_name;

    run;

    In this program, the following occurs:

    The %DQLOAD macro loads the ENUSA locale into memory.

    The PROC IMPORT step accesses theNewCustomers table in the NewCustomers Microsoft

    Access database. The DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the

    CONTACT field.

    The PROC PRINT step produces a listing report of the results of the DQPARSETOKENGET

    function usage.

    5. Select RunSubmit to execute the SAS program.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    83/91

    12.5 Solutions to Exercises 83

    6. Select ViewLog to activate the Log window. The portion for the DATA step and PROC PRINT

    step is shown below:

    7. Select ViewOutput to activate the Output window. The partial output is as follows:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    84/91

    84 Chapter 12 Data Cleansing Additional Functionality

    Using the DQCASE Function

    1. Verify that the Enhanced Editor window is active. If not, select ViewEnhanced Editor.

    2. Select FileOpen Program.

    3. Navigate to S:\Workshop\winsas\didq\SASPgmsand select DQPropercaseFunctions.sas.

    4. Select . The following program opens in the Enhanced Editor:

    /* xxxx COMMENTS xxxx */

    %DQLOAD(DQLOCALE=(ENUSA),DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt',

    DQINFO=1);

    PROC IMPORT OUT= WORK.ProspectsDATATABLE= "NewCustomers"

    DBMS=ACCESS REPLACE;

    DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb";

    SCANMEMO=YES;USEDATE=NO;SCANTIME=YES;

    RUN;

    data std_prospects;set prospects;ParsedName=dqParse(contact, 'NAME');Prefix=dqParseTokenGet(parsedName, 'Name Prefix', 'NAME');First_name=dqParseTokenGet(parsedName, 'Given Name',

    'NAME');Last_name=dqParseTokenGet(parsedName, 'Family Name','NAME');run;

    data std_prospects;set std_prospects;length Contact2 $50;label Contact2='Re-formatted Prospect Name';Contact2 = trim(Last_Name) || ', ' || First_Name;length Contact3 $50;label Contact3='Proper Cased Re-formatted Prospect Name';Contact3 = dqcase(contact2,'PROPER');

    run;

    proc print;Var Contact Contact2 Contact3;

    run;

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    85/91

    12.5 Solutions to Exercises 85

    In this program, the following occurs:

    The %DQLOAD macro loads the ENUSA locale into memory.

    The PROC IMPORT step accesses theNewCustomers table in the NewCustomers MicrosoftAccess database.

    The first DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the value

    for the CONTACT field.

    The second DATA step uses the concatenation operator (||) to rebuild a Name field (Contact2).

    The DQCASE function is then applied to resolve the Contact2 field to proper casing.

    The PROC PRINT step produces a listing report of some parsed and concatenated information.

    5. Select RunSubmit to execute the SAS program.

    6. Select ViewLog to activate the Log window. The portion for the DATA steps and PROC PRINT

    step is shown below:

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    86/91

    86 Chapter 12 Data Cleansing Additional Functionality

    7. Select ViewOutput to activate the Output window. Partial output is shown below:

    8. Close the SAS session and do not save any changes.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    87/91

    12.5 Solutions to Exercises 87

    12.4 Exercises

    1. Analyzing the NewCustomers Table

    Use theNewCustomers table from the New Customers database to do the following:

    Verify the type of information found for each record. (Identify records as individual ororganization.)

    Calculate gender information for each record.

    Create a frequency report and a frequency report chart on both the identity and gender

    information.

    Parse the Contact field.

    Add a field that contains a name string of the formName_Prefix Given_NameFamily_Name.

    Save the job as DIDQ Ch5Ex1 NewCustomers Analysis.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    88/91

    88 Chapter 12 Data Cleansing Additional Functionality

    12.5 Solutions to Exercises

    1. Analyzing the NewCustomers Table

    a. If necessary, invoke dfPower Studio by selecting StartAll Programs

    DataFlux dfPower Studio 7.1dfPower Studio.

    b. Select Base from the toolbar, and then select Architect.

    c. Expand the Data Inputs grouping of nodes.

    d. Double-click the Data Source node.

    1) EnterNew Customers as the name.

    2) Select next to Input table.

    3) Expand the New Customers database and select the NewCustomers table.

    4) Select to close the Select Table window.

    5) Select (double-arrow) to move all fields from the Available area to the Selected area.

    6) Select to close the Data Source Properties window.

    e. With the data source node selected, select the Preview tab from the Details area (at the bottom of

    dfPower Architect interface). The data from this node is displayed.

    f. Expand the Quality grouping of nodes.

    g. Double-click the Identification Analysis node. The Identification Analysis Properties window

    opens.

    1) Move the CONTACT field from the Available area to the Selected area by double-clicking.

    2) Double-click on the Definition column for the selected CONTACT field.

    3) From the menu, select Individual/Organization.

    4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed

    in the field CONTACT_Identity.

    5) Select below the Available area.

    6) Select (double-arrow) to move all fields from the Available area to the Selected area.

    7) Select to close the Additional Outputs window.

    8) Select to close the Identification Analysis Properties window.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    89/91

    12.5 Solutions to Exercises 89

    h. Preview the results of the Identification Analysis.

    1) Verify that Identification Analysis is selected.

    2) Select the Preview tab at the bottom of dfPower Architect interface.

    3) Scroll to the right to view the information populated for the CONTACT_Identity field.

    i. Expand the Quality grouping of nodes.

    j. Double-click on the Gender Analysis node. The Gender Analysis Properties window opens.

    1) Move the CONTACT field from the Available area to the Selected area by double-clicking.

    2) Double-click on the Definition column for the selected CONTACT field.

    3) Select Gender.

    4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed

    in the field CONTACT_Gender.

    5) Select below the Available area.

    6) Select (double-arrow) to move all fields from the Available area to the Selected area.

    7) Select to close the Additional Outputs window.

    8) Select to close the Identification Analysis Properties window.

    k. Expand the Profiling grouping of nodes.

    l. Double-click the Frequency Distribution node.

    1) The Frequency Distribution Properties window opens.

    2) Move CONTACT_Identity and CONTACT_Gender from the Available area to the Selected

    area.

    3) Select to close the Frequency Distribution Properties window. The Preview tab is

    populated with the frequency report

    m. Expand the Data Outputs grouping of nodes.

    n. Double-click the Frequency Distribution Chart node.

    1) Select next to Chart name to choose a location for the output.

    2) Navigate to S:\Workshop\winsas\didq.

    3) EnterNew Customers Chart as the value forFile name.

    4) Select to close the Save As window.

    5) EnterGender & Identity Distribution from New Customers as the title forthe chart.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    90/91

    90 Chapter 12 Data Cleansing Additional Functionality

    6) Move both CONTACT_Identity and CONTACT_Gender from the Available area to the

    Selected area.

    7) Select to close the Frequency Distribution Chart Properties window. The Preview

    tab is populated with the frequency report.

    o. Select from the toolbar. The job processes, and the Run Job window opens with a statusindicator.

    1) Select to close the Run Job window. The Chart Viewer window opens.

    2) Select to scroll to the next chart forCONTACT_Gender.

    3) Select FileExit to close the Chart Viewer window.

    p. Save the job.

    1) From dfPower Architect menu, select FileSave As.

    2) EnterDIDQ Ch5Ex1 NewCustomers Analysis as the name.

    3) EnterNew Customer Analysis as the description.

    4) Select to close the Save As window.

    q. Select the Frequency Distribution 1 node in job flow.

    r. Expand the Quality grouping of nodes.

    s. Right-click the Parsing node and select Insert Before Selected.

    1) Select CONTACT as the field to parse.

    2) Select Name as the definition.

    3) Select to move all tokens from the Available area to the Selected area.

    4) Select below the Available area.

    5) Select to move all fields from the Available area to the Selected area.

    6) Select to close the Additional Outputs window.

    7) Select to close the Parse Properties window.

    t. Expand the Utilities grouping of nodes.

    u. Double-click the Concatenate node. The Concatenation Properties window opens.

    1) Specify PreFirstLast as the output field.

    2) Enter (a space) as the value forLiteral text.

    3) Select Name Prefix, and then select to move it to the Concatenation list area.

  • 8/2/2019 Chp 12 - Data Cleansing Additional Functionality

    91/91

    12.5 Solutions to Exercises 91

    4) Select next to Literal text to move the text to the Concatenation list area after

    Name Prefix.

    5) Select Given Name, and then select to move it to the Concatenation list area.

    6) Select next to Literal text to move the text to the Concatenation list area afterGiven Name.

    7) Select Family Name, and then select to move it to the Concatenation list area.

    8) Select below the Available fields area.

    9) Select to move all fields from the Available area to the Selected area.