data mining as an audit tool

Upload: lucia-flores

Post on 07-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Data Mining as an Audit Tool

    1/140

    Data Mining As A Financial Auditing Tool

    M.Sc. Thesis in Accounting

    Swedish School of Economics and Business Administration

    2002

  • 8/4/2019 Data Mining as an Audit Tool

    2/140

    The Swedish School of Economics and Business Administration

    Department: Accounting

    Type of Document: Thesis

    Title: Data Mining As A Financial Auditing Tool

    Author: Supatcharee Sirikulvadhana

    Abstract

    In recent years, the volume and complexity of accounting transactions in major

    organizations have increased dramatically. To audit such organizations, auditors

    frequently must deal with voluminous data with rather complicated data structure.

    Consequently, auditors no longer can rely only on reporting or summarizing tools in the

    audit process. Rather, additional tools such as data mining techniques that can

    automatically extract information from a large amount of data might be very useful.

    Although adopting data mining techniques in the audit processes is a relatively new

    field, data mining has been shown to be cost effective in many business applications

    related to auditing such as fraud detection, forensics accounting and security evaluation.

    The objective of this thesis is to determine if data mining tools can directly

    improve audit performance. The selected test area was the sample selection step of the

    test of control process. The research data was based on accounting transactions

    provided by AVH PricewaterhouseCoopers Oy. Various samples were extracted from

    the test data set using data mining software and generalized audit software and the

    results evaluated. IBMs DB2 Intelligent Miner for Data Version 6 was selected to

    represent the data mining software and ACL for Windows Workbook Version 5 was

    chosen for generalized audit software.

    Based on the results of the test and the opinions solicited from experienced

    auditors, the conclusion is that, within the scope of this research, the results of data

    mining software are more interesting than the results of generalized audit software.

    However, there is no evidence that the data mining technique brings out material

    matters or present significant enhancement over the generalized audit software. Further

    study in a different audit area or with a more complete data set might yield a different

    conclusion.

    Search Words: Data Mining, Artificial Intelligent, Auditing, Computerized Audit

    Assisted Tools, Generalized Audit Software

  • 8/4/2019 Data Mining as an Audit Tool

    3/140

    Table of Contents

    1. Introduction 1

    1.1. Background 1

    1.2. Research Objective 2

    1.3. Thesis Structure 2

    2. Auditing 4

    2.1. Objective and Structure 4

    2.2. What Is Auditing? 4

    2.3. Audit Engagement Processes 5

    2.3.1. Client Acceptance or Client Continuance 5

    2.3.2. Planning 6

    2.3.2.1. Team Mobilization 6

    2.3.2.2. Clients Information Gathering 7

    2.3.2.3. Risk Assessment 7

    2.3.2.4. Audit Program Preparation 9

    2.3.3. Execution and Documentation 10

    2.3.4. Completion 11

    2.4. Audit Approaches 12

    2.4.1. Tests of Controls 12

    2.4.2. Substantive Tests 13

    2.4.2.1. Analytical Procedures 13

    2.4.2.2. Detailed Tests of Transactions 13

    2.4.2.3. Detailed Tests of Balances 14

    2.5. Summary 14

    3. Computer Assisted Auditing Tools 17

    3.1. Objective and Structure 17

    3.2. Why Computer Assisted Auditing Tools? 17

    3.3. Generalized Audit Software 18

    3.4. Other Computerized Tools and Techniques 22

    3.5. Summary 23

  • 8/4/2019 Data Mining as an Audit Tool

    4/140

    4. Data mining 24

    4.1. Objective and Structure 24

    4.2. What Is Data Mining? 24

    4.3. Data Mining process 25

    4.3.1. Business Understanding 26

    4.3.2. Data Understanding 27

    4.3.3. Data Preparation 27

    4.3.4. Modeling 27

    4.3.5. Evaluation 28

    4.3.6. Deployment 28

    4.4. Data Mining Tools and Techniques 29

    4.4.1. Database Algorithms 29

    4.4.2. Statistical Algorithms 30

    4.4.3. Artificial Intelligence 30

    4.4.4. Visualization 30

    4.5. Methods of Data Mining Algorithms 32

    4.5.1. Data Description 32

    4.5.2. Dependency Analysis 334.5.3. Classification and Prediction 33

    4.5.4. Cluster Analysis 34

    4.5.5. Outlier Analysis 34

    4.5.6. Evolution Analysis 35

    4.6. Examples of Data Mining Algorithms 36

    4.6.1. Apriori Algorithms 36

    4.6.2. Decision Trees 37

    4.6.3. Neural Networks 39

    4.7. Summary 40

    5. Integration of Data Mining and Auditing 43

    5.1. Objective and Structure 43

    5.2. Why Integrate Data Mining with Auditing? 43

  • 8/4/2019 Data Mining as an Audit Tool

    5/140

    5.3. Comparison between Currently Used Generalized Auditing Software

    and Data Mining Packages 44

    5.3.1. Characteristics of Generalized Audit Software 45

    5.3.2. Characteristics of Data Mining Packages 46

    5.4. Possible Areas of Integration 48

    5.5. Examples of Tests 58

    5.6. Summary 66

    6. Research Methodology 68

    6.1. Objective and Structure 68

    6.2. Research Period 68

    6.3. Data Available 68

    6.4. Research Methods 69

    6.5. Software Selection 70

    6.5.1. Data Mining Software 70

    6.5.2. Generalized Audit Software 71

    6.6. Analysis Methods 71

    6.7. Summary 72

    7. The Research 73

    7.1. Objective and Structure 73

    7.2. Hypothesis 73

    7.3. Research Processes 73

    7.3.1. Business Understanding 73

    7.3.2. Data Understanding 74

    7.3.3. Data Preparation 75

    7.3.3.1. Data Transformation 757.3.3.2. Attribute Selection 76

    7.3.3.3. Choice of Tests 80

    7.3.4. Software Deployment 82

    7.3.4.1. IBMs DB2 Intelligent Miner for Data 82

    7.3.4.2. ACL 91

    7.4. Result Interpretations 94

    7.4.1. IBMs DB2 Intelligent Miner for Data 94

    7.4.2. ACL 95

    7.5. Summary 99

  • 8/4/2019 Data Mining as an Audit Tool

    6/140

    8. Conclusion 101

    8.1. Objective and Structure 101

    8.2. Research Perspective 101

    8.3. Implications of the Results 102

    8.4. Restrictions and Constraints 103

    8.4.1. Data Limitation 103

    8.4.1.1. Incomplete Data 103

    8.4.1.2. Missing Information 103

    8.4.1.3. Limited Understanding 104

    8.4.2. Limited Knowledge of Software Packages 104

    8.4.3. Time Constraint 105

    8.5. Suggestions for Further Researches 105

    8.6. Summary 105

    List of Figures 105

    List of Tables 105

    References 105

    a) Books and Journals 105

    b) Web Pages 105

    Appendix A: List of Columns of Data Available 109

    Appendix B Results of IBMs Intelligent Miner for Data 105

    a) Preliminary Neural Clustering (with Six Attributes) 105

    b) Demographic Clustering: First Run 105

    c) Demographic Clustering: Second Run 105

    d) Neural Clustering: First Run 105

    e) Neural Clustering: Second Run 105

    f) Neural Clustering: Third Run 105

    g) Tree Classification: First Run 105

    h) Tree Classification: Second Run 105

    i) Tree Classification: Third Run 105

    Appendix C: Sample Selection Result of ACL 105

  • 8/4/2019 Data Mining as an Audit Tool

    7/140

    - 1 -

    1. Introduction

    1.1. Background

    Auditing is a relatively archaic field and the auditors are frequently viewed as

    stuffily fussy people. That is no longer true. In recent years, auditors have recognized

    the dramatic increase in the transaction volume and complexity of their clients

    accounting and non-accounting records. Consequently, computerized tools such as

    general-purpose and generalized audit software (GAS) have increasingly been used to

    supplement the traditional manual audit process.

    The emergence of enterprise resource planning (ERP) system, with the concept

    of integrating all operating functions together in order to increase the profitability of an

    organization as a whole, makes accounting system no longer a simple debit-and-credit

    system. Instead, it is the central registrar of all operating activities. Though it can be

    argued which is, or which is not, accounting transaction, still, it contains valuable

    information. It is auditors responsibility to audit sufficient amount of transactions

    recorded in the clients databases in order to gain enough evidence on which an audit

    opinion may be based and to ensure that there is no risk left unaddressed.

    The amount and complexity of the accounting transactions have increased

    tremendously due to the innovation of electronic commerce, online payment and other

    high-technology devices. Electronic records have become more common; therefore, on-

    line auditing is increasingly challenging let alone manual access. Despite those

    complicated accounting transactions can now be presented in the more comprehensive

    format using todays improved generalized audit software (GAS), they still require

    auditors to make assumptions, perform analysis and interpret the results.

    The GAS or other computerized tools currently used only allows auditors to

    examine a companys data in certain predefined formats by running varied query

    commands but not to extract any information from that data especially when such

    information is unknown and hidden. Auditors need something more than presentation

    tools to enhance their investigation of fact, or simply, material matters.

    On the other side, data mining techniques have improved with the advancement

    of database technology. In the past two decades, database has become commonplace in

  • 8/4/2019 Data Mining as an Audit Tool

    8/140

    - 2 -

    business. However, the database itself does not directly benefit the company; in order

    to reap the benefit of database, the abundance of data has to be turned into useful

    information. Thus, Data mining tools that facilitate data extraction and data analysis

    have received greater attention.

    There seems to be opportunities for auditing and data mining to converge.

    Auditing needs a mean to uncover unusual transaction patterns and data mining can

    fulfill that need. This thesis attempts to explore the opportunities of using data mining

    as a tool to improve audit performance. The effectiveness of various data mining tools

    in reaching that goal will also be evaluated.

    1.2. Research Objective

    The research objective of this thesis is to preliminarily evaluate the usefulness

    of data mining techniques in supporting auditing by applying selected techniques with

    available data sets. However, it is worth nothing that the data sets available are still in

    question whether it could be induced as generalization.

    According to the data available, the focus of this research is sample selection

    step of the test of control process. The relationship patterns discovered by data miningtechniques will be used as a basis of sample selection and the sample selected will be

    compared with the sample drawn by generalized audit software.

    1.3. Thesis Structure

    The remainder of this thesis is structured as follows:

    Chapter 2 is a brief introduction to auditing. It introduces some essential

    auditing terms as a basic background. The audit objectives, audit engagement processes

    and audit approaches are also described here.

    Chapter 3 discusses some computer assisted auditing tools and techniques

    currently used in assisting auditors in their audit work. The main focus will be on the

    generalized audit software (GAS), particularly in Audit Command Language (ACL) --

    the most popular software in recent years.

    Chapter 4 provides an introduction to data mining. Data mining process, tools

    and techniques are reviewed. Also, the discussions will attempt to explore the concept,

  • 8/4/2019 Data Mining as an Audit Tool

    9/140

    - 3 -

    methods and appropriate techniques of each type of data mining patterns in greater

    detail. Additionally, some examples of the most frequently used data mining algorithms

    will be demonstrated as well.

    Chapter 5 explores many areas where data mining techniques may be utilized

    to support the auditors performance. It also compares GAS packages and data mining

    packages from the auditing professions perspective. The characteristics of these

    techniques and their roles as a substitution of manual processes are also briefly

    discussed. For each of those areas, audit steps, potential mining methods, and required

    data sets are identified.

    Chapter 6 describes the selected research methodology, the reasons for

    selection, and relevant material to be used. The research method and the analysis

    technique of the results are identified as well.

    Chapter 7 illustrates the actual study. The hypothesis, relevant facts of the

    research processes and the study results are presented. Finally, the interpretation of

    study results will be attempted.

    Finally, chapter 8 provides a summary of the entire study. The assumptions,

    restrictions and constraints of the research will be reviewed, followed by suggestions for

    further research.

  • 8/4/2019 Data Mining as an Audit Tool

    10/140

    - 4 -

    2. Auditing

    2.1. Objective and Structure

    The objective of this chapter is to introduce the background information on

    auditing. In section 2.2, definitions of essential terms as well as main objectives and

    tasks of auditing profession are covered. Four principal audit procedures are discussed

    in section 2.3. Audit approaches including test of controls and substantive tests are

    discussed in greater details in section 2.4. Finally, section 2.5 provides a brief summary

    of auditing perspective.

    Notice that dominant content covered in this chapter are based on the notable

    textbook Auditing: An Integrated Approach (Arens & Loebbecke, 2000) and my own

    experiences.

    2.2. What Is Auditing?

    Auditing is the accumulation and evaluation of evidence about information to

    determine and report on the degree of correspondence between the information and

    established criteria (Arens & Loebbecke, 2000, 16). Normally, independent auditors,

    also known as certified public accountants (CPAs), conduct audit work to ascertain

    whether the overall financial statements of a company are, in all material respects, in

    conformity with the generally accepted accounting principles (GAAP). Financial

    statements include Balance Sheets, Profit and Loss Statements, Statements of Cash

    Flow and Statements of Retained Earning. Generally speaking, what auditors do is to

    apply relevant audit procedures, in accordance with GAAP, in the examination of the

    underlying records of a business, in order to provide a basis for issuing a report as an

    attestation of that companys financial statements. Such written report is called auditors

    opinion or auditors report.

    Auditors report expresses the opinion of an independent expert regarding the

    degree of reliability upon of the information presented in the financial statements. In

    other words, auditors report assures the financial statements users, which normally are

    external parities such as shareholders, investors, creditors and financial institutions, of

    the reliability of financial statements, which are prepared by the management of the

    company.

  • 8/4/2019 Data Mining as an Audit Tool

    11/140

    - 5 -

    Due to the time and cost constraints, auditors cannot examine every detail

    records behind the financial statements. The concept of materiality and fairly stated

    financial statements were introduced to solve this problem. Materiality is the magnitude

    of an omission or misstatement of information that misleads the financial statement

    users. The materiality standard applied to each account balance is varied and is

    depended on auditors judgement. It is the responsibility of the auditors to ensure that

    all material misstatements are indicated in the auditors opinion.

    In business practice, it is more common to find an auditor as a staff of an

    auditing firm. Generally, several CPAs join together to practice as partners of the

    auditing firm, offering auditing and other related services including auditing and other

    reviews to interested parties. The partners normally hire professional staffs and form an

    audit team to assist them in the audit engagement. In this thesis, auditors, auditing firm

    and audit team are synonyms.

    2.3. Audit Engagement Processes

    The audit engagement processes of each auditing firm may be different.

    However, they generally involve the four major steps: client acceptance or client

    continuance, planning, execution and documentation, and completion.

    2.3.1. Client Acceptance or Client Continuance

    Client acceptance, or client continuance in case of a continued

    engagement, is a process through which the auditing firm decides whether or not the

    firm should be engaged by this client. Major considerations are:

    -Assessment of engagement risks:

    Each client presents different levelof risk to the firm. The important risk that an auditing firm must evaluate carefully in

    accepting an audit client are: accepting a company with a bad reputation or questionable

    ethics that involves in illegal business activities or material misrepresentation of

    business and accounting records. Some auditing firms have basic requirements of

    favorable clients. On the other hand, some have a list of criteria to identify the

    unfavorable ones. Unfavorable clients, for example, are in dubious businesses or have

    too complex a financial structure.

  • 8/4/2019 Data Mining as an Audit Tool

    12/140

    - 6 -

    - Relationship conflicts: Independence is a key requirement of the

    audit profession, of equal importance is the auditors objectivity and integrity. These

    factors help to ensure a quality audit and to earn peoples trust in the audit report.

    - Requirements of the clients: The requirements include, for example,

    the qualification of the auditor, time constraint, extra reports and estimated budget.

    - Sufficient competent personnel available

    - Cost-Benefit Analysis: It is to compare the potential costs of the

    engagement with the audit fee offered from the client. The major portion of the cost of

    audit engagement is professional staff charge.

    If the client is accepted, a written confirmation, generally on an annual

    basis, of the terms of engagement is established between the client and the firm.

    2.3.2. Planning

    The objective of the planning step is to develop an audit plan. It includes

    team mobilization, clients information gathering, risk assessment and audit program

    preparation.

    2.3.2.1. Team Mobilization

    This step is to form the engagement team and to communicate

    among team members. First, key team members have to be identified. Team members

    include engagement partner or partners who will sign the audit report, staff auditors

    who will conduct most of the necessary audit work and any specialists that are deemed

    necessary for the engagement. The mobilization meeting, or pre-planning meeting,

    should be conducted to communicate all engagement matters including client

    requirements and deliverables, level of involvement, tentative roles and responsibilities

    of each team member and other relevant substances. The meeting should also cover the

    determination of the most efficient and effective process of information gathering.

    In case of client continuance, a review of the prior year audit to

    assess scope for improving efficiency or effectiveness should be identified.

  • 8/4/2019 Data Mining as an Audit Tool

    13/140

    - 7 -

    2.3.2.2. Clients Information Gathering

    In order to perform this step, the most important thing is the

    cooperation between the client and the audit team. A meeting is arranged to update the

    clients needs and expectations as well as managements perception of their business

    and the control environment.

    Next, the audit team members need to perform the preliminary

    analytical procedures which could involve the following tasks:

    - Obtaining background information: It includes the

    understanding of clients business and industry, the business objectives, legal

    obligations and related risks.

    - Understanding system structures: System structures include the

    system and computer environments, operating procedures and the controls embedded in

    those procedures.

    - Control assessment: Based upon information about controls

    identified from the meeting with the client and the understanding of system structures

    and processes, all internal controls are updated, assessed and documented. The subjects

    include control environment, general computerized (or system) controls, monitoring

    controls and application controls. More details about internal control, such as

    definitions, nature, purpose and means of achieving effective internal control, can be

    found in Internal Control Integrated Framework (COSO, 1992).

    Audit team members knowledge, expertise and experiences are

    considered as the most valuable tools in performing this step.

    2.3.2.3. Risk Assessment

    Risk, in this case, is some level of uncertainty in performing audit

    work. Risks identified in the first two steps are gathered and assessed. The level of

    risks assessed in this step is directly lead to the audit strategy to be used. In short, the

    level of task is based on the level of risks. Therefore, the auditor must be careful not to

    understate or overstate the level of these risks.

  • 8/4/2019 Data Mining as an Audit Tool

    14/140

    - 8 -

    Level of risks is different from one auditing area to another. In

    planning the extent of audit evidences of each auditing area, auditors primarily use an

    audit risk model such as the one shown below:

    Acceptable Audit RiskPlanned Detection Risk =

    Inherent Risk * Control Risk

    - Planned detection risk: Planned detection risk is the highest

    level of misstatement risk that the audit evidence cannot detect in each audit area. The

    auditors need to accumulate audit evidences until the level of misstatement risk is

    reduced to planned detection risk level. For example, if the planned detection risk is

    0.05, then audit testing needs to be expanded until audit evidence obtained supports the

    assessment that there is only five percent misstatement risk left.

    - Acceptable audit risk: Audit risk is the probability that auditor

    will unintentionally render inappropriate opinion on clients financial statements.

    Acceptable audit risk, therefore, is a measure of how willing the auditor is to accept that

    the financial statements may be materially misstated after the audit is completed (Arens

    & Loebbecke, 2000, 261).

    - Inherent risk: Inherent risk is the probability that there are

    material misstatements in financial statements. There are many risk factors that affect

    inherent risk including errors, fraud, business risk, industry risk, and change risk. The

    first two are preventable and detectable but others are not. Auditors have to ensure that

    all risks are taken into account when considering the probability of inherent risk.

    - Control risk: Control risk is the probability that a clients

    control system cannot prevent or detect errors. Normally, after defining inherent risks,

    controls that are able to detect or prevent such risks are identified. Then, auditors will

    assess whether the clients system has such controls and, if it has, how much they can

    rely on those controls. The more reliable controls, the lower the control risk. In other

    words, control risk represents auditors reliance on clients control structure.

    It is the responsibility of the auditors to ensure that no risk factors

    of each audit area are left unaddressed and the evidence obtained is sufficient to reduce

    all risks to an acceptable audit risk level. More information about audit risk can be

  • 8/4/2019 Data Mining as an Audit Tool

    15/140

    - 9 -

    found in Statement of Auditing Standard (SAS) No. 47: Audit Risk and Materiality in

    Conducting an Audit (AICPA, 1983).

    2.3.2.4. Audit Program Preparation

    The purpose of this step is to determine the most appropriate audit

    strategy and tasks for each audit objective within each audit area based on clients

    background information about related audit risks and controls identified from the

    previous steps.

    Firstly, the audit objectives, both transaction-related and balance-

    related, of each audit area have to be identified. These two types of objectives share

    one thing in common -- that they must be met before auditors can conclude that the

    information presented in the financial statements are fairly stated. The difference is that

    while transaction-related audit objectives are to ensure the correctness of the total

    transactions for any given class, balance-related audit objectives are to ensure the

    correctness of any given account balance. A primary purpose of audit strategy and task

    is to ensure that those objectives are materially met. Such objectives include the

    following.

    Transaction-Related and Balance-Related Audit Objectives

    - Existence or occurrence: To ensure that all balances in the

    balance sheet have really existed and the transactions in the

    income statement have really occurred.

    - Completeness: To ensure that all balances and transactions are

    included in the financial statements.

    - Accuracy: To ensure that the balances and transactions are

    recorded accurately.

    - Classification: To ensure that all transactions are classified in

    the suitable categories.

    - Cut-off (timing): To ensure that the transactions are recorded in

    the proper period.

  • 8/4/2019 Data Mining as an Audit Tool

    16/140

    - 10 -

    Others Balance-Related Audit Objectives

    - Valuation: To ensure that the balances and transactions are

    stated at the appropriate value.

    - Right and obligation: To ensure that the assets are belonged to

    and the liabilities are the obligation of the company.

    - Presentation and disclosure: To ensure that the presentation of

    the financial statements does not mislead the users and the

    disclosures are enough for users to understand the financial

    statements clearly.

    After addressing audit objectives, it is time to develop an overall audit

    plan. The audit plan should cover audit strategy of each area and all details related to

    the engagement including the clients needs and expectations, reporting requirements,

    timetable. Then, the planning at the detail level has to be performed. This detailed plan

    is known as a tailored audit program. It should cover tasks identification and schedule,

    types of tests to be used, materiality thresholds, acceptable audit risk and person

    responsible. Notice that related risks and controls of each area are taken into accountfor prescribing audit strategy and tasks.

    The finalized general plan should be communicated to the client in order

    to agree upon significant matters such as deliverables and timetable. Both overall audit

    plan and detailed audit programs need to be clarified to the team as well.

    2.3.3. Execution and Documentation

    In short, this step is to perform the audit examinations by following the

    audit program. It includes audit tests execution, which will be described in more detail

    in the next subsection, and documentation. Documentation includes summarizing the

    results of audit tests, level of satisfaction, matters found during the tests and

    recommendations. If there is an involvement of specialists, the process performed and

    the outcome have to be documented as well.

    Communication practices are considered as the most important skill toperform this step. Not only with the client or the staff working for the client, it is also

  • 8/4/2019 Data Mining as an Audit Tool

    17/140

    - 11 -

    crucial to communicate among the team. Normally, it is a responsibility of the more

    senior auditor to coach the less senior ones. Techniques used are briefing, coaching,

    discussing, and reviewing.

    A meeting with client in order to discuss the issues found during the

    execution process and the recommendations of those findings can be arranged either

    formally or informally. It is a good idea to inform and resolve those issues with the

    responsible client personnel such as the accounting manager before the completion step

    and leave only the critical matters to the top management.

    2.3.4. Completion

    This step is similar to the final step of every other kind of projects. The

    results of aforementioned steps are summarized, recorded, assessed and reported.

    Normally, the assistant auditors report their work results to the senior, or in-charge,

    auditors. The auditor-in-charge should perform the final review to ensure that all

    necessary tasks are performed and that the audit evidence gathered for each audit area is

    sufficient. Also, the critical matters left from the execution process have to be resolved.

    The resolution of those matters might be either solved by clients management

    (adjusting their financial statements or adequately disclosing them in their financial

    statement) or by auditors (disclosing them in the auditors opinion).

    The last field work for auditors is review of subsequent events.

    Subsequent events are events occurred subsequent to the balance sheet date but before

    the auditors report date that require recognition in the financial statements.

    Based on accumulated audit evidences and audit findings, the auditors

    opinion can be issued. Types of auditors opinion are unqualified, unqualified with

    explanatory paragraph or modified wording, qualified, adverse and disclaimer.

    After everything is done, it is time to arrange the clearance meeting with

    the client. Generally, auditors are required to report results and all conditions to the

    audit committee or senior management. Although not required, auditors often make

    suggestions to management to improve their business performance through the

    Management Letter. On the other hand, auditors can get feedback from the client

    according to their needs and expectations as well.

  • 8/4/2019 Data Mining as an Audit Tool

    18/140

    - 12 -

    Also, auditors should consider evaluating their own performances in

    order to improve their efficiency and effectiveness. The evaluation includes

    summarizing clients comments, bottom-up evaluation (more senior auditors evaluate

    the work of assistant auditors) and top-down evaluation (get feedback from field work

    auditors).

    2.4. Audit Approaches

    In order to determine whether financial statements are fairly stated, auditors

    have to perform audit tests to obtain competent evidence. The audit approaches used in

    each audit area as well as the level of test depended on auditors professional

    judgement. Generally, audit approaches fall into one of these two categories:

    2.4.1. Tests of Controls

    There are as many control objectives as many textbooks about system

    security nowadays. However, generally, control objectives can be categorized into four

    broad categories -- validity, completeness, accuracy and restricted access. With these

    objectives in mind, auditors can distinguish control activities from the normal operating

    ones.

    When assessing controls during planning phase, auditors are able to

    identify the level of control reliance -- the level of controls that help reducing risks. The

    effectiveness of such controls during the period can be assessed by performing testing

    of controls. However, only key controls will be tested and the level of tests depends

    solely on the control reliance level. The higher control reliance is, the more tests are

    performed.

    The scope of tests should be sufficiently thorough to allow the auditor to

    draw a conclusion as to whether controls have operated effectively in a consistent

    manner and by the proper authorized person. In other words, the level of test should be

    adequate enough to bring assurance of the relevant control objectives. The assurance

    evidence can be obtained from observation, inquiry, inspection of supporting

    documents, re-performance or the combination of these.

  • 8/4/2019 Data Mining as an Audit Tool

    19/140

    - 13 -

    2.4.2. Substantive Tests

    Substantive test is an approach designed to test for monetary

    misstatements or irregularities directly affecting the correctness of the financial

    statement balances. Normally, the level of tests depends on the level of assurance from

    the tests of controls. When the tests of controls could not be performed either because

    there is no or low control reliance or because the amount and extensiveness of the

    evidence obtained is not sufficient, substantive tests are performed. Substantive tests

    include analytical procedures, detailed tests of transactions as well as detailed tests of

    balances. Details of each test are as follows:

    2.4.2.1. Analytical Procedures

    The objective of this approach is to ensure that overall audit results,

    account balances or other data presented in the financial statements are stated

    reasonably. Statement of Auditing Standard (SAS) No. 56 also requires auditors to use

    analytical procedures during planning and final reporting phases of audit engagement

    (AICPA, 1988).

    Analytical procedures can be performed in many different ways.Generally, the most accepted one is to develop the expectation of each account balance

    and the acceptable variation or threshold. Then, this threshold is compared with the

    actual figure. Further investigation is required only when the difference between actual

    and expectation balances falls out of the acceptable variation range prescribed. Further

    investigation includes extending analytical procedures, detail examination of supporting

    documents, conducting additional inquiries and performing other substantive tests.

    Notice that the reliabilities of data, the predictive method and the

    size of the balance or transactions can strongly affect the reliability of assurance.

    Moreover, this type of test requires significant professional judgement and experience.

    2.4.2.2. Detailed Tests of Transactions

    The purpose of detailed tests of transactions (also known as

    substantive testing of transactions) is to ensure that the transaction-related audit

    objectives are met in each accounting transaction. The confidence on transactions will

  • 8/4/2019 Data Mining as an Audit Tool

    20/140

    - 14 -

    lead to the confidence on the account total in the general ledger. Testing techniques

    include examination of relevant documents and re-performance.

    The extent of tests remains a matter of professional judgement. It

    can be varied from a sufficient amount of samples to all transactions depending on the

    level of assurance that auditors want to obtain. Generally, samples are drawn either

    from the items with particular characteristics or randomly sampled or a combination of

    both. Examples of the particular characteristics are size (materiality consideration) and

    unusualness (risk consideration).

    This approach is time-consuming. Therefore, it is a good idea to

    reduce the sampling size by considering whether analytical procedures or tests of

    controls can be performed to obtain assurance in relation to the items not tested.

    2.4.2.3. Detailed Tests of Balances

    Detailed tests of balances (also called substantive tests of balances)

    focuses on the ending balances of each general ledger account. They are performed

    after the balance sheet date to gather sufficient competent evidence as a reasonable basis

    for expressing an opinion on fair presentation of financial statements (Rezaee, Elam &

    Sharbatoghlie, 2001, 155). The extent of tests depends on the results of tests of control,

    analytical procedures and detailed tests of transactions relating to each account. Like

    detailed tests of transactions, the sample size can be varied and remains a matter of

    professional judgement.

    Techniques to be applied for this kind of tests include account

    reconciliation, third party confirmation, observation of the items comprising an account

    balance and agreement of account details to supporting documents.

    2.5. Summary

    Auditing is the accumulation and evaluation of evidence about information to

    determine and report on the degree of correspondence between the information and

    established criteria. As seen in figure 2.1, the main audit engagement processes are

    client acceptance, planning, execution and completion.

  • 8/4/2019 Data Mining as an Audit Tool

    21/140

    - 15 -

    Figure 2.1: Summary of audit engagement processes

    Planning includes mobilization, information gathering, risk assessment and

    audit program preparation. Two basic types of audit approaches the auditors can use

    during execution phase are tests of controls and substantive tests. Substantive tests

    include analytical procedures, detailed tests of transactions and detailed tests of

    Gather Information

    Perform preliminary analytical procedures

    Assess risk and control

    Set materiality

    Develop audit plan and detailed audit program

    Perform Tests of Controls

    Perform Substantive Tests

    - Detailed Tests of Transactions

    - Analytical Procedures

    - Detailed Tests of Balances

    Gather audit evidence and audit findings

    Review subsequent events

    Evaluate overall results

    Issue auditors report

    Arrange clearance meeting with client

    Evaluate team performance

    Mobilize

    Gather information in details

    Evaluate clientClient

    Acceptance

    Planning

    Ex

    ecution&

    Documentation

    Completion

    High

    LowCo

    ntrol

    Relia

    nce

    Tests of Controls

    - Identify controls

    - Assess control reliance

    - Select samples

    - Test controls

    - Further investigate forunusual items

    - Evaluate Results

    Analytical Review

    - Develop expectations

    - Compare expectations

    with actual figures

    - Further investigate for

    major differences

    - Evaluate Results

    Detailed Tests

    - Select samples

    - Test samples

    - Further investigate for

    unusual items

    - Evaluate results

    Document testing results

  • 8/4/2019 Data Mining as an Audit Tool

    22/140

    - 16 -

    balances. The extent of test is based on the professional judgement of auditors.

    However, materiality, control reliance and risks are also major concerns.

    The final output of audit work is auditors report. The type of audit report --

    unqualified, unqualified with explanatory paragraph or modified wording, qualified,

    adverse or disclaimer -- depends on the combination of evidences obtained from the

    field works and the audit findings.

    At the end of each working period, the accumulated evidence and performance

    evaluation should be reviewed to assess scope for improving efficiency or effectiveness

    for the next auditing period.

    It is accepted that auditing business is not a profitable area of auditing firms.

    Instead, the value-added services, also known as assurance services, such as consulting

    and legal service are more profitable. The reason is that while cost of all services are

    relatively the same, clients are willing to pay a limited amount for auditing service

    comparing to other services. However, auditing has to be trustworthy and standardized

    and all above-mentioned auditing tasks are, more or less, time-consuming and require

    professional staff involvement. Thus, the main cost of auditing engagement is the

    salary of professional staffs and it is considerably high. This cost pressure is a major

    problem the auditing profession is facing nowadays.

    To improve profitability of auditing business, the efficient utilization of

    professional staff seems to be the only practical method. The question is how. Some

    computerized tools and techniques are introduced into auditing profession in order to

    assist and enhance auditing tasks. However, the level of automation is still

    questionable. As long as they still require professional staff involvement, auditing cost

    is unavoidable high.

  • 8/4/2019 Data Mining as an Audit Tool

    23/140

    - 17 -

    3. Current Auditing Computerized Tools

    3.1. Objective and Structure

    The objective of this chapter is to provide information about technological

    tools and techniques currently used by auditors. Section 3.2 discusses why computer

    assisted auditing tools (CAATs) are more than requisite in auditing profession at

    present. In section 3.3, general audit software (GAS) is reviewed in detail. The topic

    focuses on the most popular software, Audit Command Language (ACL). Other

    computerized tools and techniques are briefly identified in section 3.4. Finally, a brief

    summary of some currently used CAATs is provided in section 3.5.

    Before proceeding, it is worth noting that this chapter was mainly based on two

    textbooks and one journal, which are Accounting Information Systems (Bonar &

    Hopwood, 2001), Core Concept of Accounting Information System (Moscove,

    Simkin & Bagranoff, 2000) and Audit Tools (Needleman, 2001).

    3.2. Why Computer Assisted Auditing Tools?

    It is accepted that advances in technology have affected the audit process.

    With the ever increasing system complexity, especially the computer-based accounting

    information systems, including enterprise resource planning (ERP), and the vast amount

    of transactions, it is impractical for auditors to conduct the overall audit manually. It is

    even more impossible in an e-commerce intensive environment because all accounting

    data auditors need to access are computerized.

    In the past ten years, auditors frequently outsource technical assistance in some

    auditing areas from information system (IS) auditor, also called electronic data

    processing (EDP) auditor. However, when the computer-based accounting information

    systems become commonplace, such technical skill is even more important. The rate of

    growth of the information system practices within the big audit firms (known as the

    Big Five) was estimated at between 40 to 100 percent during 1990 and 2005

    (Bagranoff & Vendrzyk, 2000, 35).

    Nowadays, the term auditing with the computer is extensively used. It

    describes the employment of the technologies by auditors to perform some audit work

  • 8/4/2019 Data Mining as an Audit Tool

    24/140

    - 18 -

    that otherwise would be done manually or outsource. Such technologies are extensively

    referred to as computer assisted auditing tools (CAATs) and they are now play an

    important role in audit work.

    In auditing with the computer, auditors employ CAATs with other auditing

    techniques to perform their work. As its name suggests, CAAT is a tool to assist

    auditors in performing their work faster, better, and at lower cost. As CAATs become

    more common, this technical skill is as important to auditing profession as auditing

    knowledge, experience and professional judgement.

    There are a variety of software available to assist the auditors. Some are

    general-purpose software and some are specially designed that are customized to be

    used to support the entire audit engagement processes. Many auditors consider simple

    general ledger, automated working paper software or even spreadsheet as audit

    software. In this thesis, however, the term audit software refers to software that allows

    the auditors to perform overall auditing process that generally known as the generalized

    audit software.

    3.3. Generalized Audit Software

    Generalized audit software (GAS) is an automated package originally

    developed in-house by professional auditing firms. It facilitates auditor in performing

    necessary tasks during most audit procedures but mostly in the execution and

    documentation phase.

    Basic features of a GAS are data manipulation (including importing, querying

    and sorting), mathematical computation, cross-footing, stratifying, summarizing and file

    merging. It also involves extracting data according to specification, statistical sampling

    for detailed tests, generating confirmations, identifying exceptions and unusual

    transactions and generating reports. In short, they provide auditors the ability to access,

    manipulate, manage, analyze and report data in a variety of formats.

    Some packages also provide the more special features such as risk assessment,

    high-risk transaction and unusual items continuous monitoring, fraud detection, key

    performance indicators tracking and standardized audit program generation. With the

    standardized audit program, these packages help the users to adopt some of the

    profession's best practices.

  • 8/4/2019 Data Mining as an Audit Tool

    25/140

    - 19 -

    Most auditing firms, nowadays, have either developed their own GASs or

    purchased some commercially available ones. Among a number of the commercial

    packages, the most popular one is the Audit Command Language (ACL). ACL is

    widely accepted as the leading software for data-access, analysis and reporting. Some

    in-house GAS systems of those large auditing firms even allow their systems to

    interface with ACL for data extraction and analysis.

    Figure 3.1: ACL software screenshot (version 5.0 Workbook)

    ACL software (figure 3.1) is developed by ACL Services Ltd. (www.acl.com).

    It allows auditors to connect personal laptops to the clients system and then download

    clients data into their laptops for further processing. It is capable of working on large

    data set that makes testing at hundred-percent coverage possible. Moreover, it provides

    a comprehensive audit trail by allowing auditors to view their files, steps and results at

    any time. The popularity of the ACL is resulted from its convenience, its flexibility and

    its reliability. Table 3.1 illustrates the features of ACL and how are they used in each

    step of audit process.

  • 8/4/2019 Data Mining as an Audit Tool

    26/140

    - 20 -

    Audit Processes ACL Features

    Planning

    - Risk assessment - Statistics menu

    - Evaluation menu

    Execution and Documentation

    Tests of Controls

    - Sample selection

    - Controls Testing

    - Results evaluation

    Analytical Review

    - Expectations development

    - Expected versus actual figures

    comparison

    - Results evaluation

    - Sampling menu with the ability to

    specify sampling size and selectioncriteria

    - Filter menu

    - Analyze menu including Count,

    Total, Statistics, Age, Duplicate,

    Verify and Search

    - Expression builder

    - Evaluation menu

    - Statistics menu

    - Merge command

    - Analyze menu including Statistics,

    Age, Verify and Search

    - Expression builder

    - Evaluation menu

    Table 3.1: ACL features used in assisting each step of audit processes

  • 8/4/2019 Data Mining as an Audit Tool

    27/140

    - 21 -

    Audit Processes ACL Features

    Detailed Tests

    - Sample selection

    - Sample testing

    - Results evaluation

    Documentation

    - Sampling menu with the ability to

    specify sampling size and selection

    criteria

    - Filter menu

    - Analyze menu including Count,

    Total, Statistics, Age, Duplicate,

    Verify and Search

    - Expression builder

    - Evaluation menu

    - Document note

    - Automatic command log

    - File history

    Completion

    - Lesson learned record - Document Notes menu

    - Reports menu

    Other Possibilities

    - Fraud detection - Analyze menu including Count,

    Total, Statistics, Age, Duplicate,

    Verify and Search

    - Expression builder

    - Filter menu

    Table 3.1: ACL features used in assisting each step of audit processes (Continued)

    With ACLs capacity and speed, auditors can shorten the audit cycle with more

    thorough investigation. There are three beneficial features that make ACL a promising

    tool for auditors. First, the interactive capability allows auditors to test, investigate,analyze and get the results at the appropriate time. Second, the audit trail capability

  • 8/4/2019 Data Mining as an Audit Tool

    28/140

    - 22 -

    records history of the files, commands used by auditors and the results of such

    commands. This includes command log files that are, in a way, considered as record of

    work done. Finally, the reporting capability produces various kinds of report including

    both predefined and customized ones.

    However, there are some shortcomings. The most critical one is that, like other

    GAS, it is not able to deal with files that have complex data structure. Although ACLs

    Open Data Base Connectivity (ODBC) interface is introduced to reduce this problem,

    some intricate files still require flattening. Thus, it presents control and security

    problems.

    3.4. Other Computerized Tools and Techniques

    As mentioned above, there are many other computerized tools other than audit

    software that are capable of assisting some part of the audit processes. Those tools

    include the following:

    - Planning tools: project management software, personal information

    manager, and audit best practice database, etc.

    - Analysis tools: database management software, and artificial intelligence.

    - Calculation tools: spreadsheet software, database management software,

    and automated working paper software, etc.

    - Sample selection tools: spreadsheet software.

    - Data manipulation tools: database management software.

    - Documents preparation tools: word processing software and automated

    working paper software.

    In stead of using these tools as a substitution of GAS, auditors can incorporate

    some of these tools with GAS to improve the efficiency of the audit process. Planning

    tools is a good example.

    Together with the computerized tools, computerized auditing technique thatused to be performed by the EDP auditors has now become part of an auditors

    repertoire. At least, financial auditors are required to understand what technique to use,

  • 8/4/2019 Data Mining as an Audit Tool

    29/140

    - 23 -

    how to apply those techniques, and how to interpret the result to support their audit

    findings.

    Such techniques should be employed appropriately to accomplish the audit

    objectives. Some examples are as follows:

    - Test data: test how the system detect invalid data,

    - Integrated test facility: observe how fictitious transactions are processed,

    - Parallel simulation: simulate the original transactions and compare the

    results,

    - System testing: test controls of the clients accounting system, and

    - Continuous auditing: embed audit program into clients system.

    3.5. Summary

    In these days, technology impacts the ways auditors perform their work. To

    conduct the audit, auditors can no longer rely solely on their traditional auditing

    techniques. Instead, they have to combine such knowledge and experience with

    technical skills. In short, the boundary between the financial auditor and the

    information system auditor has becomes blurred. Therefore, it is important for the

    auditors to keep pace with the technological development so that they can decide what

    tools and techniques to be used and how to use them effectively.

    Computer assisted auditing tools (CAATs) are used to compliment the manual

    audit procedures. There are many CAATs available in the market. The challenge to the

    auditors is to choose the most appropriate ones for their work. Both the generalized

    audit software (GAS), that integrates overall audit functions, and other similar software

    are available to support their work. However, GAS packages tend to be more widely

    used due to its low cost, high capabilities and high reliability.

  • 8/4/2019 Data Mining as an Audit Tool

    30/140

    - 24 -

    4. Data mining

    4.1. Objective and Structure

    The objective of this chapter is to describe the basic concept of data mining.

    Section 4.2 provides some background on data mining and explains its basic element.

    Section 4.3 describes data mining processes in greater detail. Data mining tools and

    techniques are discussed in section 4.4 and methods of data mining algorithms are

    discussed in section 4.5. Examples of most frequently used data mining algorithms are

    provided in section 4.6. Finally, the brief summary of data mining is reviewed in

    section 4.7.

    Notice that the major contents in this chapter are based on CRISP-DM 1.0

    Step-by-Step Data Mining Guide (CRISP-DM, 2000), Data Mining: Concepts and

    Techniques (Han & Kamber, 2000) and Principles of Data Mining (Hand, Heikki &

    Smyth 2001).

    4.2. What Is Data Mining?

    Data mining is a set of computer-assisted techniques designed to automatically

    mine large volumes of integrated data for new, hidden or unexpected information, or

    patterns. Data mining is sometimes known as knowledge discovery in databases

    (KDD).

    In recent years, database technology has advanced in stride. Vast amounts of

    data have been stored in the databases and business people have realized the wealth of

    information hidden in those data sets. Data mining then become the focus of attention

    as it promises to turn those raw data into valuable information that businesses can use to

    increase their profitability.

    Data mining can be used in different kinds of databases (e.g. relational

    database, transactional database, object-oriented database and data warehouse) or other

    kinds of information repositories (e.g. spatial database, time-series database, text or

    multimedia database, legacy database and the World Wide Web) (Han, 2000, 33).

    Therefore, data to be mined can be numerical data, textual data or even graphics and

    audio.

  • 8/4/2019 Data Mining as an Audit Tool

    31/140

    - 25 -

    The capability to deal with voluminous data sets does not mean data mining

    requires huge amount of data as input. In fact, the quality of data to be mined is more

    important. Aside from being a good representative of the whole population, the data sets

    should contain the least amount of noise -- errors that might affect mining results.

    There are many data mining goals have been recognized; these goals may be

    grouped into two categories -- verification and discovery. Both of the goals share one

    thing in common -- the final products of mining process are discovered patterns that

    may be used to predict the future trends.

    In the verification category, data mining is being used to confirm or disapprove

    identified hypotheses or to explain events or conditions observed. However, the

    limitation is that such hypotheses, events or conditions are restricted by the knowledge

    and understanding of the analyst. This category is also called top-down approach.

    Another category, the discovery, is also known as bottom-up approach. This

    approach is simply the automated exploration of hitherto unknown patterns. Since data

    mining is not limited by the inadequacy of the human brain and it does not require a

    stated objective, inordinate patterns might be recognized. However, analysts are still

    required to interpret the mining results to determine if they are interesting.

    In recent years, data mining has been studied extensively especially on

    supporting customer relationship management (CRM) and fraud detection. Moreover,

    many areas have begun to realize the usefulness of data mining. Those areas include

    biomedicine, DNA analysis, financial industry and e-commerce. However, there are

    also some criticisms on data mining shortcomings such as its complexity, the required

    technical expertise, the lower degree of automation, its lack of user friendliness, the lack

    of flexibility and presentation limitations. Data mining software developers are now

    trying to mitigate those criticisms by deploying an interactive developing approach. It

    is expected that with the advancement in this new approach, data mining will continue

    to improve and attract more attention from other application areas as well.

    4.3. Data Mining Process

    According to CRISP-DM, a consortium that attempted to standardize data

    mining process, data mining methodology is described in terms of a hierarchical process

    that includes four levels as shown in Figure 4.1. The first level is data mining phases,

  • 8/4/2019 Data Mining as an Audit Tool

    32/140

    - 26 -

    or processes of how to deploy data mining to solve business problems. Each phase

    consists of several generic tasks or, in other words, all possible data mining situations.

    The next level contains specialized tasks or actions to be taken in order to carry

    out in certain situations. To make it unambiguous, the generic tasks of the second phase

    have to be enumerated in greater details. The questions of how, when, where and by

    whom have to be answered in order to develop a detailed execution plan. Finally, the

    fourth level, process instances, is a record of the actions, decisions and results of an

    actual data mining engagement or, in short, the final output of each phase.

    Figure 4.1: Four level breakdown of the CRISP-DM data mining methodology

    (CRISP-DM, 2000, 9)

    The top level, data mining process, consists of six phases which are business

    understanding, data understanding, data preparation, modeling, evaluation and

    deployment. Details of each phase are better described as follows.

    4.3.1. Business Understanding

    The first step is to map business issues to data mining problems.

    Generic tasks of this step include business objective determination, situation

    assessment, data mining feasibility evaluation and project plan preparation. At the end

    of the phase, project plan will be produced as a guideline to the whole project. Such

    plan should include business background, business objectives and deliverables, data

    mining goals and requirements, resources and capabilities availability and demand,

    assumptions and constraints identification as well as risks and contingencies

    assessment.

    Processes / Phases

    Process Instances

    Special Tasks

    Generic Tasks

  • 8/4/2019 Data Mining as an Audit Tool

    33/140

    - 27 -

    This project plan should be dynamic. This means that at the end of

    each phase or at each prescribed review point, the plan should be reviewed and updated

    in order to keep up with the situation of the project.

    4.3.2. Data Understanding

    The objective of this phase is to gain insight into the data set to be

    mined. It includes capturing and understanding the data. The nature of data should be

    reviewed in order to identify appropriate techniques to be used and the expected

    patterns.

    Generic tasks of this phase include data organization, data collection,

    data description, data analysis, data exploration and data quality verification. At the end

    of the phase, the results of all above-mentioned tasks have to be reported.

    4.3.3. Data Preparation

    As mentioned above, one of the major concerns in using data mining

    technique is the quality of data. The objective of this phase is to ensure that data sets

    are ready to be mined. The process includes data selection (deciding on which data is

    relevant), data cleaning (removing all, or most, incompleteness, noises and

    inconsistency), data scrubbing (cleaning data by abrasive action), data integration

    (combining data from multiple sources into standardized format), data transformation

    (converting standardized data into ready-to-be-mined and standardized format) and data

    reduction (removing redundancies and merging data into aggregated format).

    The end product of this phase includes the prepared data sets and the

    reports describing the whole processes. The characteristics of data sets could be

    different from the prescribed ones. Therefore, the review of project plan has to be

    performed.

    4.3.4. Modeling

    Though, the terms models and patterns are used interchangeably,

    there are some differences between them. A model is a global summary of data sets that

    can describe the population from which the data were drawn while a pattern describes a

    structure relating to relatively small local part of the data (Hand, Heikki & Smyth, 2001,

    165). To make it simplistic, a model can be viewed as a set of patterns.

  • 8/4/2019 Data Mining as an Audit Tool

    34/140

    - 28 -

    In this phase, a set of data mining techniques is applied to the

    preprocessed data set. The objective is to build a model that most satisfactorily

    describes the global data set. Steps include data mining technique selection, model

    design, model construction, model testing, model validation and model assessment.

    Notice that, typically, several techniques can be used in parallel to the

    same data mining problem. The model can be focused on either the most promising

    technique or using many techniques simultaneously. However, the latter technique

    requires cross-validated capabilities and evaluation criteria.

    4.3.5. Evaluation

    After applying data mining techniques in a model with data sets, the

    result of the model will be interpreted. However, it does not mean data mining

    engagement is over once the results are obtained. Such results have to be evaluated in

    conjunction with business objectives and context. If the results are satisfactory, the

    engagement can move on to the next phase. Otherwise, another iteration or moving

    back to the previous phase has to be done. The expertise of analysts is required in this

    phase.

    Besides the result of the model, some evaluation criteria should be

    taken into account. Such criteria include benefits the business would get from the

    model, accuracy and speed of the model, the actual costs, degree of automation, and

    scalability.

    Generic tasks of this phase include evaluating mining result, reviewing

    processes and determining the next steps. At the end of the phase, the satisfactory

    model is approved and the list of further actions is identified.

    4.3.6. Deployment

    Data mining results are deployed into business process in this phase.

    This phase begins with deployment plan preparation. Besides, the plan for monitoring

    and maintenance has to be developed. Finally, the success of data mining engagement

    should be evaluated including area to be improved and explored.

    Another important thing is that the possibility of failure has to be

    accepted. No matter how well the model is designed and tested, it is just a model that

  • 8/4/2019 Data Mining as an Audit Tool

    35/140

    - 29 -

    was built from a set of sample data sets. Therefore, the ability to adapt to business

    change and prompt management decision to correct it are required. Moreover, the

    performance of the model needs to be evaluated on a regular basis.

    The sequence of those phases is not rigid so moving back and forth between

    phases is allowed. Besides, the relationship could exist between any phases. At each

    review point, the next step has to be specified -- a step that can be either forward or

    backward.

    The lesson learned during and at the end of each phase should be documented

    as a guideline for the next phase. Besides, the documentation of all phases as well as

    the result of deployment should be documented for the next engagement. Details

    should include results of each phase, matters arising, problem solving options and

    method selected.

    Besides CRISP-DM guideline, there are other textbooks dedicating for

    integrating data mining into business problems. For the sake of simplicity, I would not

    go into too much detail than mentioned above. However, more information may be

    found in Building Data Mining Applications for CRM (Berson, Smith & Kurt, 2000)

    and Data Mining Cookbook (Rud, 2001).

    4.4. Data Mining Tools and Techniques

    Data mining is developed from many fields including database technology,

    artificial intelligence, traditional statistics, high-performance computing, computer

    graphics and data visualization. Hence, there are abundance of data mining tools and

    techniques available. However, those tools and techniques can be classified into four

    broad categories, which are database algorithms, statistical algorithms, artificial

    intelligence and visualization. Details of each category are as follows:

    4.4.1. Database algorithms

    Although data mining does not require large volume of data as input, it is

    more practical to deploy data mining techniques on large data sets. Data mining is most

    useful with the information that human brains could not capture. Therefore, it can be

    said that the objective of data mining is to mine databases for useful information.

  • 8/4/2019 Data Mining as an Audit Tool

    36/140

    - 30 -

    Thus, many database algorithms can be employed in order to assist

    mining processes especially in the data understanding and preparation phase. The

    examples of those algorithms are data generalization, data normalization, missing data

    detection and correction, data aggregation, data transformation, attribute-oriented

    induction, and fractal and online analytical processing (OLAP).

    4.4.2. Statistical algorithms

    The distinction between statistics and data mining is indistinct as almost

    all data mining techniques are derived from statistics field. It means statistics can be

    used in almost all data mining processes including data selection, problem solving,

    result presentation and result evaluation.

    Statistical techniques that can be deployed in data mining processes

    include mean, median, variance, standard deviation, probability, confident interval,

    correlation coefficient, non-linear regression, chi-square, Bayesian theorem and Fourier

    transforms.

    4.4.3. Artificial Intelligence

    Artificial intelligence (AI) is the scientific field seeking for the way to

    locate intelligent behavior in a machine. It can be said that artificial intelligence

    techniques are the most widely used in mining process. Some statisticians even think of

    data mining tool as an artificial statistical intelligence. Capability of learning is the

    greatest benefit of artificial intelligence that is most appreciated in the data mining field.

    Artificial intelligence techniques used in data mining processes include

    neural network, pattern recognition, rule discovery, machine learning, case-based

    reasoning, intelligent agents, decision tree induction, fuzzy logic, genetic algorithm,

    brute force algorithm and expert system.

    4.4.4. Visualization

    Visualization techniques are commonly used to visualize

    multidimensional data sets in various formats for analysis purpose. It can be viewed as

    higher presentation techniques that allow users to explore complex multi-dimensional

    data in a simpler way. Generally, it requires the integration of human effort to analyze

    and assess the results from its interactive displays. Techniques include audio, tabular,

  • 8/4/2019 Data Mining as an Audit Tool

    37/140

    - 31 -

    scatter-plot matrices, clustered and stacked chart, 3-D charts, hierarchical projection,

    graph-based techniques and dynamic presentation.

    To separate data mining from data warehouse, online analytical processing

    (OLAP) or statistics is intricate. One thing to be sure of is that data mining is not any of

    them. The difference between data warehouse and data mining is quite clear. Though

    there are some textbooks about data warehouse that devoted a few pages to data mining

    topic, it does not mean that they took data mining as a part of data warehousing.

    Instead, they all agreed that while data warehouse is a place to store data, data mining is

    a tool to distil the value of such data. The examples of those textbooks are Data

    Management (McFadden, Hoffer & Prescott, 1999) and Database Systems : A

    Practical Approach to Design, Implementation, and Management (Connolly, Begg &

    Strachan, 1999).

    One might argue that the value of data could be realized by using OLAP as

    claimed in many data warehouse textbooks. OLAP, however, can be thought of as

    another presentation tool that reform and recompile the same set of data in order to help

    users find such value easier. It requires human interference in both stating presenting

    requirements as well as interpreting the results. On the other hand, data mining uses

    automated techniques to do those jobs.

    As mentioned above, the differentiation between data mining and statistics is

    much more complicated. It is accepted that the algorithms underlying data mining tools

    and techniques are, more or less, derived from statistics. In general, however, statistical

    tools are not designed for dealing with enormous amount of data but data mining tools

    are. Moreover, the target users of statistical tools are statisticians while data mining isdesigned for business people. This simply means that data mining tools are

    enhancement of statistical tools that blend many statistical algorithms together and

    possess a capability of handling more data in an automated manner as well as a user-

    friendly interface.

    The choice of an appropriate technique and timing depend on the nature of the

    data to be analyzed, the size of data sets and the type of methods to be mined. A range

    of techniques can be applied to the problems either alone or in combination. However,

    when deploying sophisticated blend of data mining techniques, there are at least two

  • 8/4/2019 Data Mining as an Audit Tool

    38/140

    - 32 -

    requirements that need to be met -- the ability to cross validate results and the

    measurement criteria.

    4.5. Methods of Data Mining Algorithms

    Though nowadays data mining software packages are claimed to be more

    automated, they still require some directions from users. Expected method of data

    mining algorithm is one of those requirements. Therefore, in employing data mining

    tools, users should have a basic knowledge of these methods. The types of data mining

    methods can be categorized differently. However, in general, they fall into six broad

    categories which are data description, dependency analysis, classification and

    prediction, cluster analysis, classification and prediction, cluster analysis, outlieranalysis and evolution analysis. Details of each method are as follows:

    4.5.1. Data Description

    The objective of data description is to provide an overall description of

    data, either in itself or in each class or concept, typically in summarized, concise and

    precise form. There are two main approaches in obtaining data description -- data

    characterization and data discrimination. Data characterization is summarizing generalcharacteristics of data and data discrimination, also called data comparison, is

    comparing characters of data between contrasting groups or classes. Normally, these

    two approaches are used in aggregated manner.

    Though data description is one among many types of data mining

    algorithm methods, usually it is not the real finding target. Often the data description is

    analysts first requirement, as it helps to gain insight into the nature of the data and to

    find potential hypotheses, or the last one, in order to present data mining results. The

    example of using data description as a presentation tool is the description of the

    characteristics of each cluster that could not be identified by neural network algorithm.

    Appropriate data mining techniques for this method are attribute-oriented

    induction, data generalization and aggregation, relevance analysis, distance analysis,

    rule induction and conceptual clustering.

  • 8/4/2019 Data Mining as an Audit Tool

    39/140

    - 33 -

    4.5.2. Dependency Analysis

    The purpose of dependency analysis, also called association analysis, is

    to search for the most significant relationship across large number of variables or

    attributes. Sometimes, association is viewed as one type of dependencies where

    affinities of data items are described (e.g., describing data items or events that

    frequently occur together or in sequence).

    This type of methods is very common in marketing research field. The

    most prevalent one is market-basket analysis. It analyzes what products customers

    always buy together and presents in [Support, Confident] association rules. The

    support measurement states the percentage of events occurring together comparing tothe whole population. The confident measurement affirms the percentage of the

    occurrence of the following events comparing to the leading one. For example, the

    association rule in figure 4.2 means milk and bread were bought together at 6% of all

    transactions under analysis and 75% of customers who bought milk also bought bread.

    Milk => bread [support = 6%, confident = 75%]

    Figure 4.2: Example of association rule

    Some techniques for dependency analysis are nonlinear regression, rule

    induction, statistic sampling, data normalization, Apriori algorithm, Bayesian networks

    and data visualization.

    4.5.3. Classification and Prediction

    Classification is the process of finding models, also known as classifiers,

    or functions that map records into one of several discrete prescribed classes. It is

    mostly used for predictive purpose.

    Typically, the model construction begins with two types of data sets --

    training and testing. The training data sets, with prescribed class labels, are fed into the

    model so that the model is able to find parameters or characters that distinguish one

    class from the other. This step is called learning process. Then, the testing data sets,

    without pre-classified labels, are fed into the model. The model will, ideally,automatically assign the precise class labels for those testing items. If the results of

  • 8/4/2019 Data Mining as an Audit Tool

    40/140

    - 34 -

    testing are unsatisfactory, then more training iterations are required. On the other hand,

    if the results are satisfactory, the model can be used to predict the classes of target items

    whose class labels are unknown.

    This method is most effective when the underlying reasons of labeling

    are subtle. The advantage of this method is that the pre-classified labels can be used as

    the performance measurement of the model. It gives the confidence to the model

    developer of how well the model performs.

    Appropriate techniques include neural network, relevance analysis,

    discriminant analysis, rule induction, decision tree, case-based reasoning, genetic

    algorithms, linear and non-linear regression, and Bayesian classification.

    4.5.4. Cluster analysis

    Cluster analysis addresses segmentation problems. The objective of this

    analysis is to separate data with similar characteristics from the dissimilar ones. The

    difference between clustering and classification is that while clustering does not require

    pre-identified class labels, classification does. That is why classification is also called

    supervised learning while clustering is called unsupervised learning.

    As mentioned above, sometimes it is more convenient to analyze data in

    the aggregated form and allow breaking down into details if needed. For data

    management purpose, cluster analysis is frequently the first required task of the mining

    process. Then, the most interesting cluster can be focused for further investigation.

    Besides, description techniques may be integrated in order to identify the character

    providing best clustering.

    Examples of appropriate techniques for cluster analysis are neural

    networks, data partitioning, discriminant analysis and data visualization.

    4.5.5. Outlier Analysis

    Some data items that are distinctly dissimilar to others, or outliers, can be

    viewed as noises or errors which ordinarily need to be drained before inputting data sets

    into data mining model. However, such noises can be useful in some cases, where

    unusual items or exceptions are major concerns. Examples are fraud detection, unusual

    usage patterns and remarkable response patterns.

  • 8/4/2019 Data Mining as an Audit Tool

    41/140

    - 35 -

    The challenge is to distinguish the outliers from the errors. When

    performing data understanding phase, data cleaning and scrubbing is required. This

    step includes finding erroneous data and trying to fix them. Thus, the possibility to

    detect interesting differentiation might be diminished. On the other hand, if the

    incorrect data remained in the data sets, the accuracy of the model would be

    compromised.

    Appropriate techniques for outlier analysis include data cube,

    discriminant analysis, rule induction, deviation analysis and non-linear regression.

    4.5.6. Evolution Analysis

    This method is the newest one. The creation of evolution analysis is to

    support the promising capability of data warehouses which is data or event collection

    over a period of time. Now that business people came to realize the value of trend

    capture that can be applied to the time-related data in the data warehouse, it attracts

    increasing attention in this method.

    Objective of evolution analysis is to determine the most significant

    changes in data sets over time. In other words, it is other types of algorithm methods

    (i.e., data description, dependency analysis, classification or clustering) plus time-

    related and sequence-related characteristics. Therefore, tools or techniques available for

    this type of methods include all possible tools and techniques of other types as well as

    time-related and sequential data analysis tools.

    The examples of evolution analysis are sequential pattern discovery and

    time-dependent analysis. Sequential pattern discovery detects patterns between events

    such that the presence of one set of items is followed by another (Connolly, 1999, 965).

    Time-dependent analysis determines the relationship between events that correlate in a

    definite of time.

    Different types of methods can be mined in parallel to discover hidden or

    unexpected patterns, but not all patterns found are interesting. A pattern is interesting if

    it is easily understood, valid, potentially useful and novel (Han & Kamber, 2000, 27).

    Therefore, analysts are still needed in order to evaluate whether the mining results are

    interesting.

  • 8/4/2019 Data Mining as an Audit Tool

    42/140

    - 36 -

    To distinguish interesting patterns, users of data mining tools have to solve at

    least three problems. First, the correctness of patterns has to be measured. For

    example, the measurement of dependency analysis is [Confident, Support] value. It is

    easier for the methods that have historical or training data sets to compare the

    correctness of the patterns with the real ones; i.e., classification and prediction method.

    For those methods that training data sets are not available, then the professional

    judgement of the users of data mining tools is required.

    Second, the optimization model of patterns found has to be created. For

    example, the significance of Confident versus Support has to be formulated. To put

    it in simpler terms, it is how to tell which is better between higher Confident with

    lower Support or lower Confident with higher Support.

    Finally, the right point to stop finding patterns has to be specified. This is

    probably the most challenging problem. This leads to two other problems -- how to tell

    the current optimized pattern is the most satisfactory one and how to know it can be

    used as a generalized pattern on other data sets. In short, while trying to optimize the

    patterns, the over-fitting problem has to be taken into account as well.

    4.6. Examples of Data Mining Algorithms

    As mentioned above, there are plenty of algorithms used to mine the data. Due

    to the limited of space, this section is focused on the most frequently used and

    widespread recognized algorithms that can be indisputable thought of as data mining

    algorithms; neither pure statistical, nor database algorithms. The examples include

    Apriori algorithms, decision trees and neural networks. Details of each algorithms are

    as follows:

    4.6.1. Apriori Algorithms

    Apriori algorithm is the most frequently used in the dependency analysis

    method. It attempts to discover frequent item sets using candidate generation for

    Boolean association rules. Boolean association rule is a rule that concerns associations

    between the presence or absence of items (Han & Kamber, 2000, 229).

    The steps of Apriori algorithms are as follows:

    (a) The analysis data is first partitioned according to the item sets.

  • 8/4/2019 Data Mining as an Audit Tool

    43/140

    - 37 -

    (b) The support count of each item set (1-itemsets), also called

    Candidate, is performed.

    (c) The item sets that could not satisfy the required minimum support

    count are pruned. Thus creating the frequent 1-itemsets (a list of item

    sets that have at least minimum support count).

    (d) Item sets are joined together (2-itemsets) to create the second-level

    candidates.

    (e) The support count of each candidate is accumulated.

    (f) After pruning unsatisfactory item sets according to minimum supportcount, the frequent 2-itemsets is created.

    (g) The iteration of (d), (e) and (f) are executed until no more frequent k-

    itemsets can be found or, in other words, the next frequent k-itemsets

    contains empty frequent.

    (h) At the terminated level, the Candidate with maximum support count

    wins.

    By using Apriori algorithms, the group of item sets that most frequently

    come together is identified. However, dealing with large amount of transactions means

    the candidate generation, counting and pruning steps needed to be repeated numerous

    times. Thus, to make the process more efficient, some techniques such as hashing

    (reducing the candidate size) and transaction reduction can be used (Han & Kamber,

    2000, 237).

    4.6.2. Decision Trees

    Decision tree is a predictive model with tree or hierarchical structure. It

    is used most in classification and prediction methods. It consists of nodes, which

    contained classification questions, and branches, or the results of the questions. At the

    lowest level of the tree -- leave nodes -- the label of each classification is identified.

    The structure of decision tree is illustrated in figure 4.3.

    Typically, like other classification and prediction techniques, the decision

    tree begins with exploratory phase. It requires training data sets with labels to be fed.

  • 8/4/2019 Data Mining as an Audit Tool

    44/140

    - 38 -

    The underlying algorithm will try to find the best-fit criteria to distinguish one class

    from another. This is also called tree growing. The major concerns are the quality of

    the classification problems as well as the appropriate number of levels of the tree. Some

    leaves and branches need to be removed in order to improve the performance of the

    decision tree. This step is also called tree pruning.

    On the higher level, the predetermined model can be used as a prediction

    tool. Before that, the testing data sets should be fed into the model to evaluate the

    model performance. Scalability of the model is the major concern in this phase.

    Figure 4.3: A decision tree classifying transactions into five groups

    The fundamental algorithms can be different in each model. Probably

    the most popular ones are Classification and Regression Trees (CART) and Chi-Square

    Automatic Interaction Detector (CHAID). For the sake of simplicity, I will not go into

    the details of these algorithms and only perspectives of them are provided.

    CART is an algorithm developed by Leo Breiman, Jerome Friedman,

    Richard Olshen and Charles Stone. The advantage of CART is that it automates the

    Transaction = 50x > 35 ?

    Transaction = 15

    y > 52 ?

    Transaction = 35

    y > 25 ?

    Transaction = 9

    Group E

    Transaction = 6

    Group D

    Transaction = 25

    x > 65 ?

    Transaction = 10

    Group C

    Transaction = 15

    Group A

    Transaction = 10

    Group B

    No Yes

    No Yes

    No Yes

    No Yes

  • 8/4/2019 Data Mining as an Audit Tool

    45/140

    - 39 -

    pruning process by cross validation and other optimizers. It is capable of handling

    missing data and it sets the unqualified records apart from the training data sets.

    CHAID is another decision tree algorithm that uses contingency tables

    and the chi-square test to create the tree. The disadvantage of CHAID comparing to

    CART is that it requires more data preparation process.

    4.6.3. Neural Networks

    Nowadays, neural networks, or more correctly the arti