quantifying the risk of financial events using kernel methods and information...

184
QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVAL By MARK CECCHINI A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2005

Upload: others

Post on 18-Apr-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS

AND INFORMATION RETRIEVAL

By

MARK CECCHINI

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2005

Page 2: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

Copyright 2005

by

Mark Cecchini

Page 3: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

This document is dedicated to Tara, Julian and Campbell, who were my inspiration in pursuing and finishing a PhD.

Page 4: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

iv

ACKNOWLEDGMENTS

I would like to thank Tara and the rest of the Cecchinis for putting up with me

throughout this process. I would also like to thank our families for their support

throughout this endeavor. Without my committee there would be no dissertation. So, I

would like to acknowledge my Advisor Gary Koehler, who came up with the initial

research idea and has seen this research through from the beginning, Haldun Aytug, who

has been working on this project for three years, Praveen Pathak for his information

retrieval expertise and Gary McGill for helping me to understand the accounting

relevance of the work. Finally, I’d like to thank Karl Hackenbrack for his guidance in the

early stages of this work.

Page 5: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

v

TABLE OF CONTENTS page

ACKNOWLEDGMENTS ................................................................................................. iv

LIST OF TABLES........................................................................................................... viii

LIST OF FIGURES .............................................................................................................x

LIST OF OBJECTS ........................................................................................................... xi

ABSTRACT...................................................................................................................... xii

CHAPTER

1 INTRODUCTION AND MOTIVATION....................................................................1

2 FINANCIAL EVENTS ................................................................................................5

2.1 Fraud Detection .....................................................................................................5 2.2 Bankruptcy Detection .............................................................................................8 2.3 Restatement Detection ..........................................................................................12

3 INFORMATION RETRIEVAL METHODOLOGIES..............................................16

3.1 Overview...............................................................................................................16 3.2 Vector Space Model .............................................................................................18 3.3 WordNet ...............................................................................................................21 3.4 Ontology Creation ................................................................................................23

4 MACHINE LEARNING METHODOLOGIES .........................................................27

4.1 Statistical Learning Theory...................................................................................28 4.2 Support Vector Machines .....................................................................................29 4.3 Kernel Methods ....................................................................................................33

4.3.1 General Kernel Methods.............................................................................34 4.3.2 Domain Specific Kernels...........................................................................40

5 THE FINANCIAL KERNEL .....................................................................................43

6 THE ACCOUNTING ONTOLOGY AND CONVERSION OF DOCUMENTS TO TEXT VECTORS.................................................................................................54

Page 6: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

vi

6.1 The Accounting Ontology ....................................................................................54 6.1.1 Step 1: Determine Concepts and Novel Terms that are specific to the

accounting domain ...........................................................................................55 6.1.2 Step 2: Merge Novel Terms with Concepts ..............................................61 6.1.3 Step 3: Add multi-word domain concepts to WordNet .............................64

6.2 Converting Text to a Vector via the Accounting Ontology..................................65

7 COMBINING QUANTITATIVE AND TEXT DATA .............................................69

8 RESEARCH QUESTIONS, METHODOLOGY AND DATA .................................72

8.1 Hypotheses............................................................................................................72 8.2 Research Model ....................................................................................................74 8.3 Datasets.................................................................................................................76

8.3.1 Fraud Data ..................................................................................................76 8.3.2 Bankruptcy Data.........................................................................................77 8.3.3 Restatement Data........................................................................................79

8.4 The Ontology ........................................................................................................80 8.5 Data Gathering and Preprocessing........................................................................82

8.5.1 Preprocessing-Quantitative Data ................................................................84 8.5.2 Preprocessing-Text Data ............................................................................84

9 RESULTS...................................................................................................................88

9.1 Fraud Results ........................................................................................................89 9.2 Discussion of Fraud Results .................................................................................92 9.3 Bankruptcy Results...............................................................................................94 9.4 Discussion of Bankruptcy Results ........................................................................97 9.5 Restatement Results..............................................................................................98 9.6 Discussion of Restatement Results .....................................................................102 9.7 Support of Hypotheses........................................................................................103

10 SUMMARY, CONCLUSION AND FUTURE RESEARCH..................................104

10.1 Summary...........................................................................................................104 10.2 Conclusion ........................................................................................................105 10.3 Future Research ................................................................................................106

APPENDIX

A ONTOLOGIES AND STOPLIST ............................................................................109

A.1 Ontologies..........................................................................................................109 A.1.1 GAAP, 300 Dimensions, 100 concepts, 100 2-grams, 100 3-grams .......109 A.1.2 GAAP, 60 Dimensions, 40 concepts, 10 2-grams, 10 3-grams ...............115 A.1.3 GAAP, 10 Dimensions, 10 concepts .......................................................117 A.1.4 10K, Bankruptcy, 100 Dimensions..........................................................117 A.1.5 10K, Bankruptcy, 50 Dimensions, 50 Concepts......................................119

Page 7: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

vii

A.1.6 10K, Bankruptcy, 25 Dimensions, 25 concepts.......................................120 A.1.7 10K, Fraud, 150 Dimensions, 50 concepts, 50 2-grams, 50 3-grams......121 A.1.8 10K, Fraud, 50 Dimensions, 50 concepts................................................124 A.1.9 10K, Fraud, 25 Dimensions, 25 concepts................................................125

A.2 Stoplist ...............................................................................................................127

B QUANTITATIVE AND TEXT DATA....................................................................128

B.1 Quantitative Data ...............................................................................................129 B.2 Text Data............................................................................................................162

LIST OF REFERENCES.................................................................................................163

BIOGRAPHICAL SKETCH ...........................................................................................172

Page 8: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

viii

LIST OF TABLES

Table page 1 – Financial Kernel Validation.........................................................................................51

2 – Fraud Detection Results using Financial Kernel .........................................................89

3 – Fraud Detection Results using Text Kernel, 300 Dim GAAP Ont. .............................90

4 – Fraud Detection Results using Comb. Kernel, 300 Dim GAAP Ont...........................90

5 – Fraud Detection Results using Text Kernel, 60 Dim GAAP Ont. ...............................90

6 – Fraud Detection Results using Comb. Kernel, 60 Dim GAAP Ont.............................90

7 – Fraud Detection Results using Text Kernel, 10 Dim GAAP Ont. ...............................90

8 – Fraud Detection Results using Comb. Kernel, 10 Dim GAAP Ont.............................91

9 – Fraud Detection Results using Text Kernel, 150 Dim 10K Ont. .................................91

10 – Fraud Detection Results using Comb. Kernel, 150 Dim 10K Ont.............................91

11 – Fraud Detection Results using Text Kernel, 50 Dim 10K Ont. .................................91

12 – Fraud Detection Results using Comb. Kernel, 50 Dim 10K Ont...............................91

13 – Fraud Detection Results using Text Kernel, 25 Dim 10K Ont. .................................92

14 – Fraud Detection Results using Comb. Kernel, 25 Dim 10K Ont...............................92

15 – Bankruptcy Prediction Results using Financial Kernel .............................................94

16 – Bankruptcy Prediction Results using Text Kernel, 300 Dim GAAP Ont..................94

17 – Bankruptcy Prediction Results using Comb. Kernel, 300 Dim GAAP Ont. .............94

18 – Bankruptcy Prediction Results using Text Kernel, 60 Dim GAAP Ont....................95

19 – Bankruptcy Prediction Results using Combination Kernel, 60 Dim GAAP Ont. .....95

20 – Bankruptcy Prediction Results using Text Kernel, 10 Dim GAAP Ont....................95

Page 9: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

ix

21 – Bankruptcy Prediction Results using Combination Kernel, 10 Dim GAAP Ont. .....95

22 – Bankruptcy Prediction Results using Text Kernel, 100 Dim 10K Ont......................95

23 – Bankruptcy Prediction Results using Combination Kernel, 100 Dim 10K Ont. .......96

24 – Bankruptcy Prediction Results using Text Kernel, 50 Dim 10K Ont........................96

25 – Bankruptcy Prediction Results using Combination Kernel, 50 Dim 10K Ont. .........96

26 – Bankruptcy Prediction Results using Text Kernel, 25 Dim 10K Ont........................96

27 – Bankruptcy Prediction Results using Text Kernel combined with Financial Attributes, 25 Dim 10K Ont. ....................................................................................97

28 – Restatement (1,379 cases) Prediction Results using Financial Kernel ......................99

29 – Restatement Prediction Results using Financial Kernel ............................................99

30 – Restatement Prediction Results using Text Kernel, 300 Dim GAAP Ont.................99

31 – Restatement Prediction Results using Comb. Kernel, 300 Dim GAAP Ont. ............99

32 – Restatement Prediction Results using Text Kernel, 60 Dim GAAP Ont.................100

33 – Restatement Prediction Results using Combination Kernel, 60 Dim GAAP Ont. ..100

34 – Restatement Prediction Results using Text Kernel, 10 Dim GAAP Ont.................100

35 – Restatement Prediction Results using Combination Kernel, 10 Dim GAAP Ont. .100

36 – Restatement Prediction Results using Text Kernel, 150 Dim 10K Ont...................100

37 – Restatement Prediction Results using Combination Kernel, 150 Dim 10K Ont. ....101

38 – Restatement Prediction Results using Text Kernel, 50 Dim 10K Ont.....................101

39 – Restatement Prediction Results using Combination Kernel, 50 Dim 10K Ont. ......101

40 – Restatement Prediction Results using Text Kernel, 25 Dim 10K Ont.....................101

41 – Restatement Prediction Results using Text Kernel combined with Financial Attributes, 25 Dim 10K Ont. ..................................................................................101

Page 10: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

x

LIST OF FIGURES

Figure page 1 – Ontology Hierarchy .....................................................................................................23

2 – Basic Graph Kernel......................................................................................................38

3 – Graph Kernel................................................................................................................40

4 – The Financial Kernel 1.................................................................................................48

6 – Updated Financial Kernel ............................................................................................53

7 – Accounting Ontology Creation Process.......................................................................56

8 – WordNet Noun hierarchy with Domain Concepts.......................................................61

9 – WordNet Noun hierarchy with Domain Concepts enriched with Novel Terms..........63

10 – WordNet Noun Hierarchy with Domain Concepts, Novel Terms and Multi-Word Concepts ...................................................................................................................65

11 – Text Kernel ................................................................................................................69

12 – Combined Kernel .......................................................................................................70

13 – The Discovery Process...............................................................................................75

14 – Fraud Features............................................................................................................77

15 – Bankruptcy Features ..................................................................................................78

16 – Fraud Dataset ...........................................................................................................130

17 – Bankruptcy Dataset..................................................................................................134

18 - Restatement Dataset .................................................................................................139

Page 11: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

xi

LIST OF OBJECTS

Object page 1. Text Data......................................................................................................................162

Page 12: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

xii

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVAL

By

Mark Cecchini

August, 2005

Chair: Gary Koehler Major Department: Decision and Information Sciences

A financial event is any happening which dramatically changes the value of a firm.

Examples of financial events are management fraud, bankruptcy, exceptional earnings

announcements, restatements, and changes in corporate structure. This dissertation

creates a method for timely detection of financial events using machine learning methods

to create a discriminant function. As there are a myriad of possible causes for any

financial event, the method created must be powerful. In order to increase the power of

current methods of detection text related to the company is analyzed together with

quantitative information on the company. The text variables are chosen based on an

automatically created accounting ontology. The quantitative variables are mapped to a

higher dimension which takes into account ratios and year-over-year changes. The

mapping is achieved via a kernel. Support vector machines use the kernel to perform the

learning task. The methodology is tested empirically on three datasets: management

fraud, bankruptcy, and financial restatements. The results show that the methodology is

competitive with the leading management fraud detection methods. The bankruptcy and

restatement results show promise.

Page 13: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

1

CHAPTER 1 INTRODUCTION AND MOTIVATION

SAS 99, Consideration of Fraud in a Financial Statement Audit, establishes

external auditors’ responsibility to plan and perform audits to provide a reasonable

assurance that the audited financial statements are free of material fraud. Recent events

highlight that failing to detect fraudulent financial reporting not only exposes the audit

firm to adverse legal consequences (e.g., the demise of Arthur Andersen LLP), but

exposes the audit profession to increased public and governmental scrutiny that can lead

to fundamental changes in the structure of the public accounting industry, accounting

firm conduct, and government oversight of the accounting profession (consider, for

example, the Sarbanes-Oxley Act of 2002 89 and subsequent actions of the SEC 92 and

NYSE 71). Research that helps auditors better assess the risk of material misstatement

during the planning phase of an audit will reduce instances of fraudulent reporting. Such

research is of interest to academics, standard setters, regulators, and audit firms.

Current research in accounting has examined methods to assess the risk of

fraudulent financial reporting. The methodologies are varied and usually combine some

behavioral and quantitative factors. For example, Loebbecke, Eining and Willingham 55

compiled an extensive list of company characteristics associated with fraudulent

reporting (called "red flags"). This list contains financial ratios and behavioral

characteristics of company management. Other methods scrutinize accounting entries that

are not easily verified by outside sources; these entries are called discretionary accruals.

Page 14: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

2

Board composition and executive compensation are also used to model the type of

environment that is ripe for fraud.

This dissertation proposes a methodology that can estimate the likelihood of

fraudulent financial reporting. The resulting decision-aid has the potential to complement

the unaided auditor risk assessments envisioned in SAS 99. Our approach combines

novel aspects of the fraud assessment research in accounting with computational methods

and theory used in Information Retrieval (IR) and machine learning/datamining.

Machine learning uses computational techniques to automate the discovery of

patterns that may be difficult to find by normal analytic techniques. Machine learning

methodologies have been used in order to determine financial statement validity or,

somewhat related, the likelihood of bankruptcy and credit worthiness. There are many

models commonly used in machine learning with neural networks 66, linear discriminant

functions 34, logit functions 3, and decision trees 80 being popular choices. Attempts

have been made to recognize patterns in fraudulent companies using neural networks,

linear discriminant functions, logit functions, and decision trees. These studies utilized

quantitative data from financial statements and surveys from auditors. Unlike these

earlier studies, recent advances in machine learning theory consider generalization ability

and domain knowledge while the learning task is undertaken.

Existing work on fraud detection has left out a key component of and about the

company, text documents. In most public documents, the preponderance of information

is textual but most automated methods for detecting fraud are based on quantitative

information alone. So, either an expert has to distill the text to numbers, which is a

monumental task, or the text-based information is largely ignored. We hypothesize that

Page 15: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

3

there is information hidden “between the lines” that is overlooked. Our approach can

incorporate textual materials like management discussion and analysis, news articles, and

so on.

An area of research called Information Retrieval (IR) can help us to make use of

the text. IR is often employed in library science and, more recently, in powerful Internet

search engines (such as Google 39). IR is used for varied purposes, including question

answering, document sorting, knowledge engineering, query expansion, and inferencing.

We use IR methodologies to cull the financial text down to numbers, which can be used

in conjunction with numerical attributes obtained from the financial statements to

automatically predict the likelihood of fraud.

What distinguishes the proposed approach from prior attempts to understand and

aid fraud-risk assessments are advances in machine learning theory, both through a

theory that addresses generalization errors and methods incorporating domain knowledge

while the learning task is undertaken, and in IR that enable computer programs to analyze

textual materials.

The methodologies we create can be generalized to other accounting issues, such as

the early detection of bankruptcy, detection of earnings management, early detection of

increased market value, and general industry stability. Each of these issues has the

potential to impact a company’s value significantly shortly after a related first press

release or news item is made public. As a result of the speedy impact, these issues can be

called financial events. In this dissertation, we look at the early detection of bankruptcy,

together with the detection of management fraud.

Page 16: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

4

The goal of this dissertation is discussed below. In the following chapter we

review financial events detection literature and summarize key concepts and results.

Chapter 3 summarizes relevant machine learning literature. Chapter 4 summarizes

relevant Information Retrieval literature. In Chapter 5 we develop a machine learning

methodology that handles quantitative financial data. In Chapter 6 we develop the IR

methodologies that enable us to utilize text for financial event detection. In Chapter 7 we

explain how we put the text data together with the quantitative data. We also extend the

methodology we create in Chapter 5 to include text. These methods are used to study

some actual data on which we ask a number of questions. The research model and

hypotheses are developed in Chapter 8 and tested. Chapter 9 explains the results along

with a conclusion and an explanation of future work.

Page 17: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

5

CHAPTER 2 FINANCIAL EVENTS

As explained in the Introduction, a financial event is any event that significantly

alters the value of a company. One can think of such an event as one that raises or lowers

the value of the company. A partial list of possible events which lower the value of the

firm are as follows: civil or criminal litigation, bankruptcy, management fraud,

defalcations, restatements, earnings management and poor press. We focus on three such

events in particular, management fraud, bankruptcy and restatements. In Section 2.1 we

look at the fraud detection literature from accounting and machine learning. In section 2.2

we look at bankruptcy detection literature from those perspectives as well. In section 2.3

we look at the Restatements literature.

2.1 Fraud Detection

A key result in audit research was given by Loebbecke, Eining and Willingham 55.

They partitioned a large set of indicators into three main components: conditions,

motivation, and attitude. They find in 86% of the fraud cases at least one factor from

each component was present, indicating it is extremely rare for fraud to exist without all

three components existing simultaneously. Hackenbrack 41 finds the relative influence

of such components on auditor fraud-risk assessments varies systematically with auditor

experiences. This research influenced standard setting and much of the fraud assessment

research that has followed.

Bell and Carcello 9 developed a logistic regression model to estimate the likelihood

of fraudulent financial reporting. The significant risk factors considered were as follows:

Page 18: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

6

weak internal control environment, rapid company growth, inadequate or inconsistent

relative profitability, management places undue emphasis on meeting earnings

projections, management lied to the auditors or was overly evasive, the ownership status

(public vs. private) of the entity, and an interaction term between a weak control

environment and an aggressive management attitude toward financial reporting. The

logistic regression model was tested on a sample of 77 fraud engagements and 305 non-

fraud engagements. The model scored better than auditing professionals in the detection

of fraud. The model performed equally as well as audit professionals for the non-fraud

portion. The authors suggest that the use of this model might be used to satisfy the SAS

82 requirements for assessing the risk of material misstatement due to fraud.

Hansen, McDonald, Messier, and Bell 42 develop a generalized qualitative-

response model to analyze management fraud. They use the same dataset of 77 fraud and

305 nonfraud cases as collected by Loebbecke, Eining, and Willingham. They first tested

the model with symmetric costs between type I and type II errors. Over 20 trials they got

an 89.3% predictive accuracy. They adjusted the model to allow for asymmetric costs

and the accuracy dropped to 85.5%; however, the type II errors decreased markedly. The

consideration of type I and II errors is important in fraud detection research. Minimizing

the type II error is minimizing the chance that the model will miss an actual fraud

company. When type II error is minimized, type I error will increase. In the case of

fraud detection, type I error is much less important than type II error.

In fraud detection, discretionary accruals are a cause for concern as discretionary

accruals have been known to be used to help “smooth” fluctuations in periodic income.

Accounts that are used in discretionary accruals, such as Bad Debts Expense, Inventory

Page 19: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

7

and Accounts Receivable, are susceptible to “engineering” on the part of management.

By considering year-over-year changes in ratios, which include these accounts, a clearer

picture of the company emerges.

McNichols and Wilson 56 look at the provision for bad debts and consider how it

should be reported in the absence of earnings management. Earnings management is a

term that describes a spectrum of “cheating” that at a minimum is aggressive and not in

strict compliance with GAAP, and at maximum is management fraud. The research

found that firms use the provision for bad debts as an income smoothing method; in other

words, it is raised in times of high earnings and lowered in times of low earnings.

Ragothaman, Carpenter, and Buttars 82 developed an expert system to help

auditors in the planning stage of an audit. The system is designed to detect error potential

in order to determine if additional substantive testing is necessary for the auditors. The

expert system rules were developed using financial statement data. The expert system

methodology was rule induction. The system decides whether the firm is an "error" firm

or a "non-error" firm. If the firm is an "error" firm then the auditor should consider

additional substantive testing. A training sample of 55 firms (22 error firms and 33 non-

error firms) was used. A holdout sample of 37 firms was used. The training sample was

able to group 86.4% of errors correctly and 100% of non-error firms correctly. The

holdout sample classified 83.3% of error firms correctly and 92% of non-error firms.

This study was limited by the available data. The accounting literature on fraud detection

is covered at great length in Davia 25 and Rezaee 84.

Beneish 10 developed a probit model and considered several quantitative financial

variables for fraud detection. Five of the 8 variables involved year-over-year changes.

Page 20: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

8

The study considered differing levels of relative error cost. At the 40:1 error cost ratio

(Type I:Type II) 76% of the manipulators were correctly identified. Also, descriptive

statistics showed that the Days Receivable Index and the Sales Growth Index were most

effective in separating the manipulators from the non-manipulators.

2.2 Bankruptcy Detection

Bankruptcy detection is a well-studied area. Many methodologies have been used

to solve this problem, including discriminant analysis, neural networks, fuzzy networks,

ID3 (a decision tree classification algorithm), logistic regression and genetic algorithms.

In this section we describe some major contributions to the literature.

In 1966 Beaver 8 showed the efficacy of financial ratios for detecting bankruptcy.

The study was performed on a dataset of 79 bankrupt and 79 nonbankrupt firms. Beaver

computed the mean values of fourteen financial ratios for all companies in the study for a

five year period prior to bankruptcy. Many of the ratios proved to be valuable to the

detection, because the mean values of the bankrupt companies were significantly

different than the mean value for the nonbankrupt companies.

Altman’s paper in 1968 4 was a seminal work in bankruptcy detection. He

developed a discriminant analysis model using financial ratios. Using a paired-sample

approach, Altman compared twenty-two ratios for efficacy in bankruptcy prediction.

Five ratios stood out as they were able to accurately predict bankruptcy one year

preceding the event. The model predicted bankruptcy correctly 95% of the time and

nonbankruptcy correctly 80% of the time. The resulting function, dubbed the Altman Z-

Score, has been the benchmark for bankruptcy detection work ever since. The specific

ratios of the Altman Z-Score are as follows:

Working Capital/Total Assets (WC/TA)

Page 21: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

9

Retained Earnings/Total Assets (RE/TA)

Earnings Before Interest and Taxes/Retained Earnings (EBIT/RE)

Book Value of Equity/Total Liabilities (BVE/TL).

The function has had several incarnations and its weights differ based on industry.

For the manufacturing industry it is

6.56( / ) 3.26( / ) 6.72( / ) 1.05( / )WC TA RE TA EBIT RE BVE TL Z score+ + + = − .

The weights on each ratio indicate the ratio’s relative importance for classification

of healthy and unhealthy companies. A score which is less than some threshold means the

company is likely in financial distress while a score greater than or equal to this threshold

means the company is likely safe from bankruptcy, at least for the short term. There is a

gray area around the threshold that can be construed as an area of concern. The

predictive accuracy of this discriminant analysis function is still competitive today for

sorting out healthy companies from unhealthy ones.

Altman et al. noted that the discriminant analysis technique had limitations, one

being its inability to handle time series 5. Bankruptcy is the sum product of many events.

A company which goes bankrupt is likely to have been in a deteriorating state for more

than one period. Year-over-year changes can capture this deterioration better than single

year measures.

Ohlson 72 was the first to utilize a logistic regression approach to bankruptcy

prediction. He identified four factors as statistically significant in affecting the

probability of failure within one year. The factors are as follows: the size of the

company, a measure of financial structure, a measure of performance, and a measure of

current liquidity. Another finding of the research was that the predictive powers of linear

Page 22: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

10

transforms of a vector of ratios appear to be robust for estimating the probability of

bankruptcy.

Abdel-khalik and El-Sheshai 1 designed an experiment testing human judgment.

Decision makers (loan officers) were allowed to choose the information cues they use to

make their judgments. The information cues were used to determine whether a loan

would end in default. In comparison to mechanical models (discriminant analysis), loan

officers performed worse. The finding was that the choice of information cues is more

responsible for the lack of correct prediction than the processing of the cues.

Frydman, Altman and Kao 36 developed a recursive partitioning algorithm (RPA)

for bankruptcy classification. The RPA is a Bayesian procedure, with classification rules

derived in order to minimize the expected cost of misclassification. In most cases, the

RPA outperformed Altman’s previous results via discriminant analysis.

Messier and Hansen 62 use inductive inference to analyze examples of bankrupt

companies and loan defaults to infer a set of general rules in the form of if-then-else

statements. The set of output statements is called a production system. The bankrupt

study used only the following ratios: current ratio, earnings to total tangible assets and

retained earnings to total tangible asset. The production system was 100% accurate on a

very small holdout set (12 bankrupt and 4 nonbankrupt). The study also used the

production system to detect potential loan defaults. The method was 100% accurate on

the training sample and 87.5% accurate on a validation sample. In both studies, the

production system used fewer ratios and was more accurate than the discriminant models

it was compared against.

Page 23: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

11

It should be noted that Tsai and Koehler 104 tested the robustness of the results of

several papers using inductive learning, including the results of Messier and Hansen 62.

The authors determined the accuracy of the induced concepts when tested on the same or

similar domains. In the case of Messier and Hansen, their findings included a probability

of error on the learned concept of the bankruptcy sample. The probability that the error

of the learned concept exceeds 20% is 30.96%. This is due, in part, to the small sample

size. The study throws up a caution flag, warning readers that the true accuracy of

concepts learned by induction may not be revealed in studies of small sample size.

Tam and Kiang 101 use a back propagation neural network to predict bank

defaults. They compare their results with k nearest neighbor, discriminant analysis 34,

logistic regression 79 and ID3 80. When considering the year prior to bankruptcy, a

multilayer neural network got the best results. When considering two years prior to

bankruptcy, logistic regression performed the best.

Charalambous, Charitou, and Kaourou 17 compare the performance of three neural

network methods, namely Learning Vector Quantization, Radial Basis Function, and the

Feedforward network. They test their results on 139 matched pairs of bankrupt and

nonbankrupt U.S. firms. Their results indicate that Learning Vector Quantization gave

superior results over feedforward networks and logit analysis.

Piramuthu, Raghavan, and Shaw 77 develop a method of feature construction. The

method finds the features that are most pertinent to the classification problem and

discards the ones that are not. The “constructed” features are fed into a back propagation

neural network. The method was tested on Tam and Kiang’s 101 bankruptcy data. The

Page 24: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

12

network showed a significant improvement in both classification results and computation

time.

2.3 Restatement Detection

The literature covering restatements is found in conjunction with earnings

management literature as well as fraud literature. Each company for which the SEC

discovers fraud is forced to restate. All restatements, however, are not fraudulent.

Restatements can be made for various reasons, including stock splits, errors, accounting

irregularities, and fraud. Restatements may be voluntary or involuntary. For the

purposes of this research, restatements are defined as in General Accounting Office

report GAO-03-138 37. These restatements may be voluntary or involuntary and only

arise as a result of accounting irregularities. An accounting irregularity is fraudulent if

committed with intention and nonfraudulent if committed by mistake. Restatements can

be seen as a superset of fraud. The restatement literature specifically related to detection

is limited as compared to fraud and bankruptcy. The literature reviewed in this section

gives an overview of the research problems related to restatements.

Dechow et al. 26 evaluate the performance of competing models of earnings

management detection. The models tested are the Jones model, the Modified Jones

model, the DeAngelo model and the Industry model. These models are based on the

amount of discretionary accruals made by a company in a particular year. Discretionary

accruals are not readily observable based on publicly available reports. The models infer

the amount of discretionary accruals based on other inputs and total accruals. The results

show that all methods are accurate for detecting earnings management for extreme cases.

However, all methods gave poor results when faced with discretionary accruals which

were a small percentage of total assets (1% - 5%). Earnings management is more likely

Page 25: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

13

to occur at the 1% - 5% levels, so the practical value of the methods used is brought into

question.

Koch and Wall develop an economic model of earnings management, which

elucidates the situations in which earnings management are most likely to occur, based on

executive compensation packages. The authors determine how accruals can be used to

manage reported earnings. In the paper the authors explain several earnings management

tactics. A partial list follows;

(1) The “Live for Today” strategy - Managers minimize accrued expenses in

order to maximize profit.

(2) The “Occasional Big Bath” strategy - managers attempt to meet earnings

targets whenever possible. If it looks impossible to meet targets then they

attempt to accrue a high amount of expenses in that period to allow for

meeting earnings targets the next.

(3) Miscellaneous Cookie Jar Reserves strategy – This is defined as the usage

of unrealistic assumptions in the process of estimating accruals.

These methods can be readily detected after-the-fact. It is much more difficult to detect

these tactics as they are happening.

Abbot et al. 1 study the impact of the audit committee on the likelihood of

restatement. The authors find that an independent and active audit committee

significantly reduces the likelihood of restatement. An audit committee which contains at

least one member with financial expertise further reduces the likelihood of restatement.

This empirical study gives weight to the arguments for having an audit committee.

Page 26: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

14

Feroz et al. 33 study the effect of Accounting and Auditing Enforcement Releases

on company valuation. A reporting violation leads to a 13% decline over a two day

period, on average. The study also finds that the companies which are in violation

substantially under perform the market in the years prior to the release, indicating that the

incentive to cheat is at least in part due economic pressures on the executives of the

company.

Hribar and Jenkins 43 study the effect of restatements on a firm’s cost of equity

capital. The authors find that restatements lead to a decrease in expected future earnings

and increases in the firm’s cost of equity capital. The increases were found to be between

7% and 12%. Over the long-term the rates remain higher than before the restatement by

at least 6%. Another finding of the work are that firms with greater leverage are

associated with larger increases in capital.

Kinney et al. 50 approach the problem from the auditor’s perspective. They study

the correlation between restatements and the amount of non-audit services performed by

the auditor. This topic became especially interesting when the Sarbanes-Oxley Act of

2002 specifically forbade auditors to provide certain non-audit services to their clients.

The study found no significant positive correlation between financial information systems

services and restatements. There was a significant positive correlation between

unspecified non-audit service fees and restatements. This study supports the notion that

auditor independence can be compromised by non-audit consulting engagements with

audit clients.

Peasnell et al. 75 focus on the factors associated with low earnings quality by

looking at a sample of 47 firms which have been identified as having defective financial

Page 27: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

15

statements. A positive correlation between defective financial statements and losses or

significant earnings decreases was found. Restating firms were less likely to increase

dividends, provide optimistic forecasts, and more likely to be involved in corporate

restructuring. Restating firms were also less likely to employ a Big 4 auditor and often

carried higher debt as a percentage of total assets as compared to nonrestating firms. The

study also found that firms which employed active audit committees were less likely to

have defective financial statements.

In this Chapter the literature on Financial Events was reviewed. Bankruptcy, fraud

and restatement research was reviewed. The next two chapters explain the

methodologies used for this research project. Chapter 3 reviews Information Retrieval

Methodologies and Chapter 4 reviews Machine Learning methodologies.

Page 28: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

16

CHAPTER 3 INFORMATION RETRIEVAL METHODOLOGIES

This chapter presents a general overview of research in the area of IR. As this is an

enormous area of research, the main focus is on contributions to the field as they relate to

this dissertation. Specifically, we focus on methods of ontology creation and WordNet.

The sections are as follows: Section 4.1 provides a brief overview of general IR research.

Section 4.2 explains the Vector Space Model. Section 4.3 explains the basic concepts of

lexical databases with specific details about WordNet. Section 4.4 explains the

fundamentals of ontology creation.

3.1 Overview

“An Information Retrieval system does not inform (i.e., change the knowledge of)

the user on the subject of his inquiry. It merely informs on the existence (or non-

existence) and whereabouts of documents relating to his request” 105. This field of study

has exploded with the reality of massive amounts of text in an online environment – the

Internet. The need to correctly choose the documents that are relevant to a keyword

search has become important to industry (in the form of search engines), decision

scientists, and computer scientists. There is much more to the field of IR than merely

document retrieval. Some of these are as follows.

Question answering systems take natural language questions as input, allowing the

user to avoid learning tedious query structures. In response, the system outputs a number

of short responses, designed to answer the specific question of the user. The goal of

question-answering is to give a more precise response to the user. Whereas normal

Page 29: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

17

document retrieval outputs a list of documents, question-answering outputs small

passages from documents 74.

Query expansion is a research area which has grown tremendously as a result of the

internet. Query expansion is most commonly used by search engines as a means to

improve the accuracy of results of user queries. A user types a few words as a query, and

the system expands that query by adding words which will presumably give better results.

There are many methods of query expansion. Automatic query expansion uses machine

learning techniques to choose the best expanded query 64.

Inferencing systems are a generalization of query expansion. They can be used at

all levels of the IR process. They attempt to “infer” the meaning of a query and add

further detail. The inference is usually based on semantic relatedness of the words in the

query. Semantic relatedness can be determined by parsing a particular corpus as in the

case of latent semantic analysis 52, which uses statistical techniques to find co-

occurrences between words in a corpus, or it can be determined by using a lexical

reference system, such as WordNet.

Literature-based discovery uses IR techniques to discover hidden truths from a

particular domain. The basic idea is: parse a set of documents A related to a particular

subject and find a list of subjects that A refers to. Parse a second set of documents B

related to the subjects A refers to in order to find the subjects B refers to. The subjects B

refers to are called C. If some subjects in C are unexplored in relation to A, then they

may be worth looking at. The seminal work in this area is by Swanson 99. Using

Medline (a medical document repository) he was able to find previously unknown

connections between Raynaud’s disease and fish oil. Those connections were tested

Page 30: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

18

empirically by medical researchers. The results showed that fish oil can actually reduce

the symptoms of Raynaud’s disease.

3.2 Vector Space Model

A primary goal of IR research is to relate relevant documents to user queries.

Using IR methods, one seeks to separate relevant textual documents from non-relevant

ones.

A powerful method in IR research is called the vector space model 18, 54, 88. This

approach begins by truncating all words in the document into word stems. Word stems

are the base form of words, without suffixes. Stemming is important because a computer

cannot see that “stymies” and “stymied” are basically the same thing. If we stem the two

words, they both become “stymie.” This allows the computer to see the two as one word,

thus adding to the words importance (via word count) in the document. Then it

transforms the document into a vector by counting the frequency of each word in the

document. Various ways of normalizing these vectors are available. A key observation

is that these vectors are now quantitative representations of the textual parts of

documents.

Here is a more formal explanation of the vector space model 48. Each document in

the vector space model is represented by a vector of keywords as follows:

1, 2, ,( , ,... )j j j n jd w w w ′=

where n is the number of keywords and ijw is the weight per keyword i in document j .

This characterization allows us to view the whole document collection as a matrix of

weights and is called the term-by-document matrix. The columns of this matrix are the

documents and the terms are the rows. A document is translated into a point in an n

Page 31: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

19

dimensional vector space. For this method to be useful, the vectors must be normalized.

The dot product between normalized vectors gives the cosine of the angle between the

two vectors. When the vectors representing two documents are identical, they will have a

cosine of 1; when they are orthogonal, they will receive a cosine of 0. The similarity

measure between documents j and k is as follows:

, ,1

2 2, ,

1 1

( , )

n

i j i ki

j k n n

i j i ki i

w wsim d d

w w

=

= =

=∑

∑ ∑

Finding a w that most accurately depicts the importance of the keywords in the

collection is very important to document classification. Sparck Jones 97 made a seminal

breakthrough on this problem with the TF-IDF function. The function stands for Term

Frequency, Inverse Document Frequency. The basic TF-IDF function is as follows:

( )nNtfw ijij log*= . ijtf is the frequency of term jt in document id , N is the number of

documents in the collection and n is the number of documents where the term jt occurs

at least once. The logic is as follows: for ( )ijtf or term frequency, a word that occurs

more often in a document is more likely to be important to the classification of that

document. A measure of inverse document frequency, idf, is defined by ( ) logijNidfn

= .

The logic is that a word that occurs in all documents is not helpful in the classification of

the document (hence the inverse) and therefore gets a 0 value. A word that appears in

only one document is likely to be helpful in classifying that document and gets a value of

1.

Page 32: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

20

Many researchers have attempted to improve upon the basic vector space model.

Improvements take the form of making the document vector more accurately depict the

document itself. Part-of-speech tagging is one such improvement. A part-of-speech

tagger reads a document in natural language and tags every word with a part-of-speech,

such as noun, verb, adjective and adverb. The tags are created using sentence structure.

All part-of-speech taggers are heuristics with no guaranteed accuracy. However, the

recent taggers have become so accurate that they only make a few mistakes on entire

corpuses 11.

Another improvement is word sense disambiguation (WSD). WSD is the attempt

to understand the actual definition of a word, in the context of a sentence. Often words

that are spelled identically have several meanings. In the basic vector space model, the

document vector would take all instances of the word “crane” and add them up. What if

one sentence read, “The crane is part of the animal kingdom” and another sentence read,

“The crane was the only thing that could move the 2 ton truck to safety”? Crane in the

first sense is referring to a bird whereas crane in the second sentence is referring to a

mechanical device. A word sense disambiguated vector would have two versions of the

word crane if both showed up in the corpus. This avoids some confusion that might arise

were we comparing the similarity between two documents, one which was about the bird

called a crane, and the other which was about the piece of equipment. How is WSD

accomplished? One method is to look at a previously hand-tagged corpus. One such

corpus is called SemCor 22. It is a corpus of documents, which are all tagged with

particular word meanings. Researchers use SemCor as a tool to learn WSD. For

example, take all sets of word pairs from a corpus and compare with SemCor, looking for

Page 33: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

21

pairs that appear together often enough to be considered statistically significant. The

phrase “crane lifts beams” may show up in the corpus. It is possible to determine if the

noun “crane” and the verb “lifts” are found together often enough in SemCor to be

considered significant. If this co-occurrence pair is considered significant, then “crane”

will be given the particular sense number for which it was tagged in SemCor.

3.3 WordNet

A lexical reference system is one which allows a user to type a word in and get in

return that word’s relationships with other words. “WordNet is an online lexical

reference system whose design is inspired by current psycholinguistic theories of human

lexical memory. English nouns, verbs, adjectives and adverbs are organized into

synonym sets, each representing one underlying lexical concept 22.” The current version

of WordNet has 114,648 nouns, 11,306 verbs, 21,436 adjectives and 4669 adverbs in its

system today 22. WordNet is hand-crafted by linguists. The basic relation in WordNet is

called synonymy. Sets of synonyms (called synsets) form its basic building blocks. For

example, the word “history” is in the same block as the words past, past times,

yesteryear, and yore. Due to synonymy, WordNet would be much closer to a thesaurus

than a dictionary. Nouns are organized into a separate lexical hierarchy as are verbs,

adjectives and adverbs.

There are two main types of relations in WordNet, lexical relations and semantic

relations. Lexical relations are between words and semantic relations are between

concepts. A concept is another word for a synset. A relationship between concepts can

be hierarchical, as is the case of hyponyms and hypernyms. The hyponym/hypernym is a

relation on nouns. Nouns are separated from other parts of speech because their

relationships are considered different than the relationships between verbs and adjectives.

Page 34: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

22

A hyponym/hypernym relationship is an “is a” relation. WordNet can be represented as a

tree. Starting at the top or root node, the concept is very general (as in the case of

“entity” or “psychological feature”). As you go down the tree, you encounter more fine-

grained concepts For example, a robin is a subordinate of the noun bird and bird is a

superordinate of robin. The subordinates are called hyponyms (is a kind of bird) and the

superordinates are called hypernyms (robin is a kind of). Modifiers, which are adverbs

and adjectives are connected similarly as are verbs. Hyponomy is only one of many

relations in WordNet. Below is a list of other WordNet relations with examples 94:

Relation Example Applicable POS

Has-Member Faculty – Professor Noun

Member-of Copilot – Crew Noun

Has-part Table – Leg Noun

Part-of Course – Meal Noun

Antonym Leader – Follower

Increase-Decrease

Noun

Verb

Troponym Walk – Stroll Verb

Entails Snore – Sleep Verb

Traditional vector space model retrieval techniques focus on the amount of times a

word stem appears in a document without considering the context of the word. Consider

the following two sentences, "What are you eating?" "What's eating you?" The words

“what,” “are” and “you” would most likely be stop words. (A stop word is any word that

Page 35: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

23

is thought to have little impact on the classification of any document. Common stop

words are “the”, “and”, “but”, “what”, “are” and “you”. The list of stop words is usually

determined by taking statistics on the document set. If a word appears too often it is said

to carry little weight. This word becomes a stop word. Stop words do not appear in the

document vector.) The two sentences above would have identical meaning in the vector

space model. The meaning of the two sentences are however, completely different.

Using concepts and contexts it is possible to create a lexical reference system that

interprets data specific to a particular area of interest.

3.4 Ontology Creation

Figure 1 67 shows that there are three types of ontologies. There are top

ontologies, upper domain ontologies, and specific domain ontologies. Top ontologies are

populated with general, abstract concepts. Upper domain ontologies are more

specialized, but still very general. Specific domain ontologies are populated with

concepts that are specific to a particular subject. Top ontologies for the English language

are relatively complete. Upper domain ontologies and specific domain ontologies are

still under construction 67.

Figure 1 – Ontology Hierarchy

WordNet is a top ontology. Many domain engineers attempt to make domain

specific ontologies using the backbone of top ontologies. Often a problem arises in that

Page 36: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

24

there is a gap between the top ontology and the specific domain ontology. In this case, an

upper domain ontology is necessary. An upper domain ontology connects the top

ontology to the specific domain ontology. The upper domain ontology forms the root

nodes for the Domain Specific Ontologies.

Domain specific ontologies are usually created for a specific purpose and these are

very difficult to obtain. Navigli and Velardi explain “A domain ontology seeks to reduce

or eliminate conceptual and terminological confusion among the members of a user

community who need to share various kinds of electronic documents and information

68.” Domain ontology creation is a new and active research area in IR. Here are some

papers which highlight the current state of the research.

Khan and Luo 49 construct ontologies using domain corpora and clustering

algorithms. The hierarchy is created using a self-organizing tree. WordNet is used to

find domain concepts. The concept hyponyms are added to the tree, under the concept.

This is a novel usage of WordNet and a completely automated method of ontology

construction. The method is tested on the Reuters 21578 text document corpus.

Navigli and Velardi 68 give a step-by-step method explaining the process of

obtaining ontology. Candidate terminology is extracted from a domain corpus and

filtered by contrastive corpora. The contrastive corpora are used to ignore candidate

terms which are in actuality part of the general domain. The word senses of domain

terminology are discovered via SemCor and WordNet. New domain specific

relationships are determined based on rule based machine learning techniques. These

relationships are used to determine multi-word terms which are domain specific. Finally,

Page 37: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

25

the domain ontology is trimmed and pruned. This methodology was used to create a

tourism domain ontology.

Vossen 108 describes a methodology of extending WordNet to the technical

domain. The domain corpus is parsed into header and modifier structures. A header is a

noun or verb and a modifier is an adjective or adverb respectively. A header may have

more than one modifier, as in the example “inkjet printer technology”. Here

“technology” is the head and “inkjet” and “printer” are modifiers. Salient multiword

terms are hierarchically organized creating a domain concept forest. A domain concept

forest is a set of concepts related to a specific domain together with relationships between

the concepts. The root node of each of the domain concepts is attached to a WordNet

concept. In the above example “technology” would be the root node. The result is a

domain concept forest attached to WordNet.

Buitelaar and Sacaleanu 15 create a method of ranking synsets by domain

relevance. The relevance of a synset is determined by its importance to the domain

corpus. The importance is determined by the amount of times the concept appears in the

corpus. A contrastive corpora is used to filter out concepts that are general, as in Navigli

and Velardi 68. A unique contribution of this research is the usage of hyponyms to

determine domain relevance. A hyponym is lower on the tree, therefore it is a

specialization of the concept. The authors look at how often a hyponym to a concept

appears in the document as part of the relevance measure. The result is an ordered list of

domain terms. The authors tested the methodology on the medical domain by parsing

medical journal abstracts.

Page 38: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

26

Buitelaar and Sacaleanu 14 extend their work by adding words to domain concepts

based on lexico-syntactic patterns. The domain corpus is parsed to look at the syntax

patterns of seven word combinations. Each pattern is separately considered for

relevance. For all salient patterns, mutual information scores are given to co-occurrences

within the pattern. Novel terms from the domain which are not in WordNet are added to

WordNet concepts if it is determined that they are statistically significant. This

methodology is tested on the medical domain.

In this Chapter Information Retrieval Methodologies were explained. The specific

areas reviewed were the Vector Space Model, WordNet and Ontology creation. These

areas were chosen because of their relevance to the contributions of this work. In the

Chapter 4 Machine Learning Methodologies are reviewed.

Page 39: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

27

CHAPTER 4 MACHINE LEARNING METHODOLOGIES

Most machine learning/datamining methods 66 start with a training set of data from

past cases illustrating positive and negative examples of the concept to be learned. This

is called supervised learning. For example, if we are trying to learn how to discriminate

between companies likely to default on loans in the coming year from those unlikely to

default, we would collect past cases of defaulting and non-defaulting companies as done

in studies such as 1 62. Such a training set consists of l observations and a classification

for each. That is, there are l pairs of the form ( , )i i iz y≡ u where i nX∈ ⊆ℜu represent

the n input attributes (the independent variables) with X called the instance space of all

possible companies, { 1, 1}iy ∈ − + the classification (+1 means a positive example and -1

a negative example of the concept) for 1,...,i = l , and the sample S is

1 1(( , ),...( , )) ( )y y X Y⊆ ×u ul l l 20. Unless otherwise stated, a vector is denoted by a

bold, lowercase letter. The superscript on the vector is reserved for the observation

number. An unbolded, subscripted, lowercase letter refers to the components of the

vector. The subscript represents the index of the component. In Chapter 5 we add a

second subscript to denote the year (or period). Typical approaches, such as neural

networks, logit, etc. start with a training set and try to fit the data as best as possible using

the concept structure chosen (i.e., a neural network, a logit function, etc. respectively).

This invariably leads to over-fitting. To ameliorate this, the training set is often broken

into two (or more) sets where part of the cases are used to fit a function and part to test

Page 40: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

28

it’s ability to predict on a data set not used for fitting. These approaches do help with

over-fitting but are largely ad hoc.

4.1 Statistical Learning Theory

Statistical learning theory 106 formally develops the goal of learning a function

from examples as that of minimizing a risk functional

( ) ( )( ) ( )R L z,g z, dF zα = α∫

over α∈Λ where ( )L is a loss function, and ( )g z,α is a set of target functions

parametrically defined by α∈Λ (the family of functions we are investigating). In this

approach it is assumed that observations, z, are drawn randomly and independently

according to an unknown probability distribution ( )F z . Since ( )F z is unknown, an

induction principle must be invoked. One common induction principle is to minimize the

number of misclassifications. Minimizing the number of misclassifications is directly

equivalent to minimizing the empirical risk with the loss function as a simple indicator

function. Other loss functions give different risk functions. For example, the classical

method for linear discriminant functions, developed by Fisher 34, is equivalent to

minimizing the probability of misclassification.

As is well known, empirical risk minimization often results in over-fitting. That is,

for small sample sizes, a small empirical risk does not guarantee a small overall risk.

This has been observed in many studies. For example, Eisenbeis 28 critiques studies

based on such over-fitting.

Statistical learning theory approaches this problem by using a structural risk

minimization principle 106. For an indicator loss function, it has been shown 106 that for

any α∈Λ with a probability at least 1−η the bound

Page 41: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

29

( ) ( ) ( ) ( )( ) ( )empstruct

emp boundstruct

4RR h, ,R R 1 1 R

2 R h, ,

⎛ ⎞αη⎜ ⎟α ≤ α + + + ≡ α⎜ ⎟η⎝ ⎠

ll

holds where the structural risk ( )structR depends on the sample size, l , the

confidence level, η , and the capacity, h, of the target function. The structR expression is

as follows 20:

( )structh(4 ln(2 / h) 4) ln( / 4)R h, , + − η

η =ll

l

The capacity, h, measures the expressiveness of the target class of functions. In

particular, for binary classification, h is the maximal number of points (k) that can be

separated into two classes in all possible k2 ways using functions in the target class of

functions. This measure is called the VC-dimension. For linear discriminant functions,

without additional assumptions, the VC-dimension is h n 1= + 107, 20. The empirical

risk is measured by a loss function on the set of examples l as follows 91:

( ) ( )( )emp i ii 1

1R L x ,g x ,=

α = α∑l

l.

Since we cannot directly minimize ( )R α the structural risk minimization principle

instead tries to minimize ( )boundR α . It is almost always the case that the smaller the VC-

dimension, the lower this bound.

4.2 Support Vector Machines

Support Vector Machines (SVM) are growing in popularity rapidly in part because

both theoreticians and applied scientists find them useful. SVMs incorporate ideas from

many fields of study including applied mathematics, operations research, machine

learning, and more. Based on Statistical Learning Theory, early research suggests that

Page 42: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

30

SVMs have had good success with supervised learning. They have compared well with

other learning algorithms such as Neural Networks, k-Means, and Decision Trees 20.

Joachmis 46 used SVMs to categorize news stories. Pontil and Verri 78 used SVMs for

object recognition (independent of aspect). Cortes and Vapnik 23 tested SVMs on hand

written zip code identification, getting accuracy just shy of human error. Brown et al. 12

applied SVMs to the problem of classifying unseen genes with success.

Support vector machines determine a hyperplane in the feature space that best

separates positives from negative examples. Features are mappings of original attributes

(we discuss this shortly). The margin of an example ( , )i iyu with respect to a hyperplane

( , )bw is ( , )i i iy b∆ = +w u where w is a weight vector is and b is a bias term. The

margin about the hyperplane ∆ is the minimum of the margin distribution with respect to

a training sample S . The VC-dimension is bounded by

2

2

Rh 1 min n,⎛ ⎞⎡ ⎤

≤ + ⎜ ⎟⎢ ⎥∆⎢ ⎥⎝ ⎠

where R is the radius of a ball large enough to contain the input attribute space. If a

margin is large enough, the VC-dimension may be much smaller than n + 1. SVMs learn

by maximizing the margin which, in turn, minimizes the VC-dimension and, usually, the

bound of the risk functional.

This distinguishes them from other popular methods such as neural networks which

use heuristic methods to help find parameters that best generalize. In addition, and unlike

most methods, SVM learning is theoretically guaranteed to find the best such linear

concept, if the data are separable. Neural networks, decision trees, etc. do not carry this

guarantee leading to a plethora of heuristic approaches to find acceptable results. For

Page 43: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

31

example, most decision tree induction use pruning algorithms that try to create the

smallest tree that produces an acceptable training error in the hopes that smaller trees

generalize better (this is the so called Occam’s razor or minimum description length

principle) 80. Unfortunately, there is no guarantee that the tree produced minimizes

generalization error. SVMs also scale-up to very large data sets and have been applied to

problems involving text data, pictures, etc.

The SVM is formulated as a quadratic optimization problem with linear inequality

constraints. Below is the primal formulation assuming the data is separable.

min ,w w

st

( , ) 1i iy b+ ≥w u , 1,...i = l

,w w is minimized in the objective function in order to maximize ∆ , thus potentially

minimizing the bound on the VC-dimension which was expressed above. This can be

explained as follows. We replace the functional margin with the geometric margin. The

geometric margin will equal the functional margin if the weight vector is a unit vector.

Thus we normalize the linear function 1 1( , )i iy b⎡ ⎤+ ≥⎣ ⎦w uw w

and 1∆ =

w because

the inequality will be tight at a support vector. In order to maximize ∆ we merely

minimize w .

This problem has a dual formulation. The dual solution is useful as w is no longer

explicitly computed and the explicit usage of the data points is collapsed into a matrix of

inner products, allowing for higher, possibly infinite dimensional feature spaces. These

Page 44: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

32

feature spaces are implicitly calculated by a kernel which we explain in great detail

below. The dual formulation is:

1 , 1

1max ( ) ,2

i i j i j i j

i i jW y yλ λ λ λ

= =

= −∑ ∑ u ul l

st

10i i

iy λ

=

=∑l

0iλ ≥ 1,...,i = l

where λ are the dual variables. w is no longer in the formulation and all data appears

inside the dot product, which is key to using kernels in the SVM.

A kernel is an implicit mapping φ of an input attribute space X onto a potentially

higher dimensional feature space F . The kernel improves the computational power of

the learning machine by implicitly allowing combinations and functions of the original

input variables. For example, if only price and earnings are inputs, a PE ratio would not

be explicitly considered by a linear learning mechanism. A kernel, properly chosen,

would allow many different relationships between variables to be simultaneously

examined, presumably including price divided by earnings. The PE measure is termed a

“feature” of the input variables. There are many powerful, generic kernels 20, 38 but

kernels can also be made to suit a specific application area as we do later in this study.

Some application areas are sensitive to periodic changes, making correct pattern

recognition more likely with the usage of time series analysis. Ruping 87 shows how to

extend a number of kernels to handle time series data. Jin, Lu, and Shi 45 show that the

right subset of attributes for a particular domain is important to time series classification

for knowledge discovery applications. Their methodology trimmed the attributes to

Page 45: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

33

include only data pertinent to the domain’s time series. Preliminary research suggests

that kernels which are constructed with the help of application specific information tend

to have better results 20.

4.3 Kernel Methods

A kernel is a central component of the SVM. Shawe-Taylor and Cristianini call it

the information bottleneck of the SVM 95. This is because all data input into a SVM goes

through the kernel function and ends up in the kernel matrix. The kernel matrix is a

matrix with entries ( ), ( )i jijK φ φ=< >u u , where φ is a mapping mRX →:φ , and

,i j X∈u u . Often the dimension of the feature space is much larger than the attributes

space, and may even be infinite (ref. the Gaussian kernel in Section 4.3.1). Key to the

value of kernel methods is the ability to implicitly capture this feature space via a

mapping φ . The dual formulation expressed in Section 4.2 can be generalized to allow

the usage of kernels as follows:

1 , 1

1max ( ) ( , )2

i i j i j i j

i i jW y y Kλ λ λ λ

= =

= −∑ ∑ u ul l

The kernel function is an inner product between feature vectors and is denoted as

( , ) ( ), ( )K φ φ=< >u v u v where { , } X∈u v . The feature vectors may not have to be

explicitly calculated if the kernel function can create a mapping implicitly. In Section

4.3.1 we show how a kernel can increase the dimension of the attribute space, thus

allowing for more unique features, without significantly increasing computational cost.

An alternative to using a kernel is to explicitly create all features deemed necessary for

classification as direct input to the SVM as attributes. However, this is both time

consuming and computationally costly. Creating a kernel unleashes the potentially

Page 46: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

34

nonlinear power of the learning machine, allowing it to find patterns on the attributes that

were previously unknown. In Section 4.3.1 we explain the properties of general kernels.

In Section 4.3.2 we extend our explanation of kernels by considering domain specific

kernels. These kernels are designed with the structure of a particular domain in mind.

4.3.1 General Kernel Methods

As explained above, a kernel is evaluated within an inner product between

mappings of examples iu , where examples are vectors of attributes from the instance

space X . There are many known kernels and the list is growing 214687. Two specific

kernels can be used to illustrate the nature and expressive power of these functions. The

polynomial kernel is:

ˆ ( , ) ( ( , ) )dK K R= +u v u v

where ( , )K u v is the normal inner product ,< >u v , d is a positive integer and R is

fixed. Consider a set of examples 1 1(( , ),...( , )) ( )S y y X Y= ⊆ ×u ul l l each with four

attributes, 1 2 3 4( , , , )ii i i iu u u u ′=u and 1 2 3 4( , , , )i

i i i iv v v v ′=v , with d =1 and R = 0.

1 1 2 2 3 3 4 4( , )K u v u v u v u v= + + +u v .

and with R = 0 and d =2,

21 1 2 2 3 3 4 4

ˆ ( , ) ( )K u v u v u v u v= + + +u v .

While ( , )K u v has four features, namely 1 2 3 4( , , , )u u u u ′ , ˆ ( , )K u v has 10 features, namely

all monomials of degree 2, or 2 2 2 21 2 3 4 1 2 1 3 1 4 2 3 2 4 3 4( , , , , 2 , 2 ,2 ,2 ,2 ,2 )u u u u u u u u u u u u u u u u ′ .

Consider a d of arbitrary dimension with n attributes, the number of features is

⎟⎟⎠

⎞⎜⎜⎝

⎛ −+ddn

1

. The computational complexity becomes unreasonable as n and d grow.

Page 47: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

35

Due to the implicit mapping in the polynomial kernel (between examples via the inner

product), the monomials of degree d can be features of an SVM without their explicit

creation.

An even more powerful kernel is the Gaussian, which is defined as:

2 2( , ) exp( /(2 ))K σ= − −u v u v , where ⋅ is the 2-norm 91 and σ is a positive

parameter.

An exponential function can be approximated by polynomials with positive

coefficients, making the Gaussian kernel a limit of the sum of polynomial kernels 95.

The features of the Gaussian can be best illustrated by considering the Taylor expansion

of the exponential function ∑∞

=

=0 !

1)exp(i

ixi

x 95. The features are all possible monomials

with no restriction on the degree. This feature space has infinitely many dimensions.

Now that it is obvious that kernels are a powerful tool, we will look at their

properties. To be useful in SVM work, a kernel function must have the following

minimum characteristics (Cristianini and Shawe-Taylor 20):

(1) the function must be symmetric (ie ( , ) ( , )K K=u v v u )

(2) the function must be positive semidefinite, and

3. the function must obey the Cauchy-Schwarz inequality .

#1 is easy to check. #2 is a little more complicated and it is usually determined by

studying a related square-symmetric matrix, A , and its eigen decomposition. Let

( , )K u v be a symmetric function on X . ( , )K u v is a kernel function if and only if the

matrix , 1( ( , ))i j i jA K == u u l is positive semi-definite (has non-negative eigenvalues) 20. #3

is satisfied as long as the function obeys the Cauchy-Schwarz inequality. The Cauchy-

Page 48: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

36

Schwarz inequality as applied to kernels is defined by Cristianini and Shawe-Taylor 20

as:

2 22( , ) ( ), ( ) ( ) ( )K φ φ φ φ= ≤ =u v u v u v ( ), ( )φ φ< >u u

= ( ), ( ) ( ), ( )φ φ φ φu u v v ( , ) ( , )K K= u u v v

A kernel function often alters the dimensionality of the data, mapping it into feature

space. The inner product between all feature vectors is carried out using a kernel matrix.

A matrix formed by such inner products is called a Gram matrix. The Gram matrix has

some useful properties, for example it is positive semidefinite. Since all of the entries in

the Gram matrix are in the form of an inner product, we must be concerned with their

proper existence. An inner product space is a vector space endowed with an inner

product. The inner product is actually the metric used to determine the distance between

two points. The inner product space is enough structure to properly define each element

of the Gram matrix when considering the finite dimensional case. However, if we want

to take advantage of an infinite dimensional feature space (as in the Gaussian case) we

need the inner product to define a complete metric (defined below). If the inner product

defines a complete metric, then it is a Hilbert space 59. A complete metric is one in

which every Cauchy sequence is convergent. Consider all countable sequences of real

numbers. The Hilbert space is a subset of all countable sequences ,...},...,,{ 21 ixxx=x

such that ∑∞

=

∞<=1

22

2i

ixx . The inner product of sequences can be defined as

1, i i

ix y

=

=∑x y . This infinite space is also called 2L 59. An important characteristic of

Page 49: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

37

an Hilbert space is that it is isomorphic to nR in the finite case and 2L in the infinite

case.

A compelling property of kernel methods is the ability to form new kernels from

existing kernels. For example, one could take a polynomial and a gaussian kernel and

add them up to get the features from each. Kernels are also multiplicative. Cristianini

and Shawe-Taylor 20 show that the following functions of kernels are in fact kernels:

1 2( , ) ( , ) ( , )K K K= +u v u v u v

1( , ) ( , )K Kα=u v u v

1 2( , ) ( , ) ( , )K K K=u v u v u v

3( , ) ( ( ), ( ))K K φ φ=u v u v

where 1K , 2K , and 3K are kernels, mRX →:φ and .0>α

Until this point, we have looked at two kernels, the Polynomial and the Gaussian.

These kernels are very powerful, but offer little opportunity for crafting kernels that are

specific to a domain. The graph kernel is a general kernel which can be made domain

specific, as long as certain rules are followed. A powerful characteristic of the graph

kernel is its intuitive nature. The graphic representation allows us to better understand

how a kernel works. Before formulating this kernel, a simple example is useful.

Consider a graph ( , )G A E with nodes a and edges e . See the Figure 2 below:

Page 50: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

38

1e 2e

4e

s

t5e

3e

1a2a 3a

4a 5a

Figure 2 – Basic Graph Kernel

All ie in this graph are base kernels (for example, a polynomial kernel on a

component of the attribute space). To differentiate base kernels from general kernels, the

base kernels are denoted as ( , )i iK u v . Any path from s to t is a feature. This feature is

arrived at via the product of all edges in the path between s and t . In general, all paths

from s to t create features. This allows the researcher to create his own kernel, by

choosing the structure of the graph.

Here is a more formal explanation of the graph kernel. It is based on a directed

graph G with a source vertex s of in-degree 0 and a sink vertex t of out degree 0. A

directed graph is one where the flow on each edge is in a single direction. Each edge is

labeled with a base kernel. It is assumed that this is a simple graph, meaning that there

are no directed loops. In general, loops are allowed but that makes proving that the

resulting mapping is, indeed, a kernel extremely complicated. Takimoto and Warmuth

100 proved that a directed, acyclical graph with base kernels on the edges is indeed a

kernel.

Page 51: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

39

Shawe-Taylor and Cristianini 95 describe the kernel as follows: Let stP be the set

of directed paths from s to t for a path 0 1( ... )dp a a a= . The product of the kernels

associated with the edges of p can be seen as follows:

1( )1

( , ) ( , )i i

d

P a ai

K K− →

=

=∏u v u v .

The graph kernel is the aggregation of all ( , )PK u v and can be seen as follows:

1( )1

( , ) ( , ) ( , )i i

st st

d

G P a ap P p P i

K K K− →

∈ ∈ =

= =∑ ∑∏u v u v u v

Here is another example for clarification. Look at the Figure 3 below. It is a

slightly more complex version of the one above. The nodes are labeled for explanatory

purposes and the edges are labeled with the base kernel ( , ) ,i i i iK u v u v= . If 1s = and

2t = , then there would be a single feature, 1u . If 1s = and 3t = , there would also be a

single feature 1 2u u , but the feature would be the product of the two base kernels on the

path 1 2 3( )p a a a= . Three paths converge at node 5, specifically 1 1 2 3 5( )p a a a a= and

2 1 2 5( )p a a a= and 3 1 2 4 5( )p a a a a= . Node 5 can be seen as a kernel which sums the

products of the base kernels on each path. If Node 5 were t , the output would be the sum

of all paths into node 5, 1 2 3p p p+ + or 1 2 5 1 4 1 3 6u u u u u u u u+ + . In general, at each node a

(except s ), all paths from s to a are summed. The contribution of a path to the kernel is

based on the product of its edges.

Page 52: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

40

1 1u v

3 3u v 4 4u v5 5u v

s

t6 6u v

7 7u v

2 2u v

5a4a

3a2a1a

6a

Figure 3 – Graph Kernel

4.3.2 Domain Specific Kernels

A kernel should have two properties for a particular application. First, it should

capture a similarity measure appropriate to the domain. The features that offer the most

information content for a particular domain need to be represented by the kernel. Second,

its evaluation should require significantly less computation than would be needed by

using the explicit feature mapping 95. The first point is key to the contribution of this

dissertation. General kernels are building blocks but the goal of a kernel method is to

determine patterns correctly and tuning a kernel to a specific domain best does this.

Much empirical research has been done where a dataset is tested using several kernels

and results are given as to which kernel performs better. This is ad hoc. It seems likely

that a kernel which is tuned to a domain will better capture the features necessary to

correctly classify instances in that domain.

Ultimately we will combine kernels that deal with quantitative financial

information and textual information. Below is a brief summary of some text-based

kernels.

Page 53: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

41

Joachims 46 uses the polynomial and gaussian kernel for text categorization. He

compares several parameters for each kernel ( d for polynomial and σ for Gaussian).

He showed that the parameter that elicited the lowest estimated VC dimension was the

one with the best performance on the empirical tests. Thus, he has tailored general

kernels to the text domain. This is an early and simplistic example of domain specificity.

Cristianini et al. 21 develop a Latent Semantic Kernel designed to sort documents

into categories by keywords, which are automatically derived from the text corpus. The

kernel implicitly maps keywords into a “semantic” space, which allows documents which

share no keywords to be related. This is accomplished by analyzing co-occurrence

patterns. A co-occurrence pattern is where two terms which are often found in the same

document are considered related. The co-occurrence information is extracted using a

singular value decomposition of the term by document matrix. This paper illustrates the

usage of domain knowledge in the development of a kernel.

Another kernel adaptable to text problems is the string subsequence kernel 20. A

string is a finite set of characters from a set T . In the case of a subsequence kernel, T is

the alphabet. The goal of this kernel is to define the similarity between two documents

by calculating the number of subsequences these documents have in common. The

subsequences do not have to be contiguous. However, there is a penalty incorporated

into the function based on the distance between words of a subsequence.

Early researchers in kernel methods have given us several general forms with

which to work. Recent applications of kernel methods to domains include protein

folding, handwriting recognition, face recognition, image retrieval, and text retrieval.

Finding the right kernel for a particular problem has proven to be an ad hoc, yet

Page 54: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

42

extremely important step. The real power of kernels is harnessing those general forms to

create kernels that are specific to these domains. This work has just begun. A domain

may be defined by more than one type of data, thus complicating matters. In the case of

the accounting domain, both quantitative and text attributes contain information on a

firm. In order to utilize the text data, we must first understand how to narrow down our

potential attributes, by looking at text specific to the domain of accounting.

The next Chapter utilizes the methodologies reviewed in this Chapter to create a

Financial Kernel. A review of Chapters 2 as well as this Chapter should give the reader

an understanding of the reasons for the particular design of the Financial Kernel in

Chapter 5.

Page 55: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

43

CHAPTER 5 THE FINANCIAL KERNEL

Defining a domain specific kernel for finance entails looking to the finance and

accounting literature to see what attributes and features are often utilized for

classification. It also requires us to consider the kernels available and which ones would

fit our work the best. As this work focuses specifically on financial events the main

publications reviewed were in the realm of management fraud and bankruptcy, as seen in

Chapter 2. Without fail, most financial analyses look to ratios of items on the financial

statements. Models for earnings quality in accounting utilize ratios, such as the study by

Francis, LaFond, Olsson and Schipper 35. Loebbecke, Eining and Willingham 55 use

financial ratios as part of their management fraud model as well. All of the studies

detailed in Section 2.2 on Bankruptcy Detection use financial ratios.

McNichols and Wilson 56 used year-over-year changes in key account values to

help determine earnings management. Francis, LaFond, Olsson and Schipper 35 utilized

year-over-year changes extensively in their study on earnings quality. Beneish 10

utilized year-over-year changes to help determine management fraud. The majority of

the bankruptcy prediction methods which were reviewed in Section 2.2 show the

accuracy of their methodologies for the year of bankruptcy, the year prior to bankruptcy,

and sometimes further back. As years prior to bankruptcy increase, the predictive

accuracy of the models decreases. In general, the picture is not clear. However, a trend

may be emerging. This trend can be captured by year-over-year changes in key ratios. As

explained in Section 2.2 Altman 5 notes that a limitation on his discriminant analysis

Page 56: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

44

function for bankruptcy detection was its lack of year-over-year changes. Year-over-year

changes in ratios are captured by this function:

2 1

2 1

2

2

i i

j j

i

j

u uu u

uu

⎛ ⎞−⎜ ⎟⎜ ⎟

⎝ ⎠ ,

where , 1...i j n= are the attribute numbers and the second subscript is the year (or

period).

We created two kernels to handle ratios and year-over-year changes. The first

kernel utilizes the polynomial kernel structure on a mapping of the data to produce

inverses. Recall, the general polynomial kernel is ˆ ( , ) ( ( , ) )dK K R= +u v u v where R is a

constant and d is the degree of the polynomial. We apply the polynomial kernel to a

mapping of the input attributes ( )φ ⎯⎯→u u% , where 1 2( , ,..., )nu u u ′=u and

1 21 2

1 1 1, ,..., , , ,...,nn

u u uu u u

′⎛ ⎞

= ⎜ ⎟⎝ ⎠

u% .

Setting R to zero and 2=d , ( , ) ( ), ( )K φ φ=< >u v u v , gives all possible ratios of

individual attributes j

i

uu

. In addition, we get the following attributes: 2iu and

jiuu1 . This

can be seen in a simple example. Consider 1 2 3( , , )u u u ′=u and 1 2 3( , , )v v v ′=v , for all

, X∈u v . 1 2 31 2 3

1 1 1( ) , , , , ,u u uu u u

φ′

⎛ ⎞= ⎜ ⎟⎝ ⎠

u and 1 2 31 2 3

1 1 1( ) , , , , ,v v vv v v

φ′

⎛ ⎞= ⎜ ⎟⎝ ⎠

v .

Page 57: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

45

The function result is:

2( , ) ( ( ), ( ) )K φ φ= < >u v u v

3232313123

23

22

22

21

212121 222 vvuuvvuuvuvuvuvvuu +++++=

212111

22

22

33

11

33

33

22

33

11 222222vvuuvu

vuvuvu

vuvu

vuvu

vuvu

++++++

23

23

22

22

21

213232

1112vuvuvuvvuu

++++ ,

which gives the following feature vector:

⎟⎟⎠

⎞⎜⎜⎝

⎛23

22

2132211

2

2

3

1

3

3

2

3

1

2

13231

23

22

2121

1,1,1,2,2,2

,2

,2

,2

,2

,2

,2,2,,,,2uuuuuuuu

uu

uu

uu

uu

uu

uuuuuuuuuu

We validated this kernel on simulated data. We used the Altman Z-Score with

weights for the manufacturing industry (ref Ch. 2). We created attributes for each

variable in the Z-score. The attributes were TA, EBIT, RE, B.V.E., TL, WC, as defined

in Section 2.2. The attribute values were created using a normal distribution with means

and variances appropriate to the domain. When we created the variables we preserved

the structure of the balance sheet (i.e. TA = TL + B.V.E. + RE). Each example was input

into the Altman Z-Score function to obtain its score. The examples with scores were

sorted by score. The top 50% of scores were labeled with a +1 and the bottom 50% of

scores were labeled -1. We only input the attributes and labels into the SVM. We were

able to separate perfectly on the Altman Z score, but had problems rediscovering weights

from the actual function. We determined this is due to the fact that many extra features

are created by this kernel and are highly correlated with each other. This correlation is

due in part to the structure of the Altman Z score. Total Assets and Retained Earnings

are two of the six attributes used in creating the ratios of the Altman Z-score. Both of

Page 58: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

46

these attributes are used in two different ratios. Our kernel creates (2 )dn features, some

of which did not add a significant amount of information to the learning algorithm (i.e.

2iu and

jiuu1 ).

To add a time series representation for this kernel, we would have to represent the

following relationship: 21

21

2

2

1

1

2

2

1ij

ji

j

i

j

i

j

i

uuuu

uu

uu

uu

−=⎟⎟⎠

⎞⎜⎜⎝

⎛−

. The left hand side of this function is

the year-over-year changes as explained above. The right side is a representation that can

be constructed by our kernel by dropping the constant. The attribute vector would double

in size as the second year would be concatenated onto the end. In order to get year-over-

year changes in this format 21

21

ij

ji

uuuu

we need 3≥d . The number of features in the year-

over-year case would be at least 3(2(2 ))n . Even a modest number of attributes causes a

huge explosion in features. The 4n = case would generate 4,096 features.

The second kernel we created was built as a response to the problems we had with

the first, namely dimensionality explosion and unnecessary features. We design this

kernel with the goal of getting all the important features, including all possible intra-year

ratios and year-over-year ratios. However, we want to avoid the problem of unwanted

features.

For this we chose the graph kernel. As discussed already, the graph kernel is

extremely flexible, which makes it a natural choice when trying to construct specific

features. We exploit the research of Takimoto and Warmuth 100 to build this kernel. We

Page 59: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

47

call this kernel the Financial Kernel, and denote it as ( , )FK u v . ( , )FK u v is a directed

graph ( , )G A E∈ with base kernels on all edges e and ( , )K u v on a . The Financial

Kernel has as input n attributes per year for 2 years. The attributes vector is

11 1 12 2( ,..., , ,..., )n nu u u u ′=u , where the first index is the attribute number and the second

index is the time period. See Figures 4 and 5 for an illustration of financial domain

kernel. Figure 4 illustrates one of 1n − graphs that make up the Financial Kernel. Each

of the 1−n graphs has a source node is and a sink node it . The graphs decrease in size

with n . The reason is that each graph carries information for attributes i through n .

Each path from source to sink is a feature. The number of features are equal to the

number of paths. All 1−n graphs from Figure 4 are brought together by the graph in

Figure 5. The paths from s to t make up all of the features in ( , )FK u v .

The kernels on e are base kernels. As defined in Chapter 4, a base kernel is a

kernel function on a vector component. We can have as many different kernels as there

are edges. For the creation of a financial kernel, we limited the base kernels to two

forms, one is the standard inner product kernel of >=< iiii vuvuK ,),( the second is

iiii vu

vuK 1),(~ = .

Page 60: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

48

Figure 4 – The Financial Kernel 1

1

1

1 11

1

1

1 1

1 1

1

1

1

1

1 1

L

L11

1

1

1is

it

1)1(1)1(

1

++ ii vu 1)2(1)2(

1

++ ii vu 11

1

nn vu

22

1

ii vu

2)2(2)2( ++ ii vu2)1(2)1( ++ ii vu

11 ii vu

22 nn vu

Page 61: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

Figure 5 – The Financial Kernel 2

According to Takimoto and Warmuth 100, in order to prove that ( , )FK u v is a

kernel, we need only have a directed graph without cycles and show that each edge e is a

valid kernel. (For details of their proof, see pg. 33 of Takimoto and Warmuth 100.)

Examination of Figures 4 and 5 clearly show that the graph is directed and free of cycles.

We need to show that both ),( ii vuK and ( , )i iK u v% are kernels. ),( ii vuK is simply the

standard inner product kernel. ),(~ii vuK can be shown to be a kernel as follows:

(1) niuuf ii ...1,)( 1 == −

(2) nivvf ii ...1,)( 1 == −

(3) ii

iiii vuvuvfuf 1)()( 11 == −−

(4) By Cristianini and Shawe-Taylor [2000] (pg. 42) 20 )()(),(~iiii vfufvuK ≡

The features of the Financial Kernel are: 2 1 21

1 2 1 2

( ) , , , , 1... ,j i ji

j i j i

u u uu i j n i ju u u u

φ′⎛ ⎞

= = <⎜ ⎟⎜ ⎟⎝ ⎠

u .

Here is a small example. In this example 11 21 12 22( , , , )u u u u ′=u and

11 21 12 22( , , , )v v v v ′=v .

Page 62: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

50

11 11 22 22 11 11 22 22

21 21 12 12 21 21 12 12

( , )Fu v u v u v u vKu v u v u v u v

= + +u v

which gives the following features:

11 22 11 22

21 12 21 12

, ,u u u uu u u u

′⎛ ⎞⎜ ⎟⎝ ⎠

.

In general, for year 1 we get all ratios in the form of j

i

uu

. In year two we get all

ratios in the form of i

j

uu

, which is the inverse of year 1. We structure the ratios in this

form in order to get year-over-year changes of the form 21

21

ij

ji

uuuu

.

The feature space we have constructed so far with intra-year ratios has the structure

1

1

j

i

uu

and 2

2

i

j

uu

. It is evident that with this kernel we get the feature or its inverse. In other

words, if the true feature is 1

1

i

j

uu

, this mapping only gives the inverse. By constructing the

features in this manner, we reduce the dimensionality necessary to get year-over-year

changes, but we lose a potentially important set of features in the process. For the year-

over-year changes all we need to do is get the product of the intra-year ratios 1

1

j

i

uu

and

2

2

i

j

uu

. The computational complexity of the Financial Kernel is ( 1)32

n n −⎛ ⎞⎜ ⎟⎝ ⎠

, for n

attributes and 2 periods. This is easy to see as each pair of attributes ,i j are represented

three times, 2 1 21

1 2 1 2

, ,j i ji

j i j i

u u uuu u u u

′⎛ ⎞⎜ ⎟⎜ ⎟⎝ ⎠

, and the number of attribute pairs are ( 1)2

n n − .

Page 63: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

51

We validate the financial kernel on simulated data to test the kernel’s ability on

inputs of a known function. We take the Altman Z-Score and modify it slightly, to add a

time series component.

The function we create is:

++++− 1111 )/(*05.1)/(*72.6)/(*26.3)/)((*56.6 TLBEREEBITTARETACLCA

++++− 2222 )/(*05.1)/(*72.6)/(*26.3)/)((*56.6 TLBEREEBITTARETACLCA

+++−− 212121 )/(*)/(*3)/(*)/(*3))/((*)/)((*2 EBITREREEBITRETATARECLCATATACLCA scoreBETLTLBE =21 )/(*)/(

The first and second rows of this function are year 1 and year 2 individual Altman

Z Scores. The third row is year-over-year changes in the ratios of the Altman Z-Score.

The weights on the year-over-year changes were chosen arbitrarily.

Our dataset contains 2,000 randomly generated examples labeled with the modified

Altman Z-score function. We divide the examples up by sorting the data on the score.

The threshold value for our modified Altman function is chosen as a midpoint between

the score of sorted item 1,000 and 1,001. Thus all of the top-half are labeled as +1 and all

of the bottom half as -1. We run experiments using the financial kernel, a polynomial

kernel of degree 2, a Gaussian kernel, and a linear kernel. The results are as follows:

Table 1 – Financial Kernel Validation SV Test on Train 10 fold cross validation

Linear 877 85% 84%Polynomial (deg 2) 1998 75% 55%Gaussian 1056 86% 86%Financial Kernel 707 92% 91%

The results show that the Financial Kernel achieves superior results when using 10-

fold cross validation. The first column is the number of support vectors. A bound on

Page 64: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

52

generalization error is # SVl

. The generalization error of the Financial Kernel is the

lowest of the listed kernels. The result is not quite as expected though. One would

expect the Financial Kernel to achieve perfect separation. The reason for the error is an

assumption we made when developing the kernel. The assumption was that we could

represent both of the following ratios i

j

j

i

uu

uu

, by only one of i

j

uu

and j

i

uu

. In order to get

perfect separation, we hypothesize that we need to have both i

j

uu

and j

i

uu

as features.

This has been easily achieved by adding a mirror image of Figure 4, with the components

being inverses of the components of Figure 4. Figure 6 shows the updated Financial

Kernel.

This Chapter detailed the development of the Financial Kernel, one of the two main

methodological contributions of this research. In Chapter 6 the development of the

Accounting Ontology is explained.

Page 65: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

53

Figure 6 – Updated Financial Kernel

Page 66: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

54

CHAPTER 6 THE ACCOUNTING ONTOLOGY AND CONVERSION OF DOCUMENTS TO

TEXT VECTORS

We describe a methodology for creating an accounting ontology in Section 6.1.

Section 6.2 describes how the ontology is used in conjunction with the vector space

model to turn accounting documents into text vectors.

6.1 The Accounting Ontology

The accounting ontology is built using an accounting corpus to represent the

accounting domain and general corpora to represent the general domain. The accounting

corpus is the US Master GAAP Guide 16. We chose this because it explains generally

accepted accounting principles in a fairly non-technical manner. It uses all the

terminology, but in more regular language than a legal publication. We get our general

corpora from the Text Research Collection, which is syndicated by TREC 103. This

collection includes material from the Associate Press, The Wall Street Journal, the

Congressional Record, and various newspapers. The Text Research Collection has been

used in many natural language processing applications and is often used to test IR

methodologies.

A domain specific ontology is created by a series of major steps, each with its own

series of minor steps. Figure 7 shows how the ontology is created. There are two classes

of corpora, the domain corpora and the general corpora. Both are part-of-speech tagged

and fed into the function that determine which concepts are germane to the accounting

domain, as described in Step 1 below. A set of concepts and other domain specific terms

Page 67: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

55

called Novel Concepts are put through a process that uses the syntactic structure of the

accounting corpora, as described in Step 2 below. The result of this step is a WordNet

enriched with novel terms from the accounting domain. The final step in ontology

creation is to add new multi-word concepts to WordNet based on an algorithm that uses

the syntactic structure of domain concepts, as described in Step 3 below. The details of

each step are explained in the remainder of the section.

6.1.1 Step 1: Determine Concepts and Novel Terms that are specific to the accounting domain

We start with a part-of-speech (POS) tagger used to tag the natural language text.

This puts additional structure on the individual words. The POS tagger used is a

derivative of the Brill tagger, called MontyTagger 110. The tagger is run on both the

accounting corpus and the general corpora. The POS tagged data is culled down to the

following form:

<Accounting Corpus>

Word1#POS#WordCount

Word2#POS#WordCount

WordN#POS#WordCount

</Accounting Corpus>

<General Corpus 1>

Word1#POS#WordCount

Word2#POS#WordCount

Page 68: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

56

Accounting Domain

General Domain

POS Tagger

POS Tagger

Modified TF-IDF

Lexico-Syntactic Patterns

k-Nearest Neighbor

Domain Concepts and Novel

Terms (Step 1)

Domain Concepts

enriched with Novel Terms

(Step 2)

Header Modifier

Algorithm

Add Multi-word

Domain Concepts (Step 3)

Figure 7 – Accounting Ontology Creation Process

Page 69: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

57

WordN#POS#WordCount

</General Corpus 1>

<General Corpus 2>

Word1#POS#WordCount

Word2#POS#WordCount

WordN#POS#WordCount

</General Corpus 2>

<General Corpus m>

Word1#POS#WordCount

Word2#POS#WordCount

WordN#POS#WordCount

</General Corpus m>

where Word#POS#WordCount is as follows:

Word – Stemmed Word

POS – Part of Speech of Word

WordCount – Number of times a word appears in a document.

The word counts are run through a function in order to detect words that have the

highest amount of information for that particular domain. For example, when

considering the accounting domain versus a general domain, the word “defeasance” will

have a higher score for the accounting domain because it is specific to accounting, while

the word “balance” will have a lower score as it can be found equally in the accounting

Page 70: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

58

and the general domain. The function used is a modification of the basic TF-IDF

function which includes WordNet Concepts.

Recall the basic TF-IDF function: nNtfw ijij log*)(= , where ijw is the weight of

term jt in document id , ijtf is the frequency of term jt in document id , N is the number

of documents in the collection and n is the number of documents where the term jt

occurs at least once 44 . The inverse document frequency nNidf ij log)( = .

We modify the TF-IDF as follows:

(1) ⎟⎟⎠

⎞⎜⎜⎝

⎛=

tdt df

Ntfdtrlv log)log()|( ,

(2) ∑∈

=ct

dtrlvdcrlv )|()|(

(3) ∑+∈

+ =ct c

Tdtrlvdcrlv )|()|(

(4) ⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑

+∈+

ictic c

Tdtrlvdcrlvi

)|()|(max

Function (1) mimics the basic TF-IDF function, only the d stands for domain

instead of document in our research. rlv is the domain relevance of a term t on domain

d . N is the number of domains. In function (2), we introduce c for concept, where

},...,,{ 21 ntttc = . This introduces the notion of a synset. By considering the relevance of

terms and their synonyms, we get a clearer understanding of the domain. Function (2)

sums up the relevance rlv for all terms t in the synset c . This is a concept relevance

score. Function (3) sharpens this, considering hyponyms. The +c is the concept,

Page 71: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

59

unioned with all of its direct hyponym sets. Recall, the hyponym of a concept i is a

concept j which is a specific instance of concept i. For example, a bank account is an

account. Function (3) sums up the relevance rlv for all terms t in c and all of c ’s direct

hyponyms. Looking at the direct hyponyms gives us one more measure of a concept’s

relevance. Function 3 adds an additional term cT where T is the total number of terms

it within a concept c which are found in the domain corpus. The c is the cardinality of

the concept. This set of functions was developed by Sacaleanu and Buitelaar 15.

We add a measure of word sense disambiguation in Function 4 by comparing the

domain frequency of various senses of a term t . In other words, consider concept 1c

with terms ),,( 321 ttt and concept 2c with terms ),( 5,43 ttt . Notice that 3t is in both 1c

and 2c . We determine which concept 3t actually belongs to by comparing the Function 3

scores of the two concepts. We choose the concept ( 1c or 2c ) which achieves the

maximum value in Function 3.

Here is an illustrative example. The noun “stocks” in WordNet has 17 different

senses (definitions). Listed below are 4 of the 17 senses.

1. stock -- (the capital raised by a corporation through the issue of shares entitling

holders to an ownership interest (equity); "he owns a controlling share of the company's

stock")

=> capital, working capital -- (assets available for use in the production of

further assets)

2. broth, stock-(liquid in which meat and vegetables are simmered; used as a basis

for e.g. soups or sauces; "she made gravy with a base of beef stock")

Page 72: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

60

=> soup-(liquid food especially of meat or fish or vegetable stock often

containing pieces of solid food)

3. stock, inventory-(the merchandise that a shop has on hand; "they carried a vast

inventory of hardware")

=> merchandise, wares, product-(commodities offered for sale; "good business

depends on having good merchandise"; "that store offers a variety of products")

4. livestock, stock, farm animal-(not used technically; any animals kept for use or

profit)

=> placental, placental mammal, eutherian, eutherian mammal-(mammals

having a placenta; all mammals except monotremes and marsupials)

Senses 1 and 3 are much more likely to come up in an accounting context than

senses 2 and 4. In order to test which sense is the most likely sense in the context of a

document or corpus, we compare relevance scores, which include for Sense 1 “stock” and

its hyponyms “capital” and “working capital”, with Sense 3 “stock”, “inventory”, and its

hyponyms “merchandise”, “wears”, and “product”. The Sense with the highest relevance

becomes the candidate to be a domain specific concept.

All word sense disambiguated concepts are sorted based on score, and the highest

scoring concepts become domain specific concepts. Novel terms are those terms that

have high scores but do not fit into a WordNet category. These terms are very important

as they give us an opportunity to enrich WordNet with domain knowledge.

WordNet can be viewed as a hierarchical tree where the nodes are concepts and the

edges are relationships. Figure 8 shows a simplified WordNet tree after Step 1. In this

tree accounting domain concepts are filled in with the color gray. We also show that

Page 73: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

61

there is a listing of novel terms which are highly important to the accounting domain but

cannot be matched to current WordNet concepts. These terms are the subject of Step 2.

DomainConcept

DomainConcept

DomainConcept

Novel Term ListTerm 1Term 2

…Term n

Figure 8 – WordNet Noun hierarchy with Domain Concepts

6.1.2 Step 2: Merge Novel Terms with Concepts

In this step, we take the novel terms that were not matched to concepts and we

attempt to fit them into a domain concept. We use the methodology of Buitelaar and

Sacaleanu 14. This process is done using lexico-syntatic patterns. Consider the natural

text, before preprocessing. In such a text there are certain syntactic patterns that arise,

such as [determiner, adjective, noun, verb, noun]. A sentence with this structure would be

“The large crane eats breakfast.” The “the” is a determiner, large is an adjective (ADJ),

crane is a noun (NN), eats is a verb (V) and breakfast is a noun (NN). We consider the

syntactic patterns that arise in 7-grams, that is, contiguous 7 word structures. We look for

patterns with three words to the left and three words to the right of a central word. This

central word will always be either a domain concept (which includes all constituent

terms) or a novel term. The basic idea is as follows: look for patterns where novel terms

Page 74: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

62

and concepts appear together, often interchangeably. A visual representation of a 7-gram

is below:

This 7-gram has words in the following sequence: two words on the left that are

not important to us (signified by null), an adjective, the term or concept, another

unimportant word, and then a verb and a noun. The parts-of-speech we are concerned

with are nouns, verbs and adjectives. All other parts-of-speech are considered “null”.

In order to determine patterns which are populated with words that are related, we

use a mutual information score based on co-occurrence. The sore is used to determine

the semantic similarity of two-word pairs based on how often pairs of words are found

together relative to chance. The mutual information score MI is the following function:

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

∈ Nwftf

wtfwcMI

cti

cti

i

i

)()(

),(log),( 2

where c is a concept, w is a word in the pattern, and N is the total number of words in

the pattern. MI is an approximation of the probabilistic mutual information function:

2( , )log

( ) ( )P x y

P x P x.

The details of the derivation of ),( wCMI can be found in 14.

In order to determine if a novel term belongs inside a particular concept we have to

first decide whether the pattern is reliable. We assume a pattern is reliable if all the terms

of a concept are assigned back to the concept, using an unsupervised clustering algorithm

called k-nearest neighbor. Below is the data structure for the example pattern above:

[null, null, ADJ, Term/Concept, null, V, NN]

Page 75: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

63

[c,MI, ADJ, V, NN].

As “null” attributes are unimportant, we simply leave them out. For reliability

testing we expect the concept c in its representation above to be clustered together will all

term instances, represented as [ it ,MI,ADJ,V,NN]. If this is not the case, the pattern is

considered unreliable.

For all reliable patterns, we use the k-nearest neighbor to cluster the concepts as

seen above together with the novel terms (NT) in the following representation:

[NT, MI, ADJ, V, NN].

If a NT is clustered with a Concept, then we add the NT to the concept, thus enriching

WordNet. Figure 9 shows the WordNet tree after Step 2. This Figure updates Figure 7.

The domain concepts which were found in Step 1 are shaded gray. Figure 8 illustrates

that after Step 3 some of the domain concepts include novel terms, thus enriching

WordNet.

}{ NovelTermsConcept∪

DomainConcept

}{ NovelTermsConcept ∪

Figure 9 – WordNet Noun hierarchy with Domain Concepts enriched with Novel Terms

{ }Concept NovelTerms∪

Page 76: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

64

6.1.3 Step 3: Add multi-word domain concepts to WordNet

At this point, we have domain concepts enriched with novel terms. We would like

to extend WordNet, by adding new nodes. We do this using a slightly altered form of the

method described by Vossen 108. Vossen utilized the header-modifier relationship to

determine multi-word concepts. For our purposes, a header is a noun and a modifier is

one or more adjectives describing it. For example, “bank account” is a two-word

structure with account as the header and bank as the modifier. Vossen considers all

header-modifier structures, limiting the final set to the ones above a statistical threshold

for a particular domain. We already have our domain concepts from Steps 1 and 2, so we

consider only header-modifiers where the header is one of the domain concepts.

If an instance of the header-modifier structure is considered statistically significant,

then it is added as a node below the header in the tree. This means it becomes a hyponym

of the domain concept. The potential for more than one layer exists. Consider the

following phrases, “federal tax expense” and state tax expense.” Both of these multiword

phrases are actually line items on an income statement. “Expense” is a term specific to

the accounting domain. A “tax expense” is a term that belongs below “tax” in a

hierarchy. There can be an additional set of nodes below “tax expense” called “federal

tax expense” and “state tax expense.” There can be any number of modifiers for any

noun, although it is likely that the number of modifiers will be between one and three.

The WordNet tree takes on new nodes underneath domain concepts. The new nodes are

the header modifiers deemed significant to the domain. Figure 10 shows a simplified

representation of the tree after Step 3. The figure shows (as Figures 8 and 9) the domain

concepts as shaded light gray. Additional nodes are added below some domain concepts.

Page 77: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

65

This is represented by nodes which are shaded dark gray. These new nodes are the

domain specific multi-word concepts, added in this step.

Figure 10 – WordNet Noun Hierarchy with Domain Concepts, Novel Terms and Multi-Word Concepts

6.2 Converting Text to a Vector via the Accounting Ontology

Above we developed an accounting ontology methodology. Now we use this

ontology to aid in detection of financial events by using it as domain filter to get rid of

unwanted noise. Recall that the output of the Accounting Ontology is a set of concepts

specific to the accounting domain, as well as relationships between those concepts. The

process of getting a quantitative form of a text vector is as follows: We input the

company reports in natural language and use a Part-of-Speech tagger 110 as a

Page 78: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

66

preprocessor. The preprocessed document is then parsed down to word counts labeled

with a Part-of-speech (e.g. Word#POS#WordCount). Each word is compared to all

domain concepts. If a word fits into one of the domain concepts, its word count is added

to the vector. The power of concepts rests in the fact that all words inside a concept will

count the same. For example, the word “liability” has the following words as synonyms,

“indebtedness”, and “financial obligation”. All three of these words are part of the same

concept. If one document has the word “liability”, a word count is placed in the index

reserved for the “liability” concept. If another document has the word “indebtedness” or

“financial obligation”, a word count is placed in the index reserved for the “liability”

concept. Below is a class of concepts:

1 2{ , ,..., }

1,...,| |

1,...,

n

i j

j

c c cw c

i c

j n

=

=

where jc are concepts, iw are words and || jc is the size of the concept set. The filtering

process leaves us with only concepts jc that are specifically related to the accounting

domain.

We take this reduction a step further by considering the relationships between the

concepts in the Accounting Ontology. We do this by utilizing the tree structure of

WordNet. We need a measure to determine the similarity between nodes (or concepts).

There is a vast literature on similarity measures, so we choose an off-the-shelf measure

that has proven to be among the best. Based on the work of Budanitsky and Hirst 13, we

choose the Jiang and Conrath measure, which has been shown to be more accurate on the

Miller-Charles 63 set than competing similarity measures. We create a similarity matrix,

Page 79: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

67

comparing each of the concepts, where ijs is the similarity between concepts ic and jc

for },...,2,1{, nji ∈ . Here is a view of this matrix:

nnnn

n

n

sss

ssssss

L

MLMM

L

L

21

22221

11211

We input the similarity matrix into an agglomerative clustering algorithm 47. This

algorithm clusters the most similar items and shrinks the matrix. This algorithm is

iterative, in each run concepts which are less similar are added to existing clusters, so we

choose a parameter k where k is the minimum level of similarity with which two

concepts can be clustered. The clustered concepts c are called super-concepts sc .

scc ≤ , where ⋅ is the size. In turn, the total number of super-concepts sc are less than

or equal to the total number of concepts c .

There are two goals to creating super-concepts:

(1) The super-concepts are designed to cluster concepts that are similar, therefore

financial documents which share accounting super-concepts are more likely to be similar.

(2) The super-concepts allow us to shrink the size of an undoubtedly large vector.

This can help us avoid overfitting on the empirical data, which is possible due the small

datasets available for fraud and bankruptcy. Below is a class of super-concepts:

1 2{ , ,..., }

1,...,1,...,

s

j k

k

sc sc scc sc

j mk s

==

where km is the number of such concepts in super-concept k .

Page 80: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

68

Chapter 6 explained the methods used to develop the Accounting Ontology. The

procedure for converting text to a vector of numbers was also explained. In the next

Chapter the method of combining the text data with the quantitative data is detailed.

Page 81: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

69

CHAPTER 7 COMBINING QUANTITATIVE AND TEXT DATA

In this chapter we combine quantitative and textual financial data for subsequent

analyses. We turn the text into a numeric vector as discussed in Chapter 6, here we

concatenate the quantitative form of text to the vector of quantitative financial data.

Since we will be applying a kernel to this concatenated vector, we need to expand the

financial kernel developed in Chapter 5.

We concatenate the text and quantitative attribute vectors as a single, partitioned

vector ),...,,|,...,,,,...,,( 22122221212111 ′= ++ mnnnn uuuuuuuuuu . The Financial Kernel is

applied to 22212111 ,,,,..., nn uuuuu and the text kernel is applied to mn uu ,...,12 + where

these m – n values are the quantitative representation of text. This is a two step process.

(1) We create a graph kernel ( , )TK u v for the text. (2) We add the text graph to the

Financial Kernel graph.

(1) Text Graph: The text kernel is a linear kernel ( , ) ,TK =< >u v u v . We

show ( , )TK u v in graph form (Figure 11):

Figure 11 – Text Kernel

Page 82: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

70

This is a directed graph with no cycles and each edge e is a base kernel. This

graph is a kernel by Takimoto and Warmuth [2004] 100 (pg. 33).

(2) Add ( , )TK u v to ( , )FK u v to create a combined kernel ( , )CK u v . See Figure

12 below:

2FG

1FG

kFG

1s 1t

2t

s t

MM

M

2s

Ts

TG

Tt

1−ns 1−nt

Figure 12 – Combined Kernel

The text graph TG is added to the Financial Kernel. The addition of TG does not

alter the fundamental structure of the Financial Kernel graph. The graph is still directed

and still contains no cycles. Thus ( , )CK u v is a kernel.

A simple example illustrates the Combined Kernel. There is an input of 2

quantitative attributes for both years, 22122111 ,,, uuuu and 4 text attributes,

Page 83: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

71

8765 ,,, uuuu .The input vectors are 11 21 12 22 5 6 7 8( , , , | , , , )u u u u u u u u ′=u and

11 21 12 22 5 6 7 8( , , , | , , , )v v v v v v v v ′=v .

11 11 22 22 11 11 22 225 5 6 6 7 7 8 8

21 21 12 12 21 21 12 12

( , )Cu v u v u v u vK u v u v u v u vu v u v u v u v

= + + + + + +u v

with the following features:

11 22 11 225 6 7 8

21 12 21 12

, , , , , ,u u u u u u u uu u u u

′⎛ ⎞⎜ ⎟⎝ ⎠

.

Other kernels could be used in place of the linear kernel, giving additional features on the

text. For this study it is not necessary due to the extensive preprocessing steps used

during the creation of the text vector.

This Chapter explained the method used to combine text and quantitative data.

Chapter 7 is the final chapter in the methodology creation. The following three Chapters

delve into the empirical research, testing, results and a conclusion. Specifically, Chapter

8 details the research questions, the three datasets used for testing (management fraud,

bankruptcy, and restatements), and the ontologies created. Chapter 9 gives the results

from the tests on the datasets. Chapter 10 gives a summary, conclusion and explanation

of future research.

Page 84: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

72

CHAPTER 8 RESEARCH QUESTIONS, METHODOLOGY AND DATA

This chapter explains the research hypotheses, the data gathering methodology,

accounting ontology creation and data preprocessing. In Section 8.1 the Hypotheses and

test mechanisms are articulated. Section 8.2 outlines the Research Model. Section 8.3

explains the methods used for gathering data for the Fraud, Bankruptcy and Restatement

datasets. Section 8.4 details the ontologies created and Section 8.5 explains data

preprocessing.

8.1 Hypotheses

The main contributions of this research are threefold. (1) We have developed a

financial kernel that operates on quantitative financial attributes. (2) We have developed

an accounting ontology to aid in using textual data in learning tasks. (3) We have

combined these two kernels to simultaneously analyze quantitative and text information.

These methods will be tested to for their effectiveness in early detection of financial

events. Our first testable hypothesis is as follows.

Hypothesis 1: A support vector machine using the Combined Kernel, which

includes the Financial Kernel for quantitative data and the Text Kernel for text data

detects financial events with greater accuracy than quantitative methods alone,

including the Financial Kernel.

A series of tests are run on the financial events data, using the Combination kernel.

All available data, both quantitative and text is used. We use 10-fold cross validation as

Page 85: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

73

a method for estimating generalization error. We compare the classification accuracy of

our method with the other methods as explained in Chapter 2, including linear

discriminant functions, and logit functions.

The concepts in WordNet include semantic relationships between individual words.

Developing an ontology specific to the domain of accounting allows us to utilize these

relationships when creating the text vector. The basic vector space model does not take

these relationships into account. The expectation is that the ontology driven text vector

will provide a better representation of accounting-related documents than the basic vector

space model.

Hypothesis 2: A Support Vector Machine using data from a text vector filtered

through the accounting ontology will detect financial events with greater accuracy

than a Support Vector Machine using only the vector space model.

Two tests are run on the financial events data, using the combination kernel. One

test uses a vector created by filtering the text through the accounting ontology. The other

is run using a vector of word counts. The results of the tests’ 10-fold cross validation are

reported and compared.

Comparing the classification accuracy of the text and quantitative data allows us to

effectively compare the “information content” in the numbers against that of the text.

Hypothesis 3: Text filtered through an accounting ontology will detect financial

events at least as accurately as compared to pure quantitative methods.

Page 86: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

74

Two tests are run on the financial events data, one using only the quantitative data

which is fed into the Financial Kernel and the SVM. The other uses only the text data in

the form of the concept vector which is preprocessed using the accounting ontology. The

concept vector is fed into the Text Kernel and the SVM. The results of the tests’ 10-fold

cross validation are compared.

8.2 Research Model

In this section, the Research Model is explained. Figure 13 shows the process we

use to study the efficacy of our approach. The empirical analysis is carried out to test our

methodology. Starting on the left of the figure, we gather our dataset, which consists of

companies that were shown to be fraudulent and/or bankrupt. We match the fraud and

bankrupt companies with nonfraud and nonbankrupt companies based on year, sector,

and total assets. Once we have chosen the companies in our dataset, we gather

quantitative data from financial statements and text data from the 10Ks. The financial

data is converted into a vector of attribute values. The text data is filtered through the

accounting ontology and turned into a numerical vector using the counts of the concepts

in the ontology. The text and financial vectors are concatenated and run through the

combination kernel. An SVM using the combination kernel is used to determine a

classifier to distinguish the companies as fraud/nonfraud and bankrupt/nonbankrupt. The

financial vector is similarly processed using the financial kernel to get classification

results for the quantitative data alone. We compare the quantitative results against the

results for the text-only case by feeding the text vector into the text kernel SVM.

Page 87: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

75

Figure 13 – The Discovery Process

Fraud Bankrupt Financial DataSEC AAERs Compustat Research Financial StatementAAA Monograph Figures

Nonfraud Nonbankrupt Text DataMatch to fraud Match to bankrupt SEC filingscompanies on year companies on year Press releasesindustry and industry, and total assets Exogenous presstotal assets Chat room data

Data Sets FinancialKernel

TextKernel

AccontingOntology

VectorSpaceModel

FinancialVector

TextVector

CombinationKernel

Decision

FraudulentCompany

NonfraudulentCompany

BankruptCompany

NonbankruptCompany

SVM

Page 88: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

76

8.3 Datasets

The data gathering methods are described in this section for the Fraud, Bankruptcy

and Restatement datasets. Text and quantitative data are gathered for all companies in

the datasets.

8.3.1 Fraud Data

Gathering fraud data is a task which requires considerable time and effort. The

main data sources are the SEC Accounting and Auditing Enforcement Releases 93 as

well as the Accounting and Auditing Association Monograph by Palmrose 73. The set

was limited to fraud which occurred no earlier than 1993. The extracted financial data

consists of financial statement figures for two years. The text data set consists of the text

portion of annual reports (10Ks). As the fraud dataset required both text and quantitative

attributes, any company which was missing either the text or quantitative attributes was

deleted from the dataset. The quantitative dataset is shown in Figure 16 of Appendix B.

The attribute definitions are as follows:

Ticker – Company ticker for stock market Label – fraudulent (-1) nonfraudulent (1) Ind – Industry Number Year – 1st year of data collection Salesyr[1,2] – Sales ARyr[1,2] – Accounts Receivable INVyr[1,2] – Inventory TAyr[1,2] – Total Assets OAyr[1,2] – Other Assets CEyr[1,2] – Capital Expenditures

The attributes were chosen based on their reported occurrence in cases of fraud. A

secondary reason for choosing these particular attributes was the likelihood of getting

reported data. This is in contrast to other highly reported fraud attributes, such as

Advertising Expense, Research and Development Expense and Allowance for Bad Debts.

Page 89: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

77

The dimension of the feature space for the Financial Kernel in this experiment is

90. The features are listed in Figure 14. A “YOY” in front of a ratio means the year-

over-year change for that ratio. Here is a listing of the features:

Salesyr1/ARyr1 ARyr1/Salesyr1 Salesyr2/ARyr2 ARyr2/Salesyr2 YOYSalesyr1/ARyr1 YOYARyr1/Salesyr1 Salesyr1/INVyr1 INVyr1/Salesyr1 Salesyr2/INVyr2 INVyr2/Salesyr2 YOYSalesyr1/INVyr1 YOYINVyr1/Salesyr1 Salesyr1/TAyr1 TAyr1/Salesyr1 Salesyr2/TAyr2 TAyr2/Salesyr2 YOYSalesyr1/TAyr1 YOYTAyr1/Salesyr1 Salesyr1/OAyr1 OAyr1/Salesyr1 Salesyr2/OAyr2 OAyr2/Salesyr2 YOYSalesyr1/OAyr1 YOYOAyr1/Salesyr1 Salesyr1/CEyr1 CEyr1/Salesyr1 Salesyr2/CEyr2 CEyr2/Salesyr2 YOYSalesyr1/CEyr1 YOYCEyr1/Salesyr1

ARyr1/INVyr1 INVyr1/ARyr1 ARyr2/INVyr2 INVyr2/ARyr2 YOYARyr1/INVyr1YOYINVyr1/ARyr1ARyr1/TAyr1 TAyr1/ARyr1 ARyr2/TAyr2 TAyr2/ARyr2 YOYARyr1/TAyr1 YOYTAyr1/ARyr1 ARyr1/OAyr1 OAyr1/ARyr1 ARyr2/OAyr2 OAyr2/ARyr2 YOYARyr1/OAyr1 YOYOAyr1/ARyr1 ARyr1/CEyr1 CEyr1/ARyr1 ARyr2/CEyr2 CEyr2/ARyr2 YOYARyr1/CEyr1 YOYCEyr1/ARyr1 INVyr1/TAyr1 TAyr1/INVyr1 INVyr2/TAyr2 TAyr2/INVyr2 YOYINVyr1/TAyr1YOYTAyr1/INVyr1

INVyr1/OAyr1 OAyr1/INVyr1 INVyr2/OAyr2 OAyr2/INVyr2 YOYINVyr1/OAyr1 YOYOAyr1/INVyr1 INVyr1/CEyr1 CEyr1/INVyr1 INVyr2/CEyr2 CEyr2/INVyr2 YOYINVyr1/CEyr1 YOYCEyr1/INVyr1 TAyr1/OAyr1 OAyr1/TAyr1 TAyr2/OAyr2 OAyr2/TAyr2 YOYTAyr1/OAyr1 YOYOAyr1/TAyr1 TAyr1/CEyr1 CEyr1/TAyr1 TAyr2/CEyr2 CEyr2/TAyr2 YOYTAyr1/CEyr1 YOYCEyr1/TAyr1 OAyr1/CEyr1 CEyr1/OAyr1 OAyr2/CEyr2 CEyr2/OAyr2 YOYOAyr1/CEyr1 YOYCEyr1/OAyr1

Figure 14 – Fraud Features

8.3.2 Bankruptcy Data

The bankrupt companies were chosen using the Compustat Research database 19.

All chosen companies are from the Manufacturing sector (Industry codes 2000 – 3999).

The companies chosen were delisted between 1993 and 2002. A company is delisted

when it does not meet the minimal requirements of financial stability according to the

Page 90: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

78

market (NYSE, NASDAQ, AMEX). The analysis is limited to post-1992 company years

due to the fact that the model requires the text of the 10Ks to be in electronic form.

Electronic 10Ks were not available until 1993.

Figure 17 in Appendix B shows the entire quantitative dataset. The attribute

definitions are as follows:

Label – bankrupt (-1) nonbankrupt (1) Ticker – company ticker for stock market Ind – Industry Number Year – 1st year of data collection TAyr[1,2] – Total Assets REyr[1,2] – Retained Earnings WCyr[1,2] – Working Capital EBITyr[1,2] – Earnings before Interest and Taxes SEyr[1,2] – Stockholder’s Equity TLyr[1,2] – Total Liabilities

These attributes chosen were the components of the Altman Z score for

manufacturing4. The dimension of the feature space for the Financial Kernel in this

experiment is 90. The features are listed in Figure 15. A “YOY” in front of a ratio

means the year-over-year change for that ratio. Here is a listing of the features:

TAyr1/REyr1 REyr1/TAyr1 TAyr2/REyr2 REyr2/TAyr2 YOYTAyr1/REyr1 YOYREyr1/TAyr1 TAyr1/WCyr1 WCyr1/TAyr1 TAyr2/WCyr2 WCyr2/TAyr2 YOYTAyr1/WCyr1 YOYWCyr1/TAyr1 TAyr1/EBITyr1 EBITyr1/TAyr1 TAyr2/EBITyr2 EBITyr2/TAyr2

REyr1/WCyr1 WCyr1/REyr1 REyr2/WCyr2 WCyr2/REyr2 YOYREyr1/WCyr1 YOYWCyr1/REyr1 REyr1/EBITyr1 EBITyr1/REyr1 REyr2/EBITyr2 EBITyr2/REyr2 YOYREyr1/EBITyr1 YOYEBITyr1/REyr1 REyr1/SEyr1 SEyr1/REyr1 REyr2/SEyr2 SEyr2/REyr

WCyr1/SEyr1 SEyr1/WCyr1 WCyr2/SEyr2 SEyr2/WCyr2 YOYWCyr1/SEyr1 YOYSEyr1/WCyr1 WCyr1/TLyr1 TLyr1/WCyr1 WCyr2/TLyr2 TLyr2/WCyr2 YOYWCyr1/TLyr1 YOYTLyr1/WCyr1 EBITyr1/SEyr1 SEyr1/EBITyr1 EBITyr2/SEyr2 SEyr2/EBITyr2

Figure 15 – Bankruptcy Features

Page 91: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

79

YOYTAyr1/EBITyr1 YOYEBITyr1/TAyr1 TAyr1/SEyr1 SEyr1/TAyr1 TAyr2/SEyr2 SEyr2/TAyr2 YOYTAyr1/SEyr1 YOYSEyr1/TAyr1 TAyr1/TLyr1 TLyr1/TAyr1 TAyr2/TLyr2 TLyr2/TAyr2 YOYTAyr1/TLyr1 YOYTLyr1/TAyr1

YOYREyr1/SEyr1 YOYSEyr1/REyr1 REyr1/TLyr1 TLyr1/REyr1 REyr2/TLyr2 TLyr2/REyr2 YOYREyr1/TLyr1 YOYTLyr1/REyr1 WCyr1/EBITyr1 EBITyr1/WCyr1 WCyr2/EBITyr2 EBITyr2/WCyr2 YOYWCyr1/EBITyr1 YOYEBITyr1/WCyr1

YOYEBITyr1/SEyr1 YOYSEyr1/EBITyr1 EBITyr1/TLyr1 TLyr1/EBITyr1 EBITyr2/TLyr2 TLyr2/EBITyr2 YOYEBITyr1/TLyr1 YOYTLyr1/EBITyr1 SEyr1/TLyr1 TLyr1/SEyr1 SEyr2/TLyr2 TLyr2/SEyr2 YOYSEyr1/TLyr1 YOYTLyr1/SEyr1

Figure 15 Continued 8.3.3 Restatement Data

Restatements as defined in this research are annual reports by publicly traded

companies, which have been restated either voluntarily or involuntarily. Restatements

are a much more loosely defined dataset than that of bankruptcy or fraud. There is a

strong interest as to the underlying causes of restatements, which was a primary

motivation for the addition of this dataset. The restatements analyzed in this study were a

subset of all restatements of publicly traded companies for the years of 1997 – 2002

(details are explained below). The Restatement dataset was gathered using report code

GAO-03-138 37 by the General Accounting Office. The restatements in this report

involve accounting irregularities resulting in material misstatements of financial results.

Restatements can be seen as a superset which includes fraud and earnings management as

subsets. When a company is deemed to have committed fraudulent activity or managed

earnings, the SEC requires that the company restate its financials. The GAO report

includes an appendix which lists all restatements for the years between 1997 and 2002.

The restatement dataset is the largest of the datasets tested (800/1,379), (i.e. the

fraud dataset had 122 cases and the bankruptcy dataset had 156 cases). There were 919

Page 92: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

80

restatements for publicly traded companies between the years of 1997 and 2002 37. The

quantitative dataset was 1379 companies, 690 of which were restatements and 689 of

which were non restatements. The smaller, 800 case dataset is a subset of the 1,379 case

dataset which includes text and quantitative attributes. The size 800 dataset is split

evenly between restatements and nonrestatements. The reduction from 919 to 690 was

due completely to the lack of quantitative data available for some of the companies in the

GAO report. The drop from 690 to 400 restatements for the combined dataset was due to

the inability to get 10K data for some of the GAO companies. This was due in part to the

GAOs inclusion of foreign companies and companies traded on Over The Counter

markets, both of which are not required to submit the same type of 10K. The quantitative

attributes for this dataset are as follows:

Ticker – Company ticker for stock market Label – restatement (-1) nonrestatement (1) Ind – Industry Number Year – 1st year of data collection Salesyr[1,2] – Sales ARyr[1,2] – Accounts Receivable INVyr[1,2] – Inventory TAyr[1,2] – Total Assets OAyr[1,2] – Other Assets CEyr[1,2] – Capital Expenditures

The entire quantitative dataset is in Appendix B under the title of Figure 18. The

features for the Restatement Dataset are the same as the features in the Fraud Dataset,

under Figure 14.

8.4 The Ontology

The ontology is a three-level ontology composed of concepts, two-grams and three-

grams. The concepts may be one word or two word concepts. The two-grams and three-

grams are built on top of the concepts. The size of the ontology is determined at three

Page 93: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

81

levels, the concept, two-gram, and three-gram level. A concept can have many children

at the two-gram and three-gram levels. A two-gram can have many children at the three-

gram level. A two-gram is always a direct child of a concept. A three-gram may be a

direct child of a concept or a two-gram.

Appendix A shows a 300 dimension ontology. This ontology was built using the

entire GAAP text [28] as the accounting corpus. The 300 dimensions include 100

concepts, 100 two-grams and 100 three-grams. Given the small number of examples in

the fraud and bankruptcy datasets, 300 dimensions was the largest ontology created. The

concepts are determined by the functions described in Chapter 6. The concepts chosen

for this ontology are the ones that had the highest scores as described in Chapter 6. The

two-grams and three-grams are chosen based on mutual information scores, using

respectively the Dice Coefficient and ll3 [5]. Commonly accepted Mutual Information

scores are available for two and three-grams. Higher order n-grams do not have accepted

Mutual Information scores, therefore this analysis is limited to two and three-grams. An

ontology of 100 two-grams and 100 three-grams makes it feasible to have some concepts

with both children and grandchildren. The deeper the tree the more specific the ontology

gets. The effect is a more precise ontology. The prediction accuracy on the test datasets

ultimately determine which ontologies are the best for this particular project. The two-

grams and three-grams are preceded by their part-of-speech (n-noun, a-adjective, v-verb).

As seen in Appendix A, there are two numbers after the two and three-grams. The first is

the mutual information score and the second is the overall ranking of the n-gram’s

importance as compared to all n-grams. The ranking is used to determine which two and

Page 94: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

82

three-grams are used in the ontology. A two or three-gram is eligible for the ontology if

at least one of its component words is a concept in the ontology.

Ontology creation is an iterative process. The process must be refined based on the

actual results achieved. The 300 dimension GAAP ontology Appendix A was used in

conjunction with 10Ks of bankrupt and nonbankrupt companies (see Section 9.x for

further details). Due to the small size of the dataset the 300 dimension ontology appears

to be overfitting. Two additional GAAP ontologies were created having 60 and 10

dimensions, respectively. These ontologies are available in Appendix A.

Choosing an accounting text as the basis of the ontology has a major impact on the

results. GAAP was chosen because it is a general purpose text that covers all major

accounting topics and is written in natural language. A drawback of GAAP is its indirect

relationship to the MDNAs. A more direct accounting text would be the MDNAs. A set

of ontologies were created using the MDNAs from the bankrupt and nonbankrupt

companies as the accounting text. These ontologies are of the following dimensions, 150

(including 50 concepts, 50 2-grams, 50 3-grams), 50 (including 50 concepts), and 25

(including 25 concepts). All ontologies are available in Appendix A.

8.5 Data Gathering and Preprocessing

The financial information for bankrupt firms was gathered for two consecutive

years prior to delisting. In the event that the financial information was not available for

the two years directly prior to delisting, the latest two years of pre-delisted data were

taken instead. In the case of fraud the financial data was gathered for the first year of

fraud and the year prior to fraud, as reported by the SEC. For example, if the first year of

fraudulent activity was 2000, then data from 1999 and 2000 is gathered. In the case of

restatements, the restatements were gathered for the year of the restatement and the year

Page 95: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

83

prior to the restatement. The fraudulent/bankrupt/restatement companies were matched

with nonfraudulent/nonbankrupt/nonrestatement companies based on industry, year and

total assets. A match was accepted if total assets of a

nonfraudulent/nonbankrupt/nonrestatement company were within 10% of the

fraudulent/bankrupt company for year one. If no company met this requirement, then the

company with the nearest total asset value was chosen. The Compustat Industrial

Annual database was used in conjunction with a script created using Perl to download the

quantitative financial data for all three datasets.

The 10Ks were gathered directly from www.sec.gov. There is one 10K per

company and the year of the 10K matches (in most instances) the last year of the

financial information. If the 10K was not available for the last year, then the 10K was

chosen as follows:

(1) The 10K for the year prior to the final year

(2) If (1) was not available, the year after the final year (as long as it is not past the

delist year in the bankruptcy case, the restatement year in the restatement case or

the fraud year in the fraud case).

If (1) and (2) were not available, both the company and its match company were deleted

from the analysis.

The text analysis was limited to the section entitled “Management’s Discussion and

Analysis of Financial Condition and Results of Operations (MDNA).” The MDNA

section is a natural choice as it is the portion of the 10K which allows management to

explain the underlying causes of the company’s financial condition. It also is a section

where forward-looking statements are allowed.

Page 96: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

84

Using the Financial Kernel, the attributes are mapped to features, as explained in

Chapter 5. The total size of the attribute space is 12 for the fraud, bankruptcy and

restatement datasets. The attributes in the fraud and restatement datasets are described in

Section 8.2. The attributes in the bankruptcy dataset sets are described in Section 8.3.

The feature space is determined by the function

/ 16( / * )2

A YA Y −⎛ ⎞⎜ ⎟⎝ ⎠

where /A Y = the number of attributes per year.

8.5.1 Preprocessing-Quantitative Data

There are three issues to consider for quantitative attributes: missing data, “0-

valued” data, and scaling. Missing data is a common problem with publicly available

financial information. The method used to fill in the missing data for this paper is called

multiple imputation [81]. This method takes into account not only the statistics of the

missing variable over the entire dataset, but also the relationship between the missing

variable and the other variables in the example. The data is put through a multiple

imputation routine in the statistics package R 81. Quantitative attributes with a value of 0

is a problem in this analysis because of the extensive usage of ratios in the Financial

Kernel. A ratio of the form 0x , for any x is undefined. In order to avoid this problem, 0

data are given a value of .0001 and the entire dataset is scaled between 1 and -1.

8.5.2 Preprocessing-Text Data

The preprocessing of the text data involved the following steps:

(1) Making all text lowercase. This is done to avoid the problem that a computer

will see the same words as different if they are different cases. For example, the word

Page 97: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

85

Asset, asset and ASSET would be considered three separate words. Making all letters

lowercase avoids this problem.

(2) Deleting stopwords. Stopwords are common words that add noise to text

analysis. Deleting these stopwords is a method of cleaning the text. The stopword list

used for preprocessing the ontology is the same stoplist used for preprocessing the

MDNAs. The stoplist is available in Appendix A.

(3) Part-of-speech tagging and stemming. Part-of-speech tagging assures that

matches between the MDNAs and the ontology will occur only for words with the same

spelling and part-of-speech. Stemming removes the suffixes from the words to facilitate

matching of concepts that are the same but used in different tenses.

(4) Concept-naming. For this step, all synsets from each concept from the ontology

are given a single, representative word. For example, the concept liability has three

synonyms; liability, indebtedness and financial obligation. The MDNA is searched for

all three words and each instance is replaced with a single representative word. This

allows for correct matches between the ontology (which was preprocessed with concept-

naming as well) and the MDNAs.

Simple counts of each component of the ontology are placed in vector form for

each company MDNA. The size of the ontology is a user-defined parameter. The size of

an ontology is limited to the top scoring concepts, two-grams and three-grams. The user

decides how many of each should be in the ontology. The main limitation is that only

two and three-grams that have an ontology concept as one of their components words can

be in the ontology.

Page 98: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

86

Below is an example of one MDNA, (company name Fifth Dimension Inc., year

1996)

This is a sample of the raw text.

The Corporation spent $122,128 on capital additions during 1996 while recording

$124,611 of depreciation expense. A reduction in capital spending is projected

for 1997 while depreciation reserves are projected at slightly lower levels than in

1996.

This is the text after Steps (1) and (2).

corporation spent 122,128 capital additions 1996 recording 124,611 depreciation

expense. reduction capital spending projected 1997 depreciation reserves

projected slightly lower levels 1996.

This is the text after Steps (3) and (4).

null/JJ/null corporation/NN/corporation spent/VBD/spend 122/CD/122 ,/,/,

128/CD/128 capital/NN/capital additions/NNS/addition 1996/CD/1996

recording/NN/recording 124/CD/124 ,/,/, 611/CD/611

depreciation/NN/depreciation expense/NN/expense ././.

reduction/NN/reduction capital/NN/capital spending/NN/spending

projected/VBN/project 1997/CD/1997 depreciation/NN/depreciation

reserves/NNS/reserve projected/VBN/project slightly/RB/slightly lower/JJR/lower

levels/NNS/level 1996/CD/1996 ././.

The complete MDNA is available via a link in Appendix B.

The text vectors are created by totaling the number of times each ontology

component is encountered in the text of a company’s MDNA. The text vectors are

Page 99: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

87

normalized by dividing each vector component by the total word count of the company’s

MDNA text. This normalization procedure assures that the importance of concepts to a

particular document is not diminished due to the difference in sizes between documents.

This Chapter gave the research questions along with detailed explanations of the

bankruptcy, fraud and restatement datasets. Data preprocessing was explained as well as

ontology creation. In the next Chapter test results are given on the three datasets along

with discussions on each.

Page 100: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

88

88

CHAPTER 9 RESULTS

This chapter gives the results of the empirical tests. Each dataset is tested

individually and the results are listed in table format. Following the results for each

dataset is an discussion of the results. The format of the results is explained below.

The experiments are set up so that the hypotheses in Chapter 8 can be either

supported or negated. There are three main categories of tests. The quantitative data is

tested using a SVM with the Financial Kernel. The Text Kernel is tested using various

sizes and types of ontologies. The Combination Kernel is tested using various sizes and

types of ontologies as well. The results are given in tables 2 - 40. The table headings are

described as follows:

“SV” is the number of support vectors.

“ /SV l ” is a rough measure of the generalizability of the “Test on Training”

results. Here l is the number of examples in the dataset.

“C” is a user-defined parameter that defines the penalty for a mistake in the

quadratic optimization problem. Deciding on the right C is more of an art than a science.

After raising C to a certain point, the results will level off or decline. Results are given

for various values of C.

“Test on Training” is the test results of the examples used to train the SVM. The

number shown is the prediction accuracy of the model.

“10-fold Cross Validation” results are the average prediction accuracy of 10 SVM

runs where 10% of the examples are left out from training on each run and used for

Page 101: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

89

testing. This method is often used to test the generalizability of a function on small

datasets.

A few of the tests were left running without completing for over a week. In these

instances it was decided to cancel the runs. The cancelled runs are shown as blanks and

are highlighted in gray.

9.1 Fraud Results

In this section we report the results of the experiments using the Financial Kernel,

the Text Kernel, and the Combination Kernel. Text Kernels of various dimensions were

tested. One set of text kernels is based on the GAAP text and the other is based on

MDNAs. For the fraud experiments, all MDNAs in both the fraud and bankruptcy

datasets are used to create the ontology. Table 2 shows (starting from left) the Test

number, the number of Support Vectors for that Test number, the /SV l function to

determine generalizability based on the training set, the error penalty (C) used for that

Test number, the Test on Training Results and the 10-fold Cross Validation results.

Results tables are in this form for the Bankruptcy and Restatement datasets as well.

Tables 2 – 14 illustrate the results on the Fraud datasets.

Table 2 – Fraud Detection Results using Financial Kernel

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 23 18.85% 1 98.36% 94.26%2 22 18.03% 100 100.00% 95.90%3 22 18.03% 1000 100.00% 95.90%4 22 18.03% 10000 100.00% 95.90%

Page 102: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

90

Table 3 – Fraud Detection Results using Text Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 70 57.38% 1 100.00% 46.49%2 70 57.38% 100 100.00% 46.49%3 70 57.38% 1000 100.00% 46.49%4 70 57.38% 10000 100.00% 46.49%

/SV l

Table 4 – Fraud Detection Results using Comb. Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 48 39.34% 1 100.00% 92.11%2 48 39.34% 100 100.00% 92.11%3 48 39.34% 1000 100.00% 92.11%4 48 39.34% 10000 100.00% 92.11%

/SV l

Table 5 – Fraud Detection Results using Text Kernel, 60 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 74 60.66% 1 80.70% 54.39%2 76 62.30% 100 85.09% 53.51%3 76 62.30% 1000 85.09% 53.51%4 76 62.30% 10000 85.09% 53.51%

/SV l

Table 6 – Fraud Detection Results using Comb. Kernel, 60 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 31 25.41% 1 100.00% 92.98%2 28 22.95% 100 100.00% 92.98%3 28 22.95% 1000 100.00% 92.98%4 28 22.95% 10000 100.00% 92.98%

/SV l

Table 7 – Fraud Detection Results using Text Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 104 85.25% 1 62.28% 44.74%2 104 85.25% 100 62.28% 44.74%3 104 85.25% 1000 62.28% 44.74%4 104 85.25% 10000 62.28% 44.74%

/SV l

Page 103: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

91

Table 8 – Fraud Detection Results using Comb. Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 30 24.59% 1 98.25% 92.98%2 25 20.49% 100 100.00% 91.23%3 25 20.49% 1000 100.00% 91.23%4 25 20.49% 10000 100.00% 91.23%

/SV l

Table 9 – Fraud Detection Results using Text Kernel, 150 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 71 58.20% 1 94.74% 39.47%2 64 52.46% 100 100.00% 42.98%3 64 52.46% 1000 100.00% 42.98%4 64 52.46% 10000 100.00% 42.98%

/SV l

Table 10 – Fraud Detection Results using Comb. Kernel, 150 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 39 31.97% 1 100.00% 92.98%2 39 31.97% 100 100.00% 92.98%3 39 31.97% 1000 100.00% 92.98%4 39 31.97% 10000 100.00% 92.98%

/SV l

Table 11 – Fraud Detection Results using Text Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 81 66.39% 1 84.21% 50.00%2 77 63.11% 100 78.95%3 73 59.84% 1000 46.77% 48.25%4 73 59.84% 10000 50.00%

/SV l

Table 12 – Fraud Detection Results using Comb. Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 35 28.69% 1 99.12% 92.98%2 36 29.51% 100 100.00% 92.98%3 36 29.51% 1000 100.00% 92.98%4 36 29.51% 10000 100.00% 92.98%

/SV l

Page 104: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

92

Table 13 – Fraud Detection Results using Text Kernel, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 94 77.05% 1 68.42% 43.86%2 92 75.41% 100 70.18% 49.12%3 92 75.41% 1000 70.18% 47.36%4 92 75.41% 10000 70.18% 47.36%

/SV l

Table 14 – Fraud Detection Results using Comb. Kernel, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 27 22.13% 1 99.12% 92.11%2 23 18.85% 100 100.00% 93.86%3 23 18.85% 1000 100.00% 93.86%4 23 18.85% 10000 100.00% 93.86%

/SV l

9.2 Discussion of Fraud Results

Tables 3, 4, 6, 8, 9, 10 and 11 show under “Testing on Training” that the SVM was

able to perfectly separate the Fraudulent from the Nonfraudulent companies. The 10-fold

Cross Validation results degrade quite a bit though. An explanation for this is overfittting,

as the number of examples is often outnumbered by the number of features.

The exception was the Financial Kernel results as shown in Table 2. The 10-fold

cross validation results shown in Table 2 are 95.9% accurate. also, the # /SV l is

18.03%, which can be interpreted as the risk of incorrect categorization on unseen data.

This is the lowest # /SV l for the Fraud and Bankruptcy experiments.

The strong results from the Financial Kernel are contrasted against the results from

the Text Kernel. No Text Kernel gave 10-fold cross validation results more accurate than

54.39%. Given that the sample is split 50/50 between the positive and the negative

classes this result is not very encouraging. The Combination Kernel got strong results,

which can be attributed to the Financial Kernel portion. The results are slightly worse

with the Combination Kernel than with the Financial Kernel alone which again signals

Page 105: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

93

that the large number of features introduced by the Text Kernel causes overfitting. These

results do not support Hypothesis 4, which states that the text methods will do at least as

good a job predicting financial events as the quantitative methods. The fraud 10K

ontology was created using MDNAs from the bankruptcy and the fraud datasets. The

possibility exists that the MDNAs from the bankruptcy dataset added noise to the fraud

ontology. Another possibility is that the better ontology is based on the MDNAs of all

publicly traded companies. The small samples used in these experiments may not give

enough information to create the true ontology.

An interpretation of the results is that more information regarding fraud is given in

the quantitative financial values than the text or that the ontology is not strong enough for

the task. The patterns in the financial ratios of fraudulent companies are different than

those of nonfraudulent companies. The word patterns are not as strong or are not

detectable by the ontology created in this work.

An experiment was run using the Financial Kernel to test for per-class error. The

results of the cross-validation show that the Financial Kernel correctly classified

fraudulent companies 95.1% of the time and nonfraudulent companies 93.4% of the time.

It is clearly more important to err on the side of detection, and that is what the Financial

Kernel did.

For comparison purposes, ratios of the attributes used in the fraud dataset were fed

into a Linear Discriminant Analysis (LDA) function and a Logit function. Ratios of all

features of the second-year data were used. The ratios for this experiment are one-way.

That is, xy

but not yx

. The LDA, using ratios of the attributes predicted fraud with 65%

accuracy. The 10-fold cross validation results for Logit were 54.10%. These results

Page 106: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

94

validate the power of the Financial Kernel together with SVM. The Financial Kernel

gives two years of ratios which are two way, that is xy

and yx

as well as year-over-year

changes. In the case of a complex dataset, such as fraud, the additional ratios were

necessary for correct classification.

9.3 Bankruptcy Results

The results section consists of experiments using the Financial Kernel, the Text

Kernel, and the Combination Kernel. Text Kernels of various dimensions were tested,

One set of text kernels is based on the GAAP text and the other is based on MDNAs. For

the bankruptcy experiments, the MDNAs in the bankruptcy dataset are used to create the

ontology. Tables 15 -27 illustrate the results.

Table 15 – Bankruptcy Prediction Results using Financial Kernel

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 119 76.28% 1 76.92% 64.74%2 84 53.85% 100 87.82% 64.10%3 81 51.92% 1000 94.23% 56.41%4 74 47.44% 10000 94.87% 58.97%5 72 46.15% 100000 96.15% 62.18%

Table 16 – Bankruptcy Prediction Results using Text Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 96 61.54% 1 95.51% 50.00%2 82 52.56% 100 100.00% 52.56%3 82 52.56% 1000 100.00% 52.56%4 82 52.56% 10000 100.00% 52.56%

/SV l

Table 17 – Bankruptcy Prediction Results using Comb. Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 84 53.85% 1 100.00% 62.18%2 84 53.85% 100 100.00% 62.18%3 84 53.85% 1000 100.00% 62.18%4 84 53.85% 10000 100.00% 62.18%

/SV l

Page 107: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

95

Table 18 – Bankruptcy Prediction Results using Text Kernel, 60 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 114 73.08% 1 75.64% 50.46%2 101 64.74% 100 79.49% 51.92%3 101 64.74% 1000 79.49% 52.56%4 102 65.38% 10000 79.49% 52.56%

/SV l

Table 19 – Bankruptcy Prediction Results using Combination Kernel, 60 Dim GAAP

Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 106 67.95% 1 96.79% 55.13%2 87 55.77% 100 100.00% 58.33%3 87 55.77% 1000 100.00% 58.33%4 87 55.77% 10000 100.00% 58.33%

/SV l

Table 20 – Bankruptcy Prediction Results using Text Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 145 92.95% 1 58.33% 39.74%2 144 92.31% 500 58.33%3 10004 10000

/SV l

Table 21 – Bankruptcy Prediction Results using Combination Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 121 77.56% 1 75.00% 58.97%2 102 65.38% 10 78.21% 62.18%3 85 54.49% 100 78.21% 64.74%4 82 52.56% 1000 83.33% 62.82%

/SV l

Table 22 – Bankruptcy Prediction Results using Text Kernel, 100 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 82 52.56% 1 95.51% 65.81%2 69 44.23% 5 97.44% 67.10%3 61 39.10% 500 99.36% 63.87%4 61 39.10% 1000 99.36% 63.87%5 61 39.10% 10000 99.36% 63.87%

/SV l

Page 108: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

96

Table 23 – Bankruptcy Prediction Results using Combination Kernel, 100 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 96 61.54% 1 92.31% 66.03%2 73 46.79% 0.5 100.00% 67.31%3 76 48.72% 100 100.00% 60.26%4 76 48.72% 1000 100.00% 60.26%5 76 48.72% 10000 100.00% 60.26%

/SV l

Table 24 – Bankruptcy Prediction Results using Text Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 83 53.21% 1 84.62% 66.03%2 76 48.72% 100 87.82% 62.82%3 76 48.72% 1000 87.82% 62.82%4 76 48.72% 10000 87.82% 62.82%

/SV l

Table 25 – Bankruptcy Prediction Results using Combination Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 94 60.26% 1 89.74% 67.95%2 83 53.21% 50 85.26% 71.15%3 80 51.28% 100 85.26% 70.51%4 78 50.00% 1000 87.82% 68.59%5 77 49.36% 10000 87.82% 69.23%

/SV l

Table 26 – Bankruptcy Prediction Results using Text Kernel, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 93 59.62% 1 82.69% 71.15%2 92 58.97% 100 82.69% 67.95%3 92 58.97% 1000 82.69% 67.95%4 92 58.97% 10000 82.69% 67.31%

/SV l

Page 109: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

97

Table 27 – Bankruptcy Prediction Results using Text Kernel combined with Financial Attributes, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 94 60.26% 1 89.74% 67.95%2 73 46.79% 10 92.31% 67.31%3 63 40.38% 100 98.72% 70.51%4 62 39.74% 1000 99.36% 68.59%5 62 39.74% 10000 100.00% 69.23%

/SV l

9.4 Discussion of Bankruptcy Results

Tables 16, 17, 19 and 23 show under “Testing on Training” that the SVM was able

to perfectly separate the Bankrupt from the Nonbankrupt companies. The 10-fold Cross

Validation results degrade quite a bit though. An explanation for this is overfittting, as the

number of examples is often outnumbered by the number of features.

The highest prediction accuracy using 10-fold cross validation was 71.15% and

was achieved by the Combination Kernel using a 50 dimension 10K ontology (Table 25)

and the Text Kernel using the 25 dimension 10K ontology (Table 26). The best results

for the Financial Kernel were 64.74% (Table 15). One explanation for these results is

that there is more discriminatory information in the 10K ontology than the features for

the Financial Kernel. The features are a mapping of the attributes and the attributes were

chosen based on the Altman Z Score. Perhaps there are other attributes which, when

mapped to feature space, would provide more discriminatory power. An explanation for

the Combination Kernel’s inability to improve results is that it is likely to be overfitting

due to the small training set size compared to the number of total features. The Text

Kernel based on 10Ks performed markedly better than the Text Kernel based on GAAP.

The best results for the GAAP ontology are shown in Table 16 and are 52.56% which is

significantly less than the 71.15% achieved by the 25 dimension 10K ontology. The

Page 110: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

98

results lend support to Hypothesis 4, which states that the text is as good or better a

discriminator than the quantitative data.

The Text Kernel with 25 dimension ontology was used to get per-class accuracies

of the bankruptcy dataset. The cross-validation results show that classification accuracy

for bankruptcy was 67.9% and nonbankruptcy was 72%. It is more important to correctly

classify bankrupt companies, this is a weakness of the model.

For comparison purposes, two tests were performed using this dataset with the

Altman ratios as inputs into an LDA. The LDA was tested using the training set and

predicted with 65% accuracy, which is significantly lower than the highest prediction

accuracy achieved using these methodologies. The second test was with Logit. The 10-

fold cross validation results were 66.03%. These results were also lower than the highest

predicted accuracy using this methodology. The LDA and Logit performed slightly better

than the Financial Kernel alone. Given the small dataset and the large number of features

in the Financial Kernel, overfitting is the likely culprit. Another issue could be the

chosen attributes. The Altman ratios did not perform particularly well on the dataset

using LDA, which was Altman’s method. Perhaps the optimal set of features for

bankruptcy prediction have changed since the Altman publication.

9.5 Restatement Results

The Restatement Data is broken into two sets, one which is quantitative data alone,

and has 1,379 cases, and the other which is text and quantitative data and has 800 cases.

The set of experiments listed in Table 28 are related to the Financial Kernel and the

dataset of 1,379 cases.

Page 111: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

99

Table 28 – Restatement (1,379 cases) Prediction Results using Financial Kernel

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 625 45.32% 1 93.40% 92.89%2 138 10.01% 100 96.76% 95.07%3 111 8.05% 1000 97.03% 94.49%

/SV l

The following results consist of experiments using the Financial Kernel, the Text

Kernel, and the Combination Kernel on the dataset of 800 cases. Text Kernels of various

dimensions were tested, one set of text kernels is based on the GAAP text and the other is

based on MDNAs. For the restatements experiments, the MDNAs in the restatements

dataset are used to create the ontology. Tables 29 -41 illustrate the results.

Table 29 – Restatement Prediction Results using Financial Kernel

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 774 96.75% 1 52.67% 51.02%2 745 93.13% 100 55.33% 51.02%3 729 91.13% 1000 57.36% 51.52%4 693 86.63% 10000 60.46% 53.56%

Table 30 – Restatement Prediction Results using Text Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 676 84.50% 1 76.14% 55.58%2 601 75.13% 100 76.14% 54.70%3 598 74.75% 1000 76.14% 55.33%4 577 72.13% 10000 75.35% 55.58%

Table 31 – Restatement Prediction Results using Comb. Kernel, 300 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 669 83.63% 1 70.18% 56.73%2 576 72.00% 100 75.00% 54.44%3 552 69.00% 1000 76.14% 54.95%4 508 63.50% 10000 75.25% 55.84%

Page 112: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

100

Table 32 – Restatement Prediction Results using Text Kernel, 60 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 718 89.75% 1 68.27% 54.31%2 697 87.13% 100 68.27% 52.54%3 693 86.63% 1000 68.27% 52.92%4 691 86.38% 10000 66.62% 52.92%

Table 33 – Restatement Prediction Results using Combination Kernel, 60 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 710 88.75% 1 61.55% 54.06%2 662 82.75% 100 66.50% 55.20%3 642 80.25% 1000 68.27% 54.94%4 576 72.00% 10000 66.62% 54.57%

Table 34 – Restatement Prediction Results using Text Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 755 94.38% 1 64.59% 52.67%2 746 93.25% 100 64.59% 52.79%3 746 93.25% 1000 64.59% 52.92%4 746 93.25% 10000 64.59% 52.54%

Table 35 – Restatement Prediction Results using Combination Kernel, 10 Dim GAAP Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 739 92.38% 1 56.73% 53.55%2 701 87.63% 100 60.28% 54.70%3 686 85.75% 1000 62.31% 54.82%4 647 80.88% 10000 64.59% 54.44%

Table 36 – Restatement Prediction Results using Text Kernel, 150 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 706 88.25% 1 61.29% 54.19%2 675 84.38% 100 63.20% 54.42%3 667 83.38% 1000 63.20% 54.19%4 633 79.13% 10000 64.34%

/SV l

Page 113: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

101

Table 37 – Restatement Prediction Results using Combination Kernel, 150 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 694 86.75% 1 64.72% 53.05%2 634 79.25% 100 67.51% 52.67%3 622 77.75% 1000 69.29% 52.16%4 553 69.13% 10000 68.65% 55.58%

/SV l

Table 38 – Restatement Prediction Results using Text Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 725 90.63% 1 58.63% 54.06%2 700 87.50% 100 59.90% 54.06%3 693 86.63% 1000 60.15% 53.93%4 682 85.25% 10000 60.28%

/SV l

Table 39 – Restatement Prediction Results using Combination Kernel, 50 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 711 88.88% 1 60.79% 54.44%2 661 82.63% 100 64.47% 54.44%3 636 79.50% 1000 67.26% 54.70%4 564 70.50% 10000 67.89% 55.33%

/SV l

Table 40 – Restatement Prediction Results using Text Kernel, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 736 92.00% 1 56.98% 52.54%2 720 90.00% 100 58.76% 52.79%3 720 90.00% 1000 58.50% 52.16%4 716 89.50% 10000 58.25%

/SV l

Table 41 – Restatement Prediction Results using Text Kernel combined with Financial Attributes, 25 Dim 10K Ont.

Test # SV SV/l CTest on Training

10-fold Cross Validation

1 722 90.25% 1 57.99% 52.92%2 680 85.00% 100 63.58% 55.46%3 664 83.00% 1000 64.21% 53.93%4 630 78.75% 10000 63.96% 55.70%

/SV l

Page 114: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

102

9.6 Discussion of Restatement Results

The obvious standout results on this dataset are the Financial Kernel results on the

large restatement dataset. The 95% results shown in Table 28 validate that the Financial

Kernel, together with simple attributes, can accurately predict restatements. The results

for the Financial Kernel on the 800 case dataset are shown in Table 29. The best result

was 53.56%, much lower than the 95% results from the larger dataset. It seems likely

that the 579 cases which are in the large dataset but not the small dataset provide the

SVM more separation information. The per-class accuracies were obtained for the results

of the Financial Kernel. Restatements were correctly classified 95% of the time and

nonrestatements were correctly classified 95.1% of the time. The distinction in per-class

accuracy is very minor.

The results on the size 800 dataset are much lower than the results achieved on the

larger dataset. The results in Tables 29 – 41 show that the more complex ontologies

achieve the best results. The 300 dimension GAAP ontology is the largest in this study.

This ontology combined with the Financial Kernel achieved 56.73% accuracy, as seen in

Table 31. Although this is the highest accuracy for this dataset the accuracy does not

compare to the accuracies achieved in the management fraud and bankruptcy datasets and

is not much better than mere chance. The results in Tables 29 – 41 show that

improvement is achieved with increases in the dimension and complexity of the

ontologies. This indicates that overfitting is not as extreme as was the case with the

other, smaller datasets. It also indicates that the additional features are important to the

discovery of a separating function. In Section 10.3 it is suggested that larger, more

complex ontologies should be created as a means to achieve better separation between

Page 115: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

103

restatements and nonrestatements, although then overfitting becomes a potentially more

important issue.

9.7 Support of Hypotheses

Hypothesis 1 states that the Combination Kernel classifies more accurately than the

Text Kernel and Financial Kernel. Hypothesis 1 was supported for the 800 case

restatement dataset, as shown in Tables 29 - 41. The test results show that prediction

results improve as the dimension of the training vector increases. The best results come

from the 390 dimension Combination Kernel as shown in Table 31. As previously

discussed, the restatement dataset was the only set large enough to avoid overfitting the

data when using the Combination Kernel. The fraud, bankruptcy and 1,379 restatement

datasets did not support the hypothesis. Hypothesis 2 states that Text Kernel

preprocessed with the Accounting Ontology classifies more accurately than the Text

Kernel using basic word counts. This Hypothesis is yet to be tested.

Hypothesis 3 states that the Text Kernel preprocessed with the Accounting

Ontology classifies as accurately or more accurately than quantitative methods, including

the Financial Kernel. Hypothesis 3 was supported bankruptcy dataset. The fraud and

restatement datasets did not support Hypothesis 3 as the classification accuracy for text

was much lower than the classification accuracy for the Financial Kernel in both cases.

This Chapter gave the results of the tests on the datasets along with an analysis for

each. Section 9.7 explained how the results supported or refuted the research hypotheses

given in Chapter 8. The next and final Chapter is Chapter 10. In Chapter 10 a summary

of the paper is given along with a conclusion and an explanation of future research.

Page 116: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

CHAPTER 10 SUMMARY, CONCLUSION AND FUTURE RESEARCH

In this chapter a summary of the research is given (Section 10.1), a conclusion

(Section 10.2) and an explanation of future research (Section 10.3).

10.1 Summary

A methodology was created to combine text and quantitative variables in domains

where the combination of the two can provide insight into underlying structure. The

quantitative variables were mapped into a higher dimensional space via a kernel function

which was constructed using domain knowledge from finance. This is the first, to our

knowledge, domain specific kernel designed for financial problems. Accounting

ontologies were created as a means of finding concepts which were salient in an

accounting context. The methodology was created with help from the literature. The

ultimate process, however, was quite new. Utilizing text as a discriminator for predicting

financial events is unprecedented.

The methodology was tested on fraud, bankruptcy and restatements datasets. The

datasets used for empirical testing were chosen based on their complexity. The

expectation is that complex financial datasets would benefit from using text together with

quantitative attributes. Interest in mechanisms of detecting the likelihood of management

fraud and restatement is strong. Bankruptcy was chosen because it is well-studied subject

with many models to benchmark against.

Page 117: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

10.2 Conclusion

The results attained on management fraud detection were very strong. These

results were achieved by the Financial Kernel using simple quantitative attributes as

inputs. Using the same attributes and some reasonable assumptions about mappings, the

data was tested with LDA and Logit. In both cases, the results were much lower (65/66%

vs. 95%). These results were achieved using only publicly available data. The best

results in past research were obtained by surveying audit partners, who filled out

checklists of both quantitative and qualitative company attributes. A positive feature of

this research is that it can be applied using a computer with internet access, without the

high costs of surveys or personal interactions with top management.

Bankruptcy is a well-studied subject. Successful models of bankruptcy detection

can be found in the literature (see Section 2.2 for details). The results achieved with this

methodology did not match the best. A possible reason for this could be that the wrong

attributes were chosen for the Financial Kernel. However, the text results in bankruptcy

were very promising. Achieving 71.15% accuracy using a 25 dimension ontology

showed promise for using the text from company reports as attributes in a machine

learning context. An intuition here is that there are obvious differences between the text

in the MDNAs of healthy companies and those bankrupt companies.

The Financial Kernel got surprising results (95% cross-validation accuracy) on the

full (1379 case) restatement dataset. The results were much lower for the Financial

Kernel on the smaller, 800 case dataset. The difference in results might be explained by a

much higher discriminatory power in the cases that appear in the large set and not the

small set. The text results achieved on the restatements dataset were not much better than

chance. The text data as represented in the research is inadequate. The ontologies are not

Page 118: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

powerful enough to detect differences between the restatements and nonrestatements.

The interesting result of the restatement tests was the continued improvement in

prediction accuracy with higher dimensional ontologies.

10.3 Future Research

The work on this dissertation opened up many possibilities for future research.

Improvements can be made to the ontology development process. The relationships

between the concepts can be more fully explored. Ontologies for more granular issues

can be hand made. For example, one area of future research which is currently underway

is an analysis of the types of risk listed in the risk statements that are required as part of a

proxy statement. Prior research in this area has shown that the types of risk listed have a

positive correlation with actual future risk. The risk types and their relationships with

other keywords are being hand crafted by a person who has expertise in this area. This

handcrafted ontology can be tested against an automatically created ontology, using the

methodology from this paper.

Improving the classification accuracy of the text is an important area of future

research. Currently, the base text (GAAP and 10Ks) are used to build the ontology.

There is no preprocessing which gives higher weights to the ontology components which

differ the most between datasets. A feature selection mechanism, such as the 1-Norm

SVM, can be used to choose features which are valuable to the separation of the two

classes in the datasets. A technique for tying the ontologies back to the data might be to

use the vector space model on the text vectors. The weights on the vector components

would boost the importance of components which aid in separation and give 0 weight to

components which do not aid in separation.

Page 119: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

The text and quantitative results reported in Chapter 9 varied widely based on

dataset, financial attributes and kernel. In future work, many different base texts will be

tested in ontology creation and many financial attribute combinations as well. The

methodology can be applied to other financial events such as stock price changes after

earnings announcements. Abnormal changes, such as better-than-expected returns or

steep drops can be analyzed. In the fraud context the speech transcripts of the executive

team can be analyzed textually. Speech is different than written text as it allows for more

human judgment, which may betray clues of future problems.

The current analyses on bankruptcy, fraud and restatements can be extended by

thoroughly analyzing the results for underlying causes. The function created by the

support vector machine can be analyzed and the features with the highest weights can be

extracted as potential true features that separate the good from the bad. This is especially

interesting in this research since a feature can be text or financial ratios. The companies

can be ranked based on their margin (i.e., the distance from the separating hyperplane).

Those which are close on either side may be in a gray area, while those that are furthest

from the plane are likely to be prototypes for fraud/nonfraud, bankrupt/nonbankrupt,

restatement/nonrestatement. As a tool for managers and regulatory agencies, a function

which helps determine firms in the gray area for fraud and bankruptcy can be created. It

is possible to determine thresholds based on the output function for fraud and

restatements, like the Altman Z-score does for bankruptcy.

Future work related specifically to solidifying the results of the datasets tested are

as follows. For the fraud dataset, comparative testing against other methods was difficult

due to the fact that the dataset used in fraud detection was created by this author. Other

Page 120: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

research using this particular dataset is not available. In order to compare the research

fully with other quantitative methods, the dataset will be tested using the precise methods

and attributes presented in previous research. Future work should be done to extract the

optimal size of an ontology for bankruptcy. The base text should also be explored, as

there may be a more suitable text with which to build an ontology. The Financial

Kernel’s failed to predict bankruptcy as accurately as the Text Kernel. Additional work

should be done to test this dataset using the Financial Kernel with combinations of

different quantitative attributes. Future work on the restatement dataset is to build more

complex ontologies for testing. Other quantitative attributes should be tested as well.

Obtaining large datasets for fraudulent and/or bankrupt firms is problematic. The

number of features tends to grow rapidly when using the Financial Kernel and the

Accounting Ontology. As a result, there was a prevalence of overfitting in the datasets

tested in the dissertation. In the context of a larger dataset, such as the restatement set, a

more complex ontology allowed for improvements in prediction. For this reason, future

research should include larger datasets with both text and quantitative attributes possibly

from other areas of interest. For example, a potentially large dataset could be the hourly

stock market data and newswire reports for a set of companies in an industry. After

factoring out the market bias, the goal would be to find the keywords that determine

short-term market moves in a large percentage of cases.

Page 121: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

109

APPENDIX A ONTOLOGIES AND STOPLIST

A.1 Ontologies

A.1.1 GAAP, 300 Dimensions, 100 concepts, 100 2-grams, 100 3-grams

receivable n/describe a/receivable,0.1197,448 n/note a/receivable,0.2089,263 n/receivable n/payable,0.3651,132 derivative v/embed n/derivative,0.1538,353 v/embed a/derivative,0.2095,260 a/derivative n/instrument,0.2569,214 option incur hedge n/flow n/hedge,0.1350,403 a/ineffective a/hedge,0.1538,353 goodwill convertible a/convertible a/preferred,0.1111,486 submit v/submit n/tax,0.3687,131 report a/reportable n/section,0.1260,429 folder a/last n/folder,0.6556,40 n/folder v/add,0.6735,34 receive a/receive a/pay,0.3333,152 cease evaluate resolution segment serve measure n/measure n/plan,0.1072,503 a/fair n/measure n/plan,10614.3481,1 v/carry n/measure,0.1115,485 a/present n/measure,0.3084,168 a/fair n/measure,0.5445,64

Page 122: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

110

v/deduction a/fair n/measure,9755.2099,138 a/fair n/measure v/take,9931.4677,5 v/describe a/fair n/measure,9813.3973,27 a/fair n/measure v/find,9753.4331,164 n/service a/fair n/measure,9753.7499,158 a/fair n/measure n/service,9754.0683,154 n/alternative a/fair n/measure,9756.3290,125 a/fair n/measure n/alternative,9763.4930,84 n/collateral a/fair n/measure,9762.1074,90 n/security a/fair n/measure,9755.3780,136 v/finish a/fair n/measure,9757.7996,117 v/give a/fair n/measure,9753.2997,167 a/fair n/measure n/model,9753.3943,165 a/fair n/measure n/change,9753.4471,163 a/fair n/measure n/sale,9753.5225,161 v/write a/fair n/measure,9753.6607,159 a/end a/fair n/measure,9753.7720,157 n/swap a/fair n/measure,9753.9024,155 n/situation a/fair n/measure,9754.2699,153 n/example a/fair n/measure,9754.4194,151 a/fair n/measure n/loan,9754.4308,149 a/fair n/measure v/quote,9754.4576,147 n/earnings a/fair n/measure,9754.5644,145 n/determining a/fair n/measure,9754.6199,143 n/estimating a/fair n/measure,9754.6199,143 a/fair n/measure a/undelivered,9754.6199,143 a/fair n/measure a/fixed,9754.6199,143 n/section a/fair n/measure,9754.7761,141 r/far a/fair n/measure,9755.0124,139 a/fair n/measure n/interest,9755.2144,137 n/shares a/fair n/measure,9755.5109,135 PRP/I a/fair n/measure,9755.7400,133 v/impair a/fair n/measure,9755.8796,131 a/fair n/measure v/underlie,9755.9758,128 a/current a/fair n/measure,9756.2491,126 a/fair n/measure v/be,9756.4682,124 a/fair n/measure n/loss,9756.9329,122 a/fair n/measure r/therefore,9757.0797,120 a/fair n/measure a/uncommitted,9757.7408,118 a/fair n/measure v/estimate,9757.9922,116 a/fair n/measure v/carry,9758.5970,114 n/year a/fair n/measure,9758.9432,112 a/fair n/measure n/indebtedness,9758.9662,110 a/fair n/measure a/common,9759.6877,107 a/fair n/measure a/long,9759.7499,105 a/fair n/measure v/describe,9760.1308,103 a/fair n/measure n/finish,9760.3855,101

Page 123: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

111

n/comparison a/fair n/measure,9760.5235,99 v/carry a/fair n/measure,9760.8054,97 n/case a/fair n/measure,9761.3039,95 v/be a/fair n/measure,9761.5544,93 n/guarantee a/fair n/measure,9762.1038,91 a/fair n/measure n/amount,9762.2297,89 a/fair n/measure n/building,9762.9635,87 a/fair n/measure a/fair,9763.3867,85 n/benefit a/fair n/measure,9763.5560,83 a/fair n/measure n/investment,9763.6185,81 a/fair n/measure n/deaccessed,9765.1990,79 a/fair n/measure n/stock,9765.7866,77 n/land a/fair n/measure,9766.6677,75 a/fair n/measure v/require,9767.0052,73 v/use a/fair n/measure,9767.4356,71 v/make a/fair n/measure,9767.9011,69 n/payment a/fair n/measure,9768.7069,67 n/company a/fair n/measure,9769.9191,65 n/accounting a/fair n/measure,9770.8518,63 n/compare a/fair n/measure,9772.2532,61 a/fair n/measure n/debt,9772.9091,59 a/fair n/measure n/computing,9773.3937,57 n/computing a/fair n/measure,9774.1220,55 n/gain a/fair n/measure,9775.2631,53 n/note a/fair n/measure,9775.8582,51 a/fair n/measure n/accounting,9777.7385,49 n/measure a/fair n/measure,9780.6804,47 v/measure a/fair n/measure,9783.0919,45 a/fair n/measure n/grace,9785.1442,43 n/determination a/fair n/measure,9788.4779,41 n/statement a/fair n/measure,9789.0426,39 a/fair n/measure n/cash,9790.6759,37 v/record a/fair n/measure,9793.4155,35 a/total a/fair n/measure,9803.2856,32 a/fair n/measure n/item,9803.9466,30 a/fair n/measure n/5,9809.7595,28 v/compare a/fair n/measure,9814.0377,26 v/exceed a/fair n/measure,9817.1247,24 v/imply a/fair n/measure,9827.0485,22 n/estimate a/fair n/measure,9830.4516,20 r/less a/fair n/measure,9860.9167,18 a/fair n/measure n/collateral,9867.5056,16 a/relative a/fair n/measure,9871.5648,14 v/determine a/fair n/measure,9879.2428,12 n/percent a/fair n/measure,9884.1552,10 a/fair n/measure n/future,9907.9137,8 n/change a/fair n/measure,9918.8661,6

Page 124: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

112

a/fair n/measure n/asset,9936.6050,4 a/fair n/measure n/hedge,10023.6594,2 a/fair n/measure n/derivative,9755.9489,129 a/fair n/measure a/derivative,9802.5892,33 a/receivable a/fair n/measure,9759.1733,108 withdrawal insurance n/insurance n/contract,0.1436,377 capitalization liability n/litigation a/liability,0.1250,430 finish v/finish n/inventory,0.1126,480 v/finish v/december,0.1806,306 journal n/journal n/entry,0.6516,42 liquidation amortize a/unamortized n/deduction,0.1730,315 warrant security a/favorite n/security,0.1781,309 n/equity n/security,0.1846,302 n/security a/sec,0.3160,161 package remit get curtailment n/closure n/curtailment,0.2094,261 calculation collateral v/support n/collateral,0.1553,351 n/collateral a/dependent,0.2000,278 apportion calculate alternative n/alternative n/guarantee,0.1060,508 n/stock n/alternative,0.1074,502 understanding long-term rental a/rental n/payment,0.1271,424 a/minimum a/rental,0.1288,420 a/contingent n/rental,0.1495,366 obtain closure a/partial n/closure,0.1438,376

Page 125: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

113

equity v/stockholder n/equity,0.1541,352 n/stockholder n/equity,0.1963,283 warranty express v/walter v/express,0.1250,430 n/link n/express,0.6600,39 n/express a/favorite,0.9900,3 postpone run n/run v/art,0.1961,284 v/run n/section,0.2744,197 revenue weighted a/weighted a/average,0.5259,71 engage service v/service n/cost,0.1330,408 n/customer n/service,0.2538,216 n/service a/835,0.2955,177 r/prior v/service,0.4715,92 year-end taxation n/taxation n/recognition,0.1419,380 a/taxation n/profit,0.1809,304 v/taxation a/pretax,0.2000,278 section n/operating n/section,0.1051,511 r/later v/section,0.1127,479 creditor indebtedness n/asset n/indebtedness,0.1380,393 module n/module PRP/s,0.6911,31 n/accountant n/module,0.8319,17 amortization n/subject n/amortization,0.1317,414 intangible a/intangible n/asset,0.1103,489 guidebook rent selection n/site n/selection,0.1081,497 lease a/lease a/tax,0.1053,510 find account

Page 126: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

114

a/sec n/accountant,0.7645,25 v/proceed a/account,0.2857,188 sec a/sec v/sab,0.1404,386 hire incremental a/incremental n/shares,0.1209,442 a/incremental n/borrowing,0.3019,172 describe MD/should v/describe,0.1435,378 gross table settlement v/force n/settlement,0.1818,303 computation lessor n/lessor a/implicit,0.1235,435 n/5b a/lessor,0.1538,353 n/5a a/lessor,0.1538,353 lessee computing n/defer n/computing,0.1593,346 damage n/grace n/damage,0.1081,497 v/test n/damage,0.1722,317 contingent defer v/defer n/counterparty,0.1053,510 n/defer v/show,0.1204,446 n/stone v/defer,0.1290,419 r/indefinitely v/defer,0.1429,379 impairment guarantee v/guarantee a/residual,0.1887,292 grace v/cure v/grace,0.1818,303 allocate payment gaap n/gaap n/guide,0.4800,87 n/guide n/gaap,0.4824,84 n/master n/gaap,0.6326,44 take n/take n/term,0.1128,478 n/take n/agreement,0.1148,467 a/take n/throughput,0.1176,457 n/capital n/take,0.2313,240

Page 127: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

115

v/take n/property,0.2804,191 agreement allocation n/allocation n/process,0.1142,471 discount ending deduction n/deduction n/premium,0.1876,295 growth increase v/increase v/decrease,0.1161,463 n/increase n/decrease,0.1921,289 unearned subsidiary a/majority n/subsidiary,0.1317,414 depreciation a/straight n/depreciation,0.1316,415 v/accumulate n/depreciation,0.1523,357 guide a/equities n/guide,0.1145,469 n/industry n/guide,0.1684,325 n/guide a/equities,0.1938,287 n/guide a/specialized,0.2097,258 long-run debtor software n/software n/product,0.1176,457 n/computer n/software,0.4370,107 transaction A.1.2 GAAP, 60 Dimensions, 40 concepts, 10 2-grams, 10 3-grams

receivable n/receivable n/payable,0.3651,132 derivative a/fair n/value a/derivative,9827.7976,33 option report serve withdrawal liability security n/security a/sec,0.3160,161 package calculation apportion calculate

Page 128: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

116

alternative a/fair n/value n/alternative,9788.4971,83 understanding equity run weighted a/weighted a/average,0.5259,71 engage service r/prior v/service,0.4715,92 creditor indebtedness a/fair n/value n/indebtedness,9785.6007,96 module n/accountant n/module,0.8319,17 amortization intangible guidebook rent selection lease account hire describe a/fair n/value v/describe,9784.7731,104 v/describe a/fair n/value,9838.5665,27 computation computing a/fair n/value n/computing,9798.7456,56 n/computing a/fair n/value,9799.5029,54 allocate payment n/payment a/fair n/value,9794.0008,67 gaap n/gaap n/guide,0.4800,87 n/guide n/gaap,0.4824,84 n/master n/gaap,0.6326,44 take a/fair n/value v/take,9956.6650,5 agreement ending a/fair n/value n/ending,9784.7560,105 guide software n/computer n/software,0.4370,107

Page 129: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

117

A.1.3 GAAP, 10 Dimensions, 10 concepts

receivable option report liability run account describe payment ending calculate A.1.4 10K, Bankruptcy, 100 Dimensions

liquidity receivable n/describe a/receivable,0.5066,88 wear adversely r/adversely v/impact,0.2025,301 r/materially r/adversely,0.3049,187 r/adversely v/affect,0.4801,93 title housing n/student n/housing,0.6667,46 modern write-off variation forward-looking indebtedness n/indebtedness n/year v/end,9205.2412,97 order n/order n/year v/end,9205.5218,94 option convertible report write-down fund selection a/common n/stock n/selection,6910.4783,211 v/selection n/year v/end,9202.6130,115 subordinate a/exchangeable v/subordinate,0.2507,247 v/subordinate n/debenture,0.3138,182 account fluctuation

Page 130: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

118

describe v/end v/december v/describe,6783.5351,340 v/end v/december n/describe,6787.2828,286 v/describe n/year v/end,9206.0437,88 sale n/sale a/common n/stock,6597.9698,445 a/common n/stock n/sale,6601.1560,431 v/end v/december n/sale,6788.2856,272 n/sale n/year v/end,9309.7086,20 n/net v/sale,0.2105,289 a/net n/sale,0.2910,201 n/percentage a/net n/sale,7012.7677,207 impact r/negatively v/impact,0.3333,165 legend v/end v/december n/legend,6784.9553,321 n/legend n/year v/end,9453.2650,13 n/table n/legend,0.4671,100 v/section v/legend,0.5000,89 debenture n/debenture a/common n/stock,6608.1459,401 a/exchangeable n/debenture,0.2780,216 exchangeable a/common n/stock a/exchangeable,6595.3678,474 a/exchangeable a/common n/stock,6883.0620,214 rotate v/rotate n/fleet,0.3077,185 caption package mix clear revolve v/revolve n/credit,0.2811,210 property n/property n/plant,0.2085,294 a/intellectual n/property,0.2727,222 calculate subsidiary alternative funding merchandising advantage n/take n/advantage,0.2051,299 due a/common n/stock a/due,6595.8806,467 v/end v/december a/due,6786.1335,298 income

Page 131: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

119

v/end v/december n/income,6792.3332,248 n/income n/year v/end,9250.5221,32 n/income n/tax,0.3764,142 covenant liquid n/liquid n/year v/end,9199.4622,178 n/liquid n/capital,0.2007,304 a/liquid a/nutritional,0.2105,289 a/investible a/liquid,0.2105,289 software mixture n/mixture n/year v/end,9197.7610,197 innovative a/innovative n/statement,0.3274,170 run v/end v/december v/run,6789.0392,263 v/run n/capital,0.2536,245 choice financing a/common n/stock n/financing,6595.1399,478 modify A.1.5 10K, Bankruptcy, 50 Dimensions, 50 Concepts

liquidity receivable wear adversely title housing modern write-off variation forward-looking indebtedness order option convertible report write-down fund selection subordinate account fluctuation describe sale

Page 132: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

120

impact legend debenture exchangeable rotate caption package mix clear revolve property calculate subsidiary alternative funding merchandising advantage due income covenant liquid software mixture innovative run choice financing modify A.1.6 10K, Bankruptcy, 25 Dimensions, 25 concepts

liquidity receivable wear adversely title modern forward-looking order convertible roll impact debenture exchangeable rotate package clear

Page 133: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

121

revolve property due income covenant liquid software innovative run modify A.1.7 10K, Fraud, 150 Dimensions, 50 concepts, 50 2-grams, 50 3-grams

receivable n/receivable n/year v/end,22793.5777,323 n/account a/receivable,0.5381,89 wear a/perfect n/wear,0.1250,631 proceeding a/legal n/proceeding,0.1923,405 pick option a/black a/option,0.2222,342 furnish v/end v/december v/furnish,22028.3176,418 v/furnish n/year v/end,22808.1907,123 n/furnish n/chain,0.1645,481 convertible shareowner consolidated n/c n/c a/consolidated,16521.5030,725 v/end v/december a/consolidated,22020.1943,535 n/note a/consolidated,0.2302,325 a/consolidated a/financial,0.2911,250 a/consolidated a/financial n/statement,16554.2831,672 transportation n/c n/c n/transportation,16517.5027,799 roll n/roll a/roll,0.1429,561 a/roll n/roll,0.1429,561 a/roll a/flatracks,0.1818,426 n/roll n/lottoworld,0.5000,102 subtitle impact r/adversely v/impact,0.1547,518 r/negatively v/impact,0.3562,197

Page 134: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

122

legend security n/security v/exchange,0.1674,468 package contract v/contract n/year v/end,22792.7000,341 n/contract n/year v/end,22848.2800,48 r/forward v/contract,0.2640,280 land v/end v/december n/land,22026.2369,428 n/land n/year v/end,22818.6300,88 a/land n/radio,0.1250,631 v/unit n/land,0.4816,112 revolve n/c n/c v/revolve,16517.5492,798 v/revolve n/credit,0.2125,360 alternative n/alternative a/common n/stock,14689.9426,901 a/common n/stock n/alternative,15957.3718,809 v/alternative n/year v/end,22795.6095,257 advantage liquid v/end v/december n/liquid,22017.9340,646 n/liquid n/year v/end,22798.4667,204 n/liquid n/capital,0.1739,449 opening run a/common n/stock v/run,14686.9042,919 n/c n/c v/run,16522.7855,713 v/end v/december v/run,22025.2302,438 v/run n/capital,0.1861,418 n/sort a/run,1.0000,1 offset v/end v/december v/offset,22025.6727,436 r/partially v/offset,0.5690,79 choice revenue n/revenue n/services,0.1333,598 n/services n/revenue,0.1333,598 n/licenses n/revenue,0.1667,472 n/revenue n/licenses,0.1667,472 liquidity taxation a/common n/stock n/taxation,14705.0936,861 n/c n/c a/taxation,16571.2286,670 n/c n/c n/taxation,16711.7954,667 v/end v/december a/taxation,22022.2295,471

Page 135: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

123

v/end v/december n/taxation,22055.5955,376 n/year v/end a/taxation,22805.6324,131 n/taxation n/year v/end,22956.1653,31 a/total n/taxation,0.1271,623 a/taxation n/margin,0.4467,135 a/taxation n/profit,0.5455,86 adversely v/end v/december r/adversely,22019.5484,565 r/materially r/adversely,0.2655,278 r/adversely v/affect,0.4768,115 title order n/order n/year v/end,22809.2141,116 n/interest n/order,0.2461,299 shareholder vantage n/take n/vantage,0.2500,292 county n/martin n/county,0.1231,640 selection account n/c n/c n/account,16528.6923,689 v/end v/december v/account,22017.7061,653 v/end v/december n/account,22019.0151,587 n/account n/year v/end,22810.5497,111 a/doubtful n/account,0.1553,515 n/account a/payable,0.1715,454 gross stockholder n/stockholder n/year v/end,22799.4411,185 n/stockholder n/equity,0.1479,543 exchangeable a/common n/stock a/exchangeable,14699.7800,872 a/exchangeable a/common n/stock,15327.4995,811 a/exchangeable n/debenture,0.1669,471 a/exchangeable a/preferred,0.1857,420 a/exchangeable v/subordinate,0.2746,265 associate n/castle n/associate,0.2412,312 caption v/end v/december n/caption,22019.7923,551 n/caption n/year v/end,23758.4579,10 v/section v/caption,0.2500,292 v/caption r/hereby,0.2857,257 n/table n/caption,0.4600,125 payment n/payment v/end v/december,22018.9151,592

Page 136: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

124

v/end v/december n/payment,22021.8242,480 n/payment n/year v/end,22808.3087,122 alteration clear property n/property n/year v/end,22804.5746,139 n/property n/plant,0.1583,501 a/intellectual n/property,0.1977,384 a/property n/insurer,0.2500,292 subsidiary n/subsidiary n/year v/end,22799.9517,180 due n/c n/c a/due,16532.7101,683 v/end v/december n/due,22019.1944,580 v/end v/december a/due,22020.4701,528 v/end v/december r/due,22020.7710,511 n/year v/end r/due,22800.0229,179 r/due a/higher,0.1344,593 r/primarily a/due,0.1999,381 software v/end v/december n/software,22023.5838,459 r/retail v/software,0.1905,409 modify A.1.8 10K, Fraud, 50 Dimensions, 50 concepts

receivable wear proceeding pick option furnish convertible shareowner consolidated transportation roll subtitle impact legend security package contract land revolve alternative advantage

Page 137: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

125

liquid opening run offset choice revenue liquidity taxation adversely title order shareholder vantage county selection account gross stockholder exchangeable associate caption payment alteration clear property subsidiary due software modify A.1.9 10K, Fraud, 25 Dimensions, 25 concepts

receivable wear proceeding furnish security package contract land advantage run revenue liquidity taxation title order

Page 138: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

126

vantage county gross associate payment alteration clear property software modify

Page 139: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

127

A.2 Stoplist

a aboard about above across after against all along alongside although amid amidst among amongst an and another anti any anybody anyone anything around as astride at aught bar barring because before behind below beneath beside besides between beyond both but by circa concerning considering despite

down during each either enough everybody everyone except excepting excluding few fewer following for from he her hers herself him himself his hisself i idem if ilk in including inside into it its itself like many me mine minus more most myself naught near neither nobody

none nor nothing notwithstanding of off on oneself onto opposite or other otherwise our ourself ourselves outside over own past pending per plus regarding round save self several she since so some somebody someone something somewhat such suchlike sundry than that the thee theirs them themselves

there they thine this thou though through throughout thyself till to tother toward towards twain under underneath unless unlike until up upon us various versus via vis-a-vis we what whatall whatever whatsoever when whereas wherewith wherewithal which whichever whichsoever while who whoever whom whomever whomso whomsoever

whose whosoever with within without worth ye yet yon yonder you you-all yours yourself yourselves be is are were sfas sfa 20x1 3 500 0 100 400 20x4 6 d 2 21 109 22 8 4 025 920 525 607 150 10 983 145 070 600

570 419 40 911 250 750 40 425 45 000 x9 3 2 800 a b c d e f g h i j k l m n o p q r s t u v w x y z 2003sec mrpa 0000 sop inc 1977

Page 140: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

APPENDIX B QUANTITATIVE AND TEXT DATA

Page 141: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

129

B.1 Quantitative Data

Page 142: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

130

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr21 NJR 4924 1993 -1 454.746 26.072 44.373 738.662 0.001 65.505 797.347 12.993 -11.807 96.377 250.163 525.1142 PNT. 4924 1993 1 719.486 206.688 37.771 30.243 719.486 0.001 735.17 68.465 35.791 69.17 172.012 527.7833 BOL 2834 1994 -1 1850.552 271.99 312.781 2457.731 140.065 84.807 2550.066 985.217 70.87 342.705 929.3 1620.7664 DNA 2834 1994 1 1745.124 752.642 146.267 103.2 1745.124 60.989 2010.995 318.022 811.97 209.672 1602.047 408.9485 3SSFT 7372 1994 -1 27.909 10.199 4.975 48.332 29.598 1.235 23.912 -45.565 -3.091 -5.757 4.11 15.8026 LEAF 7372 1994 1 50.793 87.856 22.766 0.001 50.793 4.406 48.916 -57.188 1.114 6.57 15.327 33.4977 3CECN 1531 1995 -1 2.463 0.544 0.181 11.116 1.748 0.001 7.662 -2.426 2.474 1.065 7.033 0.6188 3DOVRA 1531 1995 1 28.12 4.615 0.001 24.008 28.12 1.173 26.725 0.884 42.862 -0.351 19.096 7.6299 3DNKY 2330 1995 -1 210.27 64.379 43.072 161.647 0.001 0.962 139.433 8.794 16.917 -2.979 55.278 84.155

10 3CYDS 2330 1995 1 117.145 540.063 35.117 29.999 117.145 2.803 47.142 -94.083 18.983 2.321 26.959 20.18311 PRST 3555 1995 -1 27.611 7.888 5.862 26.669 0.586 2.388 68.823 8.101 29.384 12.371 57.443 11.3812 TRDT 3555 1995 1 23.133 17.33 2.929 1.049 23.133 14.911 38.331 -4.167 21.056 10.459 35.962 2.36913 3CREGE 3651 1995 -1 88.676 13.19 14.557 32.53 1.722 0.15 39.169 -5.187 3.609 -3.385 3.917 35.25214 PKAU 3651 1995 1 30.191 52.171 15.661 7.928 30.191 1.017 24.997 15.762 13.568 3.905 16.544 8.45315 MWHS 5961 1995 -1 1308.009 116.399 114.395 429.664 0.527 11.631 607.842 112.642 271.53 77.594 384.168 223.67416 3HOSN 5961 1995 1 436.295 1018.625 23.634 101.564 436.295 31.365 446.499 69.56 23.073 55.945 206.443 240.05617 JDN 6798 1995 -1 31.487 2.628 0.001 295.868 0.542 0.001 371.986 -7.089 55.156 -118.931 226.539 145.44718 MGI 6798 1995 1 274.651 45.389 3.354 0.001 274.651 7.158 339.664 15.687 55.156 3.319 194.435 145.22919 IBM 7370 1995 -1 71940 23402 6323 80292 19594 4744 81132 13758 6695 12707 21375 5950420 LU 7370 1995 1 19722 21413 5354 3222 19722 2183 22626 191 2068 1186 2686 1994021 3ITEX 7389 1995 -1 23.631 1.117 0.001 15.578 5.733 0.227 23.406 3.958 0.261 2.467 20.383 3.02322 MTY 7389 1995 1 16.608 27.672 4.444 2.422 16.608 5.368 22.191 -7.796 5.783 2.983 13.576 8.61523 KMB 2621 1996 -1 13149.1 1660.9 1348.3 11845.7 680.1 883.7 11266 3918.3 -217.3 2512.1 4125.3 7140.724 CHA.2 2621 1996 1 9819.992 5880.443 579.393 458.043 9819.992 485.933 9110.598 2115.2 428.5 708.1 3210 5900.625 IG 2834 1996 -1 35.302 9.709 9.357 34.794 4.067 0.913 34.044 -8.649 -4.469 0.874 8.328 25.71626 EPMN 2834 1996 1 34.29 10.892 1.156 0.001 34.29 1.325 28.147 -107.218 17.779 -14.38 24.392 3.75527 SRM 3669 1996 -1 994.6 295.9 157.8 1630.3 639.2 86.6 1643.6 42.4 280.8 117.2 772.9 870.728 PRY.A 3669 1996 1 839.093 1111.575 208.182 203.254 839.093 12.084 971.45 441.62 244.039 138.998 487.134 484.31629 THO 3790 1996 -1 602.078 49.774 63.494 175.884 5.695 4.722 170.969 113.181 79.159 31.764 119.82 51.14930 PMSI 3790 1996 1 197.753 70.638 18.188 0.001 197.753 1.948 225.826 7.821 10.159 60.355 92.064 133.76231 MTST 3825 1996 -1 50.442 17.835 6.163 43.313 1.209 1.056 41.94 -0.186 25.641 1.959 32.094 9.84632 DAIO 3825 1996 1 39.319 60.423 10.27 8.26 39.319 3.115 57.736 18.202 33.226 3.545 34.614 23.12233 SMD.1 3842 1996 -1 667.13 146.771 80.937 620.416 7.002 19.041 610.549 57.737 102.231 59.653 279.42 331.12934 BMET 3842 1996 1 628.356 580.347 162.135 151.523 628.356 28.677 848.739 572.497 472.733 206.707 667.418 181.32135 CRNS 4700 1996 -1 161.471 82.451 1.485 399.301 14.095 143.301 327.145 7.011 -900.651 35.548 73.713 253.432

Figure 16 – Fraud Dataset

Page 143: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

131

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr236 3CFAC 4700 1996 1 145.887 36.427 120.391 0.001 145.887 3.536 135.149 17.767 54.23 13.158 65.743 69.40637 3ADIN 5080 1996 -1 24.871 7.906 3.705 17.108 0.247 0.197 24.024 -1.915 4.802 3.108 3.775 20.24938 VENT 5080 1996 1 18.053 28.398 4.254 3.668 18.053 4.155 23.888 -7.362 4.951 -1.443 9.361 14.52739 MSFT 7372 1996 -1 8671 639 0.001 10093 253 494 14387 5288 6763 5483 9797 361040 CA 7372 1996 1 6084 4040 1514 0.001 6084 507 6706 2762 379 2351 2481 422541 GYMM 7990 1996 -1 5.598 1.176 0.001 24.515 0.556 1.432 22.89 -7.401 -3.977 -1.308 17.234 5.65642 3BLRGZ 7990 1996 1 23.803 16.038 0.454 0.25 23.803 0.037 23.58 9.63 1.709 3.485 10.414 13.16643 CO 8711 1996 -1 139.604 32.622 20.529 106.918 1.47 2.123 136.275 9.146 61.74 17.276 57.961 78.31444 FDGT 8711 1996 1 100.393 39.508 65.355 0.001 100.393 6.929 103.136 1.192 59.779 3.859 83.363 19.77345 PSTI 8741 1996 -1 608.313 193.88 0.001 815.624 448.389 51.135 874.027 -177.949 93.497 53.89 501.781 372.24646 MGLN 8741 1996 1 1140.137 1345.279 191.201 4.753 1140.137 306.597 895.62 -108.025 287.662 170.919 158.25 737.3747 CQB 100 1997 -1 2433.726 272.214 349.948 2401.613 215.173 76.248 2509.133 -215.809 308.805 244.687 540.505 1715.15348 0491B 100 1997 1 2463.895 4336.12 534.844 468.692 2463.895 223.742 2915.053 157.38 365.777 308.306 621.832 2293.22149 3SOCNQ 2390 1997 -1 1073.09 228.46 304.9 1058.928 2.26 60.544 3405.517 -864.027 488.507 -308.66 260.437 3145.0850 3WSPT 2390 1997 1 1286.106 1657.511 92.99 340.818 1286.106 66.973 1391.211 -588.045 178.178 327.823 -487.452 1878.66351 MSC 3470 1997 -1 320.163 56.307 60.892 418.074 0.643 19.108 395.321 104.424 40.559 59.031 148.932 246.38952 BMMI 3470 1997 1 319.407 312.538 35.452 70.111 319.407 15.231 399.465 85.543 94.971 37.482 133.257 266.20853 CENL 3577 1997 -1 28.263 3.146 2.309 17.078 0.417 1.265 18.804 -72.709 7.445 2.84 11.696 7.10854 LINK 3577 1997 1 17.555 19.153 5.684 5.461 17.555 0.566 19.577 -10.029 14.139 1.422 14.665 4.91255 PCTL 3661 1997 -1 466.425 108.729 44.901 355.051 27.889 21.645 352.994 -32.388 112.322 17.716 190.242 162.75256 ASPT 3661 1997 1 370.343 390.642 86.896 12.306 370.343 45.808 560.659 156.025 258.177 101.752 298.157 262.50257 LSR 3812 1997 -1 10.162 3.75 2.799 11.145 0.441 0.454 12.516 1.517 8.434 1.736 11.045 1.47158 BTHS 3812 1997 1 10.076 17.048 2.17 2.063 10.076 0.234 10.632 7.609 6.003 1.246 8.443 2.18959 STCO 3825 1997 -1 102.279 15.901 22.707 64.382 0.846 5.404 48.983 14.365 19.701 -3.153 26.487 22.49660 CRPB 3825 1997 1 69.455 77.11 11.169 8.483 69.455 2.854 63.686 3.318 30.519 13.952 53.474 10.21261 IMDC 3842 1997 -1 106.728 14.58 23.117 58.842 0.34 5.106 80.707 -53.34 -0.988 16.438 -15.625 93.33262 ATSI 3842 1997 1 54.386 14.516 4.447 22.686 54.386 0.371 58.431 -15.608 54.23 1.822 55.82 2.61163 INSG 7372 1997 -1 38.869 6.754 0.185 25.457 0.41 0.753 21.011 -27.471 9.712 -12.899 11.418 9.59364 GAEX 7372 1997 1 24.263 36.04 5.548 4.999 24.263 1.994 14.114 -56.302 -13.751 -4.116 -9.927 24.04165 CYLK 3577 1998 -1 42.76 11.503 10.289 94.318 9.05 2.06 81.289 -63.426 42.862 -13.337 61.979 19.3166 KTCC 3577 1998 1 97.085 170.05 23.103 29.578 97.085 5.968 100.947 13.631 41.79 12.034 51.904 49.04367 XRX 3577 1998 -1 19449 7891 2498 30024 2903 566 28814 2705 4035 3815 4911 2353368 SBL 3577 1998 1 977.901 205.416 196.986 838.399 183.928 89.334 1139.29 252.14 216.709 1047.944 239.864 70.04169 DIGL 3669 1998 -1 24.191 7.152 5.476 27.558 1.06 4.381 39.998 -40.657 12.913 7.399 22.153 17.84570 AMXC 3669 1998 1 31.509 69.273 9.796 10.99 31.509 0.733 37.126 9.407 14.981 0.172 22.958 14.168

Figure 16 Continued

Page 144: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

132

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr271 BSX 3841 1998 -1 2233.576 537.786 461.981 3892.711 705.084 174.039 3572 636 0.001 941 1724 184872 BDX 3841 1998 1 3846.038 3116.873 726.558 536.791 3846.038 451.32 4436.958 2356.377 354.403 754.671 1742.281 2668.2773 BLS 4813 1998 -1 23123 4629 431 39410 1028 5212 43453 11098 -6008 11428 14815 2863874 GTE 4813 1998 1 25473 4785 668 43615 1687 5609 25336 5058 702 50832 1659 494075 OHP 6324 1998 -1 4719.411 146.794 0.001 1637.75 119.379 40.045 1686.888 -390.095 442.693 287.817 98.755 1243.81776 SIE 6324 1998 1 1045.12 1037.203 145.728 0.001 1045.12 63.792 1130.112 125.144 114.74 51.301 278.412 851.777 3PTUS 7372 1998 -1 31.532 4.671 0.001 13.723 0.51 2.739 4.293 -103.079 1.64 -1.001 2.372 1.92178 GAEX 7372 1998 1 14.114 30.666 4.165 1.986 14.114 1.918 11.166 -55.492 -6.79 0.473 -7.53 18.69679 EAII 7372 1998 -1 106.976 39.508 0.001 116.78 15.266 10.519 80.564 -39.888 20.265 -24.683 55.191 25.37380 AMSWA 7372 1998 1 107.358 109.177 21.073 0.001 107.358 17.826 113.047 19.412 23.204 3.23 69.706 43.34181 TLXN 3578 1999 -1 365.751 75.376 81.923 636.29 8.429 13.159 348.844 -67.441 34.275 -81.611 16.966 331.87882 HYC 3578 1999 1 276.28 261.515 52.589 57.482 276.28 22.565 369.237 59.923 75.218 -14.743 208.147 161.0983 3SCEPE 3841 1999 -1 2.717 0.273 0.396 2.106 0.036 0.109 1.18 -16.389 -1.036 -5.685 -1.334 2.51484 3ANTR 3841 1999 1 2.01 2.699 0.167 0.429 2.01 0.001 2.972 -29.084 -5.103 -3.064 -3.841 6.56385 3TSRG 4922 1999 -1 1.093 0.127 0.001 5.291 0.265 0.185 4.3 -23.766 -4.55 -2.373 -0.986 5.28686 APL.1 4922 1999 1 17.743 3.541 0.374 0.001 17.743 0.001 22.092 3.958 1.844 7.62 20.107 1.98587 RAD 5912 1999 -1 13338.947 299.634 2472.437 9909.847 2133.643 573.287 7913.911 -3121.547 1955.877 95.286 -688.409 8248.88988 CVS 5912 1999 1 7275.4 18098.3 699.3 3445.5 7275.4 359.5 7949.5 2944.1 1972.5 1600.1 4037.1 3644.989 ASFD 5961 1999 -1 39.931 4.527 24.205 177.608 3.287 6.853 56.266 -210.007 23.861 -40.609 43.268 12.99890 HNV 5961 1999 1 191.419 549.852 29.287 54.816 191.419 13.904 203.019 -471.651 16.835 -40.288 -24.452 155.84391 BAC 6020 1999 -1 51526 363834 163 632574 23043.566 335.967 642191 38943 1972.5 18279 47556 59456392 3614B 6020 1999 1 42848 238774 4867 388570 8.429 494 64503 357627 13907 551607 55409 85.6693 MSTR 7372 1999 -1 151.258 37.586 0.001 203.368 4.214 23.733 259.087 -297.816 42.616 -118.931 -145.538 285.0494 MNS 7372 1999 1 187.22 149.235 37.995 0.001 187.22 96.146 208.654 5.061 17.355 38.558 39.982 168.67295 EDSN 8200 1999 -1 132.762 21.747 0.001 106.87 5.311 25.534 251.03 -115.518 61.031 -17.039 182.513 68.51796 KIDS 8200 1999 1 98.631 115.477 24.692 0.001 98.631 1.003 90.569 4.073 23.225 14.208 54.647 35.92297 SERO 2836 2000 -1 147.76 17.064 21.186 131.495 0.78 9.56 175.338 57.711 55.156 31.633 152.475 22.86398 ENZN 2836 2000 1 130.252 17.018 5.442 0.947 130.252 0.427 549.676 -117.604 446.111 3.319 138.989 410.68799 AVP 2844 2000 -1 5681.7 594.2 610.6 2811.3 499.9 193.5 3192.6 899.9 428.1 911.4 -75.1 3267.7

100 EL 2844 2000 1 3043.3 4366.8 550.2 546.3 3043.3 143.9 3218.8 1122.2 882.2 712.4 1352.1 1506.7101 BHI 3533 2000 -1 5233.8 1310.4 898.5 6452.7 220 599.2 6676.2 -127.5 1484.8 1076.5 3327.8 3348.4102 WFT 3533 2000 1 3461.579 1814.261 498.663 443.588 3461.579 1246.967 4296.362 90.846 471.736 596.075 1838.24 2458.122103 ADELQ 4841 2000 -1 2909.351 251.653 0.001 21499.48 476.801 2208.001 17267.5 -1994.314 -6008 366.099 3721.157 13397.95104 CHTR 4841 2000 1 23043.566 3249.222 224.147 0.001 23043.566 207.888 24961.824 -2091.135 -1007.815 1765.242 2861.792 22049.466105 WMI 4953 2000 -1 12492 1575 75 18565 789 1313 19490 909 -597 3034 5392 14098

Figure 16 Continued

Page 145: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

133

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2106 AW 4953 2000 1 14513.634 5707.485 823.259 0.001 14513.634 423.955 14347.093 -471.536 -235.049 1931.307 585.779 12592.27107 3BIGTQ 6798 2000 -1 175.648 18.056 0.001 1469.607 7.813 0.001 1034.333 -689.481 -900.651 596.075 83.798 950.535108 WRI 6798 2000 1 1517.581 252.245 63.121 0.001 1517.581 17.025 2095.747 -146.977 8.434 209.672 920.809 1174.675109 3CAWC 7372 2000 -1 1.772 1.237 0.001 8.355 0.07 0.105 1.262 -12.025 -1.73 -0.898 -2.654 3.916110 3AUGRE 7372 2000 1 8.153 8.323 1.281 0.251 8.153 0.115 5.801 -2.101 -2.632 -1.467 2.101 3.7111 IINT 7372 2000 -1 145.689 52.857 0.001 140.732 3.329 12.346 137.737 -60.367 42.193 3.973 60.946 76.791112 NETE 7372 2000 1 138.379 54.036 15.758 0.001 138.379 1.041 206.179 -20.476 92.485 3.854 176.141 30.038113 LGTO 7372 2000 -1 231.395 47.655 0.001 414.864 5.099 24.998 355.261 -69.992 167.281 -15.679 259.959 95.302114 0485B 7372 2000 1 399.752 487.707 106.201 83.824 399.752 20.359 315.049 -353.663 -8.97 11.276 -219.049 514.415115 DJT 7990 2000 -1 1351.372 55.402 12.324 2199.313 61.19 20.742 2219.666 -346.224 -15.578 253.364 89.463 2130.203116 IGT 7990 2000 1 1623.716 898.404 296.268 146.989 1623.716 335.967 1923.439 1247.226 596.775 314.897 296.113 1627.326117 WCOM.CM 4813 2001 -1 35179 5308 0.001 103914 3349 7886 98903 2688 -7918 13809 55409 43494118 SBC 4813 2001 1 96322 45908 9376 0.001 96322 16024 95057 21351 -594 18198 33199 61858119 CPTH 7370 2001 -1 104.173 26.692 0.001 199.952 55.228 11.189 104.006 -2174.412 19.961 -31.421 -8.554 85.66120 3ANCPA 7370 2001 1 207.818 306.348 44.755 4.937 207.818 2.325 180.083 -5.148 -7.135 26.794 91.834 88.249121 PNC 6020 2002 -1 6356 34777 1613 66377 1517.581 335.967 68168 7702 -15.578 2714 6645 61523122 BK 6020 2002 1 5756 30508 1 77564 8.429 4744 6336 34615 0.001 92397 6645 85.66

Figure 16 Continued

Page 146: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

134

Obs # Ticker Ind Year Label TAyr1 REyr1 WCyr1 EBITyr1 SEyr1 TLyr1 TAyr2 REyr2 WCyr2 EBITyr2 SEyr2 TLyr21 CRYYQ 3571 1993 -1 48.599 -202.725 17.718 -39.223 42.652 5.947 26.166 -240.511 -6.702 -31.354 9.095 17.0712 ETRC 3571 1993 1 22.848 9.889 11.873 3.498 20.273 2.575 25.436 11.098 12.343 3.644 21.491 3.9453 FPNQ 3661 1993 -1 35.873 -82.148 21.22 -26.356 19.755 16.118 19.651 -109.229 2.603 -24.52 2.463 17.1884 VTEK 3661 1994 1 17.947 1.04 8.806 2.609 4.057 13.89 27.47 1.498 11.944 2.566 13.81 13.665 3PRCAE 2033 1994 -1 10.831 -24.269 6.81 0.71 7.77 3.061 21.861 -2.426 -1.093 -0.729 10.304 11.5576 ODWA 2033 1994 1 12.072 -0.869 2.516 1.184 8.719 3.353 35.481 0.128 17.918 2.454 28.499 6.9827 3KNITE 2253 1994 -1 12.587 3.227 4.556 1.299 7.658 4.929 13.198 3.257 4.84 1.093 7.603 5.5958 MRSA 2253 1994 1 40.71 6.906 25.147 14.389 36.072 4.638 54.009 17.054 35.788 16.049 46.223 7.7869 LTTO10 2721 1994 -1 1.242 -2.094 0.894 -1.732 0.003 0.239 2.722 -7.79 0.056 -5.545 0.597 1.124

10 IXDP 2721 1994 1 4.655 -8.11 0.068 -1.017 1.675 2.98 16.366 -5.181 5.466 0.8 10.469 5.89711 3MLLEQ 2741 1994 -1 20.307 -4.921 10.382 -4.48 15.086 5.221 18.766 -9.397 9 -3.832 10.61 6.13512 AMEP 2741 1994 1 19.331 1.725 1.112 1.54 7.671 11.66 11.24 -1.251 0.778 1.104 4.839 6.40113 3CFNEE 2741 1994 -1 24.229 -17.323 -23.125 -3.005 -13.271 37.5 21.411 -21.854 -16.537 -3.347 -6.939 28.3514 TUTR 2741 1994 1 26.931 -5.411 6.788 0.077 15.941 10.99 33.66 -1.721 6.352 6.962 19.502 14.15815 3PWLUA 2750 1994 -1 7.713 -5.023 0.165 -1.425 6.444 1.269 7.032 -5.651 -0.012 -0.743 5.881 1.15116 GGIT 2750 1994 1 10.615 -0.862 1.836 2.087 2.803 7.812 24.738 0.37 5.831 3.65 9.99 14.74817 BISYQ 2870 1994 -1 12.737 -50.976 3.079 -11.384 7.259 5.478 16.357 -128.645 -5.689 -11.48 -2.324 18.68118 ALCD 2870 1994 1 11.911 -6.423 5.876 2.373 10.329 1.186 13.769 -4.098 8.58 3.257 11.926 1.45819 3521B 3572 1994 -1 233.915 -34.499 121.022 -3.758 89.63 144.285 180.393 -118.787 65.957 -60.428 7.173 173.2220 3EXBT 3572 1994 1 242.765 139.686 157.978 62.169 196.907 45.858 250.336 127.251 137.143 27.087 186.366 63.9721 3MRESE 3621 1994 -1 2.638 -0.989 0.337 -0.262 0.422 2.216 1.773 -1.907 0.067 -0.321 -0.474 2.24722 3DEWY 3621 1994 1 6.417 -1.456 2.274 0.868 0.876 5.541 5.555 -1.349 1.769 0.621 0.983 4.57223 3CPTX. 3672 1994 -1 50.476 -56.926 17.182 1.209 -26.942 64.467 47.711 -61.641 18.391 5.381 -31.759 60.19124 BHE 3672 1994 1 48.333 20.326 30.89 10.667 40.131 8.202 57.037 26.474 37.285 11.309 46.624 10.41325 BYDSQ 3714 1994 -1 6.326 -0.806 -0.175 1.677 1.881 4.264 11.782 -0.928 2.113 2.374 5.856 5.92626 BOWE 3714 1994 1 8.478 0 3.127 3.09 3.975 3.57 9.292 1.709 4.296 3.534 5.697 2.66227 CNMWQ 3812 1994 -1 32.839 26.921 -1.701 -5.508 6.902 25.937 54.196 13.887 8.368 -8.489 24.378 29.81828 DBAS 3812 1994 1 29.061 19.267 14.199 2.449 24.632 4.429 32.209 20.548 15.998 2.854 26.424 5.78529 YESS 3944 1994 -1 29.657 -43.563 5.082 -17.783 -43.164 22.889 48.87 -40.927 25.671 7.172 28.584 20.28630 EDIN 3944 1994 1 28.282 4.184 19.432 6.12 22.828 5.454 28.254 3.94 17.982 0.548 22.584 5.6731 3UMMF 3960 1994 -1 83.562 -113.108 28.086 -8.431 -34.589 117.701 58.428 -140.003 7.73 -4.652 -61.484 119.46232 3VITC. 3960 1994 1 50.673 -23.711 23.897 0.573 9.345 41.328 47.951 -25.024 19.9 3.054 8.032 39.91933 ATREQ 2030 1995 -1 10.786 -29.448 -4.242 -5.763 -2.346 13.132 7.462 -39.81 -6.917 -8.365 -3.921 11.38334 ARMF 2030 1995 1 9.054 -1.377 5.517 1.926 8.037 1.017 11.926 -0.93 7.131 1.942 10.622 1.30435 UNCB 2050 1995 -1 11.535 -3.036 0.009 0.808 4.738 6.797 15.883 -6.613 -0.944 -0.782 1.161 14.722

Figure 17 – Bankruptcy Dataset

Page 147: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

135

Obs # Ticker Ind Year Label TAyr1 REyr1 WCyr1 EBITyr1 SEyr1 TLyr1 TAyr2 REyr2 WCyr2 EBITyr2 SEyr2 TLyr236 PIFI. 2050 1995 1 12.361 3.623 3.497 0.443 6.438 5.923 9.397 1.679 1.205 -1.592 4.563 4.83437 SIRNQ 2330 1995 -1 14.738 -13.29 6.12 4.304 10.768 3.97 18.678 -17.701 9.9 -3.214 14.77 3.90838 JLTA 2330 1995 1 9.855 0.026 4.108 -2.096 5.337 4.518 9.173 0.053 3.738 0.692 5.364 3.80939 NCCDQ 2340 1995 -1 75.369 31.57 31.672 7.205 38.832 36.537 47.334 18.638 19.905 -7.187 25.9 21.43440 MSI 2340 1995 1 57.204 8.427 22.648 3.713 8.7 48.504 34.61 3.149 19.406 0.547 3.422 31.18841 9691B 2810 1995 -1 305.932 50.662 29.269 58.706 51.296 249.865 312.365 50.272 6.832 28.213 50.906 257.28242 CCC 2810 1995 1 338.001 168.115 84.584 51.891 218.187 119.814 397.251 174.445 68.67 52.801 216.895 180.35643 3MRCFQ 2820 1995 -1 94.966 5.046 27.969 13.888 25.632 69.334 102.616 9.48 30.479 14.698 30.173 72.44344 3GLMA 2820 1995 1 12.587 1.933 1.296 1.429 3.428 9.159 12.227 2.015 1.497 1.551 3.529 8.69845 3QQQQQ 2835 1995 -1 2.736 -8.085 -4.404 -1.605 -1.709 4.445 5.514 -10.483 2.791 -2.059 5.283 0.23146 ICCC 2835 1995 1 3.234 -5.882 1.849 0.18 1.905 1.329 3.131 -5.948 1.405 0.077 1.877 1.25447 9524B 2911 1995 -1 472.208 -150.785 5.965 -37.326 93.357 378.851 564.241 -167.45 -407.018 -41.27 81.363 482.87848 TSO 2911 1995 1 519.153 35.785 77.529 99.932 216.514 302.639 582.587 110.295 99.475 109.217 304.065 278.52249 LAGR 3021 1995 -1 159.575 -168.72 103.999 -34.888 -40.627 92.456 100.956 -239.062 46.467 -35.048 -110.969 96.45250 VANS 3021 1995 1 90.461 -23.401 41.404 13.314 72.728 17.733 105.824 -12.84 53.398 19.843 88.282 17.54251 PHTA 3231 1995 -1 14.193 -3.437 -3.217 0.632 3.234 10.959 19.282 -9.402 0.621 -1.784 15.22 4.06252 CVTL 3231 1995 1 12.04 -12.626 -4.058 -4.935 -2.204 14.241 21.947 -17.179 3.573 -2.344 6.728 15.21953 CCSTQ 3312 1995 -1 42.263 4.181 17.924 7.273 10.524 31.739 51.252 1.432 16.733 2.398 9.218 42.03454 3CIIIE 3312 1995 1 40.612 -2.048 -1.025 1.78 3.579 37.029 43.001 0.402 3.128 5.641 8.739 34.25855 3NATT 3443 1995 -1 2.07 -6.715 -4.516 -1.39 -3.667 5.737 3.053 -0.53 -1.753 0.013 -0.508 3.56156 3BSTM 3443 1995 1 15.091 11.427 9.794 1.605 12.208 2.8833 16.178 12.308 8.876 2.182 13.089 3.08957 NEXR 3571 1995 -1 1.469 -2.261 0.582 -2.259 -2.261 3.73 19.589 -9.771 10.425 -7.477 -9.771 29.3658 XYBR 3571 1995 1 1.394 -3.534 -0.039 -2.086 -1.093 2.487 8.015 -8.773 6.412 -5.281 6.89 1.12559 GANDF 3576 1995 -1 79.375 -5.612 29.361 7.427 48.586 30.789 45.159 -52.03 -6.254 -30.22 3.71 41.44960 MYLX 3576 1995 1 80.458 17.21 63.576 21.094 65.201 15.257 116.586 40.641 97.931 28.941 104.172 12.41461 3SCRH. 3577 1995 -1 1.461 -50.558 -10.404 -8.33 -10.063 11.524 0.733 -62.505 -11.238 -9.622 -11.016 11.74962 3MITK 3577 1995 1 2.864 -2.088 0.602 0.218 1.343 1.521 3.762 -0.859 1.884 1.877 2.652 1.1163 MTTRQ 3713 1995 -1 29.667 3.585 11.214 6.599 13.663 16.004 36.564 6.904 13.508 6.766 17.096 19.46864 COLL 3713 1995 1 46.881 -11.517 13.452 5.403 8.805 38.076 45.744 -6.505 14.205 9.412 13.891 31.85365 EXCE 3751 1995 -1 1.883 -1.624 1.551 -1.248 1.729 0.154 10.023 -4.136 9.039 -2.633 9.6 0.39266 3RSHX 3751 1995 1 26.932 -40.115 2.327 15.834 -39.615 59.19 45.875 12.019 23.722 25.326 31.561 14.31467 3AWCIQ 3842 1995 -1 78.416 -7.023 12.371 0.4 27.034 51.382 54.405 -36.931 10.972 -2.966 -2.874 57.27968 MNMD 3842 1995 1 51.643 -1.664 31.809 2.983 42.362 9.281 59.503 3.008 37.209 7.186 49.626 9.87769 VOXQ 3845 1995 -1 4.98 -15.541 3.822 -5.527 0.742 0.638 7.625 -21.981 6.825 -6.479 7.406 0.21970 SPSIQ 3845 1995 1 4.644 -40.947 4.231 -1.31 2.922 0.255 3.551 -42.493 2.95 -1.65 2.299 0.393

Figure 17 Continued

Page 148: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

136

Obs # Ticker Ind Year Label TAyr1 REyr1 WCyr1 EBITyr1 SEyr1 TLyr1 TAyr2 REyr2 WCyr2 EBITyr2 SEyr2 TLyr271 AMCL 3851 1995 -1 3.842 -5.815 -0.89 -1.639 0.07 3.772 1.951 -9.199 -3.112 -1.204 -3.076 5.02772 SEYE 3851 1995 1 7.26 1.825 1.859 2.191 1.946 5.314 10.293 2.508 1.782 2.642 2.929 7.36473 LAZR 3944 1995 -1 2.023 -0.164 -0.284 0.592 0.413 1.61 5.401 -2.91 0.756 -1.448 3.35 2.05174 3KIDZE 3944 1995 1 3.589 -4.7 -2.654 -1.398 0.413 3.171 1.958 -5.763 -2.363 -0.387 -0.626 2.57975 TAVI 2011 1996 -1 302.786 65.969 56.178 27.357 77.08 225.706 253.913 15.681 39.987 21.01 27.095 226.81876 SFD 2011 1996 1 995.254 191.87 164.312 129.713 307.486 687.768 1083.645 245.27 259.188 167.753 361.01 722.63577 BEANQ 2090 1996 -1 109.303 -80.713 11.39 7.027 65.03 44.273 99.013 -93.484 10.195 2.554 57.76 40.95378 WFDS 2090 1996 1 80.738 27.561 20.13 16.973 48.73 32.008 95.486 34.67 25.113 18.862 56.416 39.0779 3SSMKQ 2300 1996 -1 13.673 -8.9 -4.129 -4.49 -2.823 12.396 15.727 -8.385 -0.586 1.576 0.883 10.65380 INNO 2300 1996 1 9.433 -20.64 -0.61 -1.352 2.275 7.158 9.168 -21.955 -0.179 -1.051 3.791 5.37781 STAZQ 2300 1996 -1 188.895 21.319 73.183 11.86 103.253 85.642 160.521 -15.618 33.957 -21.224 67.435 93.08682 GOSHA 2300 1996 1 196.033 137.959 104.641 33.214 138.077 57.956 174.788 113.058 82.762 41.243 113.157 61.63183 NDRE 2330 1996 -1 17.806 -46.191 -2.208 -0.368 -5.958 23.764 29.379 -0.078 -0.514 1.517 2.909 26.4784 NICH 2330 1996 1 18.179 9.065 12.209 1.858 11.724 6.455 15.079 9.261 11.51 0.624 11.682 3.39785 APAR 2330 1996 -1 20.408 -69.388 1.407 1.405 -4.457 20.297 22.722 -76.045 -4.386 -3.072 -11.114 29.22686 VARS 2330 1996 1 37.791 18.253 18.006 9.766 29.897 7.894 29.243 13.777 15.212 7.318 24.794 3.39787 3ABNKQ 2750 1996 -1 480.378 -21.281 35.533 59.029 46.277 434.101 503.536 -26.956 56.12 60.885 46.715 448.49588 CVO 2750 1996 1 470.946 27.406 22.125 81.931 121.207 349.739 586.201 48.587 92.722 96.595 146.401 439.889 3SFFP 2834 1996 -1 2.774 -26.628 1.434 -7.091 1.696 1.078 0.69 -36.157 -0.838 -7.772 -4.717 2.93990 3SQES 2834 1996 1 2.88 -7.857 0.64 -0.934 0.373 2.132 1.409 -10.298 0.045 -1.798 -0.252 1.31191 CMTR 2835 1996 -1 8.841 -36.308 4.656 -6.178 5.03 3.811 4.285 -41.503 1.104 -7.124 2.208 2.07792 3OXIS 2835 1996 1 7.997 -33.099 -1.405 -5.247 4.502 3.472 12.575 -38.428 0.958 -4.559 6.738 5.8293 3MABAE 2835 1996 -1 16.473 -42.268 13.697 -5.748 5.542 10.931 9.388 -49.415 6.961 -6.251 6.683 2.70594 AVAN 2835 1996 1 17.224 -57.129 11.672 -9.536 15.619 1.605 9.827 -70.237 4.629 -7.205 6.316 3.51195 3LOCKE 3420 1996 -1 2.094 -8.092 -0.317 -2.746 1.132 0.962 5.522 -13.357 1.8 -5.031 3.399 2.12396 QEPC 3420 1996 1 16.434 4.738 12.695 3.029 13.116 2.981 43.026 6.605 14.212 4.1 15.296 27.39397 MRSIQ 3559 1996 -1 13.428 -26.581 10.299 -4.304 9.87 3.558 7.884 -31.723 4.625 -4.815 4.832 3.05298 JMAR 3559 1996 1 15.396 -26.074 5.744 1.172 9.369 6.027 17.269 -24.278 9.635 1.638 12.488 4.78199 SUBM 3559 1996 -1 125.934 -13.006 25.618 -16.85 28.676 97.258 59.708 -60.559 9.025 -27.209 -7.49 62.298

100 PRIA 3559 1996 1 123.786 25.043 80.846 20.298 96.922 26.864 156.984 42.121 105.17 28.447 119.384 37.6101 TNNYB 3569 1996 -1 3.935 -0.971 1.059 0.521 1.324 2.611 3.24 -2.239 -0.296 -1.172 0.058 3.182102 3QPDC 3569 1996 1 1.437 -26.822 -1.456 -0.372 -1.929 3.366 2.023 -25.812 -0.46 1.04 -0.815 2.838103 PNLEQ 3572 1996 -1 40.238 -20.058 12.71 -12.419 8.503 31.735 12.544 -50.462 -15.928 -25.344 -15.47 28.014104 ADIC 3572 1996 1 36.71 6.057 24.595 6.29 26.387 10.323 75.194 14.302 53.359 12.029 60.11 15.084105 SYQTQ 3572 1996 -1 75.181 -120.11 -37.351 -91.498 -30.371 105.534 82.649 -204.304 1.436 -56.097 5.613 62.951

Figure 17 Continued

Page 149: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

137

Obs # Ticker Ind Year Label TAyr1 REyr1 WCyr1 EBITyr1 SEyr1 TLyr1 TAyr2 REyr2 WCyr2 EBITyr2 SEyr2 TLyr2106 NTAP 3572 1996 1 68.941 -0.624 41.919 20.683 54.029 14.912 115.736 20.341 69.631 38.206 86.265 29.471107 ADPVQ 3576 1996 -1 7.212 -27.059 2.828 -4.999 4.045 3.167 4.838 -28.213 2.361 -0.031 3.249 1.589108 SBEI 3576 1996 1 7.894 -5.196 2.049 -6.447 3.231 3.913 11.269 -1.863 7.492 3.297 7.966 3.303109 3APSP 3661 1996 -1 0.308 -16.148 0.155 0.016 0.156 0.152 0.324 -16.096 0.208 -0.022 0.208 0.116110 TLKP 3661 1996 1 1.682 -5.165 -1.329 -2.131 -1.775 2.608 23.083 -17.145 16.423 -10.762 18.872 4.211111 CODDQ 3663 1996 -1 5.521 -30.239 -0.468 0.712 0.447 5.073 5.309 -30.201 0.514 0.955 0.602 4.706112 AMCM 3663 1996 1 4.969 -34.765 1.32 -1.482 -0.397 2.783 5.45 -33.828 2.417 1.543 0.548 2.319113 EAIN 3670 1996 -1 50.971 -73.245 -9.166 -7.038 7.086 43.885 47.862 -91.307 -16.787 -4.274 -1.101 48.963114 VIDE 3670 1996 1 40.887 14.122 13.784 5.274 17.743 23.144 40.582 17.589 16.441 8.358 21.146 19.436115 3FIVDE 3679 1996 -1 2.103 0.937 1.099 0.025 1.705 0.398 1.255 -1.224 0.13 -1.987 -0.456 1.711116 3POWDQ 3679 1996 1 0.427 -2.026 -1.424 -0.324 -1.205 1.632 6.501 -7.152 -9.595 -4.02 -5.769 12.267117 AI. 3690 1996 -1 73.112 24.019 38.125 1.217 31.296 41.816 59.333 12.767 26.25 -0.681 20.043 39.29118 MPAA 3690 1996 1 75.51 11.086 51.8 10.87 40.108 35.402 98.245 17.136 75.333 13.608 68.127 30.118119 BDTTZ 3714 1996 -1 503.802 199.054 90.171 114.88 275.08 228.722 877.153 189.121 -47.204 99.21 266.419 610.734120 STNT 3714 1996 1 581.571 45.051 78.568 84.02 200.562 381.009 573.536 22.585 95.718 73.372 178.096 395.44121 IMTI10 3841 1996 -1 137.159 -165.003 1.714 -7.804 -14.511 151.67 202.786 -278.252 -55.656 -43.517 -49.923 252.709122 BMP 3841 1996 1 142.465 94.218 75.675 41.393 135.924 6.541 186.449 121.75 92.362 51.987 179.689 6.76123 URMD 3842 1996 -1 110.488 -71.325 98.013 -19.192 35.952 74.536 76.593 -105.861 60.907 -29.628 1.785 74.808124 RSND 3842 1996 1 114.752 -38.309 25.377 8.674 52.371 57.156 89.775 -59.766 19.883 10.129 37.019 52.756125 PLSIQ 3845 1996 -1 19.321 -20.228 8.018 -3.302 16.632 2.689 47.708 -61.63 19.017 -18.056 31.456 16.252126 3BCHM 3845 1996 1 18.429 3.523 11.782 8.411 15.485 2.944 22.76 8.163 16.347 6.86 19.945 2.815127 ONTAQ 2253 1997 -1 86.977 -86.449 -42.595 -10.537 -8.742 95.719 51.851 -126.822 -62.395 -13.472 -49.115 100.966128 HAMP 2253 1997 1 80.585 30.949 36.303 16.594 57.71 22.875 100.848 35.271 51.283 14.165 63.403 37.445129 TLTXQ 2253 1997 -1 538.226 150.005 295.721 52.758 186.081 344.114 447.328 107.994 236.152 5.582 144.74 300.89130 PLUAQ 2253 1997 1 165.987 26.819 32.974 9.384 63.668 102.319 158.543 -9.24 -44.664 -18.763 27.609 130.934131 BISD 2330 1997 -1 34.817 -19.06 13.944 1.596 7.658 27.159 14.384 -37.905 -11.185 -10.756 -11.185 25.569132 NICH2 2330 1997 1 15.079 9.261 11.51 0.624 11.682 3.397 14.058 9.263 9.384 -0.972 10.571 3.487133 3NMPCQ 2844 1997 -1 132.759 24.602 38.602 12.37 23.479 109.28 138.751 21.631 29.899 16.931 21.194 117.557134 DLI 2844 1997 1 149.314 72.867 53.576 34.658 54.53 94.784 177.474 79.738 62.728 30.788 59.097 118.377135 DVLGQ 3540 1997 -1 121.444 -8.506 18.273 13.907 25.713 95.731 123.915 -14.947 13.634 4.172 19.406 104.509136 FNSTQ 3540 1997 1 88.832 25.423 35.126 9.869 46.92 41.912 88.717 26.05 27.589 10.577 47.547 41.17137 RACE 3576 1997 -1 9.47 -24.779 4.888 -5.902 3.784 2.607 4.009 -34.265 1.056 -8.634 0.952 1.452138 FSCXQ 3576 1997 1 9.226 -17.511 3.057 -6.411 3.246 5.98 5.581 -23.06 1.922 -5.161 1.036 4.545139 AXHM10 3577 1997 -1 204.044 -41.933 32.249 30.031 -18.081 222.125 171.726 -72.365 31.096 33.294 -47.998 219.724140 3CLCP 3577 1997 1 209.457 -211.595 19.154 -64.084 75.733 133.724 21.939 -35.684 -65 86.939

Figure 17 Continued

Page 150: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

138

Obs # Ticker Ind Year Label TAyr1 REyr1 WCyr1 EBITyr1 SEyr1 TLyr1 TAyr2 REyr2 WCyr2 EBITyr2 SEyr2 TLyr2141 GECMQ 3577 1997 -1 250.049 18.323 66.19 32.415 45.396 204.653 229.977 -4.715 75.406 17.423 24.617 205.36142 DIMD 3577 1997 1 337.554 -70.615 158.287 -49.557 180.521 157.033 306.91 -110.104 92.848 -54.291 146.37 160.54143 3SHELQ 3672 1997 -1 139.367 13.736 22.943 -0.121 82.917 56.435 136.306 -23.45 9.219 0.39 78.716 57.549144 3IECE 3672 1997 1 152.07 37.367 34.622 22.047 75.461 76.609 98.665 31.34 31.764 8.809 69.568 29.097145 AURLQ 3674 1997 -1 6.35 -138.543 -0.645 -10.056 -24.149 30.499 13.638 -172.567 -3.586 -11.217 -1.308 14.946146 3SODI 3674 1997 1 6.835 -1.672 1.168 0.702 0.966 5.869 5.425 -1.196 1.34 0.826 1.442 3.983147 3ECTHE 3841 1997 -1 3.373 -11.456 0.958 -0.514 -0.136 3.509 3.183 -12.534 0.286 -0.637 -1.19 4.373148 3SPSGE 3841 1997 1 3.328 -9.406 1.122 -0.214 -1.147 3.873 2.997 -9.792 1.313 0.029 -0.516 2.775149 SNRS 3845 1997 -1 2.949 -37.062 1.382 -6.998 0.849 2.1 11.479 -54.883 6.773 -14.084 0.2 11.279150 EVMD 3845 1997 1 3.083 -19.612 2.121 -0.182 -3.501 1.278 4.097 -18.952 2.544 0.891 -2.457 1.248151 STUA10 3944 1997 -1 137.824 -11.241 37.578 11.165 16.372 121.452 136.699 -32.539 29.179 8.04 -4.891 141.59152 GAL. 3944 1997 1 207.783 -9.896 82.8 -20.813 162.03 45.753 196.905 -20.679 134.394 24.415 149.791 47.114153 NWSW 3312 1998 -1 383.199 -31.948 122.795 86.514 86.7 296.499 318.409 -71.822 64.503 0.982 46.826 271.583154 KESNQ 3312 1998 1 405.857 -9.243 0.555 35.622 53.077 352.78 410.918 -16.727 -13.92 24.413 46.315 364.603155 APMPQ 3679 1998 -1 299.518 -106.065 13.399 -91.691 85.96 213.558 226.903 -342.055 -72.298 -4.342 231.245156 LGL 3679 1998 1 480 26.83 18.768 49.473 39.793 440.207 211.192 3.803 23.214 7.105 15.991 195.201

Figure 17 Continued

Page 151: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

139

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr21 CVGYQ 4991 1995 -1 46.421 12.014 0 45.566 1.266 4.348 58.156 12.176 0 39.901 0.012 2.522 WTI.1 4991 1995 1 40.916 31.216 9.946 1.328 40.916 6.125 33.816 12.488 0.772 40.218 6.334 0.5213 IMTC.2 5110 1995 -1 15.021 2.254 0.601 50.531 0.091 0.705 71.862 35.269 4.301 360.503 232.975 4.5684 CLPI 5110 1995 -1 199.692 0 0 1239.157 59.028 0 238.373 0 0 1238.805 65.369 05 3PCNIQ 7370 1995 -1 12.918 4.304 8.459 18.762 0.095 0.408 5.451 76.713 14 32.433 0.166 0.1546 CTG 7370 1995 1 14.114 30.666 4.165 1.986 14.114 1.918 28.868 4.135 1.77 11.166 1.903 0.2377 EQMD 8093 1995 -1 831.704 74.959 60.259 2021.46 0 100.446 1932.813 218.912 106.505 2896.871 0 123.7768 RTEL 8093 1995 1 1726 1172.1 355.9 208.7 1726 116 5793 164.923 2472.437 8916.705 702.5 493.59 3WLGN 2511 1996 1 443.092 456.043 72.337 4.918 443.092 19.064 425.026 63.835 1.851 487.441 8.407 19.28

10 PLSSQ 2511 1996 1 35.878 72.39 3.277 13.641 35.878 5.233 16.537 9.552 4.209 17.308 0.883 011 3ABNKQ 2750 1996 1 145.694 2.143 0 0 145.694 0.304 1.413 0.282 0 371.491 3.392 37.76112 CVO 2750 1996 1 217.291 207.865 117.506 46.467 217.291 37.218 385.758 189.996 134.599 429.182 23.844 46.82713 UMED 2834 1996 -1 2553.7 285.3 120.3 1758.8 221.6 34.3 2518.2 305.1 161.6 1675.1 321.2 43.514 CARN 2834 1996 1 2619.533 3647.03 292.638 385.799 2619.533 0 4184.498 386.353 397.048 2984.383 0 306.74915 SYC 2840 1996 1 5554.472 472.691 2889.108 16.566 5554.472 936.8 387.271 2830.381 10.427 6099.402 7.631 1.68216 USAD 2840 1996 1 1501.8 1962.3 220.7 340.3 1501.8 42 1917 243.8 359.5 1636.7 52.7 78.417 3NSTLQ 3312 1996 -1 45.605 6.925 10.686 29.386 1.208 2.777 21.989 1.662 5.612 16.29 1.392 0.12918 NUE 3312 1996 1 102.625 93.333 23.658 0 102.625 20.364 89.428 17.551 0 111.879 23.04 3.00919 FSCXQ 3576 1996 -1 14.427 0.088 1.45 6.91 0.269 0.816 16.851 0.08 1.904 6.068 1.383 0.54420 MRVC 3576 1996 -1 15.644 3.026 0 35.128 1.017 3.489 24.636 2.636 0 21.522 0.963 1.83621 FCSE 3576 1996 -1 33.545 14.582 0 66.855 0 2.005 44.361 20.319 0 127.52 11.701 4.48822 ANCR 3576 1996 -1 63.511 8.897 0 272.745 3.519 55.126 116.13 229.063 0 50.224 58.916 19.3923 SBEI 3576 1996 1 8195.059 1760.07 6771.682 0 8195.059 227.907 1831.265 7239.684 0 8809.898 216.721 38.10924 ANET 3576 1996 1 145.042 154.676 31.295 53.702 145.042 3.456 262.833 50.963 107.358 320.205 2.9 15.15525 MDEA 3577 1996 -1 340.8 168 0 384.4 89.9 11.4 430.5 199.8 0 475.4 107.1 4.926 CENL 3577 1996 1 514 806.27 144.292 66.865 514 2.695 826.746 122.26 44.538 575.695 2.234 38.41127 ASD 3585 1996 -1 10513 1684 767 7408 146 653 11026 1752 842 7906 146 52528 ASD1 3585 1996 1 1208.246 188.62 3.175 0 1208.246 2.941 217.961 1.661 0 1427.881 4.085 029 RAM 3630 1996 1 1517.581 252.245 63.121 0 1517.581 17.025 320.439 38.45 0 2095.747 26.931 030 HMII 3630 1996 1 447.998 960.377 138.347 114.127 447.998 21.2 896.572 109.522 81.401 470.762 37.504 44.58331 PCTL 3661 1996 -1 204.042 2.299 28.637 95.397 2.488 1.279 214.148 2.136 34.652 87.164 2.232 1.30632 CMVT 3661 1996 1 93.198 119.72 26.155 27.48 93.198 3.03 95.105 21.948 24.932 88.012 3.324 0.98433 CAMD 3670 1996 -1 25.185 10.823 67.244 19.35 1.253 1.213 1441.587 144.119 119.5 658.551 0 135.84834 VIDE 3670 1996 1 2642.832 2087.112 585.216 0 2642.832 47.275 1786.594 444.398 0 1811.599 33.537 144.15535 SVRI 3674 1996 1 30.747 8.213 1.955 4.185 30.747 0 10.617 1.778 4.387 24.09 0 0.564

Figure 18 - Restatement Dataset

Page 152: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

140

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr236 3MPAD 3674 1996 1 30.204 57.295 7.958 13.457 30.204 0.386 60.674 6.276 19.563 38.205 1.214 2.85837 3DGIX 3960 1996 -1 125.086 19.76 0 82.683 1.121 1.494 110.139 14.125 0 50.224 0.544 0.09738 3LEVC 3960 1996 1 557.288 40.785 397.286 0.328 557.288 1.151 41.146 347.465 0.606 552.079 0.963 039 OMPT 4812 1996 -1 268.836 42.994 7.306 367.947 25.911 20.271 350.954 77.989 8.843 580.466 62.867 22.07740 PAGE 4812 1996 1 117.064 174.346 32.863 22.125 117.064 13.069 188.814 37.367 28.205 150.233 20.686 9.36541 TCAT 4841 1996 -1 25.599 8.047 1.734 32.146 0.21 2.856 32.795 5.58 0.702 32.248 0.228 7.13542 GOAL 4841 1996 -1 121.232 32.964 3.747 290.221 8.471 21.333 186.143 65.465 6.646 631 8.612 24.99443 AMXI 4899 1996 -1 254.603 41.652 35.146 235.445 16.201 6.225 278.907 36.629 34.824 246.069 7.709 7.90944 CCIX.1 4899 1996 -1 1330.296 386.48 19.643 964.879 89.044 52.638 1222.58 296.619 25.697 851.372 9.12 43.91545 PHX.2 4899 1996 -1 22936 3815 2703 31691 3367 1121 22859 2865 2531 28355 1746 117746 SMTKQ 4899 1996 1 132.866 196.48 62.962 0 132.866 16.171 266.964 40.716 4.442 315.767 7.313 7.56447 MLTNQ 4955 1996 -1 102.711 610.699 20.662 1404.116 2729.445 617.436 116.808 712.875 25.046 1660.649 1881.009 52.60548 HDSN 4955 1996 1 126.141 286.123 39.761 34.052 126.141 5.056 325.417 47.045 36.195 134.947 3.049 14.54849 BERT 5812 1996 -1 278.017 37.96 30.952 254.089 0.728 55.599 320.163 56.307 60.892 418.074 0.643 19.10850 3FINE 5812 1996 1 95.276 132.239 35.149 26.4 95.276 0.395 109.744 27.701 28.473 90.187 3.584 3.43751 STFR 6035 1996 -1 27.053 12.691 5.941 40.677 0.78 6.587 33.799 15.441 7.284 50.225 0.195 7.50252 EGFC 6035 1996 1 304.085 28.048 209.661 0 304.085 4.074 36.094 232.4 0 464.309 4.1 0.31553 FFBA 6035 1996 1 29.036 31.037 6.094 0.107 29.036 0.385 28.413 6.827 0.119 28.57 0.425 1.50454 3PHPC 6324 1996 -1 178.818 116.447 0 376.498 4.718 1.594 267.296 116.816 0 380.722 10.611 7.54855 HPLX10 6324 1996 1 7.894 13.35 2.053 2.742 7.894 0.619 24.97 2.78 0.851 11.269 0.316 0.29156 SUHI 6324 1996 1 965.795 1189.103 145.364 22.019 965.795 34.25 1441.587 180.252 28.214 1085.349 23.26 170.2357 UDCI 6324 1996 1 549.396 606.45 106.261 234.257 549.396 16.031 608.073 117.965 177.291 509.429 26.056 13.7458 ENVY 6411 1996 1 580.945 1014.913 155.796 0 580.945 1.573 1220.852 191.192 0 685.146 2.783 11.70959 KAYE 6411 1996 1 1973.424 148.449 1067.636 2.051 1973.424 18.311 251.123 2151.547 11.812 3488.306 1.174 17.17160 TMBS 7372 1996 -1 172.424 31.578 28.83 98.476 3.075 18.846 227.269 31.469 17.258 102.808 2.821 23.6961 MTCI 7372 1996 1 7.499 5.081 0.22 0 7.499 0.479 4.729 0.229 0 6.371 0.476 0.18362 3AUGRE 7372 1996 1 27.107 38.541 16.331 4.464 27.107 2.847 52.39 18.678 5.295 32.279 3.008 1.3563 SSAXQ 7372 1996 1 3510.704 1554.934 17.199 333.88 3510.704 517.121 1666.108 19.088 452.114 4141.688 639.08 389.96464 IDNX 7373 1996 -1 4.203 0.723 0.16 2.58 0.034 0.215 5.465 1.258 0.118 2.579 0.037 0.13365 3FNIX 7373 1996 -1 45.605 6.925 10.686 29.386 1.208 2.777 21.989 1.662 5.612 16.29 1.392 0.12966 SUMC 8051 1996 -1 488.118 3326.82 10.819 5499.443 1995.727 112.681 435.521 2742.282 10.057 4899.717 3514.095 067 3UNHC 8051 1996 1 30.204 57.295 7.958 13.457 30.204 0.386 60.674 6.276 19.563 38.205 1.214 2.85868 AHCI 8082 1996 -1 40.848 8.823 0 56.279 2.938 2.678 46.421 12.014 0 45.566 1.266 4.34869 IHHI 8082 1996 1 142.854 229.186 29.957 13.452 142.854 1.025 236.923 28.713 12.222 139.034 3.094 2.25470 3VLFIQ 2030 1997 -1 388.294 88.515 129.049 348.844 12.237 36.024 365.751 75.376 81.923 636.29 8.429 13.159

Figure 18 Continued

Page 153: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

141

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr271 IHF 2030 1997 1 27.08 45.457 9.564 6.801 27.08 1.85 54.631 10.813 8.566 47.877 12.746 2.45472 RTRO 2330 1997 -1 273.834 68.827 96.844 275.804 7.319 38.837 143.775 20.46 46.665 256.611 35.316 7.08573 JLNY 2330 1997 -1 36151 0 5822 14298 253 1456 30762 473 4825 11238 244 25274 3SOCNQ 2390 1997 -1 118.228 46.346 0 186.91 98.259 4.871 114.644 18.31 0 146.921 91.081 4.13475 3WSPT 2390 1997 -1 368.646 161.736 118.995 591.837 5.828 19.513 626.84 194.716 229.05 1005.88 8.653 30.19976 WILK 2842 1997 -1 260.581 53.261 11.412 163.743 0 23.829 246.827 46.666 17.098 140.159 0 10.35977 3KYZN 2842 1997 1 46.51 114.72 23.211 0.534 46.51 2.583 107.359 1.168 14.81 227.027 23.272 0.16978 ORXR 3060 1997 1 107.358 109.177 21.073 0 107.358 17.826 105.51 20.376 0 113.047 25.738 2.17879 3FHCO 3060 1997 1 17.321 19.302 3.293 2.359 17.321 0.811 18.737 3.568 2.352 16.19 0.949 0.1580 STMT 3350 1997 -1 71.227 19.851 3.59 60.103 0.284 3.901 259.31 476 49.068 905.984 1.173 44.781 WIRE 3350 1997 1 318.409 349.345 34.391 51.485 318.409 66.27 365.269 33.511 44.975 259.846 20.878 48.18882 ACU 3420 1997 1 9.707 21.11 3.487 0 9.707 3.339 15.793 2.037 0 5.208 1.115 0.0983 3LOCKE 3420 1997 1 714.195 82.594 26.475 0 714.195 28.654 141.171 22.23 0 731.885 31.004 084 NEXR 3571 1997 -1 553.496 524.572 0 1347.702 54.771 5.659 398.496 317.936 0 915.809 33.003 2.29485 NWRE 3571 1997 1 6.093 7.119 1.934 0.676 6.093 1.628 6.448 1.974 0.67 6.605 2.161 0.286 COMS 3576 1997 1 52.953 43.504 7.806 7.941 52.953 19.57 31.256 5.352 1.945 36.896 10.059 0.13287 CSCO 3576 1997 1 191.191 198.199 43.348 47.181 191.191 6.684 81.937 15.635 34.677 201.646 4.723 4.36488 MEDP 3577 1997 -1 14.303 2.815 0 9.344 0.133 0.605 13.6 3.279 0 7.717 0.078 0.03889 CYLK 3577 1997 -1 26 5.404 4.935 19.447 1.534 0.559 26.391 6.461 2.402 16.568 0.823 0.79790 3RGFX 3577 1997 -1 99.662 15.703 0 687.242 8.098 0 107.359 13.667 0 673.467 8.379 091 3AEXCA 3577 1997 1 608.504 335.784 19.967 0 608.504 18.594 362.927 18.839 0 676.116 36.12 27.82492 3GRDC 3579 1997 1 177.474 274.862 47.116 55.619 177.474 8.018 267.346 47.904 59.155 180.561 7.897 7.28893 SORT 3579 1997 1 259.575 21.041 86.689 60.561 259.575 8.932 22.241 88.128 66.507 256.527 0 6.70994 4360B 3669 1997 -1 1830.778 365.463 391.58 1924.27 95.928 220.097 2233.576 537.786 461.981 3892.711 705.084 174.03995 DETC 3669 1997 1 2363.142 1112.711 156.734 173.61 2363.142 44.173 1128.683 137.695 175.972 2121.357 45.346 62.06296 ISNR 3674 1997 1 141.998 125.826 30.105 0 141.998 16.102 144.385 39.868 0 198.76 21.545 4.82997 LOGC 3674 1997 1 2627.368 1405.305 438.28 63.76 2627.368 44.337 1866.426 370.472 63.369 2429.914 30.066 77.94398 MODI 3714 1997 -1 282.331 99.495 27.534 206.995 53.168 6.166 300.528 55.316 78.311 243.809 56.748 4.9399 BDTTZ 3714 1997 1 47.579 45.908 8.546 9.31 47.579 22.87 22.284 1.726 2.47 12.695 5.659 2.748

100 DAIO 3825 1997 -1 1676.838 184.312 196.733 2446.5 69.676 247.651 1808.77 187.819 223.206 2542.445 78.939 166.422101 TVL.1 3825 1997 -1 58518 10362 7543 62624 3466 3744 57108.199 9365.682 5074.883 61686.473 4182.163 2927.762102 3STRN 3829 1997 -1 10.478 1.794 0.161 6.499 0.956 0.261 9.089 3.119 0.201 11.775 4.196 0.485103 MDLG 3829 1997 1 1619.355 1798.456 325.039 278.948 1619.355 294.572 1800 307.732 218.03 1083.963 336 0104 CTU 3842 1997 -1 4816.657 3968.555 78.349 1689.56 129.255 90.406 6574.986 124.67 54.844 1394.329 108.953 90.945105 CRSS 3842 1997 1 336.186 206.179 4.154 2.3 336.186 5.303 264.419 4.291 3.614 351.737 3.924 32.312

Figure 18 Continued

Page 154: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

142

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2106 3KTTEQ 4522 1997 1 409.557 28.093 317.02 0 409.557 25.281 32.205 342.687 0 449.337 229.907 7.244107 OLG 4522 1997 1 317.945 510.146 67.011 112.451 317.945 6.355 438.224 75.816 89.748 310.393 7.17 15.323108 ITSW 4581 1997 -1 18.226 4.523 2.273 49.915 0.375 0.881 9.953 11.432 14.639 57.228 0 69.369109 HGC. 4581 1997 -1 340.449 59.515 42.498 8701.016 44.684 412.025 837.453 7239.684 0 5896.245 1852 92.284110 PTEK 4899 1997 -1 372.963 57.692 63.574 352.129 1.111 13.071 372.728 51.232 76.449 339.679 1.254 16.315111 SMTKQ 4899 1997 1 14.827 3.048 1.517 0 14.827 0 5.559 2.016 0 10.781 0.231 0.094112 DANKY 5040 1997 -1 102.144 25.337 6.891 56.584 0.195 2.179 65.349 7.887 0.056 88.067 0.054 4.134113 IKN 5040 1997 1 30.4 44.552 0.25 50.451 30.4 0.652 44.088 0.26 10.17 26.173 0.612 0.118114 MTLM 5093 1997 1 86.67 36.049 9.593 15.833 86.67 3.684 51.484 8.046 21.691 60.165 0.003 4.05115 NR 5093 1997 1 9285 8995 1402 1190 9285 76 7618 1557 585 9622 158 274116 MCD 5812 1997 -1 126.343 23.505 38.154 94.044 4.475 2.832 138.045 20.916 37.762 92.812 4.67 3.354117 YUM 5812 1997 1 261.749 310.602 67.741 47.27 261.749 12.432 207.877 40.197 25.6 169.786 11.372 3.42118 VTA 6331 1997 1 148.419 360.742 53.555 0 148.419 0.988 371.115 53.714 0 154.186 1.174 1.75119 HSB. 6331 1997 1 1093.331 654.342 56.138 69.869 1093.331 1.278 836.623 90.101 64.027 1439.599 3.797 29.821120 RST 7011 1997 -1 48.876 1.698 12.641 18.464 0.182 0.215 98.725 0.404 7.346 44.604 1.09 0.73121 MCS 7011 1997 1 5451.984 6440.171 1170.401 254.677 5451.984 617.192 8458.777 1297.867 361.986 8916.705 1096.153 414.843122 ISLI 7372 1997 -1 1.462 0.2 0.209 4.256 0.112 1.492 0.357 0.028 0.056 4.232 0.504 0.735123 FDPC 7372 1997 -1 910.272 94.021 103.716 292.183 35.003 5.854 1084.633 110.584 120.705 334.009 40.221 8.624124 CYBR. 7372 1997 -1 54.262 16.865 5.763 600.392 13.7 1.494 296.189 32.663 6.499 1128.207 21.203 47.806125 3SOFT 7372 1997 1 1989.32 1247.448 1089.227 0 1989.32 111.808 1217.013 1028.218 0 1899.806 102.176 57.978126 PEGA 7372 1997 1 1246.659 460.986 76.622 0 1246.659 4.932 482.261 73.283 0 1335.347 7.308 0127 3PTUS 7372 1997 1 27.538 43.791 5.343 7.165 27.538 6.691 30.427 3.792 5.73 24.736 6.586 1.447128 3QDEK 7372 1997 1 1773.529 543.151 85.062 7.602 1773.529 0 940.388 151.102 11.812 2636.881 0 451.116129 FILE 7372 1997 1 97.04 415.874 14.902 15.439 97.04 2.352 504.379 20.311 23.825 118.406 3.161 13.032130 TEALQ 7372 1997 1 81.671 92.861 13.246 16.38 81.671 0.045 73.91 11.789 19.766 116.001 3.35 3.554131 CORL 7372 1997 1 188.576 139.897 33.203 19.947 188.576 10.224 275.068 4.841 0.176 151.395 0.736 0.028132 HYBR 7373 1997 1 20.954 29.94 7.133 1.388 20.954 1.435 42.427 13.69 3.531 41.641 6.082 0.284133 3DSYS 7373 1997 1 20.544 32.679 4.975 4.156 20.544 0.385 27.351 6.266 5.119 17.978 1.172 0.701134 QSII 7373 1997 1 119.006 377.41 27.333 44.477 119.006 5.636 318.451 31.738 41.657 135.839 7.015 0.896135 PSCDQ 7600 1997 -1 494.351 0 0 597.394 20.515 22.051 592.686 20.607 0 766.311 52.903 34.276136 ISER 7600 1997 -1 28931.9 5630.4 4202.4 40404.3 17605.4 1814.9 36047.8 38759 5101.3 111287.3 46023.5 4042137 UTLV 7812 1997 -1 46.284 5.678 8.158 57.736 1.643 1.197 35.338 8.718 4.442 40.089 4.405 0.482138 DCPI 7812 1997 1 163.333 7.089 3.373 1.648 163.333 3.278 11.451 2.576 1.833 122.717 3.401 2.177139 3JACK 7990 1997 -1 603.606 135.609 0 357.954 22.683 72.334 496.722 93.522 0 309.78 7.05 39.058140 ELSO 7990 1997 1 759.024 1040.418 162.177 152.674 759.024 61.126 1111.447 182.91 178.949 915.739 82.552 90.86

Figure 18 Continued

Page 155: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

143

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2141 VSR 8711 1997 -1 157.957 50.182 0.575 140.3 3.003 6.473 131.3 28.761 0 66.614 8.779 2.103142 QTEC 8711 1997 -1 661.983 73.409 52.036 1070.248 125.775 40.922 1020.993 135.048 131.307 2348.639 43.344 143.746143 MRSA 2253 1998 1 308.082 70.543 9.021 0 308.082 17.756 61.482 2.85 0 278.251 15.815 2.199144 SIAYQ 2253 1998 1 31.164 39.325 12.11 5.798 31.164 1.121 39.455 12.614 5.465 30.885 1.387 0.854145 ZQK 2320 1998 -1 224.58 23.54 1.212 270.583 10.25 42.835 246.827 18.839 0 176.33 10.186 4.935146 TNFI 2320 1998 1 59.392 201.745 37.986 0 59.392 1.639 158.693 34.769 0 52.73 2.399 0.572147 JH 2780 1998 -1 67.725 11.357 8.016 49.149 5.487 2.343 55.082 8.231 4.725 29.339 0.331 1.583148 3DAYR 2780 1998 -1 894.709 167.347 141.898 729.796 72.656 57.629 894.534 164.555 153.006 789.457 100.607 84.015149 DLI 2844 1998 -1 19.902 1.651 0.241 29.426 0.205 0.125 23.12 1.896 0.149 22.419 1.654 0.192150 STYL 2844 1998 1 16624.874 11205.091 1200.638 74.6 16624.874 0 8252.523 745.13 158.583 17356.82 0 1258.965151 HIPC 2860 1998 1 18.038 19.137 1.807 1.061 18.038 2.78 2.886 258.895 7.314 791.291 0 46.827152 NZYM 2860 1998 1 5.777 5.661 0.673 4.011 5.777 0.142 5.721 0.553 3.587 5.099 0.11 0.012153 SCNYB 3021 1998 -1 979.5 153.2 176.9 1179 129.5 36.3 1025.7 153.5 174.4 1082.1 239.2 34.4154 DECK 3021 1998 1 10800 9673 2329 0 10800 679 9475 2221 0 11064 542 291155 HAVA 3060 1998 1 19.909 33.608 8.832 4.973 19.909 1.288 13.873 24.199 0 69.538 0.553 1.573156 SFSK 3060 1998 1 5665 6625 885 707 5665 1780 8402 1267 965 8437 1318 496157 RCKY 3140 1998 -1 171.399 39.07 5.597 140.904 0 3.269 196.566 5.538 5.168 125.735 0 4.625158 CAND 3140 1998 1 16850.816 6168.432 899.314 0 16850.816 12315.726 6853.652 976.638 0 17889.09 13159.653 185.406159 ACRN 3420 1998 1 49.083 53.627 19.124 0 49.083 1.25 61.542 29.14 0 60.781 0.989 6.882160 LCUT 3420 1998 1 1900.39 1119.584 243.297 21.283 1900.39 57.551 2089.444 115.958 114.128 2429.914 58.798 319.234161 ADPC 3540 1998 -1 129.744 29.559 33.361 156.898 5.685 16.291 147.76 17.064 21.186 131.495 0.78 9.56162 DVLGQ 3540 1998 -1 2505.1 218.4 11.5 570.3 16 24.3 2809 248 9.6 619.9 28 14.5163 BEHP 3567 1998 1 80.758 4.542 0.812 0 80.758 0.121 278.907 242.159 0 752.653 58.526 18.115164 GNCI 3567 1998 1 76.557 73.725 10.412 17.751 76.557 34.644 34.872 5.695 8.869 57.361 0.421 0.619165 XIRC 3576 1998 -1 575.04 131.904 0 427.586 99.365 66.657 603.606 135.609 0 357.954 22.683 72.334166 DGII 3576 1998 1 856.247 1426.288 282.423 312.13 856.247 34.288 1169.511 264.233 240.488 823.98 62.203 29.39167 TLXN 3578 1998 -1 656.527 167.592 99.438 487.707 37.634 32.116 539.852 154.036 91.799 495.205 36.245 19.355168 HYC 3578 1998 1 31.259 9.344 0.973 0 31.259 25.909 7.681 0.575 0 17.983 0.549 0.025169 SORT 3579 1998 -1 365.751 75.376 81.923 636.29 8.429 13.159 479.594 65.296 184.387 208.293 108.953 22.077170 IDN 3579 1998 1 2178.941 3348.986 628.052 482.656 2178.941 615.902 2897.22 571.47 356.139 1905.142 96.667 191.054171 TNB 3640 1998 -1 39.709 15.174 0 39.87 5.787 2.018 31.532 4.671 0 13.723 0.51 2.739172 RYC 3640 1998 1 96.598 88.699 17.25 47.11 96.598 1.01 98.099 23.941 32.573 89.333 1.76 9.675173 CUBE 3663 1998 1 4.692 9.941 2.207 1.528 4.692 0.039 9.888 2.898 1.828 5.744 0.037 0.438174 3PCOM 3663 1998 1 12615.333 5060.605 332.553 103.078 12615.333 1181.473 4794.01 707.731 126.938 18016.455 1268.569 1508.085175 ALTR 3674 1998 -1 41.805 19.467 4.599 100.26 53.958 1.482 95.797 21.103 5.798 150.165 74.815 1.59

Figure 18 Continued

Page 156: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

144

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2176 WFR 3674 1998 -1 56.06 41.183 0 248.758 3.279 1.092 135.521 96.484 0 874.642 7.325 1.747177 XLNX 3674 1998 -1 1512 135.8 437.5 1124.1 112.3 45.4 2171.1 222.4 773.1 3544.9 11.5 197.2178 ADI 3674 1998 1 3.907 3.459 0.533 1.037 3.907 0 3.846 0.437 1.228 6.011 0 2.497179 3SAFY 3714 1998 -1 667.65 57.915 87.031 285.889 43.05 11.577 743.323 64.741 85.707 308.686 49.713 14.548180 NER 3714 1998 -1 2964 1825 0 12226 2046 25 3116 1854 0 11054 1852 30181 ETN 3714 1998 1 317.249 168.913 31.322 1.112 317.249 89.448 164.553 38.12 1.093 354.504 102.995 9.031182 TRW.1 3714 1998 1 5546.556 5877.859 1680.683 662.606 5546.556 153.064 9020.929 1887.926 736.019 6025.218 199.964 114.534183 WNC 3715 1998 -1 1019.174 263.016 204.168 817.197 7.698 83.542 1049.495 243.642 191.905 857.732 8.735 72.065184 FTHR 3715 1998 1 89.733 74.331 27.239 27.104 89.733 2.742 439.131 146.295 64.764 417.787 1.464 4.671185 IFRS 3825 1998 -1 81.812 27.456 3.949 38.735 0.487 0.651 53.53 13.558 4.39 27.808 0.472 0.694186 GEN 3825 1998 -1 5128.433 1436.444 442.207 5323.886 159.622 193.238 5628.663 1616.503 431.837 5748.796 149.4 212.169187 TLGD 3825 1998 -1 26347 10878.4 0 112839 55025.9 123.5 26818.9 1834.7 0 47445.7 14776.2 36.9188 STCO 3825 1998 1 49.223 39.423 9.653 12.926 49.223 1.219 30.66 7.887 14.433 31.982 5.117 0.642189 AH 3842 1998 -1 1565.238 294.353 108.133 1249.309 95.485 53.945 1692.792 283.326 100.922 1166.665 73.553 41.31190 SLS 3842 1998 -1 1163.928 10198.57 237.523 17020.231 679.8 2215 1625.883 12381.731 213.739 20280.9 13778 60.809191 ASE 3844 1998 -1 78.851 12.684 8.887 41.248 0.636 2.968 7.172 19.961 7.284 221.559 26.931 1.447192 3SCHK 3844 1998 1 5525 392 111 0 5525 41 515 389 0 8904 147 3436193 CMED 3845 1998 -1 244.498 39.768 0 392.417 12.617 55.697 115.898 22.917 0 240.136 19.578 1.091194 DYNT 3845 1998 1 27.493 12.519 7.304 12.399 27.493 1.558 9.563 4.643 12.536 23.599 0.969 0.925195 NMTX.1 3845 1998 1 336.039 425.761 33.551 69.722 336.039 13.994 424.354 30.586 82.977 344.06 11.13 26.971196 MGCC 3845 1998 1 3235.565 573.223 119.683 97.031 3235.565 68.984 909.388 192.914 73.925 4559.538 71.024 460.251197 3IGTI 3845 1998 1 6260.363 5492.915 636.404 126.052 6260.363 119.905 3842.3 668.6 123.1 6631.4 2.468 692.4198 AVID 3861 1998 -1 2249.936 131.964 483.111 2343.062 183.1 110.062 1671.256 301.887 418.902 1985.455 11.612 24.727199 IMAX 3861 1998 -1 5055.4 2403 2416 20202.1 679.8 1708.2 5384 1556 1112 15149 1141 277200 GMTC 3944 1998 -1 42.248 12.442 8.135 84.461 3.878 3.626 51.419 2.091 10.401 63.538 0.13 5.44201 JUST 3944 1998 -1 307.987 40.116 0 1596.911 5.384 0 419.418 42.609 0 1907.563 14.624 0202 OAR 3944 1998 1 6.698 32.266 1.196 0.376 6.698 0.077 35.338 2.583 0.615 8.909 0.012 0.433203 3DSIT 3944 1998 1 45130.668 27362.142 6150.954 256.626 45130.668 0 30183.335 7943.485 358.246 59847.614 0 5680.97204 3TRGP 4210 1998 -1 193.206 33.517 22.843 162.459 8.887 8.224 190.186 31.931 23.743 166.868 6.505 10.019205 DDN 4210 1998 1 48.471 120.999 29.93 1.608 48.471 3.807 81.817 17.389 1.368 43.902 5.956 0.23206 ITSW 4581 1998 -1 59.754 17.897 0 35.877 1.368 2.088 85.77 17.119 0 36.8 1.223 2.739207 3LYNGD 4581 1998 -1 102.227 19.13 1.798 98.008 0.724 25.489 119.667 27.975 3.187 118.443 0.354 10.521208 BTY 4813 1998 1 18.258 0.829 0.399 0 18.258 0.628 0.043 0.065 0 10.132 0.029 0.032209 BLS 4813 1998 1 95333 18781 22559 0 95333 41466 19994 23330 0 95088 38690 0210 MCN 4932 1998 -1 5.46 0.802 0.34 3.226 0.147 0.15 5.878 0.832 0.346 3.002 0.177 0.207

Figure 18 Continued

Page 157: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

145

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2211 UGI 4932 1998 1 319.105 80.984 15.088 11.062 319.105 37.546 102.693 9.81 10.17 155.083 1.258 7.673212 3PDGE 4955 1998 -1 23.408 2.358 29.495 41.128 0.29 0.412 12.845 0.008 85.707 7.748 8.779 11.189213 BIKO 4955 1998 -1 9597 1364 813 17102 1262 679 8687 757 617 13915 1881.009 747214 CDSC 5045 1998 -1 4.729 0.229 0 6.371 0.476 0.183 8.847 0.016 0 8.65 0.138 0.242215 CHSWQ 5045 1998 -1 92.083 15.213 11.358 48.983 0.005 1.608 82.449 13.937 10.446 57.601 0.028 1.928216 TECD 5045 1998 -1 1957.58 149.135 84.546 615.574 60.527 22.044 1953.392 127.767 78.905 507.652 13.877 50.683217 3TRNT 5045 1998 1 49.823 53.788 0.623 0.382 49.823 1.655 55.9 0.473 0.445 49.748 1.505 2.117218 GFIHQ 5047 1998 -1 9600.6 544.755 315.069 4152.544 822.156 314.804 5937.896 501.445 120.27 2986.857 131.03 199.493219 PDCO 5047 1998 1 139.587 13.689 1.32 0 139.587 15.649 11.287 1.948 0 78.775 14.13 0.954220 MI.1 5065 1998 1 17.013 41.74 8.696 0.426 17.013 0.507 33.302 2.426 0.608 7.668 0.772 0.748221 CLST 5065 1998 1 1001.107 76.513 648.036 0.672 1001.107 227.2 84.988 727.042 72.464 1081.354 147.536 0222 MCK 5122 1998 -1 1490.701 245.538 178.107 2799.997 167.602 329.092 2089.444 275.62 243.896 3206.605 301.189 205.172223 CAH 5122 1998 1 24.63 56.346 7.189 0.625 24.63 8.08 37.587 4.338 0.325 16.445 5.374 0.134224 POCC 5171 1998 -1 16.851 0.08 1.904 6.068 1.383 0.544 19.223 0.164 1.94 4.489 0.334 0.079225 3EVSI 5171 1998 -1 567.815 165.868 0 649.494 16.426 20.136 526.867 124.67 0 523.408 17.354 12.926226 BMHC 5211 1998 -1 9172.205 1750.827 1887.28 5244.355 224.367 86.927 12814.01 1629.566 1917.044 5864.148 294.637 125.421227 WIKSQ 5211 1998 1 18.713 36.889 3.821 4.909 18.713 0.041 56.754 20.212 2.101 88.805 2.048 4.93228 FFPM 5500 1998 -1 172.19 37.866 14.567 109.453 4.448 0.969 104.22 23.041 7.066 76.983 1.94 0.351229 SCHA 5500 1998 -1 44.361 20.319 0 127.52 11.701 4.488 61.769 43.478 0 139.26 9.505 7.52230 RINO 5900 1998 1 156.102 53.989 65.645 0 156.102 4.948 58.549 45.676 0 141.025 3.657 1.481231 OFLD 5900 1998 1 161.697 178.569 54.948 22.241 161.697 5.785 191.32 50.607 20.355 148.563 3.854 10.083232 FOOT 6020 1998 -1 54.598 11.129 7.221 70.142 2.467 0 85.348 13.317 10.174 80.462 2.575 1.096233 STL 6020 1998 -1 64.945 11.071 5.813 396.926 16.282 10.7 100.164 17.343 12.76 672.583 167.397 8.656234 CBSS 6020 1998 -1 203.27 24.831 0 621.884 3.014 11.911 588.608 54.426 0 1469.821 16.729 49.548235 ZION 6020 1998 -1 167.549 1463.835 11.221 1982.831 1060.576 468.381 171.131 1532.19 46.246 2087.094 146 17.138236 NCBM 6020 1998 1 260.144 197.097 91.881 0 260.144 24.973 218.236 87.813 0 309.926 30.022 5.615237 BLMT 6020 1998 1 7.614 16.449 1.853 1.424 7.614 1.619 12.438 1.479 1.331 6.201 1.4 0.254238 UBMT 6020 1998 1 493.348 555.86 87.84 34.199 493.348 9.591 712.9 65.496 35.251 537.367 7.397 104.052239 CWBC 6020 1998 1 1583.185 1808.639 390.506 383.724 1583.185 0 1696.944 370.286 341.273 1448.691 0 95.562240 CAT1 6159 1998 -1 14.27 8.87 3.368 43.119 1.563 0.643 12.418 1.433 5.224 15.42 0.495 3.907241 3FNVG 6159 1998 1 2.762 0.898 0.268 0.487 2.762 0.096 10.647 0.864 0.303 6.963 0.086 0.321242 NDB 6211 1998 1 104.766 339.407 58.546 0 104.766 2.49 365.076 55.948 0 121.281 3.186 3.584243 ZCOI 6211 1998 1 7604.541 12959.25 2635.595 2972.661 7604.541 250.888 10127.604 1458.553 1403.075 5358.984 326.024 64.355244 3SFGD 6324 1998 1 7.634 8.716 2.023 0 7.634 0.115 19.572 4.894 0 58.519 0 0.608245 AMIC 6324 1998 1 556.887 786.434 76.733 20.736 556.887 35.291 864.116 15.822 19.577 704.816 50.684 11.877

Figure 18 Continued

Page 158: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

146

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2246 ALFA 6331 1998 1 36.06 19.979 11.096 0.338 36.06 14.307 30.557 9.126 0.335 32.669 13.26 1.055247 VTA 6331 1998 1 504.641 760.553 200.283 103.751 504.641 31.704 653.098 162.603 114.773 472.908 6.303 27.908248 FNF 6361 1998 -1 15273.6 650.3 3190.2 6686.2 261.2 502.3 18098.3 699.3 3445.5 7275.4 359.5 493.5249 STC 6361 1998 1 26.502 27.521 2.983 5.059 26.502 0.152 21.671 2.455 5.438 15.436 0.458 0.193250 3HDSGE 6794 1998 -1 10.418 4.953 4.483 47.708 0.434 0.888 14.037 1.343 6.977 22.564 0.022 0.384251 BLM 6794 1998 -1 76.584 20.435 2.586 133.814 2.436 4.784 137.605 33.51 2.585 166.625 4.431 8.744252 3POMH 6794 1998 -1 2249.936 131.964 483.111 2343.062 183.1 110.062 1671.256 301.887 418.902 1985.455 11.612 24.727253 CD 6794 1998 -1 8388.339 1074.94 1561.863 7074.559 234.789 270.595 8169.639 1338.08 1594.308 6675.932 32.605 233.268254 OLS 7363 1998 1 141.619 283.287 9.557 62.136 141.619 0.192 404.729 14.663 85.967 235.038 1.263 20.96255 MAN 7363 1998 1 11716.489 7249.689 2438.161 3729.91 11716.489 66.941 6937.727 2525.959 3263.284 11183.641 0 309.955256 YHOO 7370 1998 -1 5.928 61.135 0.011 81.733 49.748 10.519 6.724 76.713 0 92.497 6.909 3.107257 IMRS 7370 1998 -1 84.864 5.391 6.329 83.25 0.33 8.359 96.73 6.999 5.038 80.613 0.026 2.183258 BBOX 7370 1998 -1 2103.7 377.7 386.1 2580.5 976.1 76.2 2289.4 343.7 392.1 2430.8 95.3 80.7259 DTLN 7370 1998 -1 2549.808 470.532 537.401 2652.686 55.942 133.083 1756.083 360.671 298.852 2087.763 34.578 69.369260 3ATHMQ 7370 1998 -1 6710.735 2602.815 2202.295 6649.043 246.447 52.442 7748.43 4062.826 2704.444 8985.455 333.143 60.38261 LCOS. 7370 1998 1 67.952 54.888 12.162 3.966 67.952 11.643 53.356 12.262 6.407 68.437 9.723 3.377262 CMGI 7370 1998 1 530.815 631.115 182.074 27.575 530.815 75.851 5793 164.923 2472.437 5748.796 149.4 70.2263 EDGW 7370 1998 1 1279.407 569.237 129.933 116.266 1279.407 28.348 631.801 134.176 109.291 1311.395 10.991 34.095264 PQE 7370 1998 1 4473.763 486.816 2193.838 4473.763 168.197 19.064 5957.8 419.318 2675.571 5335.56 191.114 0265 CSRE 7372 1998 -1 21.051 5.313 0 21.792 0.048 0.564 14.303 2.815 0 9.344 0.133 0.605266 MTMS 7372 1998 -1 31.209 13.816 1.135 23.291 4.085 0.58 54.726 0 34.371 79.948 26.931 1.146267 ADSK 7372 1998 -1 20.97 3.101 0 58.016 3.835 1.235 18.637 1.792 0 31.056 2.796 0.331268 3VOXW 7372 1998 -1 92.955 19.026 27.528 81.658 14.442 0.677 98.132 19.252 20.956 77.697 15.455 1.15269 3ASFT 7372 1998 -1 194.546 1.693 39.334 81.867 2.093 5.909 167.959 1.576 30.941 65.58 1.992 0.847270 PRGN 7372 1998 -1 570.035 122.404 61.942 497.234 182.473 7.569 805.274 103.76 60.443 667.736 20.216 19.022271 CGFW 7372 1998 -1 48.045 10.658 0 780.631 298.063 16.793 336.955 70.532 0 9104.279 8005.276 60.809272 EAII 7372 1998 -1 10064.646 132.401 1046.366 2531.623 36.156 165.698 12494.023 189.301 1183.681 2995.342 58.798 361.024273 3OBJS 7372 1998 -1 8902 1507 1373 9536 266 211 8995 1402 1190 9285 76 201274 PEGA 7372 1998 -1 8460 1343 596 11250 467 1031 9468 1843 845 15028 934 1373275 SEGU 7372 1998 -1 12703.469 2385.911 256.229 22715.198 824.741 1651.489 13126.92 1907.287 116.207 22681.424 757.538 1326.684276 OMKT 7372 1998 -1 19653 4514 0 25510 5729 3828 17154 893.029 0 30299 2201 2954277 INFM 7372 1998 -1 6096.295 45587.361 0 62624 96322 2785.6 6577.241 56039.179 4825 90222.377 14443 1184278 3AUGRE 7372 1998 1 372.037 278.31 56.502 0 372.037 5.707 247.977 42.647 0 376.974 8.89 11.337279 AMSWA 7372 1998 1 1709.453 234.253 0 0 1709.453 17.576 244.712 6.97 0 1718.129 5.745 0280 GAEX 7372 1998 1 4175.538 763.294 115.559 0 4175.538 189.623 999.143 155.937 0 4880.443 324.075 0

Figure 18 Continued

Page 159: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

147

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2281 INGR 7372 1998 1 46.768 63.462 22.415 0 46.768 9.328 72.515 21.753 0 68.03 29.838 3.022282 SSNC 7372 1998 1 109.599 130.609 37.607 0 109.599 3.627 67.069 25.595 0 96.723 1.464 2.854283 3SOFT 7372 1998 1 40.916 31.216 9.946 1.328 40.916 6.125 33.816 12.488 0.772 40.218 6.334 0.521284 INLQ 7372 1998 1 81025 7160 35131 2 81025 16024 5756 30508 1 77564 14443 1206.128285 MNS 7372 1998 1 46.26 11.01 15.725 13.933 46.26 11.914 3.098 6.847 8.766 45.874 14.342 4.18286 3TRDXQ 7372 1998 1 4999.481 4745.748 619.536 21.035 4999.481 73.089 2529.039 1402 678.609 6025.218 54.566 0287 WALL 7372 1998 1 247.72 361.048 31.628 144.579 247.72 8.406 155.344 25.884 78.64 220.463 57.95 8.373288 FILE 7372 1998 1 16996.111 9633.962 738.076 220.699 16996.111 0 8481.86 906.301 195.451 18820.31 0 1206.128289 3FLXI 7372 1998 1 4902 15307 373 2027 4902 135 17838.8 486.5 2462.6 5906.7 91.1 696.3290 FFTI 7373 1998 -1 15.085 0.366 2.118 8.037 0.104 0.199 20.744 1.52 2.372 8.807 0.074 0.231291 BI 7373 1998 -1 2.916 0.611 0.947 8.339 0.244 0.025 5.451 1.138 0.926 7.559 0.166 0.059292 LVLT 7373 1998 -1 605.033 45.458 69.497 285.789 7.634 23.964 669.135 46.972 79.161 310.742 9.434 27.23293 UIS 7373 1998 -1 537.159 78.488 0 442.916 40.813 30.902 543.107 59.311 0 459.418 32.731 20.861294 DSLGF 7373 1998 -1 23123 4629 431 39410 1028 5212 25224 5177 451 43453 1564 6200295 CSPI 7373 1998 1 636.854 542.627 87.566 0 636.854 13.226 612.346 77.577 70.358 640.21 10.676 14.159296 AZTC 7373 1998 1 0.741 0.643 0.172 0 0.741 0.016 267.346 4.536 0 732.03 16.729 0297 HX 7373 1998 1 42.562 81.19 12.177 5.435 42.562 0.408 53.529 9.43 2.855 20.207 0.024 0.673298 IUSA 7374 1998 -1 1222.434 108.132 220.565 1262.05 340.976 13.563 1699.6 141.422 235.73 1446.197 332.843 30.968299 INOC 7374 1998 1 68.188 5.064 0.54 0 68.188 0 5.783 2.063 0 73.466 0 0.178300 LTBG 7374 1998 1 42.847 64.025 0.066 0 42.847 1.942 24501.067 2392 4825 22058.981 985 1206.128301 CSGS 7374 1998 1 1399.2 1451.8 218.7 278.5 1399.2 79 1427.6 209.8 298.6 1454.5 74.3 48.6302 EQUUS 7948 1998 -1 5.974 0.318 0.087 27.449 0.484 0.081 5.711 0.205 0.133 22.484 0.548 0.037303 FGRD 7948 1998 1 2956.906 5076.909 215.7 687.312 2956.906 3.056 6052.684 317.58 797.45 4346.842 4.201 311.978304 CCRIQ 7990 1998 -1 33.686 13.398 0.726 45.947 23.511 1.276 10.674 1.669 0 5.884 1.757 -0.043305 ELSO 7990 1998 1 49.865 46.277 8.189 13.202 49.865 0.339 61.111 11.2 17.336 66.202 0.368 2.273306 CTEN 8051 1998 -1 5.711 1.487 0 18.08 0 0.082 2.39 3.087 0 66.15 0 0.201307 NHC 8051 1998 1 4595.521 6070.568 239.42 1295.878 4595.521 218.194 5911.122 267.062 1306.667 4579.356 201.296 142.355308 LABS 8071 1998 -1 3.403 0.758 1.525 15.876 0 0.279 2.928 0.303 1.874 10.699 0 0.116309 LABS. 8071 1998 -1 920.582 249.792 240.162 1218.448 86.03 43.067 1030.663 324.848 277.601 1489.311 159.751 35.261310 AMS 8090 1998 -1 9.081 4.78 8.12 44.361 0.164 5.994 24.191 7.152 5.476 27.558 1.06 4.381311 ASGR 8090 1998 1 261.749 310.602 67.741 47.27 261.749 12.432 207.877 40.197 25.6 169.786 11.372 3.42312 NRES 8711 1998 -1 1013.183 1589.73 0 3539.344 84.249 29.487 1025.056 1431.092 0 3048.034 32.949 17.138313 TTEK 8711 1998 1 1153.421 148.136 236.849 0 1153.421 92.288 284.091 243.145 0 1189.864 85.701 0314 IBP 2011 1999 -1 704.44 168.513 105.45 512.319 5.76 21.61 749.201 158.28 119.498 559.257 4.172 27.62315 1970B 2011 1999 1 51.477 45.993 4.086 6.974 51.477 1.938 28.638 2.626 8.036 42.134 3.226 3.287

Figure 18 Continued

Page 160: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

148

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2316 SEB 2011 1999 1 1014.304 1066.451 256.477 18.45 1014.304 33.923 1671.925 338.328 38.207 1356.042 40.722 29.486317 HRL 2011 1999 1 1176.256 232.761 91.016 120.939 1176.256 24.429 679.989 114.839 102.748 1277.165 5.796 188.019318 WEAR 2320 1999 -1 329.807 204.076 0 2398.703 12.88 0 332.46 197.226 0 2431.153 9.213 0319 SPOR 2320 1999 -1 4584.976 39383.705 22.5 57982.699 3751.337 261.2 7392.073 64144.644 34.9 93169.932 2201 2954320 PLANQ 2522 1999 -1 619.143 167.966 0 602.738 78.091 10.343 1030.663 275.62 0 1511.485 3.329 5.135321 MITY 2522 1999 1 178.945 145.659 5.688 0 178.945 3.141 181.322 16.385 0 333.335 1.139 1.4322 BOTX 2670 1999 1 11.425 12.268 4.493 0 11.425 1.248 11.028 3.368 0 9.147 2.156 0.252323 WYNT 2670 1999 1 661.207 1804.102 295.572 246.671 661.207 8.718 2007.102 299.108 195.791 643.687 7.631 12.542324 BPIE 2673 1999 -1 3.749 11528.999 70.5 30.13 0.045 261.2 17154 226.927 3445.5 8916.705 0 209.874325 EPTG 2673 1999 1 33.088 32.817 7.638 0.774 33.088 0.291 40.934 10.154 0.787 39.667 0.406 3.287326 SG 2750 1999 1 6669.572 5649.56 138.07 80.564 6669.572 18.311 5555.174 71.283 70.561 8997.141 31.435 997.843327 3MAGRQ 2750 1999 1 2746.7 3961.5 533.7 513 2746.7 148.9 4366.8 550.2 546.3 3043.3 143.9 180.9328 CIMA 2834 1999 -1 96.949 27.379 8.786 90.658 9.741 7.719 109.675 34.801 14.945 109.897 9.645 4.893329 NSTK 2834 1999 -1 8.489 97.804 3.468 115.997 0.21 6.412 9.463 109.611 3.122 131.034 1.496 19.777330 DEX 2834 1999 -1 1252.904 173.461 89.43 766.962 32.414 22.482 1328.094 191.546 98.01 813.577 130.979 15.582331 ZONE 2834 1999 1 345.376 464.82 77.387 12.796 345.376 13.213 393.629 88.567 13.557 336.466 11.832 7.587332 ALO 2834 1999 1 19121.118 29308.227 432.044 874.964 19121.118 1893.682 26755.233 439.871 905.508 19326.121 1631.725 1210.845333 IG 2834 1999 1 2060.599 9807.363 612.52 1243.153 2060.599 76.209 11645.021 623.961 1570.504 2458.567 72.986 16.619334 IMCL 2836 1999 1 11.661 4.141 1.275 0.174 11.661 0.201 2.603 0.313 0.085 15.44 0.165 0.143335 GERN 2836 1999 1 144.007 50 7.22 1.59 144.007 0.271 41.864 4.789 1.96 151.395 0.237 3.565336 ATISZ 2836 1999 1 173.408 215.874 50.359 59.608 173.408 2.053 246.047 57.768 64.044 185.921 4.555 7.074337 SERO 2836 1999 1 3799.374 676.734 200.062 98.674 3799.374 1280.541 1007.795 164.923 131.004 4674.249 0 76.746338 BNET 2870 1999 1 511.3 396.9 67.5 34.3 511.3 40.8 419.7 69.8 41.3 536.8 36.7 37.7339 3IGNE 2870 1999 1 170.959 421.496 5.112 56.021 170.959 9.701 587.385 9.118 68.105 231.696 10.748 53.12340 GSE 3081 1999 -1 13.392 3.058 2.772 19.27 0.319 2.82 23.919 8.651 1.381 189.462 0.041 7.47341 3SWTX 3081 1999 1 469.077 41.323 289.007 3.379 469.077 77.286 40.072 339.533 2.101 458.676 10.989 0.826342 NWSW 3312 1999 -1 29.178 270.148 0.922 420.119 8.116 5.055 30.228 283.437 0.476 424.5 91.1 2.294343 RESC 3312 1999 1 270.759 212.468 31.34 18.461 270.759 6.775 240.42 33.98 19.588 75.951 3.728 2.086344 CAST 3320 1999 -1 27.531 5.671 13.283 32.467 0.67 0.788 23.139 4.795 9.659 26.622 0.379 0.679345 AHNCQ 3320 1999 1 1685.585 3357.757 266.059 270.239 1685.585 138.395 3675.132 307.732 281.404 1641.94 145.267 100.125346 CMI.3 3531 1999 -1 1645.948 261.682 265.83 6711.813 2904.595 112.681 2081.733 303.298 356.946 7134.847 1288.884 213.351347 GEHL 3531 1999 1 8807.039 568.98 5947 0 8807.039 8775.291 590.816 7943.485 0 8610.497 8588.245 0348 HIRI 3559 1999 -1 1073.09 228.46 304.9 1058.928 2.26 60.544 1836.871 361.774 519.189 3405.517 1859.377 53.686349 TDSC 3559 1999 1 143.87 105.541 22.65 0 143.87 8.317 91.432 20.686 0 134.713 59.58 1.739350 NXWXQ 3576 1999 -1 1343.817 262.484 194.926 779.39 27.024 39.934 1359.676 271.677 234.661 835.674 21.704 35.662

Figure 18 Continued

Page 161: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

149

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2351 ZOOXQ 3576 1999 1 1540.2 603.4 150.2 0 1540.2 192 579.2 172.8 0 2144 702.6 20.8352 3COPY 3577 1999 -1 0 0.074 0 69.894 0.52 0.799 0.192 0.05 0 66.894 0.52 0.263353 DMTI 3577 1999 -1 132.619 21.06 22.969 89.612 2.405 2.366 107.888 10.549 14.275 46.168 0.675 1.518354 MTLG 3577 1999 -1 2362.8 526.3 440.7 3491.7 930.7 201.5 1756.1 438 239.6 3273.5 764.6 155.9355 3SOCR 3577 1999 1 830.949 62.98 574.754 0.492 830.949 0.369 63.014 608.434 0.176 936.332 20.686 1.172356 MXIP 3577 1999 1 126.353 85.536 15.521 23.392 126.353 0 64454.565 27681.2 5101.3 61686.473 1881.009 1184357 3MITK 3577 1999 1 3085.9 1772.4 417.2 247.7 3085.9 936.8 1711.9 380.7 253.4 2993.5 210.7 96.4358 TLXN 3578 1999 -1 134.64 57.157 1.033 124.308 3.183 7.742 385.758 20.376 0.007 220.116 11.249 102.453359 CMIV 3578 1999 -1 465.87 148.688 108.178 390.539 17.106 26.124 388.294 88.515 129.049 348.844 12.237 36.024360 HYC 3578 1999 -1 5681.7 594.2 610.6 2811.3 499.9 193.5 5957.8 519.5 612.5 3192.6 530.8 155.3361 ASPE 3578 1999 1 767.65 1252.904 173.461 89.43 767.65 33.025 1328.094 191.546 98.01 695.283 34.62 15.582362 SATC 3621 1999 -1 73.175 622.059 1.312 859.143 62.07 0.736 56.364 543.141 1.284 747.911 10.489 104.052363 EXX.A 3621 1999 1 19.785 0.589 3.477 6.371 19.785 0.862 0 1.024 0 18.2 0.015 0364 TNB 3640 1999 -1 1052.628 254.177 0 4302.806 251.116 469.789 1806.6 498.708 418.902 3492.061 324.075 224365 HUB.B 3640 1999 1 131.002 316.37 11.685 0 131.002 1.522 347.524 12.414 0 150.896 1.672 29.393366 ACRO 3663 1999 -1 2236.2 212.6 55.9 996.5 99.4 105 1468.7 142.1 39.6 788 79.3 52.4367 SNR 3663 1999 1 1286.106 1657.511 92.99 340.818 1286.106 66.973 1778.991 70.086 381.022 1391.211 73.359 147.463368 RMTR 3674 1999 -1 536.721 113.729 105.488 383.697 0.929 4.467 62.508 178.862 79.161 25.516 499.9 16.619369 3HDTC 3674 1999 1 75.264 9.998 0.53 1.372 75.264 1.956 140.032 1.851 2.828 113.466 6.897 12.876370 MFCO 3679 1999 1 429.642 26.468 383.526 0 429.642 2.049 91.103 1036.676 0 1160.605 5.279 3.906371 EMA 3679 1999 1 23.969 78.476 12.135 0.069 23.969 3.436 117.489 503.424 87.1 2404.594 152.6 59.038372 WGO 3716 1999 -1 5.559 2.016 0 10.781 0.231 0.094 117.489 127.644 18.107 187.539 0.166 0.758373 COA 3716 1999 1 20.716 15.608 1.326 1.814 20.716 0.136 20.73 1.427 2.779 24.89 0.099 1.567374 ATK 3760 1999 -1 525.203 43.959 17.378 548.719 13.264 45.804 384.448 33.404 14.81 406.43 42.527 37.052375 ORB 3760 1999 1 349.646 146.918 46.997 0 349.646 25.636 184.929 37.981 0 439.621 84.214 4.527376 FLIR 3812 1999 -1 27.586 5.466 3.585 74.014 3.575 4.428 54.983 15.141 5.29 305.42 151.636 9.351377 NOC 3812 1999 -1 184.929 37.981 0 439.621 84.214 4.527 19.902 52.589 0 413.09 3.329 193.5378 EDO 3812 1999 1 79.222 72.925 13.741 6.404 79.222 4.062 67.388 12.434 8.384 77.372 3.014 0.383379 RTN 3812 1999 1 490.091 190.355 54.52 18.747 490.091 2.168 203.835 76.537 31.141 538.237 8.814 38.91380 EW 3842 1999 -1 2.139 0.509 0.057 5.682 0.089 0.407 6.832 2.537 0.051 6.605 1.293 0.6381 BMET 3842 1999 -1 1311.8 1510.9 375.4 6263.7 917.3 1095.3 1488.6 1475.7 13.3 6109.7 164.1 1141.1382 INVN 3844 1999 -1 19.278 7.903 0 79.25 20.343 3.335 228.6 89.038 0 403.174 183.487 9.661383 DGTC 3844 1999 1 271.693 316.447 154.412 0.666 271.693 13.28 843.661 294.683 1.901 532.375 3.729 7.849384 SHFL 3990 1999 -1 34.753 200.977 1.734 438.283 0.089 0.732 27.543 155.432 1.845 315.767 35.316 1.464385 MIKN 3990 1999 -1 209.336 98.857 342.853 713.671 22.559 9.456 335.456 195.598 456.009 1063.487 40.074 14.728

Figure 18 Continued

Page 162: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

150

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2386 YRKG 3990 1999 1 174.721 184.497 52.631 0 174.721 4.103 225.3 77.418 0 321.082 6.821 32.619387 ZSCO 3990 1999 1 107.901 117.352 10.314 1.491 107.901 0.856 143.644 17.808 1.764 132.583 1.182 9.395388 3PERF 3990 1999 1 139.453 130.405 16.161 16.792 139.453 1.53 101.536 10.518 12.491 180.828 1.301 1.317389 IGCA 3990 1999 1 5098 9681 149 73 5098 199 8468 155 68 4531 200 460390 CEXP 4213 1999 1 186.91 118.228 46.346 0 186.91 98.259 114.644 18.31 0 146.921 91.081 4.134391 IRNE 4213 1999 1 24627.174 24185.243 4172.087 4266.896 24627.174 3383.757 22027.068 3459.356 3396.212 21551.182 2979.826 1573.288392 PAA 4220 1999 -1 18.526 2.516 2.697 23.253 1.446 1.878 10.689 134.412 104.431 1083.963 22.854 143.746393 IRM 4220 1999 1 114.316 64.017 19.902 40.056 114.316 9.158 94.509 20.352 22.942 114.559 20.26 1.146394 HPAC 4581 1999 -1 107.758 24.553 30.123 112.633 0 2.882 111.414 20.212 30.891 108.867 0 4.935395 ITSW 4581 1999 1 73.525 97.934 20.05 18.238 73.525 12.85 98.407 27.561 23.896 82.251 13.584 1.977396 CMLS 4832 1999 -1 319.969 66.526 67.244 364.921 5.603 21.602 336.956 80.594 65.805 359.208 6.649 13.535397 CXR 4832 1999 1 55.386 163.573 15.072 15.07 55.386 1.05 174.063 21.378 8.66 62.149 2.627 0.511398 3POWR 4991 1999 1 326.233 108.355 10.896 0.314 326.233 42.141 39.162 11.955 0.361 370.345 38.137 4.549399 USEY 4991 1999 1 229.995 139.772 24.332 12.635 229.995 3.629 245.307 40.239 38.616 918.014 593.194 11.657400 ANIC 5063 1999 -1 1255.304 171.931 200.382 1285.326 162.611 67.713 1583.696 243.643 218.03 1312.848 27.485 116.933401 HMSI 5063 1999 1 525.293 292.642 15.756 3.125 525.293 11.179 255.744 12.964 2.801 520.01 11.503 3.979402 PBSI 5110 1999 -1 12.086 1.692 4.738 28.979 0.507 0.505 11.312 2.241 3.727 22.916 0.401 0.908403 3BCTI 5110 1999 -1 98.752 7.636 37.67 165.227 7.576 1.921 117.008 13.032 41.939 181.131 4.737 7.765404 MDRX 5122 1999 -1 903.937 204.411 0 781.625 92.32 36.362 867.469 199.303 0 696.604 77.176 38.91405 9956B 5122 1999 1 723.651 1200.945 180.929 117.058 723.651 61.921 1261.171 150.015 117.681 678.984 55.545 46.778406 DBRN 5621 1999 -1 1376 120 0 8202 410 0 1473 276 0 8396 336 0407 CTR 5621 1999 1 3865.576 648.817 417.2 0 3865.576 3689.773 1437.8 1878.1 59.8 6388 92.8 167.1408 GES 5651 1999 -1 2165.453 565.604 736.019 2747.217 19.098 382 3483.83 1014.22 1533.89 10161.04 0 81.64409 SMRT 5651 1999 1 97.627 82.242 28.077 0 97.627 63.592 128.607 42.332 4.574 245.184 11.958 1.942410 LECH 5700 1999 -1 563.684 773.462 0 1675.685 29.874 4.182 553.496 524.572 0 1347.702 54.771 5.659411 FRAE 5700 1999 1 47.692 108.914 10.553 18.082 47.692 0.208 116.329 2.759 21.185 59.509 11.968 0.966412 GGUY 5731 1999 -1 341.928 38.133 52.828 236.96 8.799 15.692 332.88 28.024 48.835 256.727 8.855 20.239413 TWTR 5731 1999 -1 264.508 32.684 2.585 1543.616 58.087 115.062 462.368 13.894 16.132 2005.127 77.062 44.743414 ITN 5731 1999 1 3652.671 2195.293 597.644 10.781 3652.671 286.354 2017.918 503.424 56.566 3597.532 268.827 64.191415 ULTE 5731 1999 1 361.082 316.666 56.816 44.548 361.082 2.402 287.362 56.245 41.313 359.415 3.447 4.754416 WHENQ 5735 1999 -1 212.05 47.512 68.283 383.171 48.354 9.202 118.583 24.882 50.01 248.445 3.301 8.23417 HAST 5735 1999 1 688.903 593.044 35.614 53.194 688.903 60.947 310.754 17.248 104.649 708.495 26.96 96.636418 AUBN 6020 1999 -1 219.472 120.758 69.91 492.1 50.378 28.782 118.659 69.94 42.723 261.512 10.911 1.091419 NBAK 6020 1999 1 80.589 64.68 20.192 0 80.589 12.697 21.696 9.168 0 80.002 2.135 1.015420 CEBK 6020 1999 1 167.788 132.854 16.985 9.838 167.788 1.491 101.649 16.138 4.355 147.228 1.544 4.699

Figure 18 Continued

Page 163: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

151

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2421 SRCE 6020 1999 1 59961 5697 34486 15 59961 2111.2 5793 37750 14 63503 2201 2954422 FFWC 6035 1999 -1 43.959 7.856 0.952 37.784 0.024 3.305 48.934 7.216 1.073 40.299 0.234 0.568423 FFBZ 6035 1999 -1 1319.331 149.765 0 416.024 4.008 10.319 1335.66 140.451 0 399.262 1.474 6.538424 3AMMBQ 6162 1999 -1 75.263 8.605 9.545 152.147 0.636 8.747 38.362 3.715 3.874 122.081 0.377 1.302425 ICII 6162 1999 -1 11408.8 483.5 70.5 18241.5 608.5 2111.2 12421.4 609.4 77.3 19784.4 538.3 1879.3426 GCAP 6211 1999 -1 1369.471 147.348 213.877 1503.174 67.818 96.966 1869.082 154.501 145.046 1617.015 53.79 63.123427 PLCC 6211 1999 1 349.646 146.918 46.997 0 349.646 25.636 184.929 37.981 0 439.621 84.214 4.527428 LGAM 6282 1999 -1 415.263 82.698 118.214 716.722 155.486 20.998 484.797 76.099 44.552 578.04 7.85 24.608429 AMPH 6282 1999 1 78.376 5.546 55.772 0.967 78.376 0.739 6.167 58.791 0.022 80.448 2.255 10.386430 CI 6324 1999 -1 214.525 31.257 121.902 215.475 8.116 5.717 228.248 27.972 105.195 192.9 10.045 3.473431 AET 6324 1999 1 1839.211 2331.134 347.515 181.594 1839.211 299.534 1854.461 354.026 205.152 1283.029 147.536 139.01432 ISNS 6794 1999 1 419.395 475.292 73.81 51.81 419.395 4.662 466.602 81.33 61.144 441.656 8.629 7.157433 3NVIC 6794 1999 1 646.278 725.884 182.242 132.519 646.278 47.799 817.509 194.4 139.179 684.028 44.353 43.228434 TCI 6798 1999 -1 14.56 3.282 0 18.488 11.901 0.063 15.777 2.084 0 7.06 3.961 0.005435 CMO 6798 1999 -1 28.062 5.546 11.44 34.704 1.784 1.788 34.823 6.02 4.005 32.433 3.361 4.585436 TCO 6798 1999 -1 3237.248 153.032 176.245 1277.791 187.619 30.136 4438.383 220.676 239.846 1617.717 13.616 57.656437 HMT 6798 1999 1 19.22 4.248 1.004 1.524 19.22 0.037 1.774 0.388 1.009 8.88 0.042 0.091438 BRE 6798 1999 1 224.487 228.783 38.315 32.602 224.487 6.823 234.829 36.039 18.138 310.309 2.521 13.052439 RFS 6798 1999 1 4892.116 5400.717 903.177 783.2 4892.116 83.08 4459.695 675.233 660.601 4602.202 66.296 166.891440 LGN 7011 1999 1 189.696 60.172 13.805 12.385 189.696 0.572 84.825 12.863 18.567 209.055 0.263 7.976441 PDQ 7011 1999 1 872 1505 138 343 872 0 1462.1 109.2 425.3 1040.7 79.5 67.8442 AVSV 7359 1999 -1 84.688 17.245 260.519 2.196 232.975 13.563 361.628 65.481 15.415 172.697 4.384 5.055443 RWY 7359 1999 1 346.188 117.489 4.069 0 346.188 10.186 139.423 2.448 0 336.458 15.056 3.543444 HWCR 7363 1999 -1 441.195 97.231 0 473.411 25.03 25.007 230.04 41.719 0 348.816 10.489 0.753445 3SCBI 7363 1999 1 404.172 26.941 194.717 0.564 404.172 18.899 30.161 218.36 0.077 413.09 1357.244 768446 CPLXQ 7370 1999 -1 0.22 0.353 0 1.317 0.457 0.007 0.158 0.026 0 2.935 0.012 0.095447 BBOX 7370 1999 -1 32.242 23.288 4.264 248.739 5.924 7.949 16.487 69.94 18.567 31.056 1.385 12.926448 ONHN 7370 1999 -1 89.384 5.127 0 414.669 12.746 117.972 117.489 4.069 0 346.188 10.186 27.824449 3GEEK 7370 1999 1 450.855 135.653 38.938 0 450.855 86.292 104.173 26.692 0 199.952 55.228 11.189450 ETAD 7370 1999 1 16.829 16.931 5.004 3.413 16.829 0.242 26.829 3.565 4.505 16.688 0.358 0.278451 3SRCM 7370 1999 1 458.01 532.061 158.873 24.125 458.01 58.889 501.177 152.187 19.397 835.64 34.253 10.713452 EDS 7370 1999 1 26.919 35.162 1.332 30.597 26.919 0 7.156 1.226 0 36.986 0.235 0.131453 IFXCV 7370 1999 1 105.072 116.746 13.306 44.938 105.072 1.108 106.761 22.443 54.046 116.384 1.173 2.552454 3TSCN. 7370 1999 1 970.245 4107.434 166.808 274.995 970.245 34.399 3874.672 165.527 245.477 947.922 15.478 52.605455 LU 7370 1999 1 56504.398 51029.04 8479.335 4633.691 56504.398 1994.216 64454.565 9156.365 4911.444 63147.546 1881.009 2585.845

Figure 18 Continued

Page 164: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

152

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2456 EVIS 7372 1999 -1 11.028 3.368 0 9.147 2.156 0.252 7.963 2.899 0 5.141 1.257 0.023457 MDSI 7372 1999 -1 3.219 0 0 11.042 0 0.401 174.063 25.132 0 234.976 37.504 8.177458 EPIC 7372 1999 -1 19.153 5.978 0 15.546 0.458 13.54 72.44 299.634 0 274.968 0.104 26.855459 TMBS 7372 1999 -1 33.57 17.525 0 25.448 0.569 0.638 34.895 14.631 0 21.564 0.775 0.482460 YND 7372 1999 -1 113.287 13.761 0 28.376 1.69 0.739 272.926 39.738 0 98.727 0.836 1.91461 3COVR 7372 1999 -1 36.954 10.963 9.546 43.054 7.177 1.151 477.7 36.039 10.004 207.818 186.37 10.083462 IINT 7372 1999 -1 72.759 16 14.566 75.808 0.274 2.051 81.045 15.57 20.666 91.784 0.778 3.529463 DLVAZ 7372 1999 -1 275.76 5.656 31.887 81.842 2.231 0.369 269.739 4.883 39.169 63.816 1.503 2.439464 3VETX 7372 1999 -1 103.254 19.159 30.898 89.242 1.127 3.295 60.085 15.985 18.018 50.262 0.432 1.697465 SEGU 7372 1999 -1 102.701 21.595 0 114.982 2.26 9.503 106.023 11.864 0 95.386 2.4 3.304466 MCTR 7372 1999 -1 361.628 65.481 15.415 172.697 4.384 5.055 290.856 38.455 7.314 79.907 0.702 2.629467 3IPLYE 7372 1999 -1 108.885 13.357 0 217.668 1.529 76.376 132.488 1.318 2.779 631 8.839 3.5468 3TCSI 7372 1999 -1 475.559 83.235 68.777 375.766 15.153 20.038 461.137 87.401 56.123 342.423 10.728 20.531469 SDRC.1 7372 1999 -1 995.418 96.332 123.967 1851.116 38.85 25.281 1000.131 82.628 104.319 1794.286 75.136 12.78470 MSTR 7372 1999 -1 988.5 169 65 2921.6 453.4 42.6 332.463 38.45 0 2636.881 3.926 168.706471 3GLOB 7372 1999 -1 1451.675 215.08 143.335 3220.193 299.586 37.805 952.312 123.164 41.632 3046.783 429.63 70.602472 3LHSPQ 7372 1999 -1 33875 3131 3026 55798 1051 1101 29723 3116 3382 57100 851 1184473 THQI 7372 1999 1 14.827 3.048 1.517 0 14.827 0 5.559 2.016 0 10.781 0.231 0.094474 3TSSW 7372 1999 1 28.503 9.648 3.837 0.381 28.503 2.369 32.627 6.871 0.514 69.538 0.098 12.477475 3UNFY 7372 1999 1 1070.978 668.242 168.653 0.804 1070.978 16.59 700.07 199.267 2.745 1247.297 64.563 275.081476 QNTSQ 7372 1999 1 27.872 10.24 0.501 1.377 27.872 0.445 10.882 1.612 2.129 27.37 0.502 10.219477 PTEC 7372 1999 1 159.079 255.974 23.719 7.698 159.079 31.894 262.236 24.099 7.006 162.912 25.4 13.433478 DOCC 7372 1999 1 13.443 2.708 0.619 8.205 13.443 0 7.404 1.48 6.688 13.448 0 0.086479 MNS 7372 1999 1 73.088 63.618 11.539 28.737 73.088 2.322 65.049 12.585 21.801 68.126 1.12 4.729480 MLOG 7372 1999 1 78.859 159.377 18.275 35.552 78.859 0 182.344 33.025 34.816 88.067 0.742 2.874481 LGTO 7372 1999 1 123.915 116.767 24.895 45.459 123.915 15.442 266.964 20.376 3.531 274.968 16.563 9.031482 SSAXQ 7372 1999 1 470.946 778.524 51.859 68.275 470.946 8.723 897.56 70.406 78.143 586.201 13.71 24.699483 JKHY 7373 1999 1 206.995 282.331 99.495 27.534 206.995 53.168 300.528 55.316 78.311 243.809 56.748 4.93484 AZTC 7373 1999 1 3091.162 8972.161 565.525 393.38 3091.162 617.436 9958.956 503.399 449.989 3142.151 586.809 63.014485 THLC 7389 1999 -1 27.938 8.441 2.377 30.47 0.431 2.283 53.82 12.864 0.106 59.899 0.553 14.149486 MSGI 7389 1999 -1 348.182 12.213 3.603 577.9 22.132 128.842 458.056 18.194 3.305 553.397 13.385 55.71487 AAC 7389 1999 1 231.068 34.699 3.988 0 231.068 0.201 14.307 3.885 0 250.806 5.872 48.828488 ANLT 7389 1999 1 720.989 1124.305 324.654 105.032 720.989 53.566 1032.79 312.123 38.001 695.974 61.24 17.264489 RENT 7822 1999 -1 9564.412 2351.379 1503.751 10545.73 239.466 383.255 7343.248 776.451 1412.997 9828.51 228.546 710.62490 3PNEC 7822 1999 1 48.524 32.122 3.603 0 48.524 3.644 28.151 3.24 0 17.935 1.129 1.334

Figure 18 Continued

Page 165: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

153

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2491 ESI 8200 1999 -1 32.795 5.58 0.702 32.248 0.228 7.135 296.189 18.31 87.1 160.585 0.554 4.699492 POSO 8200 1999 1 9.853 6.739 1.214 0.199 9.853 3.132 3.963 0.144 0.007 2.708 0.35 0.007493 CLCXQ 8200 1999 1 94.714 59.734 14.62 10.519 94.714 0.777 63.299 12.125 8.109 85.058 0.451 1.102494 3RUSS 8200 1999 1 1693.433 2542.302 31.745 31.543 1693.433 37.385 2333.805 325.509 519.189 2135.9 71.5 148495 CMDCQ 8300 1999 -1 263.689 2022.979 1.167 2872.945 78.091 55.972 309.306 2264.418 1.698 3182.181 4501 60.809496 3ASLC 8300 1999 1 963.982 213.472 523.01 0 963.982 41.737 132.176 611.846 0 960.577 54.441 0497 AERS 8711 1999 -1 62.898 387.338 0 436.886 9.37 0 81.53 564.483 0 614.715 7.657 0498 EACO 8711 1999 1 16249.22 8054.528 5036.194 3290.118 16249.22 0 9139.547 6718.654 3341.791 17224.681 434.369 476.738499 MGLN 8741 1999 -1 48.972 9.134 13.56 116.861 4.911 4.632 57.646 7.824 10.743 103.812 2.647 2.75500 PMCOQ 8741 1999 -1 6190.833 606.212 268.761 6482.935 858.967 391.47 6034.747 479.716 215.998 5588.938 406.251 112.394501 3CASL 8741 1999 1 3053.505 287.811 1628.997 67.318 3053.505 502.3 804 2264.418 0 3282.187 449 100.865502 ADPI 8741 1999 1 63.298 66.129 4.221 895.192 63.298 2.869 86.251 6.673 189.866 73.215 1.896 2.227503 3EBTI 9995 1999 -1 622.08 18.641 20.742 104.007 7.983 6.365 51.419 1.427 1.462 387.79 0 64.207504 ITRNA 9995 1999 -1 3896.302 252.067 429.362 960.539 122.913 50.656 4258.425 408.052 267.775 1103.539 250.458 40.174505 KCS 1311 2000 -1 224.789 69.49 32.989 208.225 16.616 15.157 301.948 81.276 49.068 248.639 21.236 16.241506 CHOHQ 1311 2000 1 9.22 1.025 1.241 0.498 9.22 0 4.882 0.844 0.175 4.45 0 0.169507 PLLL 1311 2000 1 25.508 24.793 3.092 0.917 25.508 1.567 22.304 2.333 1.214 22.558 1.163 1.142508 EQTY 1311 2000 1 75.968 95.213 10.859 35.206 75.968 2.835 91.779 20.4 52.535 103.889 2.542 32.113509 DF.1 2020 2000 1 752.858 116.734 39.47 0 752.858 8.704 117.461 37.303 0 932.932 9.558 0510 DF 2020 2000 1 84.026 47.115 14.534 0.217 84.026 0.125 853.658 275.294 38.616 762.285 119.273 7.909511 PENX 2040 2000 1 8.065 14.399 2.373 3.771 8.065 0 14.722 2.412 2.708 8.458 0 0.218512 RVFD 2040 2000 1 9647.716 7656.328 1473.876 757.927 9647.716 0 7757.401 1396.34 680.105 9995.602 0 306.74513 LNCE 2052 2000 -1 23.225 0.126 0.096 46.435 0.063 0.313 6.832 41.845 4.162 31.982 2.161 0.094514 KBL 2052 2000 1 9035.15 9430.422 1383.55 1407.961 9035.15 984.064 9431 1449.147 1527.554 10278.354 1164.639 213.387515 WRNC 2320 2000 -1 38303 10438 6151 38775 3822 2215 33813 9558 7558 48792 13778 2701516 TOM 2320 2000 1 11.642 12.283 2.244 2.723 11.642 0.83 15.957 2.863 4.493 13.854 0.826 0.104517 RDA 2731 2000 1 11.645 19.725 3.384 6.389 11.645 0.039 27.535 4.037 8.471 12.911 0.06 0.039518 SCHL 2731 2000 1 70.664 89.041 4.222 13.474 70.664 13.625 74.248 7.488 5.768 28.902 0 0.692519 WCS 2761 2000 -1 20.449 5.263 4.917 11.426 0.573 0.067 495.131 46.031 0 676.116 2.261 55.648520 NEB 2761 2000 -1 228.678 61.344 0 270.773 112.206 20.582 266.073 82.546 0 473.344 320.383 9.048521 SR 2761 2000 -1 80.87 602.119 0.468 887.712 124.9 0.507 108.427 829.961 1.462 1274.166 19.578 256.135522 EBF 2761 2000 -1 1170.318 496.129 390.301 1848.16 34.179 68.03 891.395 387.436 274.64 1416.927 44.376 31.743523 ARNX 2834 2000 -1 1607.079 0 259.127 852.767 34.009 57.921 1993.843 11.633 300.407 1132.677 27.869 63.046524 3NRDC 2834 2000 1 64.039 56.116 2.869 0 64.039 0.737 71.325 3.085 0 70.943 0.701 10.113525 3DSCI 2834 2000 1 65.305 13.933 2.699 0.375 65.305 2.79 33.253 3.615 0 49.056 1.244 2.717

Figure 18 Continued

Page 166: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

154

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2526 NSTK 2834 2000 1 185.034 302.357 97.606 5.455 185.034 7.479 347.003 135.048 10.707 229.942 11.249 7.877527 CVM 2836 2000 -1 16.177 1.677 0 42.477 1.194 5.683 41.663 3.652 0 58.803 20.334 8.341528 BCRX 2836 2000 -1 2.498 0.078 2.508 65.82 3.806 1.105 2.981 0.462 2.706 45.63 4.045 0.701529 CRXA 2836 2000 -1 106.976 39.508 0 116.78 15.266 10.519 70.736 21.65 0 80.564 13.55 10.948530 VICL 2836 2000 -1 83.339 27.862 0 462.546 0 39.581 147.595 32.346 0 189.392 0 17.307531 SUPG 2836 2000 -1 592.42 26.952 9.19 1423.977 13.068 120.854 580.897 20.624 7.805 1163.947 14.988 79.241532 TKTX 2836 2000 -1 2032.7 231.3 417 1455.2 124.9 54.2 1806.6 207.6 277 1802.1 156.8 49.7533 CRGN 2836 2000 1 21.053 19.587 2.397 0 21.053 0.634 21.624 2.186 0 24.12 1.043 0.088534 ANIK 2836 2000 1 14.354 4.715 1.553 0 14.354 0.027 8.327 2.86 0 27.025 1.423 0.647535 NPSP 2836 2000 1 702.099 846.898 0 109.464 702.099 33.32 1060.417 0 112.104 668.534 33.369 45.459536 CSON 2836 2000 1 1714.011 2264.313 326.937 168.22 1714.011 105.105 2354.723 514.074 279.785 2525 175.218 157.937537 3ORGG 2836 2000 1 691.659 1977.011 208.116 226.785 691.659 15.828 1825.169 242.159 137.549 609.42 15.34 27.442538 HEB 2836 2000 1 3021.761 530.555 36.456 1369.351 3021.761 0 635.08 39.631 431.837 3362.96 0 209.874539 GRKA 2911 2000 -1 15140 2159 471 34778 0 2041 14955 2392 524 37478 0 2150540 HWY 2911 2000 1 32.669 30.557 9.126 0.335 32.669 13.26 19.865 5.108 0.054 26.017 13.127 1.464541 LAF 3270 2000 1 5.39 15.671 2.481 2.117 5.39 0.084 17.151 2.04 1.179 3.993 0.055 0.137542 USG 3270 2000 1 185.034 302.357 97.606 5.455 185.034 7.479 347.003 135.048 10.707 229.942 11.249 7.877543 ATI 3312 2000 -1 51.926 15.165 0 52.918 8.842 1.916 52.331 13.893 0 50.276 8.407 4.194544 3NSTLQ 3312 2000 1 49.748 55.9 0.473 0.445 49.748 1.505 56.754 0.694 0.594 48.793 1.643 3.046545 NCS 3448 2000 1 432.679 865.2 121.461 40.434 432.679 2.769 802.315 115.159 36.022 409.116 3.944 10.466546 BBR 3448 2000 1 188.26 325.149 33.309 58.36 188.26 1.112 277.63 32.926 71.368 188.703 2.436 8.608547 TTC 3523 2000 -1 6.821 1.271 2.095 21.765 0.186 2.001 214.373 64.033 3.129 410.608 5.972 16.315548 ALG 3523 2000 1 20.599 0.802 0.015 0.998 20.599 4.916 0.247 0.096 0.349 30.628 3.889 0.134549 CMCO 3530 2000 1 18.862 0.725 0.458 0 18.862 0.655 10.689 2.928 0 62.531 3.694 4.647550 MTW 3530 2000 1 2.486 5.461 1.128 0.434 2.486 0.074 5.061 1.318 0.475 5.654 0.043 0.202551 LUFK 3533 2000 -1 18.119 1.123 0 18.913 9.195 3.396 29.345 2.161 0 39.287 32.885 0.887552 HYDL 3533 2000 1 11.434 2.383 0.781 0.557 11.434 2.471 14.065 3.233 0 24.253 1.56 1.949553 ATU 3540 2000 1 15.397 70.425 6.791 1.408 15.397 0.489 44.306 6.736 0.887 17.119 0.429 0.256554 3THMD 3540 2000 1 35.7 41.795 17.234 8.913 35.7 0.052 58.815 19.268 8.438 52.708 0.271 1.233555 ASTX 3559 2000 -1 63.351 18.832 0 57.777 7.655 6.929 89.716 16.786 0 75.857 4.256 10.386556 TRKN 3559 2000 1 388.549 442.223 99.067 0 388.549 68.591 451.937 112.267 0 376.96 64.971 12.918557 IVAC 3559 2000 1 214.254 206.822 69.023 24.914 214.254 18.994 259.961 83.407 22.937 285.63 10.611 14.298558 SMTL 3559 2000 1 736.011 426.893 85.543 76.139 736.011 10.024 466.44 89.077 82.853 732.03 9.522 19.219559 WDC 3572 2000 -1 741.412 34.383 144.22 419.683 13.566 79.134 640.7 40.5 96.105 362.463 16.981 22.866560 RDRTQ 3572 2000 -1 1871.636 202.098 0 1881.615 1364.062 48.119 1873.554 141.64 0 1803.787 1357.244 36.924

Figure 18 Continued

Page 167: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

155

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2561 ELX 3576 2000 -1 39.823 8.679 5.176 22.664 4.762 0.487 41.469 10.829 6.028 25.516 4.794 0.738562 PROX. 3576 2000 -1 34.583 107.1 59.077 252.034 232.561 5.994 59.516 293.391 158.62 523.847 11.021 29.39563 VNWK 3577 2000 -1 157.487 29.393 3.117 218.587 0.965 19.511 134.32 19.961 0.799 235.699 0 7.817564 ENCD 3577 2000 1 50681.7 4620 31960.5 2014.9 50681.7 2785.6 14955 42.787 0.083 3142.151 4318 40.227565 MMAN 3580 2000 -1 1.978 0.808 2.445 25.88 0.054 1.014 6.32 0.555 9.563 12.891 0 1.657566 ILI 3580 2000 -1 114.38 27.383 0 106.415 14.467 3.528 122.005 16.51 0 93.881 11.079 6.473567 ANEN 3663 2000 -1 15.552 6.595 0.916 13.076 0.279 0.911 13.873 2.021 0.21 8.277 0.105 0.182568 CAMP 3663 2000 -1 151.558 34.067 12.657 136.411 30.898 9.502 253.3 60.901 41.686 676.116 10.989 55.648569 ADAPQ 3663 2000 1 796.952 477.302 145.778 10.062 796.952 33.16 455.683 144.119 16.096 780.781 28.36 256.135570 AND 3663 2000 1 124.491 97.936 32.818 12.188 124.491 23.432 206.822 69.023 24.914 214.254 18.994 3.861571 ILXI 3669 2000 -1 207.746 27.171 0 128.554 0.67 3.318 239.631 25.945 0 130.422 1.841 3.432572 CKP 3669 2000 1 1271.786 523.899 57.092 4.747 1271.786 33.207 571.677 53.312 5.07 1516.681 107.83 135.848573 UTSI 3669 2000 1 4646.299 5355.337 840.04 974.196 4646.299 581.531 5979.604 922.325 1008.864 5337.661 638.963 334.748574 ROV 3690 2000 -1 12297.8 1465.2 1283.8 11687.8 901.4 669.5 13006.8 1600.6 1239.9 12815.5 702.5 786.4575 SMP 3690 2000 1 1.329 2.898 0.352 0.876 1.329 0 3.392 0.537 2.027 2.665 0 0.004576 HAYZ 3714 2000 -1 679.738 175.794 102.034 879.237 99.698 62.448 301.718 80.862 123.219 757.419 91.659 37.052577 TEN 3714 2000 -1 696.649 229.714 1.734 3437.11 86.957 482.775 793.071 226.749 461.981 3814.474 149.896 70.2578 3WMCO 3714 2000 1 419.313 324.494 87.053 0 419.313 6.283 71.862 38.455 0 187.22 2.4 4.549579 UVSL 3714 2000 1 436.627 1250.604 53.555 207.72 436.627 10.051 1192.546 68.56 179.971 396.234 9.798 15.258580 TWAV 3823 2000 -1 66.511 11.175 12.084 33.701 1.936 2.288 70.684 11.22 11.929 32.924 1.371 0.331581 3RVSI 3823 2000 -1 249.843 122.191 37.158 836.746 13.97 113.46 714.938 140.014 50.135 982.585 15.628 217.179582 ARXX 3825 2000 -1 27.255 6.08 5.301 22.367 0.627 0.444 36.602 7.344 6.634 27.085 0.198 1.104583 LCRY 3825 2000 -1 214.64 30.695 0 71.212 0.105 0.826 186.198 25.387 0 31.543 0.118 0.941584 COHU 3825 2000 -1 13.689 1.32 0 139.587 15.649 8.006 11.287 1.948 0 78.775 14.13 0.954585 DATM 3825 2000 -1 54.428 14.403 0 282.226 2.713 18.196 94.635 19.133 0 236.628 0.554 12.69586 LTXX 3825 2000 1 107.358 109.177 21.073 0 107.358 17.826 105.51 20.376 0 113.047 25.738 2.178587 CMOS 3825 2000 1 59.386 42.803 2.185 4.295 59.386 8.536 25.271 1.474 5.773 59.106 5.313 0.873588 BIO 3826 2000 -1 63.813 2.214 2.32 14.324 0.014 0.432 18.63 0.878 0.699 7.755 0.011 0.178589 VARI 3826 2000 -1 5286.041 214.084 520.762 1160.51 24.046 138.098 6248.518 214.96 667.514 1481.715 42.55 85.28590 CPWY 3841 2000 -1 887.87 162.592 0 3163.544 14.299 130.188 1091.293 238.587 0 3402.396 36.869 92.284591 IJX 3841 2000 1 4.902 5.395 1.178 0 4.902 0.114 19.865 7.722 0 4.746 0.823 0.482592 POSS 3841 2000 1 498.481 968.761 46.732 0 498.481 26.923 1072.554 48.58 0 535.741 30.514 25.307593 EMBX 3841 2000 1 352.2 1034.561 4.472 245.186 352.2 4.895 1206.624 4.758 277.453 389.989 5.543 20.914594 VFOX 3843 2000 -1 1880.194 488.008 75.347 4392.898 168.645 482.775 2481.068 647.128 431.837 4239.054 108.288 294.258595 3BLLI 3843 2000 1 156.172 480.355 77.405 0 156.172 5.464 519.704 76.974 0 150.401 9.254 1.023

Figure 18 Continued

Page 168: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

156

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2596 ABCX 3861 2000 -1 23.12 1.896 0.149 22.419 1.654 0.192 26.392 2.355 0.133 24.981 2.597 0.154597 LENS 3861 2000 1 125.63 197.568 49.074 4.043 125.63 6.71 268.688 68.345 9.634 182.895 2.703 5.082598 GMT3 4700 2000 -1 1.488 1.175 0 6.91 0.723 1.02 4.606 1.258 0.24 9.099 0.534 0.498599 HOLL 4700 2000 1 11994.109 3170.328 2359.204 0 11994.109 1217.623 3569.301 4088.229 0 15072.063 1834.496 52.516600 CTIX 4700 2000 1 16.095 12.362 1.896 0.196 16.095 1.089 3.227 0.722 0.129 34.568 0.515 0.87601 GMT 4700 2000 1 11167.552 10378.931 514.963 187.355 11167.552 52.267 2988.487 546.37 186.035 10600.279 2.468 403.142602 3SNHDE 4813 2000 -1 915.508 19.021 110.276 226.09 7.646 21.086 860.543 15.106 121.458 220.116 0.448 1.802603 3DTIX 4813 2000 1 17850 13985 687 444 17850 330 7338 1079 491 16548 308 1236604 EWST 4924 2000 -1 184.075 35.421 35.31 185.693 1.062 9.331 263.046 67.726 80.191 425.944 89.525 29.877605 RGCO 4924 2000 -1 4459.695 675.233 660.601 4602.202 66.296 166.891 4183.664 669.885 595.071 4437.385 216.743 102.305606 CMN 5047 2000 -1 225.887 48.126 6.111 83.06 11.082 1.229 297.925 46.692 8.833 98.993 1.199 2.85607 ABIX 5047 2000 -1 762.808 59.786 47.48 342.549 8.976 25.899 492.108 470.532 0 1610.435 0 13.651608 ARW 5065 2000 -1 20.045 1.575 0 40.958 0.059 0.123 21.006 2.142 0 40.894 0.175 0.41609 AVT 5065 2000 1 1217.293 924.381 547.902 15.287 1217.293 124.959 1004.012 546.324 10.846 1142.35 133.35 51.383610 FLMIQ 5141 2000 -1 1113.408 8944.318 220.1 10441.236 293.115 261.2 1346.743 12096.516 167.983 14050.293 279.225 573.287611 3089B 5141 2000 1 186.476 156.996 33.98 22.339 186.476 42.081 28.704 16.312 14.87 198.848 65.021 9.268612 AE 5172 2000 -1 124.979 12.37 10.373 49.812 0.501 4.337 100.715 8.219 9.472 56.688 0.287 1.534613 ENRNQ 5172 2000 -1 640.493 386.229 0 3395.688 349.477 0 512.886 437.844 0 3440.01 357.321 0614 DG 5331 2000 1 1289.571 91.716 353.799 0 1289.571 0 96.837 427.989 0 1395.313 60.899 44.743615 AMESQ 5331 2000 1 3609.207 1917.898 116.433 178.532 3609.207 0 1922.21 166.641 98.851 3492.061 0 242.078616 KKD 5400 2000 -1 20.744 1.52 2.372 8.807 0.074 0.231 21.586 3.316 1.999 9.617 0.065 0.564617 0295B 5400 2000 -1 178.505 49.167 0 168.901 6.551 6.402 145.689 52.857 0 140.732 3.329 12.346618 KR 5411 2000 -1 1659.39 64.134 55.682 8364.059 5624.913 49.783 1646.984 50.818 41.839 9690.528 6467.152 40.227619 ABS 5411 2000 1 0.451 0.086 0 0.017 0.451 0.006 0.029 0.014 0.187 7.208 0.009 0.094620 CHRS 5621 2000 -1 543.78 26.691 26.891 1188.128 29.089 96.986 615.26 31.837 32.326 1447.71 34.159 144.939621 ANN 5621 2000 1 8.603 19.347 0.365 2.712 8.603 0.131 18.071 0.037 1.964 7.141 0.102 0.613622 GES 5651 2000 1 499.163 20.838 1.4 0 499.163 5.191 23.475 0.856 0 538.701 0.131 11.382623 URBN 5651 2000 1 706.288 652.419 153.578 0 706.288 206.895 274.982 158.1 277.453 359.957 219.005 286.91624 GDYS 5651 2000 1 413.683 765.891 10.142 247.8 413.683 11.857 706.411 11.168 197.147 341.691 20.585 13.267625 FTUSQ 5651 2000 1 3961.077 15918.138 901.649 1895.525 3961.077 80.915 25033.6 1742.8 2931.4 8289 205.5 319.9626 FRS 5812 2000 -1 72.196 7.73 2.5 50.553 0 4.757 119.94 10.332 5.399 61.261 0 3.276627 3GRLL 5812 2000 -1 10.685 19.263 0 70.77 7.924 0.941 52.92 24.199 0 133.581 21.946 0.353628 CHMD 5912 2000 -1 47.931 14.131 6.665 103.496 0.292 2.771 33.29 2.922 10.626 34.487 0.795 5.286629 CURE 5912 2000 -1 98.836 22.427 37.67 271.677 8.435 1.972 117.008 13.166 41.939 284.977 3.246 7.803630 JILL 5961 2000 -1 22959 3091 462 35862 2665 1814.9 12495 2845 347 33696 19.052 3032

Figure 18 Continued

Page 169: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

157

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2631 LVC 5961 2000 1 4.772 5.508 1.61 0 4.772 0.312 4.166 1.546 0 4.752 0 0.049632 RIGS 6020 2000 -1 42.562 4.44 5.888 11.537 0.278 0.362 59.216 1.03 10.592 14.817 0.785 0.127633 PBKS 6020 2000 1 65.083 46.926 6.588 6.206 65.083 5.963 12.438 69.579 456.009 104.228 1.693 14.728634 LEDG 6036 2000 -1 17008 5818 1364 27689 4453 219 15849 5724 1222 25458 4318 146635 ASBI 6036 2000 1 4.226 5.416 0.369 0.159 4.226 0.052 5.928 0.503 0.106 4.258 0.034 0.093636 AMPH 6282 2000 -1 413.916 104.811 0 1143.024 619.022 67.515 244.498 39.768 0 392.417 12.617 55.697637 ECMN 6282 2000 1 110.974 59.703 10.504 25.591 110.974 4.513 77.13 13.865 6.673 71.089 0.741 2.219638 PHLY 6331 2000 -1 93.558 16.738 14.608 66.728 3.296 1.946 37.587 55.151 18.138 65.58 1.12 49.6639 PXT 6331 2000 1 70.742 91.108 0 0 70.742 4.179 91.904 0 0 75.267 3.397 3.393640 CBG.2 6531 2000 -1 51.424 16.059 1.977 27.093 3.397 0.226 40.533 9.457 1.983 16.455 0.031 0.47641 JLL 6531 2000 1 30680.544 46225.837 2149.136 579.673 30680.544 1101 7922.498 917.474 351.816 19634.279 985 854.376642 3BIGTQ 6798 2000 -1 0 1.158 0 25.331 0.031 1.311 0 0.615 0 22.895 0.04 0.672643 WRI 6798 2000 -1 226.872 23.811 0 70.634 0.104 2.373 214.64 30.695 0 71.212 0.105 0.826644 HCP 6798 2000 -1 91.484 23.96 8.25 237.46 2.533 8.189 175.666 41.794 8.367 2404.594 128.102 16.211645 CDX 6798 2000 1 295.845 280.976 3.022 39.443 295.845 20.868 237.684 3.645 29.506 260.804 55.795 22.817646 GREY 7311 2000 -1 148.674 28.608 0.938 230.921 148.189 3.587 224.366 32.855 1.819 192.167 120.661 1.535647 LEAP 7311 2000 -1 439.547 92.172 85.765 549.048 23.493 14.416 456.666 80.479 89.685 569.457 29.323 15.694648 TNO 7311 2000 -1 493.106 95.69 14.406 772.832 97.736 14.513 659.219 78.373 13.7 734.824 89.85 16.524649 BULL 7311 2000 -1 690.811 176.479 122.267 872.005 32.378 13.172 658.535 145.768 91.51 752.653 89.578 9.572650 HC 7359 2000 -1 31129.357 1798.724 174.035 12291.706 839.908 6171.737 11557.998 1692.488 318.893 17636.82 0 660.526651 UCO 7359 2000 1 6.426 18.023 1.155 0.375 6.426 0.153 14.527 1.628 0.278 4.718 0.196 0.224652 KEYN 7370 2000 -1 5.39 0.769 0 3.588 1.631 0.218 4.379 1.021 0 4.348 1.45 0.045653 EDS 7370 2000 -1 7.154 1.71 0.921 10.871 5.406 0.165 6.432 0.508 0.832 2.831 0.213 0.238654 CPTH 7370 2000 -1 6.821 1.271 2.095 21.765 0.186 2.001 107.359 11.432 0 236.628 142.9 11.657655 3CMEE 7370 2000 -1 36.785 1.219 0.118 49.144 0.283 3.478 41.795 1.261 0.125 48.952 0.054 0.758656 BGNK 7370 2000 -1 19.115 5.822 0 77.28 2.389 19.102 15.318 1.19 0 9.545 0.581 3.265657 AVCS 7370 2000 -1 167.841 35.148 37.61 101.179 1.158 7.175 196.212 43.649 42.361 216.311 1.793 10.495658 3ARIS 7370 2000 -1 258.176 75.263 0 170.177 38.905 11.392 219.768 60.424 0 134.787 10.503 5.135659 EDGW 7370 2000 -1 101.086 720.368 6.922 1337.224 104.766 0.067 122.807 893.029 1.84 1505.796 59.569 90.86660 3HGRD 7370 2000 -1 1068.859 272.45 275.504 2272.966 53.617 661.391 1009.999 211.722 223.35 2154.029 61.289 250.17661 DGIN 7370 2000 -1 6009.847 388.083 21.776 4037.223 0 119.905 980.446 321.091 22.047 3614.133 0 64.207662 3WEBB 7370 2000 1 12522.3 18534.2 4423.4 0 12522.3 2960.9 19226.8 4837.1 0 12700.3 3268.3 768663 AVRT 7370 2000 1 29.38 24.871 1.703 4.174 29.38 0 26.079 1.71 7.19 38.362 16.464 0.635664 3ZMBA 7370 2000 1 21.551 23.27 5.902 5.894 21.551 1.484 22.84 5.574 5.859 22.236 1.47 0.813665 LU 7370 2000 1 21551.182 22027.068 3459.356 3396.212 21551.182 2979.826 24501.067 4154.892 3602.092 24522.55 3395.233 1655.85

Figure 18 Continued

Page 170: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

158

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2666 MGEN 7372 2000 -1 0.672 0.315 0 4.252 0.103 0.092 2.981 0.83 0 4.654 0.259 0.127667 ONXS 7372 2000 -1 18.346 3.4 0.039 26.77 4.532 0.572 24.431 3.763 0 25.797 5.248 1.644668 GSOF 7372 2000 -1 130.639 12.065 3.26 95.993 5.517 3.534 275.068 29.712 6.241 290.178 11.11 8.187669 3UNFY 7372 2000 -1 5491.8 952.1 1079.2 5764.1 305.1 227.2 5318.3 860.6 856.6 5300.9 383.3 207.3670 3EGAM 7372 2000 -1 4626 31960.5 2014.9 50681.7 2785.6 218.194 5275.1 27681.2 1019.5 42710.5 3371.7 1877.2671 3LTWO 7372 2000 1 96322 45908 9376 0 96322 16024 43138 8540 0 95057 14443 6808672 FRTL 7372 2000 1 6.561 5.812 0.951 4.363 6.561 0.133 5.661 0.673 4.011 5.777 0.142 0.047673 EBIX 7372 2000 1 143.497 199.998 33.123 26.371 143.497 0.714 161.707 25.132 26.255 122.26 2.095 1.203674 REY 7373 2000 -1 186.448 57.777 63.299 195.06 21.469 7.47 186.357 39.663 55.844 166.991 12.611 7.279675 BVSN 7373 2000 1 209.86 16.613 178.949 0 209.86 4.518 18.06 207.049 0 236.877 1357.244 768676 FDC 7374 2000 -1 31129.357 1798.724 174.035 12291.706 839.908 6171.737 11557.998 1692.488 318.893 17636.82 0 660.526677 ADP 7374 2000 1 3621.284 474.414 518.955 0 3621.284 404.442 472.907 692.849 0 3646.261 344.73 0678 SRCP 7389 2000 -1 447.155 6.272 152.065 247.933 3.704 47.31 458.203 7.759 130.676 213.484 0.012 30.482679 FMKT 7389 2000 -1 36.974 16.437 0 504.88 6.945 3.019 58.065 6.689 0 367.382 6.925 18.115680 FNIS 7389 2000 1 7.132 9.218 1.882 0.123 7.132 0.148 10.036 2.366 0.083 8.852 0.459 1.172681 MTY 7389 2000 1 5050.611 6581.236 193.258 1522.203 5050.611 232.19 6070.568 239.42 1295.878 4595.521 218.194 219.838682 TMTV 7812 2000 -1 186.357 39.663 55.844 166.991 12.611 7.279 214.373 57.965 47.605 185.038 17 4.242683 LGF 7812 2000 1 254.7 31.542 21.996 0 254.7 40.677 26.574 4.045 0 112.847 22.088 0.717684 ROMN 7812 2000 1 153.304 232.307 51.124 1.46 153.304 6.71 400.13 73.589 1.487 253.58 18.449 10.329685 WWE 7812 2000 1 103.163 82.318 18.506 24.68 103.163 5.158 77.059 14.954 27.398 97.999 1.727 5.885686 9136B 7830 2000 1 40.255 64.683 0.146 0.669 40.255 0.736 75.705 0.691 0.78 77.849 0.999 16.098687 AEN 7830 2000 1 14.114 30.666 4.165 1.986 14.114 1.918 28.868 4.135 1.77 11.166 1.903 0.237688 SLOT 7990 2000 -1 2475.682 345.996 265.644 858.824 25.656 5.461 2433.803 216.002 218.927 646.07 25.918 5.558689 5530B 7990 2000 -1 6267 443 571 5196 18 200 6664 442 597 5927 25 200690 HWD 7990 2000 1 246.275 329.974 62.841 32.258 246.275 0.66 499.816 115.958 52.535 452.289 1.944 12.566691 OCA 8000 2000 -1 284.8 38.078 0 76.841 3.167 1.559 2.844 82.628 0 69.284 0 24.998692 TLCV 8000 2000 1 6.562 11.764 1.992 0 6.562 0.089 12.887 2.447 0 10.426 0.24 0.193693 3MHCA 8051 2000 -1 9.566 1.562 1.282 5.581 0 0.023 8.947 0.914 1.787 4.72 0 0.039694 KIND 8051 2000 1 1944.426 429.869 21.088 295.041 1944.426 257.291 170.657 4.084 97.981 715.035 30.832 361.024695 3CRHEQ 8082 2000 -1 48.268 3.704 0.533 58.782 1.272 0.829 54.303 7.337 0.631 69.284 1.113 0.003696 3AHOM 8082 2000 1 691.659 1977.011 208.116 226.785 691.659 15.828 1825.169 242.159 137.549 609.42 15.34 27.442697 VSIH 8700 2000 -1 10.891 3.016 3.815 8.664 0.115 0.366 9.28 2.712 2.714 6.977 0.105 0.32698 METG 8700 2000 1 1995.727 3968.555 17.039 744.132 1995.727 468.381 3264.565 12.631 532.407 1083.731 52.283 25.245699 CBIZ 8721 2000 1 44.577 58.12 15.462 0 44.577 0.365 38.735 12.053 0 37.705 0.535 2.178700 0131B 8721 2000 1 62.987 58.436 10.633 17.46 62.987 0.957 79.146 22.547 20.207 69.332 0.54 2.652

Figure 18 Continued

Page 171: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

159

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2701 OXY 1311 2001 -1 23.841 7.648 9.458 30.246 0.855 4.16 24.098 4.313 6.764 25.345 0.056 1.934702 APC 1311 2001 -1 12731.9 749.606 2893.143 10421.741 3751.337 776.327 13338.947 299.634 2472.437 9909.847 2133.643 573.287703 VTS 1382 2001 -1 300.715 24.733 12.031 171.493 2.97 25.655 394.354 41.216 16.159 255.376 8.309 37.31704 3SEIEQ 1382 2001 1 217.718 332.002 4.849 81.991 217.718 2.11 438.911 4.668 130.15 255.978 2.176 29.555705 FLR 1600 2001 -1 15.806 3.988 0.024 7.712 0.64 0.088 15.554 43.478 0 10.353 0.471 1.055706 3FWLRF 1600 2001 1 4854 7158 520 965 4854 200 10751 1199 1911 10632 501 261707 HAIN 2000 2001 1 13.809 0 0.04 0 13.809 0.14 0 0.04 0 4.509 0.14 0.169708 DLM 2000 2001 1 824.36 52.781 646.385 0.632 824.36 2.203 112.834 1661.381 104.953 2145.406 451.942 64.207709 OME 2070 2001 -1 537.418 38.417 15.998 429.852 7.193 9.533 535.338 32.89 8.402 571.767 13.445 17.309710 DAR 2070 2001 1 26.721 48.136 13.587 0 26.721 0.069 79.061 26.913 1.705 124.305 0.23 32.375711 TGX 2834 2001 1 5.911 1.926 1.277 0 5.911 0.241 6.178 3.361 0 8.446 0.813 0.228712 IMGN 2834 2001 1 2381.1 8814.272 1674.729 0 2381.1 110.934 9770.1 1897.6 0 2718.7 152.6 74.7713 FMXI 3086 2001 -1 40.119 19.503 0 30.867 0.39 0.547 37.715 13.312 2.779 25.524 0.606 1.417714 8412B 3086 2001 1 134.215 40.091 4.339 6.825 134.215 23.02 36.255 4.445 4.374 45.978 13.44 5.134715 DEVC 3270 2001 1 42.23 50.309 9.329 9.541 42.23 2.311 82.399 11.439 12.617 42.499 1.792 1.578716 EGBT 3270 2001 1 232.969 280.487 24.979 50.451 232.969 17.424 312.538 35.452 70.111 319.407 15.231 75.11717 DOV 3559 2001 -1 258.12 61.626 0 736.602 25.331 109.909 428.514 70.784 8.628 738.984 31.709 140.899718 GRB 3559 2001 -1 1263.274 123.97 64.449 2842.586 82.482 260.506 1425.277 149.232 80.94 3236.888 95.171 269.371719 GSLI 3559 2001 1 4.426 1.246 3.746 0 4.426 0.008 2.024 5.759 0.114 7.911 0 0.004720 NVLS 3559 2001 1 315.496 21.321 233.216 0 315.496 0.4 24.898 244.503 0 345.239 2.703 0.564721 ETS 3576 2001 -1 671.252 38.848 87.762 1357.943 81.332 69.887 1825.169 1.576 254 2121.357 186.37 155.3722 JNIC 3576 2001 1 88.692 89.753 21.354 0 88.692 8.876 63.972 14.723 0 63.455 7.313 1.578723 EFII 3576 2001 1 8.102 0.047 0.862 4.539 8.102 0.027 1.472 0.595 1.769 6.895 2.97 0.03724 DGII 3576 2001 1 390.901 197.181 63.54 31.494 390.901 15.062 280.281 79.783 35.22 457.563 20.1 12.58725 MMAN 3580 2001 -1 730.662 215.11 58.141 895.192 25.046 36.294 874.911 295.315 54.483 1092.912 59.569 55.648726 ILI 3580 2001 1 72.721 193.318 6.156 0.59 72.721 4.366 203.304 8.286 1.025 115.537 2.359 1.858727 NMSS 3661 2001 -1 10.147 3.065 0.758 4.838 0 0.372 238.373 26.801 1.174 84.939 1.47 0.134728 AVNX 3661 2001 -1 2171.3 270.1 201.2 2811.1 160.1 157.1 2039.1 266.2 155.2 2358.1 55.2 133.8729 3CARD 3690 2001 -1 19.571 3.406 9.062 28.775 2.108 2.026 23.005 1.904 3.755 12.504 0.095 1.126730 IGOC 3690 2001 1 802.784 782.023 114.777 0 802.784 122.337 891.963 138.049 0 936.351 212.627 49.532731 TLGD 3825 2001 -1 36.828 5.365 1.356 9.564 0.564 0.513 28.48 6.233 1.346 10.353 0.554 0.67732 PHTN 3825 2001 1 366.791 90.182 11.517 0 366.791 29.421 76.956 8.198 0 377.143 23.272 35.811733 IO 3829 2001 -1 21.14 10.558 4.439 479.16 0.691 196.817 87.127 1.778 121.458 2127.577 139.186 696.3734 OYOG 3829 2001 1 21.947 2.68 0.343 0.338 21.947 1.379 1.429 0.146 0.521 9.437 0.872 1.051735 TRMB 3829 2001 1 139.749 287.094 19.483 31.944 139.749 5.66 259.634 19.118 26.618 120.757 6.172 1.096

Figure 18 Continued

Page 172: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

160

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2736 MSS 3829 2001 1 26508.253 29222.364 10842.254 4652.505 26508.253 0 22369.491 7695.059 2382.114 23912.557 0 794.293737 ARRO 3841 2001 -1 37.848 16.273 18.229 55.782 0.292 3.899 28.263 3.146 2.309 17.078 0.417 1.265738 HAE 3841 2001 1 2451.8 1615.2 359.8 56.585 2451.8 31.424 2298.1 660 0 2885.4 1.423 158.4739 SYD 3843 2001 -1 49000 687 4066 18190 315 1623 50098 679 4178 19087 324 2139740 ALGN 3843 2001 1 12.622 11.163 3.149 2.897 12.622 0.905 8.907 3.126 3.118 9.226 0.034 0.318741 ESMC 3845 2001 1 34.007 47.3 8.564 4.226 34.007 0.799 65.349 10.726 4.599 45.688 0.767 1.175742 POCI 3845 2001 1 443.986 793.092 28.511 268.214 443.986 19.888 845.815 37.566 367.366 563.251 28.125 20.132743 AYE 4911 2001 -1 15.076 3.214 1.975 7.907 0.11 0.306 21.026 5.538 3.989 13.921 0.289 0.654744 RRI 4911 2001 1 2333.294 1271.27 294 597.643 2333.294 86.286 1282.173 275.294 522.609 2143.895 93.444 44.797745 ED 4931 2001 1 14.578 17.446 3.402 0 14.578 0.625 190.884 41.216 16.159 6.963 149.552 19.69746 CMS 4931 2001 1 12.627 8.56 2.391 3.403 12.627 3.759 2.844 0.073 2.199 7.84 3.203 0.362747 SVU 5411 2001 -1 348.486 150.52 0 333.565 8.374 18.91 378.15 161.447 0 350.919 14.946 20.067748 3PUSH 5411 2001 -1 421.095 67.148 7.895 2511.319 74.23 32.711 605.176 102.772 12.908 2558.235 70.089 32.644749 3HCAR 5500 2001 -1 349.736 60.702 32.835 810.905 13.777 37.769 392.748 73.269 37.349 981.964 26.236 59.528750 MAJR 5500 2001 -1 4616.79 1005.685 90.383 2058.807 18.241 92.826 5046.875 72.622 158.62 2144 0 191.054751 TRVS 5500 2001 -1 16949.608 673.485 873.544 4426.256 168.548 428.217 6853.652 1458.553 10.687 7.748 0 0752 FFPM 5500 2001 1 936.2 582.2 10.9 77.8 936.2 0 477.7 10.8 73.5 890.8 0 49.6753 BBA 5712 2001 -1 110.126 1061.524 2.148 1514.088 62.07 2.816 116.13 1164.602 0.748 1556.026 3.301 20.132754 RSTO 5712 2001 1 1646.491 125.943 1149.708 1.366 1646.491 41 137.82 1268.578 1.792 1816.918 33.003 16.619755 PMORQ 5912 2001 -1 114.956 30.501 18.365 91.71 0.237 6.485 50.49 14.456 4.153 55.39 0.133 1.463756 ECMV 5912 2001 -1 221.279 51.867 21.445 222.852 2.194 13.854 228.266 41.845 14.826 175.56 0.857 7.741757 PAB 6020 2001 1 1402.809 114.509 814.488 3.814 1402.809 0 139.28 1128.381 6.232 2091.096 1.258 2.177758 IFCJ 6020 2001 1 448.044 7022.16 346.152 35.453 448.044 0.178 4717.242 167.556 10.004 227.027 3.823 3.591759 FBC 6035 2001 -1 128.434 20.286 36.879 138.751 30.669 7.025 44.306 72.622 4.162 206.822 3.924 7.564760 HTHR 6035 2001 -1 21.718 151.263 0.472 317.874 89.9 0.152 24.845 182.869 0 380.964 20.1 18.444761 SUFI 6035 2001 1 402.429 55.066 15.299 0 402.429 0.745 141.862 27.97 0 1172.46 6.683 286.91762 WFSL 6035 2001 1 23.619 35.058 7.332 7.862 23.619 1.219 42.675 7.834 6.278 23.332 0.271 0.983763 CIT 6172 2001 -1 99.79 16.018 0 79.944 0.779 2.449 96.225 3.458 0 28.577 3.111 1.22764 CIT. 6172 2001 -1 209.55 31.395 14.501 133.252 45.017 5.338 190.884 18.554 14.639 120.122 1.472 6.961765 HC 7359 2001 -1 231.784 32.511 37.057 248.961 40.691 65.36 244.044 476 38.001 905.984 3.924 32.644766 WSC 7359 2001 -1 1462.283 19.808 188.211 507.629 0 44.553 1569.062 13.297 227.22 599.12 2.668 40.514767 3ALRC 7363 2001 -1 5.838 3.408 81.923 14.371 0.065 0.418 3.582 0.791 179.971 7.748 0 0.015768 BBSI 7363 2001 1 2886 3549 487 422 2886 285 3364 395 326 2681 292 127769 ONES 7370 2001 -1 730.662 215.11 58.141 895.192 25.046 36.294 874.911 295.315 54.483 1092.912 59.569 55.648770 IFXCV 7370 2001 1 33.862 34.594 6.493 8.762 33.862 4.819 19.252 2.895 2.585 12.387 0.223 0.237

Figure 18 Continued

Page 173: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

161

Obs# Ticker Ind Year Label Salesyr1 ARyr1 INVyr1 TAyr1 OAyr1 CEyr1 Salesyr2 ARyr2 INVyr2 TAyr2 OAyr2 CEyr2771 CA 7372 2001 -1 3.287 0.594 0.695 2.806 0.066 0.206 3.646 0.558 0.521 1.554 0.007 0.092772 ORCL 7372 2001 -1 98.625 38.612 0 226.815 0.364 4.563 138.29 38.92 0 160.891 1.832 6.18773 RDTA 7372 2001 1 9.836 1.146 0.056 0.075 9.836 0.215 0.112 0.23 0.01 15.686 0.037 0.641774 IATV 7372 2001 1 38.27 32.936 3.938 8.843 38.27 3.324 33.043 5.086 8.092 35.994 3.314 1.132775 CSRE 7372 2001 1 203.633 228.573 48.655 15.281 203.633 20.711 87.127 75.376 5.805 69.284 4.431 7.849776 AVGO 7372 2001 1 73.112 103.603 16.778 37.313 73.112 1.97 87.501 13.679 30.92 59.333 1.567 0.644777 MDSI 7372 2001 1 905.984 1077.52 244.881 53.629 905.984 81.462 1141.949 214.724 54.136 879.504 117.717 24.755778 AKLM 7372 2001 1 3214 3781 305 271 3214 508 3296 350 254 3464 355 109779 BVSN 7373 2001 -1 0.398 0.169 1.287 10.666 0.858 0.12 0.548 0.049 6.482 11.122 1.284 0.413780 SONE 7373 2001 -1 16.279 73.696 0 127.292 2.659 0.833 7.476 32.97 0 103.962 8.982 1.091781 DV 8200 2001 1 238.289 383.197 55.323 8.297 238.289 2.059 306.348 44.755 4.937 207.818 2.325 4.825782 EDSN 8200 2001 1 213.651 198.042 35.01 33.98 213.651 9.835 192.493 29.596 10.531 189.744 11.896 3.685783 AFAM 8300 2001 -1 261.541 52.334 17.144 187.439 3.35 12.05 231.395 2.751 82.977 450.119 75.136 49.548784 CRN 8300 2001 -1 94.867 752.324 0 1025.682 270.759 20.582 97.86 824.702 0 1140.18 37.84 76.059785 CSU 8300 2001 1 52.852 63.98 11.602 0 52.852 1.466 126.55 18.554 0 367.382 1.387 2.297786 RGNT 8300 2001 1 1380.2 2212.7 332.5 115.9 1380.2 133.7 2349 385.5 124.5 1447.1 56.5 96.9787 MDTL 8731 2001 -1 14.753 0 0 68.412 5.909 0 145.609 7.763 0 92.5 11.285 2.302788 SONO 8731 2001 -1 909.13 256.118 0 914.045 18.386 45.804 873.116 235.99 0 835.727 34.477 35.792

Figure 18 Continued

Page 174: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

B.2 Text Data

Object 1. Text Data

Page 175: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

163

LIST OF REFERENCES

1. Abbot, L.J, Parker, S., and Peters, G. (2004). Audit Committee Characteristics and Restatements. Auditing, March, Vol 23, pp. 69–77.

2. Abdel-khalik, A. R. and K. M. El-Sheshai. (1980). Information Choice and Utilization in an Experiment on Default Prediction. Journal of Accounting Research, autumn, Vol. 18, pp. 325-342.

3. Agresti, A. (1990), Categorical Data Analysis, John Wiley & Sons, Inc., New York, NY.

4. Altman, E. (1968) Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy,Journal of Finance, Vol. 23, pp. 193-194.

5. Altman, E. Eisenbeis, R.A., and Sinkey, J. (1981). Applications of Classification Techniques in Business, Banking, and Finance, JAI Press, Greenwich, CT.

6. Banerjee, S. and Pedersen, T (2004). The Design, Implementation and Use of the Ngram Statistics Package. Working Paper, University of Minnesota, Duluth.

7. Baeza-Yates, R. and Ribeiro-Neto B. (1999) Modern Information Retrieval, Addison-Wesley Pub Co; Boston, MA.

8. Beaver (1966) Financial Ratios as Predictors of Failure. Journal of Accounting Research, Vol. 4, pp. 71–111.

9. Bell, T.B. and Carcello, J.V. (2000). Research Notes, A Decision Aid for Assessing the Likelihood of Fraudulent Financial Reporting. Auditing: A Journal of Practice and Theory, Vol. 19(1), pp. 169 – 175.

10. Beneish, M. (1999) The Detection of Earnings Manipulation. Financial Analysts Journal, Vol. 55, pp. 24-36.

11. Brill, E. (1992) A Simple Rule-Based Part of Speech Tagger. Proceedings of the Third Conference on Applied Natural Language Processing, pp. 152-155, Trento, Italy.

12. Brown, M., Grundy, W., Lin, D., Cristianini, N., Sugnet, C., Furey, T., Ares, M. and Haussler, D. (1999). Knowledge-Based Analysis of Microarray Gene Expression Data using Support Vector Machines. Technical report, University of California in Santa Cruz. (submitted for publication).

Page 176: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

164

13. Budanitsky, A. and Hirst, G. (2001), Semantic Distance in WordNet: An Experimental, Application-Oriented Evaluation of Five Measures, http://ftp.cs.toronto.edu/pub/gh/Budanitsky+Hirst-2001.pdf, June, 2003.

14. Buitelaar, P. and Sacaleanu, B. (2002) Extending Synsets with Medical Terms. Proceedings of the First International WordNet Conference, Mysore, India, January 21 - 25.

15. Buitelaar, P. and Sacaleanu, B. (2001) Ranking and Selecting Synsets by Domain Relevance. Proceedings of WordNet and other Lexical Resources: Applications, Extensions and Customizations, NAACL 2001 Workshop, Carnegie Mellon University, Pittsburgh.

16. CCH, Generally Accepted Accounting Principles, 2005, http://business.cch.com/primesrc/bin/highwire.dll, March, 2005.

17. Charalambous, C., Charitou, A., and Kaourou, F. (1999) Comparative Analysis of Artificial Neural Network Models: Application in Bankruptcy Prediction. IEEE.

18. Choo, C. W. (1995). Information Management for the Intelligent Organization: The Art of Scanning the Environment.: Information Today, Inc. Medford, NJ.

19. Standard & Poor’s, Compustat Data, 2005, http://www.compustat.com/, June, 2005.

20. Cristianini N. and Shawe-Taylor J. (2000) An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, United Kingdom.

21. Cristianini, N., Shawe-Taylor J, and Lodhi, H. Latent Semantic Kernels. Kluwer Academic Publishers, Forthcoming.

22. Princeton University, WordNet, 2005, http://www.cogsci.princeton.edu/~wn/index.shtml, June, 2005.

23. Cortes, C. and Vapnik, V. (1995). Support Vector Networks. Machine Learning, Vol. 20, pp. 273-297.

24. Michael I. Jordan, Advanced Topics in Learning & Decision Making, 2004, http://www.cs.berkeley.edu/~jordan/courses/281B-spring04/lectures/lec10.pdf, September, 2004.

25. Davia, H.R. (2000). Accountant's Guide to Fraud Detection and Control. 2nd ed., Wiley New York, NY.

26. Dechow, P.M., Sloan, R.G. and Sweeney, A.P. (1995) Detecting Earnings Management. The Accounting Review, Vol. 70(2), pp. 193-225.

Page 177: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

165

27. DeFond, M.L. and Jiambalyo, J. (Jul. 1991). Incidence and Circumstances of Accounting Errors. The Accounting Review, Volume 66(3), pp. 643-655.

28. Eisenbeis, R. (July 1987) Discussion, Supplement to Srinivasan, V. and Kim, Y. H. (1987) Credit Granting: A Comparative Analysis of Classification Procedures. J. Fin. Vol. XLII (3), pp. 665-680.

29. Fanning, K., Cogger, K.O., and Srivastava R. (1995). Detection of Management Fraud: A Neural Network Approach. Proceedings, 11th Conference on Artificial Intelligence for Applications, pp. 220-223.

30. Federal Accounting Standards Advisory Board, Generally Accepted Accounting Principles, 2005, http://www.fasab.gov/accepted.html, November, 2004.

31. Fayyad, U.M., Uthurusamy R., Piatetsky-Shapiro G., and Smyth P. (1996). Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA.

32. Felbaum, Christiane. (1999) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.

33. Feroz, E.H., Park, K., and Pastena, V. (1991). The Financial and Market Effects of the SEC’s Accounting and Auditing Enforcement Releases, Journal of Accounting Research, Vol. 29, pp. 107-142.

34. Fisher, R. A., (1936). The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, Vol. 7, pp. 179-188.

35. Francis, J., LaFond, R., Olsson, P. and Schipper, K. (2005). The Market Pricing of Accruals Quality. Journal of Accounting and Economics, Vol. 39(2), pp. 295-327.

36. Frydman, H., Altman, E., and Duen-Li, K. (1985) Introducing Recursive Partitioning for Financial Classification: The Case of Financial Distress. J. Fin Vol. XL(1), pp. 269–291.

37. General Accounting Office, Financial Statement Restatements: Trends, Market Impacts, Regulatory Responses, and Remaining Challenges, 2002, http://www.gao.gov/htext/d03138.html, June, 2004.

38. Genton M.G. (2001). Classes of Kernels for Machine-Learning: A Statistics Perspective. Journal of Machine Learning Research, Vol. 2, pp. 299-312.

39. Google Corporation, Homepage, 2005, http://www.google.com, June, 2005.

40. Gur-Ali, O. and Wallace W.A. (1995). Classifying Delinquent Customers for Credit Collections: an Application of Probabilistic Inductive Learning. International Journal of Human-Computer Studies, Vol. 42, pp. 633-646.

Page 178: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

166

41. Hackenbrack, Karl.(1993) The Effect Of Experience With Different Sized Clients On Auditor Evaluations Of Fraudulent Financial Reporting Indicators, Auditing: A Journal of Practice and Theory, Vol. 12(1), pp. 99-100.

42. Hansen, J.V., McDonald, J.B., Messier, W.F., and Bell, T.B.(1996) A Generalized Qualitative-response Model and the Analysis of Management Fraud. Management Science, Vol 42(7), pp. 1022-1033.

43. Hribar, P. and Jenkins, N.T.,(2004). The Effect of Accounting Restatements on Earnings Revisions and the Estimated Cost of Capital: Accounting, Disclosure, and the Cost of Capital. Review of Accounting Studies, Vol. 9(2-3), pp. 356–375.

44. Unknown Author, Term Frequency, Inverse Document Frequency, 2003, http://instruct.uwo.ca/gplis/601/week3/tfidf.html, November, 2004.

45. Jin, X., Lu, Y., and Shi, C. (2002) Similarity Measure Based on Partial Information of Time Series. Conference on Knowledge Discovery in Data Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Ming, pp. 544-549.

46. Joachims, T. (1999). Text Categorization with Support Vector Machines. Proceedings of the 1999 Conference on AI and Statistics, pp. 137-142.

47. Johnson, S. (1967). Hierarchical Clustering Schemes, Psychometrika, Vol. 2, pp. 241-254.

48. Jurafsky, D. and Martin J. (2000) Speech and Language Processing, Prentice-Hall, Inc., Upper Saddle River, NJ.

49. Khan, L. and Luo, F. (2002). Automatic Ontology Derivation from Documents. Submitted to IEEE Transactions on Knowledge and Data Engineering,

50. Kinney, W.R., Palmrose, Z.V., and Scholz, S. (2004) Auditor Independence, Non-Audit Services, and Restatements: Was the U.S. Government Right? Journal of Accounting Research, Vol. 42(3), pp. 561-588.

51. Koch, T.W. and Wall, L.D. (2000) The Use of Accruals to Manage Reported Earnings: Theory and Evidence, Working Paper 2000-23, Federal Reserve Bank of Atlanta.

52. Landauer, T.K.,Foltz, P.W., and Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, Vol. 25, pp. 259-284.

53. Langlois, Shawn, WorldCom's $3.8 Billion Scandal, 2002, CBSMarketWatch.com., June, 2004.

Page 179: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

167

54. Lewis, D. D., Schapire, R. E., Callan, J. P., and Papka, R. (1996). Training Algorithms for Linear Text Classifiers. Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 298-306.

55. Loebbecke, J.K., Eining, M.M., and Willingham J.J. (Fall 1989) Auditors’ Experience with Material Irregularities: Frequency, Nature, and Detectability. Auditing: A Journal of Practice and Theory, Vol. 9(1), pp. 1-28.

56. McNichols, M. and Wilson, P. (1988). Evidence of Earnings Management from the Provision for Bad Debts. Journal of Accounting Research, Vol. 26, pp. 1-31.

57. Mangasarian, O.L., Mathematical Programming in Data Mining. (1997). Data Mining and Knowledge Discovery, Vol. 1(2): pp. 183-201.

58. Weisstein, E., Hilbert Space, 1999, http://mathworld.wolfram.com/HilbertSpace.html, October, 2004.

59. Weisstein, E. Inner Product Space, 1999, http://mathworld.wolfram.com/InnerProductSpace.html, October, 2004.

60. Matsumura, E.M. and Tucker R.R.. (1992). Fraud Detection: A Theoretical Foundation. The Accounting Review, Vol. 67(4), pp. 753-782.

61. McKendall, MA. and Wagner, J.A., III. (1997). Motive, Opportunity, Choice, and Corporate Illegality. Organization Science, Vol. 8(6), pp. 624-647.

62. Messier, W.F. and Hansen J.V. (1988) Inducing Rules for Expert System Development: An Example using Default Bankruptcy Data. Management Science, Vol. 34(12), pp. 1403-1416.

63. Miller, G. and Charles, W. (1991) Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, Vol. 6(1), pp. 1-28.

64. Mitra, M. and Singhal, A. (2004). Improving Automatic Query Expansion. Working Paper, Cornell University.

65. Moler C. and Van Loan C., (2003) Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later,” SIAM Rev., Vol. 45(1), pp. 3–49.

66. Natarajan, B. (1991). Machine Learning: A Theoretical Approach. Morgan Kaufmann, Palo Alto, CA.

67. Navigli, R.(2002) Extending, Reducing and Trimming General Purpose Ontologies. Proc. of 2nd IEEE International Conference on Systems, Man and Cybernetics, Tunisy, Italy.

Page 180: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

168

68. Navigli, R. and Velardi. P. (2002) Automatic Adaptation of WordNet to Domains, Proc. of 3rd International Conference on Language Resources and Evaluation, Las Palmas, Canary Island, Spain.

69. Navigli, R., and Velardi, P. (2003) Ontology Learning and Its Application to Automated Terminology Translation. IEEE Intelligent Systems, January/February, pp. 22-31.

70. Nelson, K. and Kogan, A, Srivastava, R., Vasarhelyi, M., and Lu, Hai. (2000). Virtual Auditing Agents: The EDGAR Agent Challenge. Decision Support Systems Vol. 28(3), pp 241-255.

71. New York Stock Exchange, Homepage, 2004, http://www.nyse.com/, June, 2004.

72. Ohlson, J. (1980). Financial Ratios and the Probabilistic Prediction of Bankruptcy. Journal of Accounting Research. Vol. 18(1), pp. 109-131.

73. Palmrose, Z.V. (1999) Studies in Accounting Research #33: Empirical Research in Adutior Litigation: Considerations and Data. American Accounting Association, Sarasota, FL.

74. Pasca, M. (2003). Question-Driven Semantic Filters for Answer Retrieval. International Journal of Pattern Recognition and Artificial Intelligence. Vol. 17(5), pp. 741-756.

75. Peasnell, K.V., Pope, P.F., and Young, S. (1999) What Factors Drive Low Accounting Quality? An Analysis of Firms Subject to Adverse Rulings by the Financial Reporting Review Panel, Working Paper, Lancaster University.

76. Pincus, K, (1989) The Efficacy of a Red Flags Questionnaire for Assessing the Possibility of Fraud, Accounting, Organizations and Society Vol. 14, pp. 153-63.

77. Piramuthu, S., Raghavan, H. and Shaw. M (1998). Using Feature Construction to Improve the Performance of Neural Networks. Management Science, Vol. 44, pp. 416–430.

78. Pontil, M. and Verri, A. (1998). Object Recognition with Support Vector Machines. IEEE Trans. on PAMI, 20, Vol. 20(6), pp. 637-646.

79. Press, S.J. and Wilson, S. (1978). Choosing between Logistic Regression and Discriminant Analysis. J. Amer. Statist. Assoc., Vol. 73, pp. 699-705.

80. Quinlan, J. R. (1996). Decision Trees and Instance-Based Classifiers. In CRC Handbook of Computer Science and Engineering. A. B. Tucker, Ed., CRC Press, Boca Raton, FL.

81. R Project Foundation, The R Project for Statistical Computing, 2002, http://www.r-project.org/, June, 2005.

Page 181: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

169

82. Ragothaman, S., Carpenter, J. and Buttars, T. (1995). Using Rule Induction for Knowledge Acquisition: An Expert Systems Approach to Evaluating Material Errors and Irregularities. Expert Systems with Applications, Vol. 9(4), pp. 483-490.

83. Rodriguez, M., and Gomez-Hidalgo, J. (1997). Using WordNet to Complement Training Information in Text Categorization. Working Paper, Universidad Complutense de Madrid.

84. Rezaee, Z. (2002). Financial Statement Fraud: Prevention and Detection. Chichester: Wiley, New York, NY.

85. Roberts, J. and Thomas, E., Enron's Dirty Laundry, www.Newsweek.com, January, 2005.

86. Rudorfer, G. (1995) Early Bankruptcy Detection Using Neural Networks. APL Quote Quad, ACM New York, Vol. 25(4), pp. 171-178.

87. Ruping, S. (2004) SVM Kernels for Time Series Analysis. Working Paper, University of Dortmund.

88. Salton, G. and Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, Vol. 24(5), pp. 513-523.

89. Sarbanes-Oxley, Financial and Accounting Disclosure Information, 2002, http://www.sarbanes-oxley.com/, August, 2003.

90. Schafer, J. and Olsen, M. (1998) Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research, Vol. 33(4), pp. 545-571.

91. Scholkopf, B. and Smola, A. (2002) Learning with Kernels. MIT Press, Cambridge, MA.

92. Securities and Exchange Commission, Homepage, 1995, http://www.sec.gov, July, 2005.

93. Securities and Exchange Commission, Litigation Releases, 1995, http://www.sec.gov/litigation/litreleases.shtml, August, 2003.

94. Sarkar, A., Computational Linguistics (Course notes), 2004, http://www.sfu.ca/~anoop/courses/CMPT-413-Spring-2004/lexicalsem.pdf, January, 2005.

95. Shawe-Taylor, J., and Cristianini, N. (2004) Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, United Kingdom.

Page 182: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

170

96. Siolas, G. and d’Alche-Buc, F. (2000) Support Vector Machines Based on a Semantic Kernel for Text Categorization. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, Vol. 5, pp. 5205 - 5218.

97. Sparck Jones, K. (1972) A Statistical Interpretation of Term Specificity and its Application in Retrieval. Journal of Documentation, Vol. 28(1), pp. 11-21.

98. Srinivasan, V. and Kim, Y. H. (July 1987) Credit Granting: A Comparative Analysis of Classification Procedures. J.Fin., Vol. XLII(3), pp. 665-681.

99. Swanson, D.R. (1986). Fish-oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge. Perspectives in Biology and Medicine Vol. 30(1), pp. 7-18.

100. Takimoto, E., and Warmuth, M. Path Kernels and Multiplicative Updates. Journal of Machine Learning Research, forthcoming.

101. Tam, K. and Kiang, M. (1992) Managerial Applications of Neural Networks: The Case of Bank Failure Predictions. Management Science, Vol. 38(7), pp. 926-947.

102. Tay, Francis, Shen Lixiang, and Cao, Lijuan (2003). Ordinary Shares, Exotic Methods, World Scientific, Singapore.

103. Information Technology Laboratory’s Retrieval Group, Text Retrieval Conference, 2000, http://trec.nist.gov/data/docs_eng.html, February, 2005.

104. Tsai, L. and Koehler, G. (1993). The Accuracy of Concepts Learned from Induction. Decision Support Systems. Vol. 10, pp. 161-172.

105. van Rijsbergen, C.J. (1979) Information Retrieval. Butterworths, London, United Kingdom, pp. 1.

106. Vapnik, V. (1995) Statistical Learning Theory. Springer Verlag, New York NY.

107. Vapnik V. and Chervonenkis A. (1981). The Necessary and Sufficient Conditions for Uniform Convergence of Means to their Expectations. Theory of Probability and its Applications. Vol. 26(3), pp. 532-553.

108. Vossen, P. (2001) Extending, Trimming and Fusing Wordnet for Technical Documents, Proceedings on NAACL-2001 Workshop on WordNet and Other Lexical Resources Applications, Extensions and Customizations, Pittsburgh, USA.

109. Wallin, J. and Stefan S. (1995). Using Linear Programming to Predict Business Failure: An Empirical Study, Liiketaloudellinen aikakausikirja, http://www.shh.fi/depts/redovis/research/jwss95/jwssma95.htm

110. Liu, Hugo, MontyTagger v 1.2, 2002, http://web.media.mit.edu/~hugo/montytagger/, March, 2005.

Page 183: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

171

111. Zimbelman, M.F. (1997). The Effects of SAS No. 82 on Auditors' Attention to Fraud Risk Factors and Audit Planning Decisions. Journal of Accounting Research, Volume 35, Issue Studies on Experts and the Application of Expertise in Accounting, Auditing, and Tax, pp. 75-97.

Page 184: QUANTIFYING THE RISK OF FINANCIAL EVENTS USING KERNEL METHODS AND INFORMATION RETRIEVALufdcimages.uflib.ufl.edu/UF/E0/01/14/30/00001/cecchini_m.pdf · 2010-05-05 · Requirements

172

BIOGRAPHICAL SKETCH

Mark Cecchini received Bachelor of Science degrees in accounting and finance at

The Florida State University in 1992. After six years of professional experience he

attained an MBA from the Crummer Graduate School at Rollins College in 2000. This

experience inspired him to continue his education and get a PhD in a business school.

Mark chose decision and information sciences as it looked to be the most challenging and

thus most rewarding. He chose the University of Florida because of its excellent

reputation. Mark matriculated in 2001 and graduates in 2005. He will subsequently be

joining the faculty at the University of South Carolina as an assistant professor in the

accounting department of the Darla Moore School of Business. Mark’s wife and

inspiration is Tara Cecchini and he has two very cool boys named Julian and Campbell.