fraud detection with matlab · types of fraud corporate –financial statement falsification...
TRANSCRIPT
1© 2015 The MathWorks, Inc.
Fraud Detection with MATLAB
Ian McKenna, Ph.D.
2
Agenda
Introduction: Background on Fraud Detection
Challenges: Knowing your Risk
Overview of the MATLAB Solution– Connect to financial data sources
– Calculate fraud indicators
– Classify funds with machine learning
– Generate reports & deploy applications
Questions & Answers
4
Fraud Detection
Detecting when people
intentionally act secretly
to deprive another of
something of value
Types
– Returns Forensics
– Linguistic Based Cues
http://nakedshorts.typepad.com/files/madoff_fairfieldsentry3x.pdf
5
Types of Fraud
Corporate
– Financial statement falsification
Securities and commodities
– Hedge Fund returns manipulation
– Stock markets manipulation, regulation compliance
Healthcare
Mortgage
Identity theft (credit card)
Insurance
Mass marketing
Asset forfeiture/money laundering
6
Hedge Fund Returns Manipulation
More prone to fraud due to decreased regulation
– SEC stats indicate 1% misbehave
Scenarios
– Misbehavior: HF managers that have some discretion in
valuing illiquid investments. Academics have devised methods
to analyze and flag potentially “manipulated” fund returns.
– Outright fraud: Quantitative screening and use of dedicated
algorithms can save a lot of time
7
Return-Based Analysis
# of negative monthly returns used to judge manager’s
performance
Attract investors by misreporting returns
Distortion possible for returns at manager’s discretion
– Illiquid assets, complex assets
E.g. discontinuity exists at zero but disappears if returns
computed bimonthly
“Suspicious Patterns in Hedge Fund Returns and the Risk of Fraud”. Bollen, Nicolas P.B. and Veronika
K. Pool (2012) Review of Financial Studies 25, 2673-2702.
9
Returns Distribution Discontinuity
10
Benford’s Law
Frequency distribution of digits in many real-life sources
of data:
– Electricity bills
– Street addresses
– Stock prices
– Population numbers
– Death rates
– Physical and mathematical constants
– Processes described by power laws
11
Stock Market Returns First Digit Frequency
Source: Checking Financial markets via Benford's law, Marco Corazza, Andrea Ellero, and Alberto
Zorzi
12
Agenda
Introduction: Background on Fraud Detection
Challenges: Knowing your Risk
Overview of the MATLAB Solution– Connect to financial data sources
– Calculate fraud indicators
– Classify funds with machine learning
– Generate reports & deploy applications
Questions & Answers
13
Challenges in Fraud Detection
Cost/Economics
– Most cases not fraud
– Manual analysis
Data
– Huge data sets
– Complex data types
– Data integration
Change
– Evolutionary
– Secrecy in detection methods
15
Traditional Approach Challenge
Challenges Faced During Model Development
Off-the-shelf softwareInability to work with
custom and complex data
In-house development with
traditional languages
Adapting requires long
development times
Spreadsheets, Excel Limited data size
Combination of the aboveInefficiencies in
Integration & Automation
16
Computational Finance Workflow
Research and Quantify
Data Analysis
& Visualization
Financial
Modeling
Application
Development
Reporting
Applications
Production
Share
Automate
Files
Databases
Datafeeds
Access
17
The Desired Report
Three funds to analyze and report:
– Gateway Fund
– American Funds Growth Fund
– Fairfield Sentry (known fraudulent Madoff fund)
18
Agenda
Introduction: Background on Fraud Detection
Challenges: Knowing your Risk
Overview of the MATLAB Solution– Connect to financial data sources
– Calculate fraud indicators
– Classify funds with machine learning
– Generate reports & deploy applications
Questions & Answers
20
Implemented Methods – Returns Based
Returns distribution and discontinuity at 0 Check discontinuity at 0 of the distribution of monthly returns
Low correlation with other assets Regress fund returns on a combination of style factors that maximize
explanatory power of the analysis
Unconditional serial correlation Check if monthly returns are serially correlated, i.e. correlated with their
previous month value. Because managers investing in illiquid securities,
with no end-of-month quoted price, may smooth their returns compared to
all available market information
Conditional serial correlation Using the optimal factor model constructed in “Low correlation with other
assets”, check serial correlation occurring especially after a down month
(i.e. when the suspicious managers has the highest incentive to “catch up”)
21
Implemented Methods – Returns Based
Number of returns equal 0 Calculate the theoretical number of returns being 0, using cumulative
distribution function and binomial coefficients, for a time series exhibiting
the same characteristics (average returns and variance) as the fund. Then
compare that number with the actual count.
Number of negative returns Calculate the theoretical number of negative returns as above. Then
compare that number with the actual count.
Number of unique returns/length of identical recurring
series Calculate the theoretical number of each patterns. Unique returns is the
number of unique numbers in the time series and length of identical series
is the number of consecutive observations that are identical . Then
compare these statistical numbers with the actual count.
22
Implemented Methods – Returns Based
Sample distribution of the last digit Check if the distribution of the returns last digit is uniformly distributed with
a goodness-of-fit test
Sample distribution of the first digit Check if the distribution of the returns first digit is following the Benford’s
Law with a goodness-of-fit test
Supervised classification methods Using machine learning tools (such a Neural Networks, Classification
methods) train a model to identify potential fraudsters. Input variables
consists of all of the indicators described above so far, attributed to
previously identified fraudulent and non fraudulent fund. Apply the fitted
model to a new fund to obtain its classification.
24
Text Based Indicators
Idea from published research in criminal investigation
Hypothesis - deceptive senders display:
– Higher quantity
– Higher expressivity
– Higher informality
– Higher uncertainty
– Higher nonimmediacy
– Lower complexity
– Lower diversity
– Lower specificity
“Automating Linguistics-Based Cues for Detecting Deception in Text-based Asynchronous Computer-Mediated Communication”.
LINA ZHOU, Department of Information Systems, University of Maryland, Baltimore County, MD, USA. JUDEE K. BURGOON, JAY F.
NUNAMAKER, JR. AND DOUG TWITCHELL, Center for the Management of Information, University of Arizona, Tucson, AZ, USA. Group
Decision and Negotiation 13: 81–106, 2004
25
Implemented Methods – Text Based
Measure Complexity Average number of statements (average concepts per sentence)
Average sentence length (average complexity of structures)
Vocabulary complexity (average word length)
Measure Uncertainty Average use of modifiers (number of adjectives/adverbs per sentence)
Average reference to other (number of he, they, …)
Measure of Expressivity Emotiveness (number of adjectives compared to nouns)
Measure of Diversity Lexical diversity (number of unique words)
26
Classifying Words
Java POS Tagger
Reference online dictionary
Only a few line of code
28
Comparison: American Growth Fund
29
Comparison: Madoff
31
Next Steps: Machine Learning with MATLAB
To learn more, visit: www.mathworks.com/machine-learning
Basket Selection using
Stepwise Regression
Classification in the
presence of missing data
Regerssion with Boosted
Decision Trees
Hierarchical Clustering
32
MATLAB Solutions
Traditional Approach Challenge Solution
Off-the-shelf softwareInability to work with
custom and complex dataFlexible Modeling
Work with structured/unstructured
In-house development
with traditional languages
Adapting requires long
development timesRapid Prototyping
Advanced
Spreadsheets, Excel Limited data sizeWork with Big Data Sets
Database/Hadoop
Combination of the aboveInefficiencies in
Integration & AutomationEasy to Integrate & Deploy
Automated reports, encrypted models
33
Financial Modeling Workflow
Financial
Statistics & Machine
LearningOptimization
Financial Instruments Econometrics
MATLAB
Parallel Computing MATLAB Distributed Computing Server
Files
Databases
Datafeeds
Access
Reporting
Applications
Production
Share
Data Analysis and Visualization
Financial Modeling
Application Development
Research and Quantify
MATLAB Compiler
SDK
MATLAB Compiler
Rep
ort G
en
era
tor
Production Server
Datafeed
Database
Spreadsheet Link EX
Trading
34
Q&A