impossibility mining. traditional data mining using multidimensional data to find previously unknown...

26
Impossibility Impossibility Mining Mining

Upload: joan-bridges

Post on 25-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility MiningImpossibility Mining

Page 2: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Traditional Data MiningTraditional Data Mining

Using multidimensional data to find Using multidimensional data to find previously unknownpreviously unknown hidden relationships hidden relationshipsNot just simple query/joinsNot just simple query/joinsCanonical: Diapers and Beer at WalmartCanonical: Diapers and Beer at Walmart Urban Legend – comes from 1992 Teradata Urban Legend – comes from 1992 Teradata

study of Osco.study of Osco.

Correlation!=CausationCorrelation!=CausationTerminology currently has negative Terminology currently has negative connotations in the pressconnotations in the press

Page 3: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Il buono, il brutto, il cattivoIl buono, il brutto, il cattivo

3 categories of “data mining” for fraud3 categories of “data mining” for fraud Profiling (il brutto)Profiling (il brutto) Probability Mining (il cattivo)Probability Mining (il cattivo) Anomaly Detection (il buono)Anomaly Detection (il buono)

Page 4: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

ProfilingProfiling

Looking for a series of characteristics which identify a Looking for a series of characteristics which identify a likely problemlikely problemDemographic Profiling:Demographic Profiling:

Looking for a series of personal identifiers to determine likely Looking for a series of personal identifiers to determine likely suspectssuspects

Example: Corporate data thieves tend to be males between 30 Example: Corporate data thieves tend to be males between 30 and 40 years of ageand 40 years of age

Behavior Profiling:Behavior Profiling: Looking for a series of behaviors which indicate likely suspectsLooking for a series of behaviors which indicate likely suspects Example: Corporate data thieves are more likely to work Example: Corporate data thieves are more likely to work

weekends, not take vacations, and be generally highly ratedweekends, not take vacations, and be generally highly rated

Page 5: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Profiling - IssuesProfiling - Issues

Demographic profiling, no matter how Demographic profiling, no matter how good, will likely end up with you on CNNgood, will likely end up with you on CNN

Base Rate Fallacy: The profile needs to Base Rate Fallacy: The profile needs to be extraordinarily close to 100% for a be extraordinarily close to 100% for a population of any size.population of any size.

Page 6: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Probability MiningProbability Mining

Identifying high probability issues to targetIdentifying high probability issues to target

Can be applied to profiling or anomaly detectionCan be applied to profiling or anomaly detection

Good for sliding thresholds with competing Good for sliding thresholds with competing business driversbusiness drivers

Example: Stolen credit cards are more likely to Example: Stolen credit cards are more likely to be used at electronics stores for high ticket be used at electronics stores for high ticket items. Applied to a particular profile, a plasma items. Applied to a particular profile, a plasma TV purchase may have a 10% chance of being TV purchase may have a 10% chance of being fraudulent.fraudulent.

Page 7: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Probability Mining - IssuesProbability Mining - Issues

Business drivers need to be consideredBusiness drivers need to be considered Is it worth it to bother 10 legitimate credit card Is it worth it to bother 10 legitimate credit card

holders to find 1 stolen card? What about holders to find 1 stolen card? What about 100? 1000?100? 1000?

Probability generation requires a lot of Probability generation requires a lot of data and a pre-labeled dataset to be data and a pre-labeled dataset to be usefuluseful

Page 8: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Anomaly DetectionAnomaly Detection

Sesame Street analysisSesame Street analysis

Relies on finding outliers in dataRelies on finding outliers in data

Does Does notnot require a priori expert knowledge require a priori expert knowledge of the dataof the data

Does require après-analysis expert Does require après-analysis expert knowledge to interpret outliersknowledge to interpret outliers

Page 9: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Case Example: Anomaly DetectionCase Example: Anomaly Detection

Product launch event - $1.5 Million budgetProduct launch event - $1.5 Million budget

Launch directors had authority for procurements Launch directors had authority for procurements up to $10,000up to $10,000

Report received of a “person directing the launch Report received of a “person directing the launch event gave a lot of vendor work to his brother-in-event gave a lot of vendor work to his brother-in-law”law”

There were ~25 recent launch events that this There were ~25 recent launch events that this could refer to, 10 of which were male-directedcould refer to, 10 of which were male-directed

Looked at the financials for each launch eventLooked at the financials for each launch event

Page 10: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

DataData

Event Launch PurchasesEvent Launch Purchases AmountAmount

Consulting – Marketing SupportConsulting – Marketing Support $9,512.00$9,512.00

Supplies - GeneralSupplies - General $250.12$250.12

Consulting - AdvertisingConsulting - Advertising $9,832.00$9,832.00

Supplies – Plasma TV RentalSupplies – Plasma TV Rental $9,814.22$9,814.22

Supplies - CateringSupplies - Catering $1,233.22$1,233.22

Consulting – Launch SupportConsulting – Launch Support $9,763.00$9,763.00

Supplies – Secondary Plasma TVSupplies – Secondary Plasma TV $9,814.22$9,814.22

Mileage - ReimbursementMileage - Reimbursement $252.84$252.84

Page 11: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

BenfordBenford

Page 12: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Anomaly Detection – How we Anomaly Detection – How we Found ‘emFound ‘em

Benford’s LawBenford’s Law Take a look at both the last and first digitsTake a look at both the last and first digits Distribution is well of predictionsDistribution is well of predictions

Nearness-to-thresholdNearness-to-threshold Distribution should not be a logarithmic Distribution should not be a logarithmic

decline from approval thresholddecline from approval threshold Nothing was over threshold…Nothing was over threshold…

Common SenseCommon Sense Plasma TV Rentals - $10K to rent? Why 2?Plasma TV Rentals - $10K to rent? Why 2?

Page 13: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

ResultsResults

Subject hired their brother-in-law to do Subject hired their brother-in-law to do phantom consultingphantom consulting

Subject rented plasma TVs with a $1 Subject rented plasma TVs with a $1 buyout optionbuyout option

Page 14: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Case Example: Geospatial Case Example: Geospatial AnomaliesAnomalies

Problem: Identify web activity that is Problem: Identify web activity that is spurious in naturespurious in nature

Application: Successfully applied to Application: Successfully applied to internal user data (activity logs) as well as internal user data (activity logs) as well as external data (attacks)external data (attacks)

Page 15: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

User DataUser Data

Page 16: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

User Data – Plotted as AnomaliesUser Data – Plotted as Anomalies

Page 17: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Outliers – What Were They?Outliers – What Were They?Outlier Categorization

63%10%

14%

7% 3% 3%

Foreign Users

Gambling

False Positives

Pornography

Dating Websites

Spyware

Page 18: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility MiningImpossibility Mining

Is NOT data miningIs NOT data mining

IS an application of control testingIS an application of control testing

Looks for patterns that cannot exist in any Looks for patterns that cannot exist in any model of reasonable likelihoodmodel of reasonable likelihood

Can be single or multifactorCan be single or multifactor

Only identifies real outliersOnly identifies real outliers

Page 19: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining Example – Impossibility Mining Example – Single FactorSingle Factor

Asset ManagementAsset Management IT Asset Management software installed on all IT Asset Management software installed on all

machines in a companymachines in a company Cataloged installed hardware and software at Cataloged installed hardware and software at

different points in timedifferent points in time

Proactive LookProactive Look Identify any computers where installed Identify any computers where installed

memory at time T is less than or equal to T-1memory at time T is less than or equal to T-1 Identified several hundred laptops from Identified several hundred laptops from

remote office users that met the criteriaremote office users that met the criteria

Page 20: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining Example – Impossibility Mining Example – Single Factor, cont’dSingle Factor, cont’d

Identified commonality in laptopsIdentified commonality in laptops All laptops were serviced by the same IT All laptops were serviced by the same IT

support locationsupport location Found the drop in memory was consistent Found the drop in memory was consistent

with the last “upgrade”with the last “upgrade” Reviewed eBay activity of the local IT support Reviewed eBay activity of the local IT support

personnelpersonnel Found the thief, who was removing half of the Found the thief, who was removing half of the

memory from laptops of non-power users and memory from laptops of non-power users and selling it!selling it!

Page 21: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining – Dual FactorImpossibility Mining – Dual Factor

Electronic Funds Transfer InvestigationElectronic Funds Transfer Investigation

Payment ProcessPayment Process Manager takes in payment request and assigns to a Manager takes in payment request and assigns to a

clerkclerk Clerk enters payment information and selects a payeeClerk enters payment information and selects a payee Manager enters EFT information for the payee and Manager enters EFT information for the payee and

confirms transaction (cannot change amount)confirms transaction (cannot change amount) Division Head confirms name on account, amount, Division Head confirms name on account, amount,

and releases fundsand releases funds

Question: Does fraud require collusion?Question: Does fraud require collusion?

Page 22: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining – Dual Factor, Impossibility Mining – Dual Factor, cont’dcont’d

EFT AuditEFT Audit Compared actual EFTs for internal consistencyCompared actual EFTs for internal consistency

Looked for EFTs where the customer ID was the same, but Looked for EFTs where the customer ID was the same, but the bank routing number was differentthe bank routing number was different

Identified a manager who was manually changing routing Identified a manager who was manually changing routing information to funnel to her husband’s accountinformation to funnel to her husband’s account

33rdrd set of eyes (Division Head) did not help – ineffective set of eyes (Division Head) did not help – ineffective controlcontrol

Two process changesTwo process changesOnly Division Head can add EFT informationOnly Division Head can add EFT information

Automated check implemented to ID bank name != routing Automated check implemented to ID bank name != routing numbernumber

Page 23: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining – Data JoiningImpossibility Mining – Data Joining

Unauthorized Computer AccessUnauthorized Computer Access Created a table of physical sitesCreated a table of physical sites Calculated the minimum travel time between Calculated the minimum travel time between

sitessites Identified anyone logging in to a machine at 2 Identified anyone logging in to a machine at 2

sites where time between logins < minimum sites where time between logins < minimum travel timetravel time

Page 24: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining – Data Joining, Impossibility Mining – Data Joining, cont’dcont’d

Identified several stolen passwordsIdentified several stolen passwords Also highlighted password sharingAlso highlighted password sharing … … as well as user passwords hard-coded in as well as user passwords hard-coded in

applicationsapplications

Page 25: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

Impossibility Mining - ConclusionsImpossibility Mining - Conclusions

The less likely for something to occur, the better The less likely for something to occur, the better the candidacy for impossibility miningthe candidacy for impossibility mining

Can always implement controls to prevent the Can always implement controls to prevent the “impossibilities”, but they are not always “impossibilities”, but they are not always implemented correctlyimplemented correctly

Best example in the media: Insurance fraud Best example in the media: Insurance fraud case - men were claiming hysterectomies, case - men were claiming hysterectomies, ovarian cyst removal, PAP tests…ovarian cyst removal, PAP tests…

Page 26: Impossibility Mining. Traditional Data Mining Using multidimensional data to find previously unknown hidden relationships Not just simple query/joins

QuestionsQuestions

……Other than can we go yet?Other than can we go yet?