data mining in health insurance. introduction rob konijn, [email protected] – vu university...
TRANSCRIPT
Introduction
• Rob Konijn, [email protected]– VU University Amsterdam– Leiden Institute of Advanced Computer Science (LIACS)– Achmea Health Insurance
• Currently working here• Delivering leads for other departments to follow up
– Fraud, abuse
• Research topic keywords: data mining/ unsupervised learning / fraud detection
2
Outline
• Intro Application– Health Insurance– Fraud detection
• Part 1: Subgroup discovery • Part 2: Anomaly detection (slides partly
by Z. Slavik, VU)
Intro Application
• Health Insurance Data• Health Insurance in NL
– Obligatory– Only private insurance companies– About 100 euro/month(everyone)+170 euro (income)– Premium increase of 5-12% each year
Achmea: about 6 million customers
Funding of Health Insurance Costs in the Netherlands
vereveningsfonds
verzekerde zorgverzekeraar
rijksbijdrageverzekerden 18-
2 mld
inkomensafh.bijdragewerkgevers 17 mld
30 mld
zorguitgaven
vereveningsbijdrage
18 mld
nominale premie 18+:
- rekenpremie (~€ 947/vrz): 12 mld- opslag (~€ 150/vrz) : 2 mld
vereveningsfondsvereveningsfondsvereveningsfondsvereveningsfondsvereveningsfondsvereveningsfonds
zorgverzekeraar
vereveningsfonds
Verevenings-model• By population
characteristics– Age– Gender– Income, social class– Type of work
• Calculation afterwards– High costs
compensation (>15.000 euro)
30 - 34 jr98035 - 39 jr1,044
50 - 54 jr
2,394
1,639
45 - 49 jr
55 - 59 jr60 - 64 jr 1,885
1,1831,354
40 - 44 jr
25 - 29 jr 870
1,400 0 - 4 jr1,026 5 - 9 jr90710 - 14 jr96415 - 17 jr89218 - 24 jr
905
3,34980 - 84 jr75 - 79 jr
65 - 69 jr
3,42490 jr e.o.
2,8263,244
70 - 74 jr
3,464
Mannen
85 - 89 jr
1,876
1,7131,905
1,366
2,560
1,476
2,201
1,768
1,532
1,232
Vrouwen
2,8863,0183,0343,014
918
1,2141,062
9361,210
Introduction Application:The Data
• Transactional data– Records of an event– Visit to a medical practitioner
• Charged directly by medical practioner• Patient is not involved• Risk of fraud
Transactional Data
• Transactions: Facts– Achmea:
About 200 mln transactions per year
• Info of customers and practitioners: dimensions
Different levels of hierarchy
• Records represent events• However, for example for fraud detection, we are
interested in customers, or medical practitoners
• See examples next pages• Groups of records: Subgroup Discovery• Individual patients/practioners: outlier detection
Handling different hierarchy
• Creating profiles from transactional data• Aggregating costs over a time period
– Each record: patient• Each attribute i =1 to n: cost spent on treatment i
• Feature construction, for example– The ratio of long/short consults (G.P.)– The ratio of 3-way and 2 way fillings (Dentist)– Usually used for one-way analysis
Different types of fraud detection
• Supervised– A labeled fraud set– A labeled non-fraud set– Credit cards, debit cards
• Unsupervised– No labels– Health Insurance, Cargo, telecom, tax etc.
Unsupervised learning in Health Insurance Data
• Anomaly Detection (outlier detection)– Finding individual deviating points
• Subgroup Discovery– Finding (descriptions of) deviating groups
• Focus on differences and uncommon behavior– In contrast to other unsupervised learning methods
• Clustering• Frequent Pattern mining
Subgroup Discovery
• Goal: Find differences in claim behavior of medical practitioners
• To detect inefficient claim behavior– Actions:
• A visit from the account manager• To include in contract negotiations
– In the extreme case: fraud• Investigation by the fraud detection department
• By describing deviations of a practitioner from its peers– Subgroups
Patient-level, Subgroup Discovery
• Subgroup (orange): group of patients• Target (red)
– Indicates whether a patient visited a practitioner (1), or not (0)
Subgroup Discovery: Quality Measures
• Target Dentist: 1672 patiënten– Compare with peer group, 100.000 patients in
total
• Subgroup V11 > 42 euro : 10347 patients– V11: one sided filling
• Crosstable
target dentist rest totaal
V11 >= 42 871 9476 10347rest 801 88852 89653totaal 1672 98328 100000
The cross table
• Cross table in data
• Cross table expected:
• Assuming independence
target dentist rest totalV11 >= 42 173 10174 10347
rest 1499 88154 89653
total 1672 98328 100000
target dentist rest totalV11 >= 42 871 9476 10347rest 801 88852 89653total 1672 98328 100000
Calculating Wracc and Lift
• Size subgroup = P(S) = 0.10347, size target dentist = P(T) = 0.01672• Weighted Relative ACCuracy (WRAcc) = P(ST) – P(S)P(T) = (871 –
173)/100000 = 689/100000• Lift = P(ST)/P(S)P(T) = 871/173 = 5.03
target dentist rest totalV11 >= 42 173 10174 10347
rest 1499 88154 89653
total 1672 98328 100000
target dentist rest totalV11 >= 42 871 9476 10347rest 801 88852 89653total 1672 98328 100000
Making SD more useful: adding prior knowledge
• Adding prior knowledge– Background variables patient (age, gender, etc.)– Specialism practitioner– For dentistry: choice of insurance
• Adding already known differences– Already detected by domain experts themselves– Already detected during a previous data mining run
Example, iterative approach
• Idea: add subgroup to prior knowledge iteratively• Target = single pharmacy• Patients that visited the hospital in last 3 years removed
from data• Compare with peer group (400,000 patients), 2929 patiënts
of target pharmacy• Top subgroup : “B03XA01 (Erythropoietin)>0 euro”
subgroup T F
T 1297 224
F 1632 396,847
B03XA01 > 0
1 ‘target’ pharmacy
rest
rest
Next iteration• Add “B03XA01 (EPO) >0 euro” to prior knowledge• Next best subgroup: “N05AX08 (Risperdal)>= 500 euro”
Addition: adding costs to quality measure
– M55: dental cleaning– V11: 1-way filling– V21: polishing
• Cost of treatments in subgroup 370 euro (average)• 791 more patients than expected• Total quality 791*370 = 292,469 euro
Iterative approach, top 3 subgroups
V12: 2-sided filling V21: polishing V60: indirect pulpa covering
V21 and V60 are not allowed on the same day Claim back (from all dentists): 1.3 million euro
Other target types: double binary target
• Target 1: year: 2009 or 2008• Target 2: target practitioner
• Pattern:– M59: extensive (expensive) dental cleaning– C12: second consult in one year
• Crosstable:
Other target types: Multiclass target
• Subgroup (orange): group of patients• Target (red), now is a multi-value column, one
value per dentist
Outline Anomaly Detection
• Anomalies– Definition– Types– Technique categories– Examples
• Lecture based on– Chandola et al. (2009). Anomaly
Detection: A Survey– Paper in BB
38
Definition
• “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior”
• Anomalies, aka.– Outliers– Discordant observations– Exceptions– Aberrations– Surprises– Peculiarities– Contaminants
39
Not covered today
• Other types of anomalies:– Collective anomalies– Contextual anomalies
• Other detection approaches:– Supervised learning– Semi supervised
• Assume training data is from normal class• Use to detect anomalies in the future
We focus on outlier scores
• Scores– You get a ranked list of anomalies– “We investigate the top 10”– “An anomaly has a score of at least 134”– Leads followed by fraud investigators
• Labels
42
ANOMAL
Y
Detection method categorisation
1. Model based2. Depth based3. Distance Based
4. Information theory related (not covered)5. Spectral theory related (not covered)
43
Model based
• Build a (statistical) model of the data
• Data instances occur in high probability regions of a stochastic model, while anomalies occur in low probability regions
• Or: data instances have a high distance to the model are outliers
• Or: data instances have a high influence on the model are outliers
Example: one way outlier detection
• Pharmacy records• Records represent patients• One attribute at a time:
– This example: attribute describing the costs spent on fertility medication (gonodatropin) in a year
• We could use such one way detection for each attribute in the data
Other models possible
• Probabilistic– Bayesian networks
• Regression models– Regression trees/ random forests– Neural networks
• Outlier score = prediction error (residual)
Depth based methods
• Applied on 1-4 dimensional datasets– Or 1-4 attributes at a time
• Objects that have a high distance to the “center of the data” are considered outliers
• Example Pharmacy:– Records represent patients– 2 attributes:
• Costs spent on diabetes medication • Costs spent on diabetes testing material
Distance based (nearest neighbor based)
• Assumption:– Normal data instances occur in dense neighbourhoods,
while anomalies occur far from their closest neighbours
Similarity/distance
• You need a similarity measure between two data points– Numeric attributes: Eucledian, etc.– Nominal: simple match often enough– Multivariate:
• Distance using all attributes• Distance between attribute values, then combine
Example, dentistry data
• Records represent dentists
• Attributes are 14 cost categories– Denote the percentage
of patients that received a claim from the category
Option 2:Use relative densities of neighbourhoods
• Density of neighbourhood estimated for each instance
• Instances in the low density neighbourhoods are anomalous, others normal
• Note:– Distance to kth neighbour is an estimate for the
inverse of density (large distance low density)– But this estimates outliers in varying density
neighbourhoods badly
56
LOF• Local Outlier Factor:• Local density:
– k divided by the volume of the smallest hyper-sphere centred around the instance, containing k neighbours
• Anomalous instance:– Local density will be
lower than that ofthe k nearest neighbours
57
Average local density of k nearest neighboursLocal density of instance
Average local density of k nearest neighboursLocal density of instance
3. Clustering based a.d. techniques
• 3 possibilities;1. Normal data instances belong to a cluster in
the data, while anomalies do not belong to any cluster– Use clustering methods that do not force all
instances to belong to a cluster• DBSCAN, ROCK, SSN
2. Distance to the cluster center = outlier score3. Clusters with too few points are outlying
clusters59
K-means with 6 clusters, centers of the dentistry data set
• Attributes: percent of patient that received claim from cost category
• Clusters correspond to specialism1. Dentist2. Orthodontist3. Orthodontist
(charged by dentist)
4. Dentist5. Dentist6. Dental hygenist
Combining Subgroup Discovery and Outlier Detection
• Describe regions with outliers using SD• Identify suspicious medical practitioners• 2 or 3 step approach to describe outliers:
1. Calculate outlier score2. Use subgroup discovery to describe regions with
outliers.3. (optional) identify the involved medical
practitioners
Example output:
• Look at patients with ‘P30>1050 euro’ for practitioner number 221
• Left: all data, right: practitioner 221
Descriptions of outliers: LOCI outlier score
• 1. Calculate outlier score – LOCI is a density based
outlier score• 2. Describe outlying
regions• Result top subgroup:
– Orthodontics (dentist) 0.044 ^ Orthodontics 0.78
– Group of 9 dentists with an average score of 3.9