anomaly (outlier) detection - new mexico state...

41
Anomaly (outlier) detection Huiping Cao, Anomaly 1

Upload: others

Post on 23-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Anomaly(outlier)detection

Huiping Cao, Anomaly 1

Page 2: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outline

• Generalconcepts– Whatareoutliers– Typesofoutliers– Causesofanomalies

• Challengesofoutlierdetection• Outlierdetectionapproaches

Huiping Cao, Anomaly 2

Page 3: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Whatareoutliers• Thesetofdatapointsthataresignificantlydifferent fromtherestofthe

objects• Assumption

– Thereareconsiderablymore“normal” observationsthan“abnormal”observations(outliers/anomalies)inthedata

– Applications– Frauddetection(creditcardusage)– Intrusiondetection(computersystems,computernetworks)– Ecosystemdisturbances– Publichealth– Medicine

• Related– Noveltydetection

Huiping Cao, Anomaly 3

Page 4: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Typesofoutliers• Global:deviatesignificantlyfromtherestofthedataset

– Alsocalledpointanomalies– Mostoutlierdetectionmethodsaredesignedtofindsuchoutliers

• Example– Intrusiondetectioninnetworktraffic

Huiping Cao, Anomaly 4

Page 5: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Typesofoutliers• Contextual(conditional)outliers

– Anobjectisanoutlierinonecontext,butmaybenormalinanothercontext

– Contextual attributes:definetheobject’scontext.• date,location

– Behavior attributes:definetheobject’scharacteristics,andareusedtoevaluatewhethertheobjectisanoutlierinthecontext.

• temperature– Ageneralizationoflocaloutlier,definedindensitybasedanalysis.– Backgroundinformationtodeterminecontextualattributes,etc.

Huiping Cao, Anomaly 5

Page 6: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Typesofoutliers• Collective:asubsetofdataobjectsformsacollectiveoutlierif

theobjectsasawholedeviatesignificantlyfromtheentiredataset– Theindividual dataobjectsmaynotbeoutliers– Applications:supply-chain,webvisiting,network(denial-of-service)

– Needbackground informationtomakeobjectrelationships

Huiping Cao, Anomaly 6

Page 7: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Causesofanomalies• Datafromdifferentclasses

– Hawkins’definitionofanOutlier:anoutlierisanobservationthatdifferssomuchfromotherobservationsastoarousesuspicionthatitwasgeneratedbyadifferentmechanism.

• Naturalvariation– Anomaliesthatrepresentextremeorunlikelyvariations(extremetallperson)

• Datameasurementandcollectionerrors– Removingsuchanomaliesisthefocusofdatapreprocessing(datacleaning)

• Others:severalsourcesHuiping Cao, Anomaly 7

Page 8: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outline

• Generalconcepts– Whatareoutliers– Typesofoutliers– Causesofanomalies

• Challengesofoutlierdetection• Outlierdetectionapproaches

Huiping Cao, Anomaly 8

Page 9: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Challengesofoutlierdetection• Model normal/outlierobjects

– Hardtomodelcompletenormalbehavior– Somemethodsassign“normal”or“abnormal”– Somemethodsassignascoremeasuringthe“outlier-ness”ofthe

object.• Universaloutlierdetection:hardtodevelop

– Similarityanddistancedefinitionisapplication-dependent• Commonissues:noise• Understandability

– Understandwhythedetectedobjectsareoutliers– Providejustificationofthedetection

Huiping Cao, Anomaly 9

Page 10: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outline

• Generalconcepts– Whatareoutliers– Typesofoutliers

• Challengesofoutlierdetection• Outlierdetectionapproaches

– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods

Huiping Cao, Anomaly 10

Page 11: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outlierdetectionmethods• Dataforanalysisarelabeledwith“normal”or“abnormal”by

domainexperts.• Supervised methods

– Canbemodeledasaclassificationproblem– Specialaspectstoconsider:imbalancednormaldatapointsandabnormalpoints

– Measures:recallismoremeaningful• Unsupervised methods

– Largelyutilizeclusteringmethods• Semi-supervised

Huiping Cao, Anomaly 11

Page 12: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outlierdetectionmethods• Outlierdetectionalgorithmsmakeassumptions aboutoutliersversusthe

restofthedata.• Categories accordingtotheassumptionsmade

– Statistical methods(ormodelbased)• Normaldatafollowastatistical(stochastic)model• Outliersdonotfollowthemodel

– Proximity-based methods• Theproximityofoutlierstotheirneighbors aredifferentfromtheproximityofmostotherobjectstotheirneighbors

• Distance-based,density-based– Clustering-based methods

• Normalobjectsbelongtolargeanddenseclusters• Outliersbelongtosmallorsparseclusters,orbelongtonocluster

Huiping Cao, Anomaly 12

Page 13: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Statisticalapproaches• Probabilisticdefinitionofanoutlier:anoutlierisanobjectthathasalow

probabilitywithrespecttoaprobabilitydistributionmodelofthedata.– Normalobjectsaregeneratedbyastochasticprocess,occurinregions

ofhighprobabilityforthestochasticmodel– Outliers occurinregionsoflowprobability

• Approachsteps– Learnagenerativemodelfittingthegivendata– Identifytheobjectsinlow-probabilityregionsofthemodel

• Categories– Parametric method(univariate,multivariate)– Nonparametric method

Huiping Cao, Anomaly 13

Page 14: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Parametric:univariate NormalDistribution

• Normaldistribution,maximumlikelihoodestimation(MLE)– Standardnormaldistribution,N(0,1)– Non-standardnormaldistribution,N(μ,σ2),z-score– UseMLEtoestimateμandσ2

Huiping Cao, Anomaly 14

Page 15: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Parametric:univariate NormalDistribution

• prob(|x|≥c)=αforN(0,1)– Markanobjectasanoutlierifitismorethan3σawayfromtheestimatedmeanμ,whereσ isthestandarddeviation(μ±3σregioncontains99.73%ofthedata)

• (c,α)pairforN(0,1)

Huiping Cao, Anomaly 15

c α for N(0,1)

1.0 0.3173

1.5 0.1336

2.0 0.0455

2.5 0.0124

3.0 0.0027

3.5 0.0005

4.0 0.0001

Page 16: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Parametric:univariate NormalDistribution

• Example• Acity’saveragetemperaturevaluesin10years:24,28.9,28.9,

29,29.1,29.1,29.2,29.2,29.3,29.4– μ=28.61– σ2≅2.29,σ =sqrt(2.29)=1.51– Is24anoutlier?

• z-score=(|24-28.61|)/1.51=3.04• >3

Huiping Cao, Anomaly 16

Page 17: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Parametric:otherunivariate outlierdetectionapproaches(S.S.)

• Boxplot method• Grubb’stest(maximumnormedresidualtest)

Huiping Cao, Anomaly 17

Page 18: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Parametric:multivariate• Multivariate

– Converttheproblemtoaunivariate outlierdetectionproblem

– UseMahalanobis distancefromobjecto toitsmeanμ– Useχ2 statistic

• oi:isthevalueofo onthei-th dimension• Ei:themeanofthei-th dimensionofallobjects• n:thenumberofobject

Huiping Cao, Anomaly 18

Page 19: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Nonparametric• Nonparametricmethodsusefewerassumptionsaboutdata

distribution,thuscanbeapplicableinmorescenarios• Histogramapproach

– Constructhistograms(types:equalwidthorequaldepth,numberofbins,orsizeofeachbin)

– Outliers:notinanybinorinbinswithsmallsize– Drawback:hardtodecidethebinsize

• Others:kernelfunction(morediscussedinmachinelearning)

Huiping Cao, Anomaly 19

Page 20: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outline

• Generalconcepts– Whatareoutliers– Typesofoutliers

• Challengesofoutlierdetection• Outlierdetectionapproaches

– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods

Huiping Cao, Anomaly 20

Page 21: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Proximity-basedApproaches• Dataisrepresentedasavectoroffeatures• Basedontheneighborhood

• Majorapproaches– Distancebased– Densitybased

Huiping Cao, Anomaly 21

Page 22: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Distance-basedapproach• Anomaly:ifanobjectisdistantfrommostpoints.• Distancetok-NearestNeighbor:theoutlierscoreofanobject

isgivenbythedistancetoitsk-nearestneighbor.• Outliers:threshold

• Problem:hardtodecidek(seenextslides)• Improvement:averageofthedistancestothefirstk-nearest

neighbors

Huiping Cao, Anomaly 22

Page 23: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

23

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

k=1, outlier is Ok=1, outlier is O

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

k=5, all points at the right upper corner are outliers

Page 24: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Distance-basedoutlierdetection• GivenadatasetD withn datapoints,adistancethresholdr• r-neighborhood:aboutoutliersvs.therestofthedata• ObjectoisaDB(r,π)-outlier

• Approach:– Computethedistancebetweeneverypair ofdatapoints– O(n2)– Practically,O(n)

Huiping Cao, Anomaly 24

Page 25: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Agrid-basedmethodimplementation

• Celldiagonallength:r/2• Celledgelength:

wheredisthenumberofdimensions• Level-1cell

– DirectneighborcellsofacellC– Anypointo’insuchcellshasdist(o,o’)≤r

• Level-2cell– OneortwocellsawayfromacellC– Anypointwithdist(o,o’)>rmustbeinlevel-2cell

Huiping Cao, Anomaly 25

r2 d

Page 26: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Agrid-basedmethodimplementation

• Pruning– n0 totalnumberofobjectsinacellC– n1 totalnumberofobjectsinacellC’slevel-1cells– n2 totalnumberofobjectsinacellC’slevel-2cells

• Level-1cellpruning:– If(n0+n1)>πn,oisNOTanoutlier

• Level-2cell:– If(n0+n1+n2)<πn+1,allthepointsinCareoutliers

Huiping Cao, Anomaly 26

Page 27: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Distance-basedoutlierdetection• Globaloutliers:cannothandledatasetswithregionsof

differentdensities

Huiping Cao, Anomaly 27

p2´ p1

´

Page 28: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Proximity-basedApproaches• Dataisrepresentedasavectoroffeatures• Basedontheneighborhood• Majorapproaches

– Distancebased– Densitybased

Huiping Cao, Anomaly 28

Page 29: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Density-basedoutlierdetection• Localproximity-basedoutlier• Comparethedensityaroundoneobjectwiththedensity

arounditslocalneighbors

Huiping Cao, Anomaly 29

p2´ p1

´

Page 30: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Densitybased• D:asetofobjects• Nearestneighborofo

– d(o,D)=min{d(o,o’)|o’inC}• Localoutliers:relativetotheirlocalneighborhoods,

particularlywithrespecttothedensitiesoftheneighborhoods.

• Densitybasedoutlier:theoutlierscoreofanobjectistheinverseofthedensityaroundanobject.

Huiping Cao, Anomaly 30

Page 31: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Concepts• k-distanceofanobjectodk(o):measuretherelativedensityof

anobjecto.• Formally,dk(o) =d(o,p)s.t.

– atleastk objectso’inD/{o},d(o,o`)≤ d(o,p)– atleastk-1objectso’inD/{o},d(o,o`)<d(o,p)

• K-distanceneighborhoodofanobjecto– Nk(o) ={o’|o’inD,d(o,o’)≤dk(o)}– Nk(o)maycontainmorethankobjects

• Measurelocaldensity:averagedistancefromo toNk(o)– Problem:fluctuations

Huiping Cao, Anomaly 31

Page 32: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Concepts• Reachabledistance

– reachdist(o’ào)=max{dk(o),d(o,o’)}– Alleviatefluctuations– Notsymmetric,reachdist(o’ào)≠reachdist(oào’)

• Localdensity ofo:averagereachabilitydistancefromotoNk(o)

• Differentfromdensitydefinitionindensity-basedclustering– Global/local

Huiping Cao, Anomaly 32

densityk (o) =| Nk (o) |

reachdist(o→ o ')o '∈Nk (o)∑

=| Nk (o) |

max{dk (o '),d(o,o ')}o '∈Nk (o)∑

Page 33: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Example• k=2,useEuclideandistance• Distancefromotoo’s2NNis1• dk(o)=1• Nk(o)={p1,p2,p3}

– dk(p1)=sqrt(0.64+1.0)=1.28,dist(o,p1)=0.8

– dk(p2)=sqrt(2)=1.41,dist(o,p2)=1– dk(p3)=sqrt(0.32)=0.57,dist(o,p3)=1– reachdist(o->p1)=1.28– reachdist(o->p2)=1.41– reachdist(o->p3)=1

• densityk(o)=3/(1.28+1.41+1)=0.813

Huiping Cao, Anomaly 33

0

1

2

3

4

5

0 1 2 3

yx

O

p1

p2

p3

Page 34: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

• Localoutlierfactor(LOF)(oraveragerelativedensityofo)– Averageratiooflocalreachabilitydensityofo and localreachability

densityofthek-nearestneighborsofo

– Thelowerdensityk(o),andthehigherdensityk(o’)è higherLOFàhigherprobabilitytobeoutlier

Huiping Cao, Anomaly 34

Page 35: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Example

Huiping Cao, Anomaly 35

0

1

2

3

4

5

0 1 2 3

yx

O

p1

p2

p3

• k=2,useEuclideandistance• Distancefromotoo’s2NNis1• dk(o)=1• Nk(o)={p1,p2,p3}

– dk(p1)=sqrt(0.64+1.0)=1.28,dist(o,p1)=0.8

– dk(p2)=sqrt(2)=1.41,dist(o,p2)=1– dk(p3)=sqrt(0.32)=0.57,dist(o,p3)=1– reachdist(o->p1)=1.28– reachdist(o->p2)=1.41– reachdist(o->p3)=1

• densityk(o)=3/(1.28+1.41+1)=0.813• Then,calculatedensityk (p1),densityk (p2),

densityk (p3)

Page 36: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Outline

• Generalconcepts– Whatareoutliers– Typesofoutliers

• Challengesofoutlierdetection• Outlierdetectionapproaches

– Statisticalmethods– Proximity-basedmethods– Clustering-basedmethods

Huiping Cao, Anomaly 36

Page 37: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Clustering-Based• Clustering-basedoutlier:an

objectisacluster-basedoutlieriftheobjectdoesnotstronglybelongtoanycluster.

• Anoutlier– anobjectbelongingtoasmallandremotecluster

– ornotbelongingto anycluster

Huiping Cao, Anomaly 37

Page 38: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Clustering-Based• Basicsteps:

– Cluster thedataintogroupsofdifferentdensity

• Threegeneralapproaches– Anobjectdoesnotbelongto anyclusterà outlierobject– Thereisa largedistancebetweenanobjectandtheclustertowhichitisclosestà outlier

– Theobjectispartofa smallandsparseclusterà alltheobjectsinthatclusterareoutliers

Huiping Cao, Anomaly 38

Page 39: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Approach2• Thereisa largedistancebetweenanobjectandthecluster to

whichitisclosestà outlier

• Calculateratio,thelarger theratio,thefarther awayoisfromitsclosestclusterCo

Huiping Cao, Anomaly 39

ratio = d(o,co )d(o ',co )o '∈Co

∑|Co |

Page 40: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

OutliersinLowerDimensionalProjection

• Inhigh-dimensionalspace,dataissparseandnotionofproximitybecomesmeaningless– Everypointisanalmostequallygoodoutlierfromtheperspectiveofproximity-baseddefinitions

• Lower-dimensionalprojectionmethods– Apointisanoutlierifinsomelowerdimensionalprojection,itispresentinalocalregionofabnormallylowdensity

Huiping Cao, Anomaly 40

Page 41: Anomaly (outlier) detection - New Mexico State Universityhcao/teaching/cs488508/note/8_anomaly.pdfAnomaly (outlier) detection Huiping Cao, Anomaly 1. Outline • General concepts –

Rpackages• https://cran.r-project.org/web/packages/outliers/outliers.pdf• RparallelimplementationofLocalOutlierFactor(LOF)whichuses

multipleCPUs tosignificantlyspeeduptheLOFcomputationforlargedatasets.https://cran.r-project.org/web/packages/Rlof/Rlof.pdf

• PythonLOFimplementation:http://shahramabyari.com/2015/12/30/my-first-attempt-with-local-outlier-factorlof-identifying-density-based-local-outliers/

Huiping Cao, Anomaly 41