economic machine learning for fraud detection · pdf fileeconomic machine learning ......

ECONOMIC MACHINE LEARNING FOR FRAUD DETECTION

Maytal Saar-Tsechansky

2015UT CID Report #1511

This UT CID research was supported in part by the following organizations:

identity.utexas.edu

ECONOMICMACHINELEARNINGFORFRAUDDETECTION

IntroductionandMotivationThedauntingriskofhealthcare,identity,andcyberfraudresultinbillionsofdollarsinlosseseachyearandpositseriousthreatstobothindividualsandnations.Inrecentyears,predictivemachinelearningmodelshaveemergedascriticalforeffectivedetectionoffraudbyautomaticallylearningpatternsoffraudfromdataandbyadaptingtonewpatternsoffraudastheseemerge.Suchmodelsusedataontherelevantdomaintoestimatethelikelihood(and/oramount)offraudandtoeffectivelyallocateauditingandenforcementresources.However,learningeffectivepredictivemodelsfromavailabledatainthesedomainspositdifficultchallenges.Considerforexampletheapplicationofmachinelearningtodetecthealthcarefraud.Inthiscase,supervisedmachinelearningrequiresdataintheformofpriorclaimsthathavebeenpreviouslyaudited,suchthatitisknownwhetherornotthesecasesarefraudulent.However,previouslyauditedcasesarenotnecessarilythemostinformativeformachinelearningtolearndistinctcharacteristicsfraudorillegitimateactivities.Consequently,intelligentacquisitionofparticularlyinformativeauditscancost-effectiveimproveboostfrauddetectionandminimizelossesthroughbetterallocationofauditingandenforcementresources.Towardsthat,itisimperativetodevelopmethodsthatwouldenablemachinelearningtechniquesreasonintelligentlyaboutopportunitiestoacquireparticularlybeneficialinformationforlearning,andthatinturncost-effectivelyenhancekeydetectionandenforcementgoals,suchasmaximizingcomplianceorminimizinglosses.Thecybermediumhasgeneratedmajorbenefitstomodernsocietyinrecentdecades.However,overtimeithasbecomeevidentthatcybertechnologiesarealsothesourceofsignificantnewrisksandvulnerabilities.Ineffect,maliciousconducthadbecomeprevalentinthecyberspaceinmultipleforms,renderingeffectivedefensivesolutionshighlydesirable.Theirsearchproposedherefocusesontwocriticalaspectsofsuchcyberdefensivesolutions:data-driven,autonomousdetectiontechnologyandcosteffectiveness.Specifically,weproposeframeworkandspecificmethodsforeconomicallyefficientallocationoflabelingtasksrequiredinsupervisedlearning-baseddefensivesolutions.Automatedtechnologieshaveemergedasacriticalelementofcyberdefenseapplications.Becauseitisimpossibleforsecurityexpertstocontinuouslymonitoremergingpatternswithindatapackets,transactions,oronlineinteractions,cyberprotectionreliesonautomatedsolutionstocontinuouslydetectandpreventpotentialthreats.Onekeytechnologythatsupportsautomatedsolutionsisdata-drivensupervisedlearning.Supervisedlearningaddressessecuritythreatsbycapturingpatternsinhistoricaldatathatarecharacteristicofthreats,andthendetectingthesepatternsinthefuture.Unlikerule-basedapproachesthatrequirehumanexpertstospecifyrulesthatareindicativeof

securitythreats,supervisedmethods“learn”patternsthatcharacterizesuchthreatsinadata-drivenmanner.Supervisedlearninghasbeensuccessfullyappliedtoamyriadofimportantsecurityapplicationsincludinginappropriatecontentfiltering,intrusiondetection,video-surveillance-basedintentiondetection,internetbullyingdetection,andonlinefrauddetection,amongothertasks.Anothercriticalaspectofcyberdefenseiscosteffectiveness.Becauseitispracticallyimpossibletopurchaseperfectdefense,companiesstruggletoprovidethebestpossibledefensegiventighteningsecuritybudgetconstraints(LawrenceandLoeb,2002).Ineffect,drivingdefensecostshighertorenderadequatedefenseeconomicallyprohibitiveisalikelystrategyemployedbyattackers.Consequently,cost-effectivedefensebecomesakeygoalforsustainablecyberdefense.Thisobjectiveisalsosharedbyoff-the-shelfsecuritysolutionproviders,whichaimtoremaincompetitivebyreducingdevelopmentcosts.Therisingcostsofeffectivedefensetriggerthefundamentalcyberchallengefacedbycompanies,organizationsandsecuritysolutionproviders:howtoprovideeffectivesecuritygivenalimitedbudget.Theresearchproposedhereaimstoaddressboththetechnologicalandeconomicaspectsofthischallengeinthecontextofsupervisedlearning-basedsecuritydefensetechnology.Inwhatfollowswedescribetheprobleminmoredetailandoutlineourproposedapproach.Supervisedlearningisparticularlysuitabletoefficientlyachieveadequatecybersecurity.Supervisedlearningoftensurpasseshumanabilitytodetectpatternsamongvastamountofdataandrequiresnosignificantguidancefromhumans.Perhapsmoreimportantly,becausemanycybersecuritychallengesemergeinanadversarialsetting,whereattackerscontinuouslyadapttheirattackstothetargetdefensestrategies,theabilityofdefensesystemstoefficiently“learn”newcharacteristicsofattacksdirectlyfromthedataeliminatestheneedforsecuritydomainexpertstocontinuouslyupdateandmaintainanevergrowingnumberofpredefinedrules.However,supervisedlearningrequireslabeledtrainingdata,whichareinherentlycostlytoacquireinmanyimportantcybersettings.Particularly,forsupervisedlearning,thedependentvariablevalueofeachtrainingexample(e.g.,whetherornotacertaintransactionisfraudulent,orwhetheracertainwebsiteisa“spoof”),mustbeknownsoastoinducepredictivepatternsfromdata.Morespecifically,securityapplicationscommonlyrequirehighlevelsofperformanceinordertodetectmostactualthreatswhileminimizingundesirablefalsedetections.Toachievesuchhighlevelsofperformance,data-drivensecurityapplicationsrequirelargevolumesofhighlyexpensivelabeledtraininginstancestolearnaccuratemodels.Forexample,inappropriatecontentfilteringrequirescostlylabelingoflargevolumesoftextual,image,andvideocontentasappropriateorinappropriate.Similarly,video-surveillance-basedintentiondetectionrequireslabelingofnumerousvideosofbodygesturesandmovementsasthreating/non-threatening.Additionally,avarietyofonlinefrauddetectiontasksrequirehiringexpensivefraudspecialiststodeterminewhetheralargenumberof

transactionsareinfactfraudulentornot,beforethesetransactionscanbeusedfortrainingdata-drivenmodels.Furthermore,becauseoftheadversarialnatureofmanysecuritydomains,hackersandfraudstersoftenadapttheirattacksbasedoncurrentpatternsofdetectionsoastodecreasethelikelihoodofdetectionoffutureattacks.Insuchsettingsinparticular,re-learningpatternsofcyberattackersmustoccurcontinuously,requiringaconstantflowofcostlylabeledtraininginstancestomaintainhighlevelofdetectionperformance.Similarly,foreffectivedetectionofoffensiveInternetcontent(suchasforonlinebullyingdetection),therapidevolutionofInternetcontent,suchasemergingthemes,expressionsorslang,canrenderasupervisedlearningmodelobsoleteunlesscontinuousflowoflabeledtrainingexamplesisavailabletoadaptthemodels.Overall,thechallengetoachievethegreatpromiseofautomated,data-drivenlearningliesintheabilitytoeffectivelymanagemountinglabelingcosts.Recently,onlinemarketplacesforhumanworkforceintelligencetasks,suchasAmazon’sMechanicalTurk,orfreelancers'websites(e.g.,Freelancer.com)havepresentedexcitingopportunitiesforbringingtobearhumanintelligencetosupportdata-drivenlearning(Brynjolfssonetal.,2014).1Particularlyrelevantforthisresearch,onlinemarketplacessuchasAmazonMechanicalTurkpresentnewopportunitiesfor“automating”thelabelingprocedurebyprogrammaticallyallocatingdatainstancesforlabeling.Theseabilitiesalsoprovideopportunitiesforcost-effectivelabeling.However,achievingthesebenefitsisnon-trivial.Forthispromisetomaterialize,itisimperativetocharacterizeandaddressseveralkeychallenges.First,giventhenatureoftheworkaswellastheincentivestructure,keychallengesincludeinaccurate(noisy)labeling(Imperatorsetal.,2014),timelycompletionoftasks,andmeetingstrictbudgetconstraints.Thisisinadditiontothechallengeofcarefullyselectinginformativedataitemssoastoimprovetheaccuracyofthemodellearnedfromtheacquiredlabeledinstances.Addressingallthesechallengessimultaneouslyisanovelanddifficultproblemthathasnotbeenaddressedinpriorwork.

Accuracy-centricalgorithmsforcost-effectiveeconomiclabelingWebeginbyconsideringlabelingmarketsinwhichthecostoflabelscanreflectarbitraryrelationships.Inparticular,researchthusfarhasdocumentedconflictingevidenceofrelationshipsbetweenthequalityoflabels,namelytheproportionofcorrectlabelsobtainedbyoutsourcingplatforms,andthecostoflabelacquisitions.Theseincludeforexampleevidenceofnoapparentchangeinqualityfordifferentpays,orhyperbolicrelationshipsuchthatincreasingcostsfirstyieldincreasinglyhigherqualityfollowedbydroppingqualityforincreasingcost.Indeed,itispossiblethisevidencesuggests,thattherelationshipbetweencostandlabelqualityvariesacrossdifferentdomains.Weconsiderasettinginwhichlabelscanbeacquiredatdifferentlevelsofcosts,eachlevelpotentiallyyieldingdifferentqualityoflabeling.Specifically,thequalityoflabelingis

reflectedbytheprobabilityofthelabelbeingcorrect.Weaimedtodevelopanacquisitionpolicythatisentirelydata-drivenandagnostictotheprevalent(thoughunknown)relationshipbetweenthecostoflabelsandtheensuinglabelquality.Inaddition,labelacquisitionisdonesequentially,suchthatateachphase,Nlabelsareacquired.Allthealgorithmsweevaluatedbeginwitha.Ourfirstpolicy,MaxRatioaimstoevaluatetheexpectedimprovementinAUCperunitcostfromacquiringlabelsateachpossiblecost,chi,andselectthecostwiththehighestexpectedratio.Inparticular,keytothisevaluationisthatwebeginwithaninitiallabeledsetcomposedbydrawingarandomsampledfromULandacquiringthecorrespondinginstanceslabelsateachofthecostlevelschi.subsequently,theexpectedimprovementinperformanceperunitcostisestimatedforeachlabelingcost.Specifically,wedrawarandomsubsetofinstancesacquiredatcostchiandomititfromthetrainingset.ThedifferentbetweentheperformanceofthemodelinducedfromthelabeledsetLandtheperformanceofamodelinducedfromthe“reduced”setisusedasaproxytotheexpectedimprovementinperformanceifanadditionalsetofinstancesisacquiredatcostchi.Thisexpectedimprovementisthendividedbythecostchitoestimatetheexpectedimprovementperunitcost.Furthermore,werepeattheevaluationoftheexpectedimprovementfordifferentrandomdrawsofasubsetofinstancesacquiredatcostchitoaccommodatethemodelvariance.ThepseudocodefortheMaxRationprocedureisshownbelow.

Maxratiomaysufferfromseveralpotentialweaknesses,mostofwhichstemfromthedifficultyincorrectlyestimatingthecorrectimprovementinperformance.First,duetoestimationvariancethisestimationislikelytobeimprecise,particularlyearlyonthelearningcurve,duringtheinitialacquisitionphases,andwhenthetrainingsetLissmall.Akeychallengeiswhenallexpectedimprovementsisestimatedtobenegative.Thisislikelytobeanartifactoftheestimation.Insuchcasesweemployaheuristicandusetheleastcostlylabelstominimizetherisk.

MAXRATIOVARIANTSWeexaminedseveralvariationstotheMaxrationalgorithms.Firstweexploreapolicywhichdoesnotconsidertheimprovementperunitcost,butonlyaimstoestimatetheexpectedimprovementinperformancefromacquiringlabelsatdifferentcosts.MaxQuality:Ifthedifferencesincostaresmall,giventhevarianceinestimatingtheexpectedimprovementdividingtheexpectedchangeinperformancebythecorresponding

costhasthepotentialtoaugmenttheerrorinestimation.Weconsiderseveralvariantstothispolicy:MaxRobustRatio:Insteadofentirelyignoringthecost,twoothervariantsaimtoreducetheeffectofcostonselectingthebestlabelingcosttochoose.Inparticular,asbefore,onevarianceconsidersthemarginalimprovementinpredictiveperformance.However,ratherthandividethisimprovementbythemarginalcostperlabelwedivideitbythecumulativecost.MaxPredictedQuality:Thisvariantdoesnotaimtoestimatetheexpectedimprovementinperformancebyremovingpreviouslyacquiredinstance,butratherbutestimatingthetrendofthelearningcurveandprojectingtheexpectedperformanceRANDOM:Asabenchmark,weconsideracquisitionoflabelsatarandomlyselectedcost.ThefigurebelowpresentstheareaundertheAUCcurveobtainedaftereachacquisitionphaseasafunctionofthecostincurredforlabelacquisition.

ConclusionsandDiscussionOurresultsdemonstratethatevaluatingtheexpectedimprovementinperformanceyieldanabilitytoselectgenerallygoodacquisitionsinacost-effectivemanner.Severalpoliciesyieldcomparableperformance.Theseresultssuggestthatourpoliciesareabletoidentifyacquisitioncoststhatyieldlabelingqualitytoproducethedesiredimprovementinperformance.Yet,itwouldbedesirabletofurtherimprovetheestimationoftheseimprovementssoastoidentifythebestcost-effectiveacquisitions.Onepossibleexplanationfortheseresultsisthatthedifferencesincostsareverysmallandinsignificanttoyieldmeaningfuldifferencesinbenefits.Towardsthat,weplantoexperimentwithsettingsinwhichthedifferencesincostsaremoresignificant.Anotherdirectionforimprovementistofurtherimprovetheestimationinexpectedpredictiveperformance.Towardsthatweaimtoexploretwodirections.Thefirstistoconductthecrossvalidationmultipletimessoastoreducetheestimationvariancefurther.Asecondstrategyweaimtoexploreisconsideraddingnewinstancestothetrainingsetratherthanevaluatethelossinomittinginstancesacquiredatagivencost.SpecificallyweaimtodrawinstancesfromthetrainingsetLacquiredatagivencostchiandcreatecopiesoftheseinstancesthatwillbeaddedtothetrainingset.Evaluatingthedifferenceinperformancebetweentheaugmentedsetandthecurrentonemayyieldabetterestimateoftheexpectedchangemodelperformance.

© 2015 Proprietary, The University of Texas at Austin, All Rights Reserved.

For more information on Center for Identity research, resources and information, visit identity.utexas.edu.

identity.utexas.edu

economic machine learning for fraud detection · pdf fileeconomic machine learning ......

Documents