practical machine learning for cloud intrusion … machine learning for cloud intrusion ... machine...

Practical Machine Learning for Cloud Intrusion DetectionChallenges and the Way Forward

Ram Shankar Siva KumarMicroso�

Ram.Shankar@microso�.com

Andrew WickerMicroso�

Andrew.Wicker@microso�.com

Ma� SwannMicroso�

MSwann@microso�.com

ABSTRACTOperationalizing machine learning based security detections is ex-tremely challenging, especially in a continuously evolving cloudenvironment. Conventional anomaly detection does not producesatisfactory results for analysts that are investigating security in-cidents in the cloud. Model evaluation alone presents its own setof problems due to a lack of benchmark datasets. When deployingthese detections, we must deal with model compliance, localization,and data silo issues, among many others. We pose the problem of“a�ack disruption” as a way forward in the security data sciencespace. In this paper, we describe the framework, challenges, andopen questions surrounding the successful operationalization ofmachine learning based security detections in a cloud environmentand provide some insights on how we have addressed them.

KEYWORDSmachine learning, security, intrusion detection, cloudACM Reference format:Ram Shankar Siva Kumar, Andrew Wicker, and Ma� Swann. 2017. PracticalMachine Learning for Cloud Intrusion Detection. In Proceedings of AISec’17,Dallas, TX, USA, November 3, 2017, 10 pages.DOI: 10.1145/3128572.3140445

1 INTRODUCTION�e increasing prevalence of cybersecurity a�acks has created animperative for companies to invest in e�ective tools and techniquesfor detecting such a�acks. Intrusion detection systems are expected[28] to grow to USD 5.93 billion by 2021 at a compound annualgrowth rate of 12%.

Academia [8, 17, 29] and industry have long focused on buildingsecurity detection systems (shortened herea�er as detection) fortraditional, static, on-premise networks (also called “bare metal”)while research in employing machine learning for cloud se�ingis more nascent [20, 24, 26]. Whether detection systems for baremetal or for the cloud, the emphasis is almost always on the algo-rithmic machinery. �is paper takes a di�erent approach - insteadof detailing a single algorithm or technique that may or may notbe applicable depending on factors like volume of data, velocityof operation (batch, near real time, real time), and availability oflabels, we document the challenges and open questions in buildingmachine learning based detection systems for the cloud. In this

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).AISec’17, Dallas, TX, USA© 2017 Copyright held by the owner/author(s). 978-x-xxxx-xxxx-x/YY/MM. . .$15.00DOI: 10.1145/3128572.3140445

spirit, this paper is more closely related to [27] but very speci�c tobuilding monitoring systems for the cloud’s backend infrastructure.

We report the lessons learned in securing Microso� Azure whichdepends on more than 300 di�erent backend infrastructure servicesto ensure correct functionality. �ese 300+ services support all�avors of cloud o�erings: public cloud (accessible by all customers)and private cloud (implementation of cloud technology within an or-ganization). Within these cloud o�erings, the backend services alsosupport di�erent customer needs like Infrastructure as a Service(IaaS) and Platform as a Service (PaaS). Azure backend infrastruc-ture generates more than tens of petabytes of log data per yearwhich has a direct impact on building machine learning based in-trusion detection systems. In this se�ing, seemingly simple taskssuch as detecting login anomalies can be di�cult when one has towrestle with 450 billion login events yearly.

�ere are other problems besides scalability. Firstly, the cloudenvironment is constantly shi�ing: virtual machines are constantlydeployed and decommissioned based on demand and usage; devel-opers continuously push out code to support new features whichinherently changes the data distributions, and the assumptionsmade during model building. Secondly, each backend service func-tions di�erently. For instance, the backend service that orchestratesAzure’s storage solution is architected di�erently from the backendservice that allocates computation power. Hence, to continue withthe login anomaly example, one must account for di�erent architec-tures, data distributions and analyze each service separately. Fur-thermore, the cloud, unlike traditional systems, is geo-distributed.For instance, Azure has 36 data centers across the world, includingChina, Europe, and Americas, and hence must respect the privacyand compliance laws in the individual regions. �is poses novelchallenges in operationalizing security data science solutions. Forinstance, compliance restrictions that dictate data cannot be ex-ported from speci�c geographic locations (a security constraint)have a downstream e�ect on model design, deployment, evaluation,and management strategies (a data science constraint).

�is paper focuses on the practical hurdles in building machinelearning systems for intrusion detection systems in a cloud envi-ronment for securing the backend infrastructure as opposed too�ering frontend security solutions to external customers. Hence,the alerts produced by the detection systems discussed in this paperare consumed by in-house, Microso� security analysts as opposedto paying customers who buy Azure services. �ough not discussedin this paper, we would like to highlight that the frontend monitor-ing solutions built for external customers are considerably di�erentfrom backend solutions, as the threat landscape di�ers based onthe customer’s cloud o�ering selection. For instance, if a customerchooses IaaS, important security tasks such as �rewall con�gura-tion, patching, and management is the customer’s responsibility

arX

iv:1

709.

0709

5v1

[cs

.CR

] 2

0 Se

p 20

17

as opposed to PaaS, where most of the security tasks are the cloudprovider’s responsibility. In practice, the di�erence between PaaSand IaaS dictates di�erent security monitoring solutions.

�is paper is not about fraud, malware, spam, speci�c algorithmsor techniques. Instead, we share several open questions related tomodel compliance, generating a�ack data for model training, siloeddetections, and automation for a�ack disruption - all in the contextof monitoring internal cloud infrastructure.

We begin with a discussion about building models (or systems)that distinguish between statistical anomalies and security-interestingevents using domain knowledge. �is is followed by a discussionof techniques for evaluating security detections. We then describeissues surrounding model deployment, such as privacy and local-ization, and present some approaches to address these issues. Wemove on to discuss issues with siloed data and models. We concludewith some ways to move from a�ack detection to a�ack disruption.

2 EVOLUTION TO SECURITY INTERESTINGALERTS

Here is a typical industry scenario: An organization invests inlog collection and monitoring systems, then hires data scientiststo build advanced security detections only to �nd that the teamof security analysts are unhappy with the results. Disgruntledanalysts are not the only thing at stake here: a recent study bythe Ponemon Institute [18], showed that organizations spend, onaverage, nearly 21,000 hours each year analyzing false positivesecurity alerts, wasting roughly $1.3 million yearly. To address thisissue, it can be appealing to invest in a more complex algorithm thatpresumably can reduce the false positive rate and surface be�eranomalies. However, as we describe below, blind adherence to thisstrategy tends not to yield the desired results.

As mentioned earlier, Azure has more than hundreds of backendservices that are all architected di�erently. On the one hand, it isimpossible to have a single generic anomaly detection that capturesthe nuances of each service. On the other hand, it is cumbersometo build bespoke machine learning detections for each service. Inthis section, we describe strategies to combine the regular anomalydetection se�ing with domain knowledge from the service andsecurity experts in the form of rules to lower false positive rates.

We have established the following criteria for security alerts tohelp maximize their usefulness to security analysts: Explainable,Credible, and Actionable. Unfortunately, anomaly detection inan industry se�ing rarely satis�es these criteria. �is is becauseanomalous events are present in any organization, but not all ofthese anomalies are “security interesting” which is what the securityanalysts care about.

As an example, we encountered the following issue when build-ing an anomalous executable detection. We collaborated with oursecurity investigation team to be�er understand how a�ackers mas-querade their tools to match common executables. For instance,a�ackers would name their tool “ccalc.exe” to be deceptively similarto the Microso� Windows Calculator program “calc.exe”. We soughtto develop an anomaly detection for �nding abnormal executablesbased on the executable name and metadata.

When we ran this new detection, security experts found most ofthe alerts were false positives despite conforming to their de�nition

Figure 1: A Venn diagram depicting the intersection of secu-rity interesting alerts

of a�acker activity. For instance, the detection system found anexecutable named psping.exe that closely resembles ping.exe, butthe investigation team found that the service engineers were usinga popular system utility tool. �is soon became a recurring theme:the alert appeared worthy of investigation at �rst glance, but a�erspending considerable resources on the investigation, we wouldconclude that the alert was a false positive.

In order to generate useful results, we moved away from simplyanomaly detection and focused our e�orts on systems that produce“security interesting” alerts. We de�ne such a system as one thatcaptures an adversary’s tools, tactics and procedures from the gath-ered event data while ignoring expected activity. We show laterin the section, how rules and domain knowledge can help in theseaspects.

As a �rst step, we recommend that machine learning engineersconsult with security domain experts to see if there is any overlapbetween the a�acker activity that we seek to detect and expectedactivity. If there is some overlap, then this is a “hygiene” issue andmust be addressed. For instance, a�ackers o�en elevate privilegesusing “Run as Administrator” functionality when compromisinginfrastructure machines, which can be tracked easily in securityevent logs. It is standard operating procedure that service engi-neers must never elevate to admin privileges without requestingelevated privileges through a just-in-time access system. �is way,the service engineer’s high privileged activity is monitored andmore importantly, is scoped for a short period of time. However,service engineers o�en disregard this rule when they are debugging.�is creates a problem in which regular service engineer activityis almost indistinguishable from a�acker activity which we referto this as “poor hygiene” (see Figure 1). Specifying and strictlyenforcing operating procedures to correct poor hygiene, is the �rststep in reducing the false positive rate of the system.

Once the hygiene issues are resolved and a well-de�ned secu-rity scenario is in place, the stage is set for incorporating domainknowledge.

Figure 2: Sophistication of anomaly detection techniques

2.1 Strategies to incorporate DomainKnowledge

Domain knowledge is critical when developing security detections,and how it is leveraged goes well beyond simple feature engineering.In this section, we discuss the di�erent strategies that we havesuccessfully employed to utilize domain knowledge in the form ofrules. Other ways to incorporate domain knowledge, not discussedin this paper, are feedback of alerts from security analysts andconsuming threat models.

2.1.1 Incorporating Rules (end consumer + security experts). Rulesare an a�ractive means to incorporate domain knowledge for thefollowing reasons:

• �ey are a direct embodiment of domain knowledge - Mostorganizations have a corpus of �rewall rules (e.g., limitingtra�c from Remote Desktop Protocol ports), Web A�ackdetection rules (e.g., detecting xp cmdshell in SQL logs isa strong evidence of compromise), or even direct embodi-ment of the goodness (like whitelists) and maliciousness(such as blacklists). Security analysts embrace rules be-cause it allows them to easily express their domain knowl-edge in simple conditionals. If we de�ne rules as an atomic�rst-order logic statements, then we can expand to a widerset:– Indicators of Compromise (�le hashes, network con-

nections, registry key values, speci�c user agent strings)that are commonly sourced from commercial vendors;

– �reat intelligence feeds (domain reputation, IP repu-tation, �le reputation, application reputation);

– Evidence/telemetry generated by adversary tools, tac-tics, and procedures that have been observed before-hand.

• Rules have the highest precision - Every time a scoped rule�res, it is malicious by construction.

Figure 3: Rules can be applied as �lters a�er the machinelearning system. �e machine learning system producesanomalies, and the business heuristics help to winnow thesecurity interesting alerts.

• Rules have the highest recall - Whenever a scoped rule�res, it detects all known instances of maliciousness thatare observed for that rule.

We also acknowledge the biggest disadvantage of rules: caremust be taken to maintain the corpus of rules since stale onescan spike the false positive rate. However, even machine learningmodels require babysi�ing and have their own complications [25].For instance, if we use a model that has been trained on data thatno longer re�ects the state environment, the model can dri� andproduce unexpected results. Given that rules encode domain knowl-edge, are readily available, and favored by security analysts, wepresent three strategies to incorporate them alongside a machinelearning system.

As �lters. Rules not only catch known malicious activity, butcan also be applied as �lters on the output of the machine learningsystem to si� out the expected activity (see Figure 3). In this archi-tecture, the machine learning system produces anomalies, and therules/business heuristics help to pick out the security interestingalerts. We used this framework to detect logins from unusual ge-ographic locations. In this scenario, if a user who always logs infrom New York a�empts to login from Sydney, then the user mustbe prompted for multifactor authentication. Our initial implemen-tation of the detection logic had a false positive rate of 28%, and atcloud scale, that translated to 280 million “suspicious” logins. Toimprove our false positive rate, we supplemented the system withcustom rules to identify company proxies, cellphone networks, andpossible vacations/travel. A�er applying such business heuristics,the false positive rate dropped to less than 0.001% which demon-strate how e�ective rules can act as �lters.

As binary features. Rules can also be easily incorporated withinmachine learning systems as binary features (see Figure 4), and isused in our network anomaly detection. For instance, if the rule isTRUE on the event data, it is set to 1, and 0 otherwise. �is kindof construction is useful when the rules are explicit and changeslowly, such as �rewall rules or web a�ack detection, as opposedto temporary ones (e.g., “Bob is on vacation from 8/1 to 8/20”).

Within ML system. Unlike the �rst two constructions where thecorpus of rules was separate from the machine learning system, it isalso possible to encapsulate the rules and machine learning systemas one entity (see Figure 5). �e biggest advantage is that when thisembodiment is used, there is no need to maintain two separate codebases (code base for machine learning + code base for �lters vs. codebase for machine learning + code base for rules vs code base for

Figure 4: Rules can also be used as input to a machine learning system, for example, as binary features.

Figure 5: Representations such as Markov Logic Nets andProbabilistic So� Logic help to combine rules and probabilis-tic representations as one ML entity.

purely machine learning system). Security scenario formalizationsin terms of Markov logic networks [23] and probabilistic so� logic[16] allow us to combine �rst order logic (in our case rules) andprobabilistic graphs. In our experience, the biggest drawbacks withthese systems is that the amount of time for inference quicklybecomes untenable for large scale cloud services. We leave it asfuture work to explore the applicability of systems such as Markovlogic networks to the security data science community.

3 MODEL EVALUATIONIn the race to build successful security analytics, an inordinateamount of a�ention has been paid to selecting the most appropriatealgorithms. However, the challenge is not model building, but con-vincing end-consumers such as security analysts or on-call serviceengineers that the system works. �is is a rather di�cult task in

practice, owing to a paucity of labeled data and more importantly,a lack of baseline datasets for the intrusion detection problem inthe cloud space.

Consider some of the popular benchmark data sets: ImageNet[5] dataset o�ers 14 million labeled images for image recognitionproblems; the SwitchBoard corpus [9] o�ers close to 5,800 min-utes of labelled data for speech recognition problems. Securityapplications, on the other hand, have outmoded datasets - the lastrelevant data set goes back to the 1999 KDD data set [4] on networkintrusions from an arti�cially created network by MIT LincolnLaboratory.

However, for the cloud se�ing there are no prevalent bench-marks, and we argue that such benchmarks may never exist for thefollowing reasons:

• Simulated environments with arti�cially created a�acksrepresent a static, sterile se�ing that is hardly the case inpractice. In a cloud se�ing, for instance, VMs are contin-uously being provisioned and decommissioned; deploy-ments happen erratically, and the underlying environmentis constantly shi�ing.

• A�acks in the real-world leave very few traces in mostcommon sources of telemetry (e.g., the windows securityevent log), which can be as low as two entries in the logsof interest. �is is too small a sample of labeled data tomeasure against.

• One bene�t of a benchmark data set such as MNIST is thatwe can make a closed-world assumption about the data.Everything that we need to know about our data is con-tained entirely within the image itself, with the additionof a simple label. In a security se�ing, we can rarely makea closed-world assumption, as with image recognition sys-tems. For example, the audit logs from an authenticationservice might not tell the complete story about suspiciousactivities it, in-part, captures.

• Unlike images where the underlying data is always pixels,log data is extremely speci�c to the organization. Whilethere is some standardization such as NetFlow, the logsfrom di�erent environments tend to have few similarities,and more importantly have di�erent assumptions. �ere-fore, training or testing on another company’s logs wouldbe less e�ective.

• Unless it is a “threat intelligence” company selling scopedtainted data as indicators of compromise, there is no incen-tive for a compromised cloud company to share their rawlogs of tainted data to the public, as it may invite furtherscrutiny.

For these reasons, security data scientists must create their ownbaseline datasets. In this section, we will discuss the di�erentstrategies that we use in practice to evaluate detection systems.�ere are two steps:

• Use security knowledge to bootstrap the intrusion detec-tion system to provide a starting point of a�ack data

• Use machine learning techniques to grow this seed of la-beled data

3.1 Bootstrapping using security strategiesFor the bootstrap step, we present three strategies with each strat-egy mapping to di�erent maturity levels of the detection lifecycle.For instance, at the beginning stages of developing detection sys-tems, we can run a quick test by injecting synthetic data to check ifthe system is operational. Once the system has become acceptablymature, we can use a red team.

3.1.1 Trivial case: Inject fake malicious data into data storage.�e idea here is to create fabricated a�acker activity, and test itagainst the detection system. We can validate that the system worksas intended if it can surface the injected data among the rest of thenon-anomalous data. While this may seem trivial, we regularly useit as a test to see if the system is functioning. For instance, whenthe O�ce 365 security team built a Markov chain model (MCM) todetect unusual operating system activity, they had no test data toverify if the system worked as intended. So, they simply created anaccount, “eviluser”, and associated it with arbitrary activity. �eyknew their system was functioning as intended if “eviluser” wouldrank on top of the alert stack. In some ways, this became a poorman’s debugger. Of course, with such low overhead comes a hugetradeo�: the injected data may not be representative of true a�ackeractivity. It is important to note that we do not espouse validating orgenerating evaluation data from seemingly random data. Instead,we are a�esting how synthetic data can act as sanity checks to testdetection systems.

3.1.2 Cross Pollinate from security products. A more advancedstrategy for gathering evaluation data is to utilize existing securityproducts and a�acker tools. Here are some strategies we employ:

• Use common a�ack tools: If the task at hand is to detectmalicious processes running in the OS, engineers can runtools like Metasploit [19], PowerSploit [21], or Veil [30]inthe environment, and look for traces in the logs. �istechnique is similar to using fuzzers for web detection.While easy to implement, care must be taken not to modify

the model based on this data. �is way, we avoid over��ingto the tool instead of generalizing to real-world a�acks.

• Data from existing security products: If an organizationhas parallel security products, they all are potential for newlabel data. For instance, when we wanted to test our de-tection for compromised VMs in our cloud infrastructure,we leveraged our online mail service for labels of compro-mised data. Speci�cally, we used our mail service whichindicated the IP addresses that sent spam. For those IPaddresses that stem from our own cloud service, we wereable to track down the VMs. Since these VMs were sendingout spam, we assumed they were compromised and usedtheir logs to evaluate our detection. �e more diverse anorganization’s security product line is, the easier it is toleverage data for evaluation.

• Honeypots: If it is not possible to acquire data from ex-isting security products, then honeypots can be deployed.When a�ackers ultimately compromise honeypots, logsfrom these systems can be used for evaluation. �ere aretwo drawbacks to this strategy:– Not all a�acks can be captured via honeypots. For

instance, we see a preponderance of network recon-naissance (such as ping, and checking for open SSHports) as opposed to more arcane a�acks such as hy-pervisor escapes.

– �ere is no guarantee that a�ackers will be lured bythe honeypot, and hence no guarantee that we willreceive any evaluation data. In other words, we willonly be able to collect data for opportunistic a�acksand not targeted ones.

3.1.3 Test against red team. �e highest quality evaluation datais generated when we hold penetration test engagements in whichauthorized engineers play the role of an adversary (commonly called“red team”) and a�empt to accomplish their goals while subvertingour detection system. Red team members emulate adversaries withspeci�c objectives, like ex�ltrating data and begin from scratchwith reconnaissance. Results from red teaming are the best approx-imations to how our systems are a�acked “in the wild” and makea great way to validate detection systems. �ere are some caveatsto using red teams: �rstly, red teams are expensive resources, andin most cases, only large organizations have in-house red teams.Secondly, it is important to note that red team engagements arepoint-in-time exercises that are typically scheduled once every fourmonths whereas security analyst teams are constantly building newdetections based on the currents threats. Finally, evaluating thedetection system a�er a red team engagement is not trivial. Forinstance, even when red team members take meticulous notes dur-ing the process, identifying the malicious sessions is an extremelytime-consuming task. For example, their logs might read “ran toolX” with a timestamp. However, security analysts would need tospend an inordinate amount of time mapping back to identify thea�ack data in the logs.

3.2 Machine learning for increasing attack dataIt is important to note that data produced by the red team or cross-pollination is not su�cient by itself for model evaluation. For in-stance, the data generated by red team activities is limited becausethe engagement is scoped: red teams tend to compromise only ahandful of accounts (in the order of tens as opposed to thousands)that are required to achieve their objective. �ese precise move-ments leave very few traces in logs. To tackle this problem, in thissection, we describe how to employ machine learning techniquesto increase our a�ack data corpus.

It is not good strategy in practice to undersample the majorityclass (i.e., the “normal”, non-malicious data) as it is tantamount tothrowing away data - the model never sees some of the normaldata, which may encode important information. On the other hand,naively oversampling the minority classes (i.e., the generated a�ackdata) has the disadvantage of repeating the same observed varia-tions - we are not increasing the data set; we are merely replicatingit. Instead, we recommend SMOTE [2]. Essentially, it generatesa random point along the line segment connecting an anomalouspoint, or “a�ack point”, to its nearest neighbor. In our experience,SMOTE outperforms weighted decision trees.

Recent developments in Generative Adversarial Networks (GAN)[10] suggest they are a promising vehicle for increasing a�ack cor-pus. Results from [11, 13] show that GANs can produce adversarialexamples that successfully trick the intrusion detection system. Weposit that these same techniques can be used to synthesize a�ackdata for evaluation. �ere are only two perceived drawbacks: thebootstrap dataset must be su�ciently large (while this may be easyin the case of malware analysis, it is going to be di�cult to getmany examples of insider a�acks for intrusion detection). We havealso found that GANs are particularly tricky to train.

4 MODEL EXPLAINABILITYFor any practical application of machine learning to security, thereis no perfect model. We must accept that security signals will bewrong sometimes; security analysts are well-aware of this fact.In a cloud se�ing where there can be thousands of alerts, modelexplainability becomes a major issue. For example, it is not su�cientto say there was an unusual .exe �le that executed on the system.�e alert must convey why the .exe �le was anomalous - perhaps itran from tmp folder; perhaps it took base-64 encoded inputs (whicha�ackers always do); perhaps the .exe �le stemmed from a browser,right a�er downloading a .pdf document. Giving security analystssuch explanations is instrumental in eliciting actions.

When creating new security signals, we must consider that thealerts generated by the signal should be explainable. �e earlierthis is taken into account during the signal development process,the be�er. Many common and otherwise useful techniques do noteasily facilitate explaining the results they produce.

For example, Deep Learning produces good results in many ap-plications, but it can be di�cult to explain why the model producedthe results that it did. On the other hand, the output of a simplelinear model can o�en be explained in terms of the coe�cients.

�e explanations provided can come in di�erent forms. �e mostobvious, at least from a machine learning perspective, is to provideone or more features of the model that most impacted the score.

�at is, we can provide the features that “caused” the model toscore the event as suspicious. �is does not need to be a statisticallysound explanation. Interaction e�ects among features can causeus to provide many features as explanations, but this only createsmore information overload for the analyst. Instead, we should treatthe explanation as breadcrumbs that can lead the analyst on theright path.

Some other ways in which we can help explain the results ofa security signal are to provide supplemental data to the outputand to provide a textual description. Supplemental data can besomething as simple as a ranked order of the suspicious executableson a machine, that can possibly help explain a signal’s output or,at least, put it in a context that helps the analyst make sense ofthe output. A textual description is not always straight-forward,but simpli�es the need to further interpret the explanation. Forexample, a suspicious login signal might provide an explanationsuch as “high speed of travel to an unfamiliar location”. �is canbe much more meaningful that having the analyst look at a set ofreal-valued features to draw the same conclusion.

Although the output produced by some machine learning modelsis not easily explained, there have nonetheless been some a�emptsat providing some explanations. Among these are techniques suchas Local Interpretable Model-Agnostic Explanations (LIME) [22].

5 MODEL COMPLIANCE ANDLOCALIZATION

Most cloud services have data centers across the world - Azure, forinstance, has 36 data centers spread across every continent exceptAntarctica. In this section, we explore how compliance laws acrossthe world a�ect building security data science solutions space alongwith a closely related problem of localization.

Building models that are respectful of compliance laws is im-portant because failure to do so not only brings with it cripplingmonetary costs - for instance, failure to adhere to the new EuropeanGeneral Data Protection Regulation (GDPR) set to take e�ect onMay 2018, can result in a �ne up to 20 Million Euros or 4% of annualglobal turnover [1] - but also the negative press associated for thebusiness. However, building privacy compliant models presentsthree challenges:

• Data protection laws are not uniform across the world. Forinstance, in the European Union, Article 29 of the DataProtection Working PARTY [3] unequivocally considers“IP addresses as data relating to an identi�able person”,whereas in the United States, IP address by itself is not per-sonally identi�able and United States courts have opinedthat “IP address identi�es a computer not a person” [15].Since we want to be able to glean as much informationas possible, we would need to follow di�erent maskingprocedures in di�erent regions which complicates modelbuilding and model deployment.

• Data protection laws ask for retroactive modi�cation. Touse GDPR as a running example, Article 17 requires rightto erasure or commonly called “Right to be forgo�en”,wherein companies must delete records pertaining to aperson, when asked. �is brings a lot of challenges to the

Figure 6: Shotgun deployment places same model codeacross di�erent regions. �ere are two disadvantages: �emodel may not be appropriate for a particular region. Also,there is no communication between models.

security se�ing. Consider the simple case of building a sys-tem that computes user risk based on user activity. One ofthe common techniques to evaluate risk score is “guilty byassociation” wherein a user’s risk score is increased if theuser communicates with other compromised or suspicioususers. In other words, user risk score is modeled jointlywith other users and not in isolation. Now, because of rightto be forgo�en, if a user’s details are deleted, should theuser’s risk score that was used as input to calculate otherrisk score also change? How do we make models elastic toretroactively delete results and yet remain logically consis-tent? �is would be a good question to explore for futureresearch.

• Privacy laws change with the political landscape - for in-stance, it is not clear how Britain leaving the EuropeanUnion would a�ect the privacy laws in the country.

Closely related to model compliance is model localization. Forinstance, we built a login anomaly detection using seasonality todetect anomalous login during “o� hours” (common wisdom statingthat anyone logging at midnight on a weekend to access con�den-tial material, is anomalous). However, this naıve system quicklyback�red - timestamps were normalized to GMT standard, andconverting to local time was not possible because there was no IPaddress, owing to privacy reasons, for reverse geo mapping. Also,in a global company, the de�nition of a weekend is labile - weekendin the Middle East is not the same as weekend in the Americas. An-other complication was that because product adoption happens atdi�erent rates across the world, the data generated by each regionis di�erent. �is is because the data distribution from a maturemarket is very di�erent that of an emerging market, and hencedictates di�erent models.

Should there be no model compliance and model localizationproblems, one can simply use “shotgun deployment”, i.e., deploythe same model code across di�erent regions (see Figure 6). With

Figure 7: Tiered Modeling: Each region is modeled sepa-rately. Results are scrubbed according to compliance lawsand privacy agreements. Scrubbed results are used as inputto “Model Prime”, where they are collated for global trends.

this comes deployment ease as there is no customization and froman operational and debugging perspective, our support sta� arehappy because there was only one troubleshooting guide to debugany failure. However, because the same model template is used, weare forced to make the erroneous that the data distribution is sameacross the world.

For capturing global trends, we propose “tiered modeling” -wherein each geographic region is modeled separately using be-spoke models which then send the scrubbed results to a central“model-prime” (see Figure 7). �e “model-prime” reasons over theoutput of the individual regions to centralize for global trends. �isarchitecture has two advantages: by having models that run inthe context of a region, it is easier to respect local privacy lawsand account for nuances of the region; at the same time, having amodel-prime accounts for global trends. �is comes at a cost: inour experience, this architecture requires specialized deploymentframeworks and code maintenance or instance, code change in oneregion, most likely requires retraining of model-prime.

�is idea of tiered modeling is built on the idea of di�erentialprivacy [7] and is similar to the work on applying di�erential pri-vacy in an ensemble se�ing [12], with the main exceptions: tieredmodeling is an ensemble of di�erent kinds of learners as opposed toa single type of learner (to ensure that the compliance and localiza-tion complexities in di�erent regions, may require di�erent typesof learners). We leave it as future work to study the theoreticalguarantees of tiered modeling.

6 HORIZONTALLY AND VERTICALLY SILOEDAPPROACHES

In addition to the problems surrounding anomaly detection, modelevaluation, and compliance, we are also faced with complicatingfactors derived from data silos. �ese data silos make it di�cultto build security detections across various data sets due to accessboundary restrictions. In this section, we describe two types ofdata siloes, referred to here as horizontal and vertical siloes, andpresent some of the problems encountered with each type.

Access to data required to create and run new security signalsis o�en restricted due to privacy concerns, among other reasons.�is leads to horizontally siloed detections. In a system like Azurewhere there are hundreds of di�erent services, a�acks can spanmultiple services. For example, an a�acker might �rst compromisethe service responsible for authentication and then proceed to theservice that controls storage. In this case, it is ine�ective to buildsecurity signals that might protect a single service but have novisibility into other services.

�ere has been some success applying signal fusion techniquesto the problem of horizontally siloed signals. We can take varioussignals from di�erent services that are otherwise siloed, then applya signal fusion algorithm to combine their outputs. For example,the outputs of many signals can form features in a simple linearmodel, or we can leverage ensemble machine learning to combinethe individual signals (see [6] for an overview of such techniques).

Vertically siloed detections present a unique set of challenges.When creating a new security detection, we must decide what is theappropriate level of abstraction that is needed to reliably detect thesecurity events of interest. As an example, determining if a processis being exploited for an in-memory malware requires a low-levelof abstraction since we must look for suspicious memory allocationand suspicious DLL loading. On the other hand, detecting unusuallogins for accounts generally requires a higher-level of abstractionso we can monitor suspicious login locations and suspicious “impos-sible travel” logins. �ese types of abstractions are o�en necessaryand su�cient for the intended scenarios. However, the di�erentlevels of abstractions can obfuscate event data that can di�erentiatea benign event from a malicious event.

Consider a sequence of actions taken on a cloud storage ser-vice. We might have events with actions such as FileOpened, FileU-ploaded, and FileDeleted. �ese events might contain supplementalinformation such as UserAgent and Timestamp, but there is onlyso much that can be used to di�erentiate a benign action from amalicious one at this level of abstraction. �e lower level actionsassociated with this event can help inform our judgment. For ex-ample, a password-protected malicious document requires userinput to open, so our higher-level abstraction signals might viewthis as benign. In this case, our lower-level signals can inform thehigher-level signals, even if none of the signals can individuallyconclude with high con�dence the exact nature of the event.

�is issue of vertically siloed detections is not simply a conse-quence of data access boundaries. It is primarily a consequenceof a lack of ability to reason across these abstraction levels. Un-fortunately, it is deceptively challenging to reason across theseabstraction levels. We cannot simply apply a signal fusion algo-rithm as with horizontally siloed signals. �is is because the very

nature of abstraction means that signals built at a higher level ofabstraction are a�ected by data captured at a lower level of abstrac-tion. �is can create a sort of redundancy in which, essentially,the same thing is being measured more than once. �e e�ect isthat the wrong things can be reinforced while the right things areoverlooked.

�ere are many complicating factors with combining the resultsof vertically siloed detections, and they are not always obvious in acloud environment with a large number of services. Some of theseproblems have been addressed in related problem se�ings [25] butthere is still opportunity for substantial progress to be made here.

7 WAY FORWARD: ATTACK DETECTION TOATTACK DISRUPTION

While we continue to bolster our detection systems for the cloud,we would like to draw the a�ention of the security data sciencecommunity to what we think is the next wave of advancements inthis �eld.

We begin with revisiting the adversary’s kill chain [14], whichdetails the most common sequence of steps an a�acker follows inan industrial se�ing. It describes how an a�acker �rst performsreconnaissance, delivers malware (most commonly via phishing)to establish foothold; establishes persistence by installing rootkits.From here on, the adversary moves from machine to machine insearch of the goal. Once the a�acker �nds the goal, he or sheelevates to administrator privileges and, in most se�ings, ex�ltratesthe data of interest.

A�er carefully analyzing several security breaches, and inter-viewing internal security analysts (the “blue team”), we discoveredthat blue teams execute their own steps to evict the adversary fromthe environment which runs parallel to the adversary kill chain.Speci�cally, blue team members perform the following steps:

• Detect the evidence as an indication of compromise• Alert the appropriate security team• Triage the alert to determine whether it warrants further

investigation• Gather context from the environment to scope the breach• Form a remediation plan to contain or evict the adversary• Execute the remediation plan

�is formulation is pertinent because industry reports [31] showthat “compromises are measured in minutes 98% of the time” whereasmean time to detect breaches is in the order of months or longer.While we must detect that an a�ack has happened - even if it’s longa�er ex�ltration has �nished - the only strategy that is of value tothe business is a�ack disruption. If we were to use the analogy of abuilding on �re, the blue team needs to transition from the role ofan arson investigator and to that of a �re�ghter.

�ere is more to a�ack disruption than a pressing need to actfaster via automation (such as “automatic incident response”, “auto-matic remediation”). Is there a place for intelligence across the blueteam kill chain - speci�cally, is there a place for machine learningto achieve this goal of a�ack disruption? For instance, can naturallanguage processing and chatbots help analysts in triaging alerts;can recommender engines recommend the next steps in investiga-tions based on previous experiences? Are the techniques used innetwork tra�c optimization transferable for remediation so that

Figure 8: Kill chains

Figure 9: Kill chain attack disruption

tra�c from tainted servers is appropriately sinkholed? Applyingmachine learning to a�ack disruption has many open questions,and we urge the security data science community to think aboutthis space.

8 CONCLUSIONIn this paper, we described the di�culties in building intrusiondetection systems for the cloud. We have claimed that conventionalanomaly detection, by itself, does not produce useful alerts in acloud se�ing. In practice, we �nd that a hybrid approach of rulesand machine learning yields be�er results, and showed how theycan be combined in the form of �lters, features, or even as onesingle machine learning unit. Since there is no benchmark forevaluating cloud intrusion detection systems, we outlined strategiesfor gathering high quality evaluation data using other securityproducts or red teams (recommended) and grow the dataset usingSMOTE or possibly GANs. We o�ered our experience in modelexplainability, and demonstrate how it is more important than ever.Because of the geo-distributed and global nature of the cloud, wemust then deal with model compliance, localization, and data siloissues that a�ect model design and development. Finally, we shareda framework for a�ack disruption as the way forward and look tothe security data science community for intelligent automation ofthe blue team kill chain.

ACKNOWLEDGMENTSWe would like to thank Bryan Smith, Eugene Bobukh, Asghar De-hghani, Anisha Mazumder, Haijun Zhai, and Bin Xu for their valu-able comments and members of Identity Driven Machine Learningteam, Azure Red team, O�ce 365 Security, Azure Security Center,Cloud + Enterprise �reat Protection team, Applications SecurityGroup and Windows Defender Security for engaged discussions.

REFERENCES[1] GDPR Key Changes. 2017. (2017). h�p://www.eugdpr.org/key-changes.html[2] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.

2002. SMOTE: synthetic minority over-sampling technique. Journal of Arti�cialIntelligence Research 16 (2002), 321–357.

[3] European Commission. 2007. Opinion 4/2007 on the concept of personaldata. (June 2007). h�p://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/�les/2007/wp136 en.pdf

[4] KDD Cup Data. 1999. (1999). h�p://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: ALarge-Scale Hierarchical Image Database. In Proceedings of IEEE Conference onComputer Vision and Pa�ern Recognition (CVPR). IEEE.

[6] �omas G. Die�erich. 2000. Ensemble Methods in Machine Learning. Springer,1–15.

[7] Cynthia Dwork. 2008. Di�erential privacy: A survey of results. In InternationalConference on�eory and Applications of Models of Computation (TAMC), Vol. 4978.Springer, Springer, 1–19.

[8] Eleazar Eskin, Andrew Arnold, Michael Prerau, Leonid Portnoy, and Sal Stolfo.2002. A geometric framework for unsupervised anomaly detection: Detectingintrusions in unlabeled data. Applications of Data Mining in Computer Security 6(2002), 77–101.

[9] John J Godfrey and Edward Holliman. 1997. Switchboard-1 Release 2 LDC97S62.(1997).

[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarialnets. In Advances in neural information processing systems. MIT Press, 2672–2680.

[11] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, andPatrick McDaniel. 2016. Adversarial perturbations against deep neural networksfor malware classi�cation. (June 2016).

[12] Jihun Hamm, Paul Cao, and Mikhail Belkin. 2016. Learning privately from mul-tiparty data. In Proceedings of the 33rd International Conference on InternationalConference on Machine Learning, Vol. 48. 555–563.

[13] Weiwei Hu and Ying Tan. 2017. Generating Adversarial Malware Examples forBlack-Box A�acks Based on GAN. (February 2017).

[14] Eric M Hutchins, Michael J Cloppert, and Rohan M Amin. 2011. Intelligence-driven computer network defense informed by analysis of adversary campaignsand intrusion kill chains. Leading Issues in Information Warfare & SecurityResearch 1, 1, 80.

[15] Dist. Court WD Washington Johnson v. Microso� Corporation, Case No. C06-0900RAJ. 2009. (2009).

[16] Angelika Kimmig, Stephen Bach, Ma�hias Broecheler, Bert Huang, and LiseGetoor. 2012. A short introduction to probabilistic so� logic. In Proceedings ofthe NIPS Workshop on Probabilistic Programming: Foundations and Applications.Advances in Neural Information Processing Systems, 1–4.

[17] Pavel Laskov, Patrick Dussel, Christin Schafer, and Konrad Rieck. 2005. Learningintrusion detection: supervised or unsupervised? Lecture Notes in ComputerScience, Vol. 3617. Springer Berlin Heidelberg, 50–57.

http://www.eugdpr.org/key-changes.html

http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2007/wp136_en.pdf

http://ec.europa.eu/justice/data-protection/article-29/documentation/opinion-recommendation/files/2007/wp136_en.pdf

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

[18] Damballa/Ponemon Institute LLC. 2015. �e Cost of Malware Containment. (Jan-uary 2015). h�ps://www.ponemon.org/local/upload/�le/Damballa%20Malware%20Containment%20FINAL%203.pdf

[19] Metasploit. 2017. (2017). h�ps://www.metasploit.com/[20] Chirag Modi, Dhiren Patel, Bhavesh Borisaniya, Hiren Patel, Avi Patel, and

Mu�ukrishnan Rajarajan. 2013. A survey of intrusion detection techniques incloud. Journal of Network and Computer Applications 36, 1 (January 2013), 42–57.

[21] PowerSploit. 2017. (2017). h�ps://github.com/PowerShellMa�a/PowerSploit[22] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I

Trust You?” Explaining the Predictions of Any Classi�er. In Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and DataMining. ACM, 1135–1144.

[23] Ma�hew Richardson and Pedro Domingos. 2006. Markov logic networks. Ma-chine learning 62, 1–2 (February 2006), 107–136.

[24] Sebastian Roschke, Feng Cheng, and Christoph Meinel. 2009. Intrusion detectionin the cloud. In Eighth IEEE International Conference on Dependable, Autonomicand Secure Computing, 2009 (DASC’09). IEEE, IEEE, 729–734.

[25] D Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, DietmarEbner, Vinay Chaudhary, and Michael Young. 2014. Machine learning: �ehigh-interest credit card of technical debt. In SE4ML: So�ware Engineering forMachine Learning (NIPS 2014 Workshop).

[26] Parag K Shelke, Sneha Sontakke, and AD Gawande. 2012. Intrusion detectionsystem for cloud computing. International Journal of Scienti�c & TechnologyResearch 1, 4 (May 2012), 67–71.

[27] Robin Sommer and Vern Paxson. 2010. Outside the closed world: On usingmachine learning for network intrusion detection. In IEEE Symposium on Securityand Privacy (SP). IEEE, IEEE, 305–316.

[28] Technavio. 2017. Global Security Information and Event Management Market2017-2021. (January 2017).

[29] Chih-Fong Tsai, Yu-Feng Hsu, Chia-Ying Lin, and Wei-Yang Lin. 2009. Intrusiondetection by machine learning: A review. Expert Systems with Applications 36,10 (December 2009), 11994–12000.

[30] Veil. 2017. (2017). h�ps://github.com/Veil-Framework/Veil[31] Verizon. 2017. Verizon 2017 Data Breach Investigations Report. Technical Report.

Verizon. h�ps://www.verizonenterprise.com/verizon-insights-lab/dbir/2017/

https://www.ponemon.org/local/upload/file/Damballa%20Malware%20Containment%20FINAL%203.pdf

https://www.ponemon.org/local/upload/file/Damballa%20Malware%20Containment%20FINAL%203.pdf

https://www.metasploit.com/

https://github.com/PowerShellMafia/PowerSploit

https://github.com/Veil-Framework/Veil

https://www.verizonenterprise.com/verizon-insights-lab/dbir/2017/

practical machine learning for cloud intrusion … machine learning for cloud intrusion ... machine...

Documents