(sec403) diving into aws cloudtrail events w/ apache spark on emr

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Will Kruse, AWS IAM Senior Security Engineer

October 2015

SEC403

Timely Security Alerts and AnalyticsDiving into AWS CloudTrail Events by Using Apache Spark

on Amazon EMR

What to expect from this session

Why are we here? To learn how to:

• Audit AWS activity across multiple AWS accounts for compliance

and security.

• Analyze AWS CloudTrail events as they arrive (in your Amazon

S3 bucket).

• Build profiles of AWS activity for users, origins, etc.

• Send alerts when an unexpected or interesting event, or series

of events, occurs.

• Use Apache Spark, a cutting edge big data platform, on AWS for

security and compliance auditing.

Expected technical background

• You are generally familiar with “big data” processing

frameworks (e.g., Hadoop).

• You are familiar with CloudTrail.

• You can read OO-code (e.g., Java, Scala, Python, Ruby,

etc.).

• You are comfortable with a command line.

CloudTrail schema

Demo: SQL queries over CloudTrail

Agenda

• SQL queries over CloudTrail logs

• Demo using spark-sql + hive tables

• Architecture

• Code

• Demo of code using Scala

• Processing CloudTrail logs as they arrive

• Architecture

• Demo

• Code

• Wrap-up

You are here

Our architecture

CloudTrail

objects

Amazon EMR cluster

running Apache

Spark

Security or

compliance

analyst

Recipe for SQL queries over CloudTrail logs

Write a Spark application that:

1. “Discovers” CloudTrail logs by calling CloudTrail.

• Alternatively, put all your CloudTrail logs in one or more buckets

known ahead of time.

2. Creates a list of CloudTrail trails + S3 objects.

3. Loads the data from each S3 object into an RDD.

4. Splits into individual CloudTrail event JSON objects.

5. Loads this RDD into a Spark DataFrame.

6. Register this DataFrame as a table (for querying).

Introduction to Apache Spark

• Big data processing framework

• Supported languages: Scala, Python, and Java

• Cluster management: Hadoop YARN, Apache Mesos, or

standalone

• Distributed storage: HDFS, Apache Cassandra,

OpenStack Swift, and S3

Why Spark?

• Fast

• Only does the work it needs to do

• Stores final and intermediate results in memory

• Supports batch and streaming processing

• Supports SQL queries, machine learning (ML), graph data

processing, and an R interface.

• Provides 20+ high-level operators that would otherwise be left

as an exercise to the coder

• Compatible with much of your existing Hadoop ecosystem

RDDs = Resilient Distributed Datasets

CloudTrail

objects in S3

Log #2

Log #1

Log #N

…

Log #1 string

Log #2 string

…

Log #N string

RDD of JSON arrays of

CloudTrail events (as

strings)

Event #1

Event #2

…

Event #M

Event #3

Event #4flatMapLog #2

Log #1

Log #N

…

Log #1 string

Log #2 string

…

Log #N string

Event #1

Event #2

…

Event #M

RDD of CloudTrail events

as individual JSON strings

Event #3

Event #4parallelize

DataFrames = Relational table abstraction

Event #1

Event #2

…

Event #M

RDD of CloudTrail

events

Event #3

Event #4

service API Timestamp Source IP Principal

Event #1

Event #2

Event #3

Event #4

…

Event #M

Event #1

Event #2

…

Event #M

RDD of CloudTrail

events

Event #3

Event #4

Service API Time Stamp Source IP Principal

Event #1 ec2 D… 2015/08/31 1:10 1.2.3.4 AIDA1…

Event #2 s3 P… 2015/08/31 1:11 1.2.3.5 AIDA2…

Event #3 swf S… 2015/08/31 1:12 1.2.3.6 AROA1…

Event #4 iam C… 2015/08/31 1:13 1.2.3.7 AROA2…

… … … … … …

Event #M CloudTrail D… 2015/08/31 2:43 1.2.3.8 AIDA3…

SQLContext

.read.json

Spark cluster components

Master node

Core node

Executor

Executor

RDD partitions

Core node

Executor

Executor

RDD partitions

Application driver

Tasks

(serialized

Java/Scala)

Recommended CloudTrail configuration

• Turn on CloudTrail logging in all regions.

• Enable S3 bucket logging for all buckets as well.

• Get all your CloudTrail logs for all your accounts in one

bucket (per region).

• Either have CloudTrail deliver them or copy them.

• Disallow deletes from CloudTrail buckets.

Needed AWS IAM permissions

• Getting started recommendation

• Launch an EMR cluster with default roles.

• Attach the CloudTrailReadOnly policy to the

EMR_EC2_DefaultRole.

• Least privilege improvements

• Restrict s3:getObject and s3:listBucket to CloudTrail

buckets.

• Remove EMR’s DDB, Amazon Kinesis, Amazon RDS,

Amazon SimpleDB, Amazon SNS, and Amazon SQS

permissions.

Tour through code to query

CloudTrail logs with SQL

Discover CloudTrail data

Transform CloudTrail data

Register CloudTrail data as a table

Demo: Querying CloudTrail

logs with Scala prompt

Agenda

• SQL Queries over CloudTrail logs

• Demo using spark-sql

• Architecture

• Code



• Architecture

• Demo

• Code

• Wrap-up

You are here

Analytics as soon as

CloudTrail data arrives in S3

Introduction to Spark Streaming

CloudTrail

SNS topic

Spark CloudTrail

receiver

Executors

New activity

Batch N-1 RDD

Batch N RDD

Previous profile

+

=Update

profileAlerts

Alert topic

Store

Spark Application

Discretized stream (Dstream)

RDD for micro-batch #3RDD for micro-batch #2RDD for micro-batch #1

Spark Streaming and micro-batches

Time

Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7 Event 8

3 seconds 3 seconds 3 seconds

Recipe

Write a Spark Streaming application that:

1. Uses a CloudTrail log receiver to learn about new logs

from CloudTrail’s SNS feed.

• Logs are delivered to S3, usually in less than 15 minutes.

2. Store()s each event from CloudTrail logs.

3. Analyzes events in micro-batches.

• Size based on the “batch interval.”

4. Generates alarms on suspicious behavior.

Scenarios we want to know about ASAP

• Connections from unusual geographies

• Connections from anonymizing proxies

• Use of dormant AWS access keys

• Use of dormant AWS principals (users, roles, root)

Demo: Streaming analysis of

CloudTrail logs

Creating stream of CloudTrail events

Build profiles and send alerts

Agenda

• SQL Queries over CloudTrail logs

• Demo using spark-sql

• Architecture

• Code



• Architecture

• Demo

• Code

• Wrap-up

You are here

How to use these tools

1. Build your threat model.

2. Configure and customize this streaming application.

3. Use Spark-on-EMR for ad hoc log analysis.

4. Use Spark Streaming for regular analysis and alerts.

How I use these tools

1. Keep my engineering teams honest.

2. Identify noncompliant usage.

3. Review actors and their actions in my accounts.

4. Craft least privilege policies by analyzing historical

usage.

Take action

• See who is active in your AWS accounts and when.

• Run queries over your logs in EMR.

• Configure and extend the sample application to meet

your specific needs.

• Find the demo code here:

https://github.com/awslabs/timely-security-analytics

https://github.com/awslabs/timely-security-analytics

Remember to complete

your evaluations!

Thank you!

(sec403) diving into aws cloudtrail events w/ apache spark on emr

Technology