(sec403) diving into aws cloudtrail events w/ apache spark on emr
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Will Kruse, AWS IAM Senior Security Engineer
October 2015
SEC403
Timely Security Alerts and AnalyticsDiving into AWS CloudTrail Events by Using Apache Spark
on Amazon EMR
What to expect from this session
Why are we here? To learn how to:
• Audit AWS activity across multiple AWS accounts for compliance
and security.
• Analyze AWS CloudTrail events as they arrive (in your Amazon
S3 bucket).
• Build profiles of AWS activity for users, origins, etc.
• Send alerts when an unexpected or interesting event, or series
of events, occurs.
• Use Apache Spark, a cutting edge big data platform, on AWS for
security and compliance auditing.
Expected technical background
• You are generally familiar with “big data” processing
frameworks (e.g., Hadoop).
• You are familiar with CloudTrail.
• You can read OO-code (e.g., Java, Scala, Python, Ruby,
etc.).
• You are comfortable with a command line.
CloudTrail schema
Demo: SQL queries over CloudTrail
Agenda
• SQL queries over CloudTrail logs
• Demo using spark-sql + hive tables
• Architecture
• Code
• Demo of code using Scala
• Processing CloudTrail logs as they arrive
• Architecture
• Demo
• Code
• Wrap-up
You are here
Our architecture
CloudTrail
objects
Amazon EMR cluster
running Apache
Spark
Security or
compliance
analyst
Recipe for SQL queries over CloudTrail logs
Write a Spark application that:
1. “Discovers” CloudTrail logs by calling CloudTrail.
• Alternatively, put all your CloudTrail logs in one or more buckets
known ahead of time.
2. Creates a list of CloudTrail trails + S3 objects.
3. Loads the data from each S3 object into an RDD.
4. Splits into individual CloudTrail event JSON objects.
5. Loads this RDD into a Spark DataFrame.
6. Register this DataFrame as a table (for querying).
Introduction to Apache Spark
• Big data processing framework
• Supported languages: Scala, Python, and Java
• Cluster management: Hadoop YARN, Apache Mesos, or
standalone
• Distributed storage: HDFS, Apache Cassandra,
OpenStack Swift, and S3
Why Spark?
• Fast
• Only does the work it needs to do
• Stores final and intermediate results in memory
• Supports batch and streaming processing
• Supports SQL queries, machine learning (ML), graph data
processing, and an R interface.
• Provides 20+ high-level operators that would otherwise be left
as an exercise to the coder
• Compatible with much of your existing Hadoop ecosystem
RDDs = Resilient Distributed Datasets
CloudTrail
objects in S3
Log #2
Log #1
Log #N
…
Log #1 string
Log #2 string
…
Log #N string
RDD of JSON arrays of
CloudTrail events (as
strings)
Event #1
Event #2
…
Event #M
Event #3
Event #4flatMapLog #2
Log #1
Log #N
…
Log #1 string
Log #2 string
…
Log #N string
Event #1
Event #2
…
Event #M
RDD of CloudTrail events
as individual JSON strings
Event #3
Event #4parallelize
DataFrames = Relational table abstraction
Event #1
Event #2
…
Event #M
RDD of CloudTrail
events
Event #3
Event #4
service API Timestamp Source IP Principal
Event #1
Event #2
Event #3
Event #4
…
Event #M
Event #1
Event #2
…
Event #M
RDD of CloudTrail
events
Event #3
Event #4
Service API Time Stamp Source IP Principal
Event #1 ec2 D… 2015/08/31 1:10 1.2.3.4 AIDA1…
Event #2 s3 P… 2015/08/31 1:11 1.2.3.5 AIDA2…
Event #3 swf S… 2015/08/31 1:12 1.2.3.6 AROA1…
Event #4 iam C… 2015/08/31 1:13 1.2.3.7 AROA2…
… … … … … …
Event #M CloudTrail D… 2015/08/31 2:43 1.2.3.8 AIDA3…
SQLContext
.read.json
Spark cluster components
Master node
Core node
Executor
Executor
RDD partitions
Core node
Executor
Executor
RDD partitions
Application driver
Tasks
(serialized
Java/Scala)
Recommended CloudTrail configuration
• Turn on CloudTrail logging in all regions.
• Enable S3 bucket logging for all buckets as well.
• Get all your CloudTrail logs for all your accounts in one
bucket (per region).
• Either have CloudTrail deliver them or copy them.
• Disallow deletes from CloudTrail buckets.
Needed AWS IAM permissions
• Getting started recommendation
• Launch an EMR cluster with default roles.
• Attach the CloudTrailReadOnly policy to the
EMR_EC2_DefaultRole.
• Least privilege improvements
• Restrict s3:getObject and s3:listBucket to CloudTrail
buckets.
• Remove EMR’s DDB, Amazon Kinesis, Amazon RDS,
Amazon SimpleDB, Amazon SNS, and Amazon SQS
permissions.
Tour through code to query
CloudTrail logs with SQL
Discover CloudTrail data
Transform CloudTrail data
Register CloudTrail data as a table
Demo: Querying CloudTrail
logs with Scala prompt
Agenda
• SQL Queries over CloudTrail logs
• Demo using spark-sql
• Architecture
• Code
• Demo of code using Scala
• Processing CloudTrail logs as they arrive
• Architecture
• Demo
• Code
• Wrap-up
You are here
Analytics as soon as
CloudTrail data arrives in S3
Introduction to Spark Streaming
CloudTrail
SNS topic
Spark CloudTrail
receiver
Executors
New activity
Batch N-1 RDD
Batch N RDD
Previous profile
+
=Update
profileAlerts
Alert topic
Store
Spark Application
Discretized stream (Dstream)
RDD for micro-batch #3RDD for micro-batch #2RDD for micro-batch #1
Spark Streaming and micro-batches
Time
Event 1 Event 2 Event 3 Event 4 Event 5 Event 6 Event 7 Event 8
3 seconds 3 seconds 3 seconds
Recipe
Write a Spark Streaming application that:
1. Uses a CloudTrail log receiver to learn about new logs
from CloudTrail’s SNS feed.
• Logs are delivered to S3, usually in less than 15 minutes.
2. Store()s each event from CloudTrail logs.
3. Analyzes events in micro-batches.
• Size based on the “batch interval.”
4. Generates alarms on suspicious behavior.
Scenarios we want to know about ASAP
• Connections from unusual geographies
• Connections from anonymizing proxies
• Use of dormant AWS access keys
• Use of dormant AWS principals (users, roles, root)
Demo: Streaming analysis of
CloudTrail logs
Creating stream of CloudTrail events
Build profiles and send alerts
Agenda
• SQL Queries over CloudTrail logs
• Demo using spark-sql
• Architecture
• Code
• Demo of code using Scala
• Processing CloudTrail logs as they arrive
• Architecture
• Demo
• Code
• Wrap-up
You are here
How to use these tools
1. Build your threat model.
2. Configure and customize this streaming application.
3. Use Spark-on-EMR for ad hoc log analysis.
4. Use Spark Streaming for regular analysis and alerts.
How I use these tools
1. Keep my engineering teams honest.
2. Identify noncompliant usage.
3. Review actors and their actions in my accounts.
4. Craft least privilege policies by analyzing historical
usage.
Take action
• See who is active in your AWS accounts and when.
• Run queries over your logs in EMR.
• Configure and extend the sample application to meet
your specific needs.
• Find the demo code here:
https://github.com/awslabs/timely-security-analytics
Remember to complete
your evaluations!
Thank you!