(sec313) security & compliance at the petabyte scale
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Igor Bogicevic, CTO
Security and Compliance
at the Petabyte ScaleLessons from the National Cancer Institute’s
Cancer Genomics Cloud PilotAngel Pizarro, AWS Scientific Computing
October 2015
What to expect from this session
• Background: Unique challenges for securing genomics
information
• Case study: Democratizing access to The Cancer
Genome Atlas (TCGA) through the Seven Bridges
Cancer Genomics Cloud
• Deep dives: How we’ve leveraged AWS to support
secure and compliant genomics research
Why is securing genomics
information hard?
i) Genomics data is big…and getting bigger
NGS: Next Generation Sequencing
NGS sequencers include machines from Illumina, Life Technologies, and Pacific Biosciences. Human genome data based on estimates of whole human genomes sequenced
Sources: Financial reports of Illumina, Life Technologies, Pacific Biosciences; revenue guidances; JP Morgan; The Economist; Seven Bridges Analysis.
Between 2014–2018, production of new NGS data to exceed 2 exabytes
# s
equencers
Genom
ic d
ata
Tb
ii) Genomes are inherently sensitive
Very personal (including your relatives…)
Can’t fully anonymize information
Can’t take it back once it’s out there
iii) Research is highly collaborative and
diverse
It occurs in large teams... ...with numerous analytical tools
The Challenge
Enable thousands of researchers
using hundreds of (custom) tools
to analyze petabytes of highly sensitive data
in a secure and compliant environment
Case study:
Bringing the Cancer Genome
Atlas (TCGA) to the Cloud
This project has been funded in whole or in part with Federal funds from the
National Cancer Institute, National Institutes of Health, Department of Health
and Human Services, under Contract No. HHSN261201400008C.
TCGA is one of the richest and most complete
genomics data sets in the world
34 tumor types
from thousands
of patients…
…analyzed across
multiple
dimensions…
…by researchers
across the US…
…at a cost of
$375 million.
1.5+ petabytes, growing to 3.5 petabytes in the next year
But learning from this data is challenging
The Cancer Genomics Cloud Pilots seek to
directly address these difficulties
• Initiated by Dr. Harold Varmus in 2013
• BAA issued in January 2014
• 3 pilots awarded September 2014o Broad Institute
o Institute for Systems Biology
o Seven Bridges Genomics
Early access: November 2015
Open release: January 2016
www.CancerGenomicsCloud.org
Our approach to democratizing
access to TCGA data
The components of democratized access –
Data
● Immediately and securely access
petabytes of open-access and
controlled-access cancer genomics
data.
● Analyze data from your private
cohorts alongside public data.
● Data access governed by the NIH
Genomic Data Sharing Policy.
● As an NIH trusted partner, Seven
Bridges is able to authorize approved
researchers.
● First controlled access genomic
dataset on AWS.
● Coming soon:
http://aws.amazon.com/public-data-
sets/tcga/.
The components of democratized access –
Reproducibility
1.1.2 2.0a 2.3Lite
● Execute workflows from primary
analysis through visualization.
● Each result is always associated with
a complete snapshot of the tool
versions, parameters, and input files.
The components of democratized access –
Open standards
● Native execution of Docker-based Common
Workflow Language (CWL) pipelines allows
portability and sharing of custom tools.
● APIs support workflow automation and
enhance interoperability.
...implemented through our genomics platform
How we’ve leveraged AWS to
support secure and compliant
genomics research
Security and compliance―connected, but separate.
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Simplified system architecture
Encrypted Amazon S3 buckets
Virtual private cloud
(Development environment)
Virtual private cloud
(Production environment)
Dynamic worker
instancesInfrastructure
server
Seven Bridges
website
Dynamic worker
instances
Infrastructure
server
IPSEC VPN
Seven Bridges
offices
Open VPN
Gateway
Remote
workforce
AWS
IPSEC
AWS
IPSEC
UserAccess platform
download data
Data flow
Secure access point
AWS
Securing the network
• Extensive use of virtual private clouds (VPCs)
• Separate dev and production environments
DevProduction
● Built-in IPSEC allows easy
network integration
• Open VPN to secure remote
user access
● Each instance and VPC is
individually firewalled
Securing data
• At-rest encryption
• Amazon S3 SSE, SSE-KMS
• Amazon EBS encryption
• Ephemeral storage
DevProduction
• In transit
• Data in-transit-fortifying - TLS
exclusively on S3
● From other users
• AWS IAM to access other users’ buckets
Controls to support secure data
• Atomic data access
• Data locality
• Dedicated tenancy on
computation instances
• Using only encrypted storage
• Strict data purging
Amazon S3 Amazon EBS Amazon EC2
{
"Version":"2012-10-17",
"Statement":[
{
"Sid":"112",
"Effect":"Deny",
"Principal": "*",
"Action":"s3:PutObject",
"Resource":"arn:aws:s3:::examplebucket/*",
"Condition": {
"StringNotEquals": {"s3:x-amz-server-side-encryption": "AES256"}
}
}
]
}
dm-crypt
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Parallel file access at scale
The Challenge:
Many bioinformatics tasks require sharing of
intermediary results between multiple instances.
Parallel file access at scale – NFS
Observed network
saturation at ~8 NFS clients.
Hypothesis
• Amazon S3 would remove single NFS server bandwidth
bottleneck.
• Presenting user’s S3 objects as a local filesystem could provide
an elegant abstraction that any application could use.
• Cumulative S3 read/write speed should scale mostly linearly
with number of workers.
• Total read/write speed on shared S3 objects should significantly
exceed NFS server solution speed on >10 workers.
Parallel access at scale – SBG-FS/Amazon S3
Amazon S3
SBG-FS single worker performance
Compute Instances
300200100
90
215
894
Thro
ughput M
B/s
400
600
50 250150
1st read (SBG-FS Prefetch)
Write (SBG-FS Upload)
2nd read (SBG- FS Cache)
SBG-FS cumulative worker performance
Compute Instances
300200100
50
250
Thro
ughput G
B/s
150
200
50 250150
1st read (SBG-FS Prefetch)
Write (SBG-FS Upload)
2nd read (SBG- FS Cache)100
SBG-FS auditing capabilities
Amazon S3
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Enabling secure computation using
researcher-contributed tools
The Challenge:
bioinformatics tools
10,000+
50+tools used in single
TCGA marker paper
Our Approach:
Common Workflow Language (CWL) wrapper
Seven Bridges Platform
Benefits of using Docker to deploy user-
contributed tools
• Enables solid resource
isolation at the container
level
• Simplifies deploying and
managing tools at scale
DevProduction
Security risks posed by use of Docker
• Docker daemon runs under
root privileges
• User can intentionally or
unintentionally add malicious
apps
• If resources management not
set properly, apps could do
damage outside its container
DevProduction
Enabling secure use of Docker containers
● Know your private vs. public
resources
● Isolate network resources for
each container (firewalling)
• Be careful with linking
containers
• Aggregate logs (forensics)
DevProduction
Security
• Network and data security overview
• Parallel file access at scale
• Enabling secure computation using researcher-
contributed tools
• Enabling secure user access and collaboration
Enabling secure access
DevProduction
● Organizations have diverse
models of internal structure
and responsibilities
• Roles and authentication
models are very diverse
• Federated authentication
and SSO
Supporting federated login for controlled data
access
Error Message
Approved Researchers
cron x 24hr
Metadata service
ELK stackVerify
SAML
Enabling collaboration
• SBG Platform provides isolation
of resources at project level
• Users can share projects and
control access through roles
• Basic role provides just a read
access, write/copy privileges
separate from execution
One Billing Group
per project
$
Multiple users and
roles per project
Users participate in projects
and can provide funding
. .
(-
$ $$
$
Project-specific user roles
Multiple users per project
Clear funding/payment
responsibility
Overall system security is enabled by
monitoring and testing
• Penetration testing
• Patch management
• Software and infrastructure vulnerability assessments
• Monitoring of platform performance and availability
• Pandora FMS/OSSEC/Sysdig
• Auditing and logs at a project and platform level
• Logs aggregated and available for inspection with ELK
stack
Putting it all together 1. User logs on to the platform
2. Platform creates a unique signed URL
for the user
3. Using signed URL, data is uploaded to
an encrypted Amazon S3 bucket
4. After the user starts a computation, the
Seven Bridges Platform calculates the
optimal execution plan and starts
dedicated task worker instances
5. Worker instances securely pull data
from Amazon S3
6. Worker instances are able to securely
share intermediate data
7. Final results are uploaded to
Amazon S3
Encrypted
S3 bucket
User
EC2
instancesData sharing
between instances
6
SevenBridges
Computation environment
Seven Bridges Platform
4
1,2
3
5,7 Encrypted
Amazon S3
Amazon EC2
Instances
Lessons learned from petabyte-scale security
• Isolate resources as much as possible
• Encrypt everything―it will make your life easier
• Understand the scale of the data
• Measure everything
• Leverage the infrastructure
Compliance
When we talk about compliance, we talk about
Building trust Shared language
dbGaPProtect against risk associated with release of genomes of
individuals consenting to participate in research studies.
HIPAAProtect against risk associated with release of Personal Health
Information (PHI).
ISO 27001 Provides framework for general security management of assets
across the organization and is a general specification for
information security management system (ISMS).
Compliance frameworks
Shared responsibility == compliance coordination
Sta
cked R
esponsib
ility
Facilities
Infrastructure
Virtualization
API and Service Endpoints
AWS
Data Security
Data Provenance
Application Monitoring
OS, Network, etc.
Seven Bridges
Genomics
Users | Groups | Projects | Applications Researcher
Auditor
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Shared responsibility across frameworks
dbGaP
HIPAA
ISO 27001
ResearcherAWS Seven Bridges
Securely integrating with platforms
Security and compliance in practiceS
tacked R
esponsib
ility Data Security
Data Provenance
Application Monitoring
OS, Network, etc.
Users | Groups | Projects | Applications
Facilities
Infrastructure
Virtualization
API and Service Endpoints
Horizontal
Responsibility
Seven Bridges GenomicsResearcher Amazon Web Services
Use case: Analyze Personal Genome Project data
http://personalgenomes.org
VPC subnet
Dedicated instance
1000 Genomes
Strategies to follow
• Rely on the platform as much as possible
• Follow security best practices outlined in the AWS
documentation
• Have a checklist!
Compliance checklist
AWS security
VPC, security groups, encrypted storage
Protect AWS credentials
Protect platform credentials
SOPs for OS and application updates
Audit and logging of the activities outside of platform
Data provenance and lifecycle
AWS architecture
IAM instance role
VPC subnet
Security
group
Virtual private cloud
• Access platforms via
Internet or VPC peering
• DevOps for instance and
application management
• Protect credentials with
AWS IAM and AWS KMS
Secure bootstrapping with instance UserData
AWS Command Line Interface
Secure and format local storage
Compliance checklist
AWS security
VPC, security groups, encrypted storage
Protect AWS credentials
Protect platform credentials
SOPs for OS and application updates
❑ Audit and logging of the activities outside of platform
❑ Data provenance and lifecycle
Thank you!
Remember to complete
your evaluations!