bio-it asia 2013: informatics & cloud - best practices & lessons learned
DESCRIPTION
Slides from a talk @ Bio-IT World AsiaTRANSCRIPT
Best Practices & Lessons Learned Life Science Informatics & The Cloud
Tuesday, May 28, 13
2
I’m Chris.
I’m an infrastructure geek.
I work for the BioTeam.
Twitter: @chris_dagTuesday, May 28, 13
Who, what & whyBioTeam
‣ Independent consulting shop‣ Staffed by scientists forced to
learn IT, SW & HPC to get our own research done
‣ 12+ years bridging the “gap” between science, IT & high performance computing
‣ www.bioteam.net
3Tuesday, May 28, 13
Seriously.Listen to me at your own risk
‣ Clever people find multiple solutions to common issues
‣ I’m fairly blunt, burnt-out and cynical in my advanced age
‣ Significant portion of my work has been done in demanding production Biotech & Pharma environments
‣ Filter my words accordingly4
Tuesday, May 28, 13
Other 2013 Presentations ...Bio-IT World Boston
5Tuesday, May 28, 13
Bio-IT World Boston: “Multi-Tenant Research Clusters”
6
http://slideshare.net/chrisdag/ Tuesday, May 28, 13
Bio-IT World Boston: “HPC Trends from the trenches.”
7
http://slideshare.net/chrisdag/ Tuesday, May 28, 13
8
Meta: Why Cloud?
What the sales & marketing folks won’t tell you
Getting Practical
Intro
HPC Case Study
1
2
3
4
5Tuesday, May 28, 13
9
The big pictureWhy we need IaaS clouds ...
Tuesday, May 28, 13
Why life science needs infrastructure clouds
10
Big Picture
‣ HUGE revolution in the rate at which lab platforms are being redesigned, improved & refreshed
• Example: CCD sensor upgrade on that confocal microscopy rig just doubled your storage requirements
• Example: That 2D ultrasound imager is now a 3D imager
• Example: Illumina HiSeq upgrade just doubled the rate at which you can acquire genomes. Massive downstream increase in storage, compute & data movement needs
Tuesday, May 28, 13
11
The Central Problem Is ...
‣ Instrumentation & protocols are changing FAR FASTER than we can refresh our Research-IT & Scientific Computing infrastructure
• The science is changing month-to-month ...
• ... while our IT infrastructure only gets refreshed every 2-7 years
‣ We have to design systems TODAY that can support unknown research requirements & workflows over many years (gulp ...)
Tuesday, May 28, 13
12
The Central Problem Is ...
‣ The easy period is over‣ 5 years ago you could toss inexpensive storage and
servers at the problem; even in a nearby closet or under a lab bench if necessary
‣ That does not work any more; real solutions required
Tuesday, May 28, 13
13
And a related problem ...
‣ It has never been easier to acquire vast amounts of data cheaply and easily
‣ Growth rate of data creation/ingest exceeds rate at which the storage industry is improving disk capacity
‣ Not just a storage lifecycle problem. This data *moves* and often needs to be shared among multiple entities and providers
• ... ideally without punching holes in your firewall or consuming all available internet bandwidth
Tuesday, May 28, 13
If you get it wrong ...
‣ Lost opportunity‣ Missing capability‣ Frustrated & very vocal scientific staff‣ Problems in recruiting, retention,
publication & product development
14Tuesday, May 28, 13
15IaaS to the Rescue
Tuesday, May 28, 13
IaaS solves the current critical “Research IT” dilemma
16
Why Cloud?
‣ IaaS clouds let us react and respond to scientific requirements that change far faster than we can refresh local datacenters and enterprise IT platforms
Image: shanelin via Flickr
Tuesday, May 28, 13
Beyond capability and agility gains ...
17
Why Cloud?
‣ The economic benefits are real, inescapable and trending in the proper direction
‣ Internet-scale providers with millions of cores and exabytes of spinning disk spanning the globe leverage operational efficiencies you will never come close to matching internally
‣ ... be suspicious of people who claim otherwise
Tuesday, May 28, 13
Also ...
18
Why Cloud?
‣ Clouds becoming a natural place for data exchange & access
‣ “scriptable everything” enables entirely new capabilities not possible internally*
‣ Finance people love converting CapEx to OpEx
Tuesday, May 28, 13
19
Meta: Why Cloud?
What the sales & marketing folks won’t tell you
Getting Practical
Intro
HPC Case Study
1
2
3
4
5Tuesday, May 28, 13
What the salesfolk won’t tell you ...
20
‣ There is no one-size-fits-all research design pattern ...
‣ You are not going to toss everything and replace it with “Big Data”
‣ Very few of us have a single pipeline or workflow that we can devote endless engineering effort to
‣ We are not going to toss out hundreds of legacy codes and rewrite everything for GPUs or MapReduce
‣ For research HPC it’s all about the building blocks { and how we can effectively use/deploy them }
Tuesday, May 28, 13
21
What the salesfolk won’t tell you
‣ Your organization actually needs THREE tested cloud design patterns:
‣ (1) To handle ‘legacy’ scientific apps & workflows‣ (2) The special stuff that is worth re-architecting ‣ (3) Hadoop & big data analytics
Tuesday, May 28, 13
Legacy HPC on the Cloud
22
Design Pattern #1 - Legacy
‣ There are many hundreds of existing algorithms and applications in the life science informatics space
‣ We’ll be running/using these codes for years to come
‣ Many can’t or will never be refactored or rewritten
‣ I call this the “legacy” design pattern
Tuesday, May 28, 13
23One Easy Solution.
Tuesday, May 28, 13
StarCluster
24
Design Pattern #1 - Legacy
‣ MIT StarCluster• http://web.mit.edu/star/cluster/
‣ Infinite Awesomeness. Worth a talk by itself.‣ This is your baseline‣ Extend as needed
Tuesday, May 28, 13
25
Design Pattern #2 - “Cloudy”
‣ Some of our research workflows are important enough to be rewritten for “the cloud” and the advantages that a truly elastic & API-driven infrastructure can deliver
‣ This is where you have the most freedom‣ Many published best practices you can borrow‣ Warning: Cloud vendor lock-in potential is strongest here
Tuesday, May 28, 13
26
Design Pattern #3 - Hadoop/BigData
‣ Hadoop and “big data” need to be on your radar‣ Be careful though, you’ll need a gas mask to avoid the
smog of marketing and vapid hype‣ The utility is real and this does represent one “future
path” for analysis of large data sets
Tuesday, May 28, 13
27
Design Pattern #3 - Hadoop/BigData
‣ It’s going to be a MapReduce world, get used to it‣ Little need to roll your own Hadoop in 2013‣ ISV & commercial ecosystem already healthy‣ Multiple providers today; both onsite & cloud-based‣ Often a slam-dunk cloud use case
Tuesday, May 28, 13
What you need to know
28
Design Pattern #3 - Hadoop/BigData
‣ “Hadoop” and “Big Data” are now general terms‣ You need to drill down to find out what people actually
mean‣ We are still in the period where senior leadership may
demand “Hadoop” or “BigData” capability without any actual business or scientific need
Tuesday, May 28, 13
What you need to know
29
Hadoop & “Big Data”
‣ In broad terms you can break “Big Data” down into two very basic use cases:
1. Compute: Hadoop can be used as a very powerful platform for the analysis of very large data sets. The google search term here is “map reduce”
2. Data Stores: Hadoop is driving the development of very sophisticated “no-SQL” “non-Relational” databases and data query engines. The google search terms include “nosql”, “couchdb”, “hive”, “pig” & “mongodb”, etc.
‣ Your job is to figure out which type applies for the groups requesting “Hadoop” or “BigData” capability
Tuesday, May 28, 13
Hadoop vs traditional Linux Clusters
30
High Throughput Science
‣ Hadoop is a very complex beast‣ It’s also the way of the future so you can’t ignore it‣ Very tight dependency on moving the ‘compute’ as close
as possible to the ‘data’‣ Hadoop clusters are just different enough that they do
not integrate cleanly with traditional Linux HPC system‣ Often treated as separate silo or punted to the cloud
Tuesday, May 28, 13
What you need to know
31
Hadoop & “Big Data”
‣ Hadoop is being driven by a small group of academics writing and releasing open source life science hadoop applications;
‣ Your people will want to run these codes‣ In some academic environments you may find people
wanting to develop on this platform
Tuesday, May 28, 13
32
Meta: Why Cloud?
What the sales & marketing folks won’t tell you
Getting Practical
Intro
HPC Case Study
1
2
3
4
5Tuesday, May 28, 13
Strategy
33
Practical Advice
‣ Research oriented IT organizations need a cloud strategy today; or risk being bypassed by employees
Tuesday, May 28, 13
Design Patterns
34
Practical Advice
‣ Remember the three design patterns on the cloud:• Legacy HPC systems
(replicate traditional clusters in the cloud)
• Hadoop
• Cloudy (when you rewrite something to fully leverage cloud capability)
Tuesday, May 28, 13
Policies and Procedures
35
Practical Advice
‣ Cloud technology bits are easy. Cloud Process and Policy discussions take forever
‣ Start these conversations sooner rather than later!
Tuesday, May 28, 13
Core services that take time and advance planning
36
Practical Advice
‣ A few of key foundational cloud services take time and advanced planning to deploy properly:
‣ VPNs & subnet schemes‣ Identity Management & Access Control‣ Data Movement
Tuesday, May 28, 13
Data Movemement
37
Practical Advice
‣ A few words & pictures on data movement ...
Tuesday, May 28, 13
38
Physical data movement station 1
Tuesday, May 28, 13
39
Physical data movement station 2
Tuesday, May 28, 13
40
“Naked” Data Movement
Tuesday, May 28, 13
41
“Naked” Data Archive
Tuesday, May 28, 13
42
Cloud Data Movement
‣ Things changed pretty definitively in 2012‣ And the next image shows why ...
Tuesday, May 28, 13
43
March 2012Tuesday, May 28, 13
Network vs. PhysicalCloud Data Movement
‣ With a 1GbE internet connection ...‣ and using Aspera software ....‣ We sustained 700 MB/sec for more than 7 hours
freighting genomes into Amazon Web Services‣ This is fast enough for many use cases, including
genome sequencing core facilities*‣ Chris Dwan’s webinar on this topic:
http://biote.am/7e
44Tuesday, May 28, 13
Network vs. PhysicalCloud Data Movement
‣ Results like this mean we now favor network-based data movement over physical media movement
‣ Large-scale physical data movement carries a high operational burden and consumes non-trivial staff time & resources
45Tuesday, May 28, 13
There are three ways to do network data movement ...Cloud Data Movement
‣ Buy software from Aspera and be done with it‣ Attend the annual SuperComputing conference & see
which student group wins the bandwidth challenge contest; use their code
‣ Get GridFTP from the Globus folks
46Tuesday, May 28, 13
SysAdmin vs Programmer
47
Practical Advice
‣ Recognize the blurring line between IT / Informatics / SW Engineer
‣ ... and how it may mix up your org chart
Tuesday, May 28, 13
Very blurry lines in 2013 for all of these roles
48
Scientist/SysAdmin/Programmer‣ Radical change in last ~2 years
for how IT is provisioned, delivered, managed & supported
‣ Root cause (Technology) Virtualization & Cloud
‣ Root Cause (Operations) Configuration Mgmt, Systems Orchestration & Infrastructure Automation
‣ SysAdmins & IT staff need to re-skill and retrain to stay relevant
Tuesday, May 28, 13
Very blurry lines in 2013 for all of these roles
49
Scientist/SysAdmin/Programmer
‣ When everything has an API ..‣ .. anything can be
‘orchestrated’ or ‘automated’ remotely
‣ And by the way ...‣ The APIs (‘knobs & buttons’)
are accessible to all
Tuesday, May 28, 13
Very blurry lines in 2013 for all of these roles
50
Scientist/SysAdmin/Programmer
‣ IT jobs, roles and responsibilities are undergoing rapid upheaval
‣ SysAdmins must learn to program in order to harness automation tools
‣ Programmers & Scientists can now self-provision and control sophisticated IT resources
Tuesday, May 28, 13
Very blurry lines in 2012 for all of these roles
51
Scientist/SysAdmin/Programmer‣ My take on the future ...‣ Far more control is going into the
hands of the research end user ‣ IT support roles will radically
change -- no longer owners or gatekeepers
‣ IT will handle policies, procedures, reference patterns , security & best practices
‣ Researchers will control the “what”, “when” and “how big”
Tuesday, May 28, 13
53
Cloud HPC Case StudyTime Permitting ...
Tuesday, May 28, 13
Next Generation Nuclear Magnetic Resonance
54
NMR Probehead Simulation on AWS
‣ CAE Simulation Project‣ via www.hpcexperiment.com‣ Software: CST Studio 2012‣ My role: Volunteer HPC Mentor
Tuesday, May 28, 13
Simulating next-generation NMR probeheads
55
Why this was an interesting project
‣ Frontend interface is graphics heavy and requires Windows
‣ Studio ‘solvers’ run Linux or Windows; support GPUs and MPI task distribution
‣ Simultaneous use of local and cloud-based solvers actually works
‣ flexLM license server involved
‣ Non-trivial security and geo-location requirements
Tuesday, May 28, 13
56
When we ran at modest scale ...
16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.
HPC on the cloud is real.Tuesday, May 28, 13
Design Attempt #1
57
‣ Hybrid Linux/Windows cloud running in AWS EU Region‣ Failure:
• No GPU nodes in EU at the time
• No cc2.4xlarge at the time
Tuesday, May 28, 13
Design Attempt #2
58
‣ Move Hybrid Linux/Windows system to US-EAST‣ ... with synthetic test data‣ Best-practices VPC isolation & VPN access‣ It looked like this ...
Tuesday, May 28, 13
Architecture #259
Tuesday, May 28, 13
Design Attempt #2
60
‣ Attempt #2 Failed:‣ CST FrontEnd Controller running at end-user site could
not tolerate NAT translation used by solvers‣ No GPU nodes available within VPC at that time
Tuesday, May 28, 13
Design Attempt #3
61
‣ Design #3 Finally works‣ VPC shrunk to single license server running in US EAST‣ All Windows/Linux/GPU solover nodes running in EU‣ NO NAT, NO VPC For Solvers‣ Extensive use of AWS spot instance servers
Tuesday, May 28, 13
At experiment end it looked like this ...62
Tuesday, May 28, 13
63
Non Trivial HPC on the Cloud
16 large compute nodes + 22 GPU nodes$30/hour on AWS Spot Market.
Tuesday, May 28, 13
Why this work was ‘easy’ on Amazon AWS ...
64
Nightmare on any other cloud
‣ Lets discuss why this simulation workload would be much, much harder to do on some other cloud platform ...
Tuesday, May 28, 13
Why this work was ‘easy’ on Amazon AWS ...
65
Nightmare on any other cloud
1. Virtual Servers2. Block Storage3. Object Storage4. ... and maybe some other
stuff if I’m lucky
‣ EC2, S3, EBS, RDS, SNS, SQS, SWS, GPUs, SSDs, CloudFormation, VPC, ENIs, SecurityGroups, 10GbE DirectConnect, Reserved Instances, ImportExport, Spot Market
‣ And ~25 other products and service features with more added monthly
‘Brand X’ Cloud AWS
Tuesday, May 28, 13
Easy on AWS; much harder elsewhereOne very specific example
66
‣ The widely used FLEXlm license server uses NIC MAC addresses when generating license keys
‣ Different MAC? Science stops. Screwed.
‣ VPC ENIs allow separation of MAC address from Network Interface. Badass.
Tuesday, May 28, 13
Why this work was ‘easy’ on Amazon AWS ...A few other examples ...
67
VPC
Spot Market
cc* & cg* ec2 instance
types
Incredibly powerful. Actually useful.
Approachable even if you are not an IPSEC or BGP routing god.
Compelling economics. Once you start you’ll likely never run anywhere else.
The competition can’t compete.
Fat nodes with bidirectional 10GbE bandwidth.
And don’t get me started on SSD or Provisioned-performance EBS volumes.
Tuesday, May 28, 13