zero to production in crazy time: adobe’s transformation
DESCRIPTION
Adobe has quickly scaled from nothing to a huge presence in the AWS cloud. This is the story from the trenches: how we screwed up, learned and evolved our use of Chef to help get us to today. Taming Chef to work in the AWS cloud while trying to build a platform at a large scale was not as easy as we originally planned, and we’re consistently trying to make it better. We’ll share some tips and tricks from our experience.TRANSCRIPT
Zero to Prod in Crazy Time
John Martinez | Adobe Cloud Services
About Me
• Currently working as a Cloud Operations Engineer at Adobe
• I get to figure out new stuff, and make really old stuff work in AWS
• 20+ years doing UNIX/Linux work
• Learned about cloud computing at Netflix
• Working at Adobe feeds my habit - photography
About Ops PeopleSome people see us as Ninjas, I really see us as Storm Troopers
Cloud Platforms @ Adobe• Creative Cloud
• Marketing Cloud
• Digital Publishing Suite
• Phonegap
• Typekit
• Acrobat.com
• Echosign
• Revel
• ...and growing...
How We Got Started
• Creative Cloud went live in late April 2012
• AWS from the start
• We needed to do SOMETHING
• Yes, it was really that scientific of a decision
• Chef vs. Puppet
• That learning curve
#EPICFAIL #1
• Not socializing the need for Chef to the dev team
• Once sold, keep momentum going
• The “let’s make this more complicated than it needs to be syndrome”
• Start with easy stuff first, then graduate
• Ops guy admits: the dev people know how to use software engineering methods for creating and maintaining infrastructure code: USE IT
Tweaking Knobs• EC2 AMIs: bake or configure?
• Baking positive: fast boot times
• Baking negative: too static
• Configure positive: very dynamic
• Configure negative: can take forever to boot
• We settled on a mostly dynamic configuration, with some static baking
• knife-ec2 is great, but what about autoscale?
• The CloudFormation connection
#EPICFAIL #2
• Get Chef, don’t actually use it
• Back to that learning curve (Hint: Training)
• Issue with compressed timelines and small staff
• In the heat of deploying prod, doing stupid things
• Losing track of what got deployed where
• Who’s doing what?
• Not sleeping sucks
Out of the Rubble
• Now that we’re live: refactor time (a.k.a. Fix all the broken stuff)
• Chef development for reals
• OMG: WINDOWS?!?!
• Not a lot of expertise in-house or outside
• Ops guy admits: learned to love dev tools like Jenkins and Git
It’s Alive!
• Did gradually over time
• Started with simple recipes, graduated to more complicated ones
• Using Environments to deploy the right thing in the right place
• It’s AWS stupid: you SHOULD kill your instances
• CloudFormation to AutoScale to Chef Client
It’s Alive (v1)
EC2Instances
S3 Bucket(validator
key)
CloudFormation Auto
ScaleGroup
Hosted
11. knife upload
CookbooksEnvironment
RolesData bags
2 3
4
0
0. ManualEditor (vi)Perforce
cfn-create-stack
4. Chef ClientBootstrap
Data Bag KeyRecipes
More Automation (v2)
EC2Instances
S3 Bucket(validator
key)
CloudFormation Auto
ScaleGroup
Hosted
11. knife upload
CookbooksEnvironment
RolesData bags
2 3
4
0
0. AutomatedGit
JenkinsJenkins CFN
4. Chef ClientBootstrap
Data Bag KeyRecipes
On Bootstrapping EC2 Instances
• Biggest issue with Chef in AWS: straying from knife-ec2
• Read the bootstrap document and reverse engineer it
• http://wiki.opscode.com/display/chef/Client+Bootstrap+Fast+Start+Guide
• http://wiki.opscode.com/display/chef/EC2+Bootstrap+Fast+Start+Guide
• user-data is your friend
• Use it for node identity
• Resist the devil: don’t send any API keys or passwords or embarrassing things via user-data!!!
• Windows works this way, too, but learn PowerShell
#EPICFAIL #3Oh crap, Opscode is DOWN!!!
#EPICFAIL #3
• Failing to architect for failure (double BAM)
• Even though we built a hot AWS architecture, we still got bit
• What does it mean when Hosted Chef is down for us?
• Talk to Opscode...really, talk to them, they want to help
How We’re Trying to Improve• Mostly around availability
• Augment Hosted Chef with Private Chef
• Mostly around security
• Use the tools at your disposal
• IAM policies for EC2 roles and S3 bucket security
• Mostly around performance
• Refactoring AWS-related code to use AWS SDK for Ruby
• AMI factory from base Amazon Linux or Ubuntu AMIs (bonus points for Windows)
The End
• Operational scripts, template examples and other bits
• https://github.com/Adobe-CloudOps
• Contact me:
• @johnmartinez
• Questions? Suggestions? Come talk to me after!