continuous deployment: the dirty details
DESCRIPTION
Presented at ALM Summit 3 in Redmond, WA. January 2013. Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: [email protected]. http://www.etsy.com/careersTRANSCRIPT
CONTINUOUS DEPLOYMENT
The Dirty Details
“OK, sounds cool. But I have some questions...”
CD 100- & 200-levels
- CI environment for automated tests- Committing to trunk- Branching in code- Config flags (a.k.a. feature flags)- DevOps mentality- Metrics and alerting- Automated deploy scripts
credit: photobookgirl (flickr)
CD 100- & 200-levels
- CI environment for automated tests- Committing to trunk- Branching in code- Config flags (a.k.a. feature flags)- DevOps mentality- Metrics and alerting- Automated deploy scripts
- Deploys vs. releases- Decoupled systems, schema changes- How we work: Arch. & Process- Integration and Operations
CD 300 level
credit: photobookgirl (flickr)
www. .com
GROSS MERCHANDISE SALES
http://www.etsy.com/blog/news/2013/notes-from-chad-2012-year-in-review/
DECEMBER 20121.5 Billion page views$117 Million of goods sold6 Million items sold
http://www.etsy.com/blog/news/2013/etsy-statistics-december-2012-weather-report/Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet
175+ Committers, everyone deploys
credit: martin_heigan (flickr)
Very end of 2009Today
DEPLOYMENTS PER DAY
Continuous delivery is a pattern language in growing use in software development to improve the process of software delivery.
~wikipedia
Techniques such as automated testing, continuous integration, and continuous deployment allow software to be developed to a high standard and easily packaged and deployed to test environments, resulting in the ability to rapidly, reliably and repeatedly push out enhancements and bug fixes to customers at low risk and with minimal manual overhead.
credit: Stewart, redgen (flickr)
Linux, Apache, MySQL, PHPArchitecture Stack
Memcache, Gearman, Postgresql, Solr, Java, Traffic Server, Hadoop, HBase
credit: Saire Elizabeth (flickr)
Git, Jenkins
2010-today2009
Then Now
Just before we started using CD
15 mins6-14 hours1 person“Deployment Army”
Then Now
Part of everydayworkflow
Special event andhighly orchestrated
Blocked for15 minutes.
15 minutes toredeploy
Blocked for6-14 hours.
6+ hours toredeploy
Then Now
Mainline,minimal linking
and building,rsync,site up
Release branch,database schemas,
data transforms,packaging,
rolling restarts,cache purging,
scheduled downtime
Then Now
1!" #$%Put your face on Etsy.com.
2&# #$%Complete tax, insurance, and
benefits forms.
credit: ktpupp (flickr)
WARNING
GROSS MERCHANDISE SALES
Continuous DeploymentSmall, frequent changes.
Constantly integrating into production.30+ deploys per day.
“Wow... 30 deploys a day.How do you build features so quickly?”
Software Deploy ≠ Product Launch
Deploys frequently gated by config flags(“dark” releases)
$cfg[‘new_search’] = array('enabled' => 'off');$cfg[‘sign_in’] = array('enabled' => 'on');$cfg[‘checkout’] = array('enabled' => 'on');$cfg[‘homepage’] = array('enabled' => 'on');
$cfg[‘new_search’] = array('enabled' => 'off');
$cfg[‘new_search’] = array('enabled' => 'off');
// Meanwhile...
# old and boring search$results = do_grep();
$cfg[‘new_search’] = array('enabled' => 'off');
// Meanwhile...
if ($cfg[‘new_search’] == ‘on’) { # New and fancy search $results = do_solr();} else { # old and boring search $results = do_grep();}
$cfg[‘new_search’] = array('enabled' => 'on');
// or...
$cfg[‘new_search’] = array('enabled' => 'staff');
// or...
$cfg[‘new_search’] = array('enabled' => '1%');
// or...
$cfg[‘new_search’] = array('enabled' => 'users', 'user_list' => 'mike,john,kellan');
Validate in production, hidden from public.
Small incremental changes to the applicationNew classes, methods, controllersGraphics, stylesheets, templatesCopy/content changes
Turning flags on, off, or % ramp up
What’s in a deploy?
Latent bugs and security holesTraffic management, load sheddingAdding and removing infrastructure
Tweaking config flags or releasing patches.
Low MTTR (response times)
http://www.flickr.com/photos/flyforfun/2694158656/
http://www.flickr.com/photos/flyforfun/2694158656/
OperatorConfig flags
Metrics
Favorites“on”
Favorites“off”
Many deploys eventually lead to a product launch.
“How do you continuously deploy database schema changes?”
Code deploysSchema changes
~15-20 minutes
Code deploysSchema changes
~15-20 minutesTHURSDAYS!
Our web application is largely monolithic.
Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
Our web application is largely monolithic.
Etsy.com, Support & Back-office tools, Developer API, Gearman (async work)
PHP, Apache, Memcache
External “services” are not deployed with the main application.
e.g. Databases, Search, Photo storage, Payments
External “services” are not deployed with the main application.
e.g. Databases, Search, Photo storage, Payments
MYSQL(schema changes)
SOLR, JVM(rolling restarts)
PROXY CACHE, FILERS, AMAZON S3
(specialized infra.)
PCI(controlled access)
For every config flag, there are two stateswe can support — present and future.
For every config flag, there are two stateswe can support — present and future.
... or past and present.
“Non-Breaking Expansions”Expose new version in a service interface;
support multiple versions in the consumer.
Example: Changing a Database SchemaMerging “users” and “users_prefs”
RULE OF THUMB:Prefer ADDs over ALTERs (non-breaking expansion)C
1. Write to both versions2. Backfill historical data3. Read from new version4. Cut-off writes to old version
0. Add new version to schema1. Write to both versions2. Backfill historical data3. Read from new version4. Cut-off writes to old version
0. Add new version to schemaSchema change to add prefs columns to “users” table.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “off”“read_prefs_from_users_table” => “off”
1. Write to both versionsWrite code for writing prefs to the “users” table.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “off”
2. Backfill historical dataOffline process to sync existing data from “user_prefs”to new columns in “users”
3. Read from new versionData validation tests. Ensure consistency both internallyand in production.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “staff”
3. Read from new versionData validation tests. Ensure consistency both internallyand in production.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “1%”
3. Read from new versionData validation tests. Ensure consistency both internallyand in production.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “5%”
3. Read from new versionData validation tests. Ensure consistency both internallyand in production.
“write_prefs_to_user_prefs_table” => “on”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “on”
(“on” == “100%”)
4. Cut-off writes to old versionAfter running on the new table for a significant amount of time, we can cut off writes to the old table.
“write_prefs_to_user_prefs_table” => “off”“write_prefs_to_users_table” => “on”“read_prefs_from_users_table” => “on”
“Branch by Astraction”
Controller Controller
Users Model
“users” (old) “user_prefs” “users”
old schema new schema
(Abstraction)
http://paulhammant.com/blog/branch_by_abstraction.htmlhttp://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
1. Write to both versions2. Backfill historical data3. Read from new version4. Cut-off writes to old version
“The Migration 4-Step”
“When do you clean up all of those config flags?
We might remove config flags for the old version when...
It is no longer valid for the business.It is no longer stable, maintained, or trusted.It has poor performance characteristics.The code is a mess, or difficult to read.We can afford to spend time on it.
Promote “dev flags” to “feature flags”
// Feature flag$cfg[‘mobilized_pages’] = array('enabled' => 'on');
// Dev flags$cfg[‘mobile_templates_seller_tools’] = array('enabled' => 'on');$cfg[‘mobile_templates_account_tools’] = array('enabled' => 'on');$cfg[‘mobile_templates_member_profile’] = array('enabled' => 'on');$cfg[‘mobile_templates_search’] = array('enabled' => 'off');$cfg[‘mobile_templates_activity_feed’] = array('enabled' => 'off');
...
if ($cfg[‘mobilized_pages’] == ‘on’ && $cfg[‘mobile_templates_search’] == ‘on’) { // ... // ...}
// Feature flags$cfg[‘search’] = array('enabled' => 'on');$cfg[‘developer_api’] = array('enabled' => 'on');$cfg[‘seller_tools’] = array('enabled' => 'on');
$cfg[‘the_entire_web_site’] = array('enabled' => 'on');
// Feature flags$cfg[‘search’] = array('enabled' => 'on');$cfg[‘developer_api’] = array('enabled' => 'on');$cfg[‘seller_tools’] = array('enabled' => 'on');
$cfg[‘the_entire_web_site’] = array('enabled' => 'on');$cfg[‘the_entire_web_site_no_really_i_mean_it’] = array('enabled' => 'on');
Architecture and Process
Deploying is cheap.
Releasing is cheap.
Some philosophies on product development...
Gathering data should be cheap, too.staff, opt-in prototypes, 1%
Treat first iterations as experiments.
Get into code as quickly as possible.
“Where a new system concept or new technology is used, one has to build a system to throw away, for even the best planning is not so omniscient as to get it right the first time. Hence plan to throw one away; you will, anyhow.”
~ Fred Brooks, The Mythical Man-Month
Architecture largely doesn’t matter.
Kill things that don’t work.
Your assumptions will be wrongonce you’ve scaled 10x.
“We don’t optimize for being right. We optimize for quickly detecting when we’re wrong.”
~Kellan Elliott-McCrea, CTO
Become really good at changingyour architecture.
Invest time in architecture by the2nd or 3rd iteration.
Integration and Operations
WARNING
REMEMBER THIS?
Continuous DeploymentSmall, frequent changes.
Constantly integrating into production.30 deploys per day.
Code review before commit
Automated tests before deploy
Why Integrate with Production?
Dev ≠ Prod
Verify frequently and in small batches.
Integrating with production is a test in itself.We do this frequently and in small batches.
More database servers in prod.Bigger database hardware in prod.More web servers.Various replication schemes.Different versions of server and OS software.Schema changes applied at different times.Physical hardware in prod.More data in prod.Legacy data (7 years of odd user states).More traffic in prod.Wait, I mean MUCH more traffic in prod.Fewer elves.Faster disks (SSDs) in prod.
Using a MySQL database in dev for an application that will be runningon Oracle in production: Priceless
Verify frequently and in small batches.
Dev ⇾ QA ⇾ Staging ⇾ Prod
Dev ⇾ QA ⇾ Staging ⇾ Prod
Dev ⇾ Pre-Prod ⇾ Prod
Test and integrate where you’ll see value.
Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%
Config flags (again)
off, on, staff, opt-in prototypes, user list, 0-100%
“canary pools”
Automated tests after deploy
Real-time metrics and dashboardsNetwork & Servers, Application, Business
SERVER METRICS
Apache requests/sec, Busy processes,CPU utilization, Script exec time (med. & 95th)
APPLICATION METRICS
Logins, Registrations, Checkouts, Listings created, Forum posts
Time and event correlated.
Real humans reporting trouble!
“Theoretical” vs. “Practical” Engineering
http://www.flickr.com/photos/flyforfun/2694158656/
OperatorConfig flags
Metrics
Thanksgiving, “Black Friday,” “Cyber Monday” ➔ Christmas(~30 days)
Managing risk during Holiday Shopping season
Code Freeze?
“Code Slush”
DEPLOYMENTS PER DAY
Tighten your feedback cyclesIntegrate with production and validate early in cycle.Use tools that allow you to detect issues early.Optimize for quick response times.
Applied to both feature development and operability.
Thank you
Mike Brittain
... and questions?
These slides will be available later today at http://mikebrittain.com/talks
ENGINEERING DIRECTOR