Public Cloud Services using IBM Cloud and Netflix OSS
Jan 2014
Andrew Spyker@aspyker
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
2
About me …• IBM STSM of Performance Architect and Strategy
• Eleven years in performance in WebSphere– Led the App Server Performance team for years– Small sabbatical focused on IBM XML technology– Works in Emerging Technology Institute, CTO Office– Now cloud service operations
• Email: [email protected]– Blog: http://ispyker.blogspot.com/– Linkedin: http://www.linkedin.com/in/aspyker– Twitter: http://twitter.com/aspyker– Github: http://www.github.com/aspyker
• RTP dad that enjoys technology as well as running, wine and poker 3
Develop or maintain a service today?
• Develop – yes
• Maintain – starting
• So far– Multiple services inside of IBM– Other services for use in our PaaS environment
4
http://www.flickr.com/photos/stevendepolo/
What qualifies me to talk?
• My monkey?
• Of cloud prize ~ 40 entrants– Best example mash-up sample
• Nomination and win– Best portability enhancement
• Nomination– More on this coming …
• Other nominees - http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html
• Other winners - http://techblog.netflix.com/2013/11/netflix-open-source-software-cloud.html
5
Seriously, how did I get here?• Experience with performance and scale on standardized
benchmarks (SPEC/TPC)– Non representative of how to (web) scale
• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size
– Out of date on modern architecture for mobile/cloud
• Created Acme Air– http://bit.ly/acmeairblog
• Demonstrated that we could achieve (web) scale runs– 4B+ Mobile/Browser request/day– With modern mobile and cloud best practices
6
What was shown?
• Peak performance and scale – You betcha!
• Operational visibility – Only during the run via nmon collection and post-run visualization
• True operational visibility - nope• Devops – nope• HA and DR – nope• Manual and automatic elastic scaling - nope
7
What next?
• Went looking for what best industry practices around devops and high availability at web scale existed– Many have documented via research papers and on
highscalability.com – Google, Twitter, Facebook, Linkedin, etc.
• Why Netflix?– Documented not only on their tech blog, but also have
released working OSS on github– Also, given dependence on Amazon, they are a clear
bellwether of web scale public cloud availability
8
Steps to NetflixOSS understanding
• Recoded Acme Air application to make use of NetflixOSS runtime components
• Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance on IBM middleware
• Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale
• Through public collaboration with Netflix technical team– Google groups, github and meetups 9
Why?
• To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon
• To understand how we can advance IBM cloud platforms for our customers
• To understand how we can host our IBM public cloud services better
10
Another Cloud Portability work of note
• In this presentation, focused on portability across public clouds
• What about applicability to private cloud?
• Paypal worked to port the cloud management system to OpenStack and Heat– https://github.com/paypal/aurora
• Additional work required to port runtime aspects as we did in public cloud
Project Aurora
11
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
12
My view of Netflix goals• As a business
– Be the best streaming media provider in the world– Make best content deals based on real data/analysis
• Technology wise– Have the most availability possible– “Stream starts per unit of time” is KPI measured for entire business– Deliver features to customers first in market
• Requiring high velocity of IT change
– Do all of this at web scale
• Culture wise– Create a high performance delivery culture that attracts top talent
13
Standing on the shoulder of a giants
• Public Cloud (Amazon)– When adding streaming, Netflix decided they
• Shouldn’t invest in building data centers worldwide• Had to plan for the streaming business to be very big
– Embraced cloud architecture paying only for what they need
• Open Source– Many parts of runtime depend on open source
• Linux, Apache Tomcat, Apache Cassandra, etc.• Requires top technical talent and OSS committers
– Realized that Amazon wasn’t enough• Started a cloud platform on top that would eventually be open sourced - NetflixOSS
14
http://en.wikipedia.org/wiki/File:Andre_in_the_late_%2780s.jpg
NetflixOSS on Github• “Technical indigestion
as a service”– Adrian Cockcroft
• netflix.github.io– 40+ OSS projects– Expanding every day
• Focusing more on interactive mid-tier server technology today …
15
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
16
High Availability Thoughts• Three of every part of your architecture
– EVERYTHING in your architecture (including IaaS components)– Likely more via clustering/partitioning– One = SPOF– Two = slow active/standby recovery– Three = where you get zero downtime when failures occur
• All parts of application should fail independently– No one part should take down entire application– When linked, highest availability is limited to lowest availability component– Apply circuit breaker pattern to isolate systems
• If a part of the system results in total end user failure– Use partitioning to ensure only some smaller percentage of users are affected
17
Faleure• What is failing?
– Underlying IaaS problems• Instances, racks, availability zones, regions
– Software issues• Operating system, servers, application code
– Surrounding services• Other application services, DNS, user registries, etc.
• How is a component failing?– Fails and disappears altogether– Intermittently fails– Works, but is responding slowly– Works, but is causing users a poor experience
Inspiration
18
Overview of IaaS HA
• Launch instances into availability zones– Instances of various sizes (compute, storage, etc.)
• Organized into regions and availability zones– Availability zones are isolated from each over– Availability zones are connected /w low-latency links– Regions contain availability zones– Regions independent of each other– Regions have higher latency to each other
• This gives a high level of resilience to outages– Unlikely to affect multiple availability zones or regions
• Cloud providers require customer be aware of this topology to take advantage of its benefits within their application
Region(Dallas)
Datacenter/Availability Zone
Datacenter/Availability Zone
Datacenter/Availability Zone
SecondRegion
Datacenter/Availability Zone
Datacenter/Availability Zone
Datacenter/Availability Zone
19
Internet
Acme Air As A Sample
Web AppFront End
(REST services)
App Service(Authentication) Data TierELB
20
Greatly simplified …
Micro-services architecture• Decompose system into isolated services that can be developed
separately
• Why?– They can fail independently vs. fail together monolythically– They can be developed and released with difference velocities by
different teams
• To show this we created separate “auth service” for Acme Air
• In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources
21
EurekaServer(s)
How do services advertise themselves?• Upon web app startup, Karyon server is started
– Karyon will configure (via Archaius) the application– Karyon will register the location of the instance with Eureka
• Others can know of the existence of the service• Lease based so instances continue to check in updating list of available instances
– Karyon will also expose a JMX console, healthcheck URL• Devops can change things about the service via JMX• The system can monitor the health of the instance
App Service(Authentication)
config.properties, auth-service.propertiesOr remote Archaius stores
KaryonApp Server
EurekaServer(s)
EurekaServer(s)
EurekaServer(s)
Name, PortIP address,Healthcheck url
22
EurekaServer(s)
How do consumers find services?
• Service consumers query eureka at startup and periodically to determine location of dependencies– Can query based on availability zone and cross
availability zone
Eureka clientApp Server
EurekaServer(s)
EurekaServer(s)
EurekaServer(s)
What “auth-service”instances exist?Web App
Front End(REST services)
23
Demo
24
App Service(Authentication)
How does the consumer call the service?• Protocols impls have eureka aware load balancing support build in
– In client load balancing -- does not require separate LB tier
• Ribbon – REST client– Pluggable load balancing scheme– Built in failure recovery support (retry next server, mark instance as failing, etc.)
• Other eureka enabled clients– Custom code in non-Java or Ribbon enabled systems (Java or pure REST)– More from Netflix
• Memcached (EVCache), Asystanax (Cassandra and Priam) coming
Ribbon REST client
Call“auth-service”Web App
Front End(REST services) App Service
(Authentication)
App Service(Authentication)
App Service(Authentication)
Eureka client
25
PS. This is a common pattern
• Same idea, but different implementations– Airbnb.com’s SmartStack
• Zookeeper/Synapse/Nerve/HAProxy
– Parse.com’s clustering• Zookeeper/Ngnix
26
How to deploy this with HA?Instances?• Asgard deploys across AZs• Using auto scaling groups in
managed by Asgard• More on Asgard later
Eureka?• DNS and Elastic IP trickery• Deployed across AZs
• For clients to find eureka servers– DNS TXT record for domain lists AZ TXT
records– AZ TXT records have list of Eureka servers
• For new eureka servers– Look for list of eureka servers IP’s for the AZ
it’s coming up in– Look for unassigned elastic IP’s, grab one and
assign it to itself– Sync with other already assigned IP’s that
likely are hosting Eureka server instances
• Simpler configurations with less HA are available 27
Protect yourself from unhealthy services
• Wrap all calls to services with Hystrix command pattern– Hystrix implements circuit breaker pattern– Executes command using semaphore or separate thread pool to guarantee
return within finite time to caller– If a unhealthy service is detected, start to call fallback implementation
(broken circuit) and periodically check if main implementation works (reset circuit)
• Hystrix also provides caching, request collapsing with synchronous and asynchronous (reactive via RxJava) invocation
App Service(Authentication)
Ribbon REST client
Call“auth-service”
Web AppFront End
(REST services)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
Executeauth-service
call
Fallback implementation
Hys
trix
28
Denominator
• Most (simple) geographic (region) based disaster recovery depends on front end DNS traffic switching
• Java Library and CLI for cross DNS configuration
• Allows for common, quicker (than using various DNS provider UI) and automated DNS updates
• Plugins have been developed by various DNS providers
29
ZuulZuul
Augmenting the ELB tier - Zuul• Originally developed to do cross region routing for regional HA
– Advanced geographic (region) based disaster recovery
• Zuul also adds devops support in the front tier routing– Stress testing (squeeze testing)– Canary testing– Dynamic routing– Load Shedding– Debugging
• And some common function– Authentication– Security– Static response handling– Multi-region resiliency (DR for ELB tier)– Insight
• Through dynamically deployable filters (written in Groovy)• Eureka aware using ribbon, and archaius like shown in runtime section
Zuul
Region 1Load
Balancers
Edge Service
Edge Service
FilterFilterFilterFilters
30
ZuulZuulZuul
Region 2Load
Balancers
HA in application architecture
• Stateless application design– Legacy application design has state– Temporal state should be pushed to caching servers– Durable state should be pushed to partitioned data servers– Trades off peak latency for uptime (sometimes no trade off)
• Partitioned data servers– Wealth of NoSQL servers available today– Be careful of oversold “consistency” promises
• Look for third party “Jepsen-like” testing
– Be ready to deal with compensated approaches– Consider differences in system of record vs. interaction data stores
31
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
32
Automatic Recovery Thoughts
• Automatic recovery depends on elastic, ephemeral instance cluster design powered by “auto scaling”
• If something fails once, it will fail again
• No repeated failure should be a pager call– Instead should be email with automated recovery information
to be analyzed offline
• Test failure on your system before the system tests your failure
33
Auto Scaling (for the masses)
• For many, auto scaling is more auto recovery– Far more important to keep N instances running
than be able to scale automatically to 2N, 10N, 100N
• For many, automatic scaling isn’t appropriate– First understand how the system can be elastically
scaled with operator expertise manually
34
ASGard
• Asgard is the console for automatic scaling and recovery
Region(Dallas) Datacenter/
Availability Zone
Datacenter/Availability Zone
Datacenter/Availability Zone
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
Web App(REST Services)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
Web App(REST Services)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
App Service(Authentication)
Web App(REST Services)
Tell IaaS to startthese instances andKeep this manyInstances running
35
Asgard creates an “application”• Enforces common practices for deploying code
– Common approach to linking auto scaling groups to launch configurations, load balancers, security groups, scaling policies and images
• Adds missing concept to the IaaS domain model – “application”– Apps clustering and application lifecycle vs. individually launched and
managed images
• Example– Application – app1– Cluster – app1-env– Asgard group version n – app1-env-v009– Asgard group version n+1 – app1-env-v010
36
When to test recovery (and HA)?
• Failure is inevitable. Don’t try to avoid it!
• How do you know if your backup is good?– Try to restore from your backup every so often– Better to ensure backup works before you have a crashed system
and find out your backup is broken
• How do you know if your system is HA?– Try to force failures every so often– Better to force those failures during office hours– Better to ensure HA before you have a down system and angry users– Best to learn from failures and add automated tests
37
The Simian Army
• A bunch of automated “monkeys” that perform automated system administration tasks
• Anything that is done by a human more than once can and should be automated
• Absolutely necessary at web scale
38
Bad Monkeys• Open Sourced – Chaos Monkey
– Used to randomly terminate instances– Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O
• Not yet open sourced– Chaos Gorilla
• Kill datacenter/availability zone instances
– Chaos Kong• Kill all instances in an entire region
– Latency Monkey• Introduce latency into service calls directly (ribbon server side)
– Split Brain Monkey• Datacenters/availability zones continue to operate, but isolated from each other
39
http://www.flickr.com/photos/27261720@N00/132750805
Elastic Scale
• Basic elastic scale required to achieve high availability– To run three or more of any component
• Front tier specific considerations– Will likely need to scale far higher than micro-services– Use distributed caching with TTL where appropriate– Otherwise micro-service architecture could overload data servers
• Scaling larger (or Web Scale) will find bottlenecks that require changes to architecture and/or tuning– Iterative process of improvement
40
Elastic scaling in application architecture
• Clusters that replicate data within the cluster must discover new peers (and timeout dead ones)
• Clusters that connect to other clusters must discover new dependency instances (and timeout dead ones)
• Many legacy architectures contain static cluster definitions that require “re-starts” to update information– Code changes required to leverage dynamic connectivity
41
Full Auto Scaling
• Eventually web scale will require auto scaling based on policy– Attach policy based on request latency, utilization,
queue depth, etc.
• Words of caution, be careful to– Design policies to be proactive on scale up or risk
scaling that isn’t fast enough to keep up with demand– Design policies to be generous on scale down or risk
over-scaling down and immediate need for scale up
42
Scaling Continues to Evolve• Reactive auto scaling is “easy” but naïve
– Instances fail– Unexpected spike in demand
• What if your traffic is “predictable”, consider– User population follows a daily pattern– User population known to follow different patterns each day (work days vs.
weekends)– End of month influx of work
• Scryer is Netflix’s predictive analytics to not wait for reactive scaling– Better end user experience, less over deployment (cheaper), more consistent
utilization (cheaper)– Not yet open sourced
43
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• How to grade public cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
44
Thoughts onContinuous Delivery
• Legacy waterfall habits are hard to break– “Leaks” of old world continue to show– Especially if product has to be released in “shrink
wrapped” form in parallel
• Netflix approach and technology assists breaking these habits– Provide the tools and proof points and the organization
will follow
Inspiration
45
Continuous Delivery Pipeline• Developers
– Perform local testing before checking code into continuous build
• Continuous build– Builds code, tests code and flags any breaks for immediate attention– Builds packages ready for image installation
• Image bakery– Builds image for deployment that then show up in Asgard
• Continuous deployment– Images deployed through Asgard– Instances are given image and environmental context from Asgard
• Same images should be used in production that are used in test
• Due to micro-services (API as contract) approach– No need to co-ordinate typical deployments across teams
46
Asgard devops procedures• Fast rollback• Canary testing• Red/Black pushes• More through REST interfaces
– Adhoc processes allowed, enforced through Asgard model• More coming using Glisten and workflow services
47
Demo
48
Ability to reconfigure - Archaius• Using dynamic properties, can
easily change properties across cluster of applications, either– NetflixOSS named props
• Hystrix timeouts for example
– Custom dynamic props• High throughput achieved by
polling approach• HA of configuration source
dependent on what source you use– HTTP server, database, etc.
Container
Libraries
Application Props
Persisted DB
Runtime
Hie
rarc
hy
URL
Application
JMXKaryonConsole
DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);int value = prop.get(); // value will change over time based on configuration 49
Get baked!• Caution: Flame/troll bait ahead!!
– Criticism – “Netflix is ruining the cloud”• Overhead of images for every code version• Ties to Amazon AMI’s (have proven this tie can be broken)
• Netflix takes the approach of baking images as part of build such that– Instance boot-up doesn’t depend on outside servers– Instance boot-up only starts servers already set to run– New code = new instances (never update instances in place)
• Why?– Critical when launching hundreds of servers at a time– Goal to reduce the failure points in places where dynamic system
configuration doesn’t provide value– Speed of elastic scaling, boot and go– Discourages ad hoc changes to server instances
50
AMInator• Starting image/volume
– Foundational image created (maybe via loopback), base AMI with common software created/tested independently
• Aminator running – Bakery– Bakery obtains a known EBS volume of the
base image from a pool– Bakery mounts volume and provisions the
application (apt/deb or yum/rpm)– Bakery snapshots and registers snapshot
• Recent work to add other provisioning such as chef as plugins 51
Imaginator
• Implementation of Aminator– For IBM SoftLayer cloud
• Creates image templates– Starts from base OS and adds deb/rpm’s
• Snapshots images for later deployment
• Not yet open sourced52
Good Monkeys
• Janitor Monkey– Somewhat a mitigation for baking approach– Will mark and sweep unused resources
(instances, volumes, snapshots, ASG’s, launch configs, images, etc.)
– Owners notified, then removed
• Conformity Monkey– Check instances are conforming to rules
around security, ASG/ELB, age, status/health check, etc.
53
http://www.flickr.com/photos/sonofgroucho/5852049290
Agenda
• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services
– High Availability– Automatic recovery– Continuous delivery– Operational visibility
• Get started yourself
54
Thoughts on Operational Visibility
• Programming model to expose metrics should be simple
• Systems need to expose internals in a way that is sensible to the owners and operators
• The tools that view the internals need to match the level of abstraction developers care about
• The tools must give sufficient context when viewing any single metric or alert
55
Monitoring - Servo
• Annotation based publishing through JMX of application metrics
• Gauges, counters, and timers
• Filters, Observers, and Pollers to publish metrics– Can export metrics to metric collection servers
• Netflix exposes their metrics to Atlas– The entire Netflix monitoring infrastructure hasn’t been open
sourced due to complexity and priority
56
Back to Hystrix• Main reason for Hystrix is protect
yourself from dependencies, but …
• Same layer of indirection to services can provide visualization
• You can aggregate the view across clusters via Turbine
• Other alert system and dashboards can read from Turbine
57
Edda
• IaaS does not typically provide– Historical views of the state of the system– All views between components an operator might want to see
• Edda polls current state and stores the data in a queriable database
• Provides a adhoc queriable view of all deployment aspects• Provides a historical view
– For correlation of problems to changes– Becoming a more common place feature in cloud
58
Ice
• Cloud spend and usage analytics
• Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups
• Watches differently priced instances and instance sizes to help optimize
• Not point in time– Shows trends to help predict future
optimizations
59
Agenda
• Blah, blah, blah
• How can I learn more?
• How do I play with this?
• Let’s write some code!
60
Want to play?• NetflixOSS blog and github
– http://techblog.netflix.com– http://github.com/Netflix
• NetflixOSS as ported to IBM Cloud– https://github.com/EmergingTechnologyInstitute– SoftLayer Image Templates coming soon
• Acme Air, NetflixOSS AMI’s– Try Asgard/Eureka with a real application– http://bit.ly/aa-AMIs
• See what we ported to IBM Cloud (video)– http://bit.ly/noss-sl-blog
• Fork and submit pull requests to Acme Air– http://github.com/aspyker/acmeair-netflix
61
Thanks!
Questions?