Download - Cloud Services Powered by IBM SoftLayer and NetflixOSS

Public Cloud Services using IBM Cloud and Netflix OSS

Jan 2014

Andrew Spyker@aspyker

Agenda

• How did I get here?• Netflix overview, Netflix OSS teaser• Cloud services

– High Availability– Automatic recovery– Continuous delivery– Operational visibility

• Get started yourself

2

About me …• IBM STSM of Performance Architect and Strategy

• Eleven years in performance in WebSphere– Led the App Server Performance team for years– Small sabbatical focused on IBM XML technology– Works in Emerging Technology Institute, CTO Office– Now cloud service operations

• Email: [email protected]– Blog: http://ispyker.blogspot.com/– Linkedin: http://www.linkedin.com/in/aspyker– Twitter: http://twitter.com/aspyker– Github: http://www.github.com/aspyker

• RTP dad that enjoys technology as well as running, wine and poker 3

Develop or maintain a service today?

• Develop – yes

• Maintain – starting

• So far– Multiple services inside of IBM– Other services for use in our PaaS environment

4

http://www.flickr.com/photos/stevendepolo/

What qualifies me to talk?

• My monkey?

• Of cloud prize ~ 40 entrants– Best example mash-up sample

• Nomination and win– Best portability enhancement

• Nomination– More on this coming …

• Other nominees - http://techblog.netflix.com/2013/09/netflixoss-meetup-s1e4-cloud-prize.html

• Other winners - http://techblog.netflix.com/2013/11/netflix-open-source-software-cloud.html

5

Seriously, how did I get here?• Experience with performance and scale on standardized

benchmarks (SPEC/TPC)– Non representative of how to (web) scale

• Pinning, biggest monolithic DB “wins”, hand tuned for fixed size

– Out of date on modern architecture for mobile/cloud

• Created Acme Air– http://bit.ly/acmeairblog

• Demonstrated that we could achieve (web) scale runs– 4B+ Mobile/Browser request/day– With modern mobile and cloud best practices

6

http://bit.ly/acmeairblog



What was shown?

• Peak performance and scale – You betcha!

• Operational visibility – Only during the run via nmon collection and post-run visualization

• True operational visibility - nope• Devops – nope• HA and DR – nope• Manual and automatic elastic scaling - nope

7

What next?

• Went looking for what best industry practices around devops and high availability at web scale existed– Many have documented via research papers and on

highscalability.com – Google, Twitter, Facebook, Linkedin, etc.

• Why Netflix?– Documented not only on their tech blog, but also have

released working OSS on github– Also, given dependence on Amazon, they are a clear

bellwether of web scale public cloud availability

8

Steps to NetflixOSS understanding

• Recoded Acme Air application to make use of NetflixOSS runtime components

• Worked to implement a NetflixOSS devops and high availability setup around Acme Air (on EC2) run at previous levels of scale and performance on IBM middleware

• Worked to port NetflixOSS runtime and devops/high availability servers to IBM Cloud (SoftLayer) and RightScale

• Through public collaboration with Netflix technical team– Google groups, github and meetups 9

Why?

• To prove that advanced cloud high availability and devops platform wasn’t “tied” to Amazon

• To understand how we can advance IBM cloud platforms for our customers

• To understand how we can host our IBM public cloud services better

10

Another Cloud Portability work of note

• In this presentation, focused on portability across public clouds

• What about applicability to private cloud?

• Paypal worked to port the cloud management system to OpenStack and Heat– https://github.com/paypal/aurora

• Additional work required to port runtime aspects as we did in public cloud

Project Aurora

11

Agenda




12

My view of Netflix goals• As a business

– Be the best streaming media provider in the world– Make best content deals based on real data/analysis

• Technology wise– Have the most availability possible– “Stream starts per unit of time” is KPI measured for entire business– Deliver features to customers first in market

• Requiring high velocity of IT change

– Do all of this at web scale

• Culture wise– Create a high performance delivery culture that attracts top talent

13

Standing on the shoulder of a giants

• Public Cloud (Amazon)– When adding streaming, Netflix decided they

• Shouldn’t invest in building data centers worldwide• Had to plan for the streaming business to be very big

– Embraced cloud architecture paying only for what they need

• Open Source– Many parts of runtime depend on open source

• Linux, Apache Tomcat, Apache Cassandra, etc.• Requires top technical talent and OSS committers

– Realized that Amazon wasn’t enough• Started a cloud platform on top that would eventually be open sourced - NetflixOSS

14

http://en.wikipedia.org/wiki/File:Andre_in_the_late_%2780s.jpg

NetflixOSS on Github• “Technical indigestion

as a service”– Adrian Cockcroft

• netflix.github.io– 40+ OSS projects– Expanding every day

• Focusing more on interactive mid-tier server technology today …

15

Agenda




16

High Availability Thoughts• Three of every part of your architecture

– EVERYTHING in your architecture (including IaaS components)– Likely more via clustering/partitioning– One = SPOF– Two = slow active/standby recovery– Three = where you get zero downtime when failures occur

• All parts of application should fail independently– No one part should take down entire application– When linked, highest availability is limited to lowest availability component– Apply circuit breaker pattern to isolate systems

• If a part of the system results in total end user failure– Use partitioning to ensure only some smaller percentage of users are affected

17

Faleure• What is failing?

– Underlying IaaS problems• Instances, racks, availability zones, regions

– Software issues• Operating system, servers, application code

– Surrounding services• Other application services, DNS, user registries, etc.

• How is a component failing?– Fails and disappears altogether– Intermittently fails– Works, but is responding slowly– Works, but is causing users a poor experience

Inspiration

18

Overview of IaaS HA

• Launch instances into availability zones– Instances of various sizes (compute, storage, etc.)

• Organized into regions and availability zones– Availability zones are isolated from each over– Availability zones are connected /w low-latency links– Regions contain availability zones– Regions independent of each other– Regions have higher latency to each other

• This gives a high level of resilience to outages– Unlikely to affect multiple availability zones or regions

• Cloud providers require customer be aware of this topology to take advantage of its benefits within their application

Region(Dallas)

Datacenter/Availability Zone



SecondRegion




19

Internet

Acme Air As A Sample

Web AppFront End

(REST services)

App Service(Authentication) Data TierELB

20

Greatly simplified …

Micro-services architecture• Decompose system into isolated services that can be developed

separately

• Why?– They can fail independently vs. fail together monolythically– They can be developed and released with difference velocities by

different teams

• To show this we created separate “auth service” for Acme Air

• In a typical customer facing application any single front end invocation could spawn 20-30 calls to services and data sources

21

EurekaServer(s)

How do services advertise themselves?• Upon web app startup, Karyon server is started

– Karyon will configure (via Archaius) the application– Karyon will register the location of the instance with Eureka

• Others can know of the existence of the service• Lease based so instances continue to check in updating list of available instances

– Karyon will also expose a JMX console, healthcheck URL• Devops can change things about the service via JMX• The system can monitor the health of the instance

App Service(Authentication)

config.properties, auth-service.propertiesOr remote Archaius stores

KaryonApp Server

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

Name, PortIP address,Healthcheck url

22

EurekaServer(s)

How do consumers find services?

• Service consumers query eureka at startup and periodically to determine location of dependencies– Can query based on availability zone and cross

availability zone

Eureka clientApp Server

EurekaServer(s)

EurekaServer(s)

EurekaServer(s)

What “auth-service”instances exist?Web App

Front End(REST services)

23

Demo

24


How does the consumer call the service?• Protocols impls have eureka aware load balancing support build in

– In client load balancing -- does not require separate LB tier

• Ribbon – REST client– Pluggable load balancing scheme– Built in failure recovery support (retry next server, mark instance as failing, etc.)

• Other eureka enabled clients– Custom code in non-Java or Ribbon enabled systems (Java or pure REST)– More from Netflix

• Memcached (EVCache), Asystanax (Cassandra and Priam) coming

Ribbon REST client

Call“auth-service”Web App

Front End(REST services) App Service

(Authentication)



Eureka client

25

PS. This is a common pattern

• Same idea, but different implementations– Airbnb.com’s SmartStack

• Zookeeper/Synapse/Nerve/HAProxy

– Parse.com’s clustering• Zookeeper/Ngnix

26

How to deploy this with HA?Instances?• Asgard deploys across AZs• Using auto scaling groups in

managed by Asgard• More on Asgard later

Eureka?• DNS and Elastic IP trickery• Deployed across AZs

• For clients to find eureka servers– DNS TXT record for domain lists AZ TXT

records– AZ TXT records have list of Eureka servers

• For new eureka servers– Look for list of eureka servers IP’s for the AZ

it’s coming up in– Look for unassigned elastic IP’s, grab one and

assign it to itself– Sync with other already assigned IP’s that

likely are hosting Eureka server instances

• Simpler configurations with less HA are available 27

Protect yourself from unhealthy services

• Wrap all calls to services with Hystrix command pattern– Hystrix implements circuit breaker pattern– Executes command using semaphore or separate thread pool to guarantee

return within finite time to caller– If a unhealthy service is detected, start to call fallback implementation

(broken circuit) and periodically check if main implementation works (reset circuit)

• Hystrix also provides caching, request collapsing with synchronous and asynchronous (reactive via RxJava) invocation


Ribbon REST client

Call“auth-service”

Web AppFront End

(REST services)




Executeauth-service

call

Fallback implementation

Hys

trix

28

Denominator

• Most (simple) geographic (region) based disaster recovery depends on front end DNS traffic switching

• Java Library and CLI for cross DNS configuration

• Allows for common, quicker (than using various DNS provider UI) and automated DNS updates

• Plugins have been developed by various DNS providers

29

ZuulZuul

Augmenting the ELB tier - Zuul• Originally developed to do cross region routing for regional HA

– Advanced geographic (region) based disaster recovery

• Zuul also adds devops support in the front tier routing– Stress testing (squeeze testing)– Canary testing– Dynamic routing– Load Shedding– Debugging

• And some common function– Authentication– Security– Static response handling– Multi-region resiliency (DR for ELB tier)– Insight

• Through dynamically deployable filters (written in Groovy)• Eureka aware using ribbon, and archaius like shown in runtime section

Zuul

Region 1Load

Balancers

Edge Service

Edge Service

FilterFilterFilterFilters

30

ZuulZuulZuul

Region 2Load

Balancers

HA in application architecture

• Stateless application design– Legacy application design has state– Temporal state should be pushed to caching servers– Durable state should be pushed to partitioned data servers– Trades off peak latency for uptime (sometimes no trade off)

• Partitioned data servers– Wealth of NoSQL servers available today– Be careful of oversold “consistency” promises

• Look for third party “Jepsen-like” testing

– Be ready to deal with compensated approaches– Consider differences in system of record vs. interaction data stores

31

Agenda




32

Automatic Recovery Thoughts

• Automatic recovery depends on elastic, ephemeral instance cluster design powered by “auto scaling”

• If something fails once, it will fail again

• No repeated failure should be a pager call– Instead should be email with automated recovery information

to be analyzed offline

• Test failure on your system before the system tests your failure

33

Auto Scaling (for the masses)

• For many, auto scaling is more auto recovery– Far more important to keep N instances running

than be able to scale automatically to 2N, 10N, 100N

• For many, automatic scaling isn’t appropriate– First understand how the system can be elastically

scaled with operator expertise manually

34

ASGard

• Asgard is the console for automatic scaling and recovery

Region(Dallas) Datacenter/

Availability Zone










Web App(REST Services)

















Tell IaaS to startthese instances andKeep this manyInstances running

35

Asgard creates an “application”• Enforces common practices for deploying code

– Common approach to linking auto scaling groups to launch configurations, load balancers, security groups, scaling policies and images

• Adds missing concept to the IaaS domain model – “application”– Apps clustering and application lifecycle vs. individually launched and

managed images

• Example– Application – app1– Cluster – app1-env– Asgard group version n – app1-env-v009– Asgard group version n+1 – app1-env-v010

36

When to test recovery (and HA)?

• Failure is inevitable. Don’t try to avoid it!

• How do you know if your backup is good?– Try to restore from your backup every so often– Better to ensure backup works before you have a crashed system

and find out your backup is broken

• How do you know if your system is HA?– Try to force failures every so often– Better to force those failures during office hours– Better to ensure HA before you have a down system and angry users– Best to learn from failures and add automated tests

37

The Simian Army

• A bunch of automated “monkeys” that perform automated system administration tasks

• Anything that is done by a human more than once can and should be automated

• Absolutely necessary at web scale

38

Bad Monkeys• Open Sourced – Chaos Monkey

– Used to randomly terminate instances– Now block network, burn cpu, kill processes, fail amazon api, fail dns, fail dynamo, fail s3, introduce network errors/latency, detach volumes, fill disk, burn I/O

• Not yet open sourced– Chaos Gorilla

• Kill datacenter/availability zone instances

– Chaos Kong• Kill all instances in an entire region

– Latency Monkey• Introduce latency into service calls directly (ribbon server side)

– Split Brain Monkey• Datacenters/availability zones continue to operate, but isolated from each other

39

http://www.flickr.com/photos/27261720@N00/132750805

Elastic Scale

• Basic elastic scale required to achieve high availability– To run three or more of any component

• Front tier specific considerations– Will likely need to scale far higher than micro-services– Use distributed caching with TTL where appropriate– Otherwise micro-service architecture could overload data servers

• Scaling larger (or Web Scale) will find bottlenecks that require changes to architecture and/or tuning– Iterative process of improvement

40

Elastic scaling in application architecture

• Clusters that replicate data within the cluster must discover new peers (and timeout dead ones)

• Clusters that connect to other clusters must discover new dependency instances (and timeout dead ones)

• Many legacy architectures contain static cluster definitions that require “re-starts” to update information– Code changes required to leverage dynamic connectivity

41

Full Auto Scaling

• Eventually web scale will require auto scaling based on policy– Attach policy based on request latency, utilization,

queue depth, etc.

• Words of caution, be careful to– Design policies to be proactive on scale up or risk

scaling that isn’t fast enough to keep up with demand– Design policies to be generous on scale down or risk

over-scaling down and immediate need for scale up

42

Scaling Continues to Evolve• Reactive auto scaling is “easy” but naïve

– Instances fail– Unexpected spike in demand

• What if your traffic is “predictable”, consider– User population follows a daily pattern– User population known to follow different patterns each day (work days vs.

weekends)– End of month influx of work

• Scryer is Netflix’s predictive analytics to not wait for reactive scaling– Better end user experience, less over deployment (cheaper), more consistent

utilization (cheaper)– Not yet open sourced

43

Agenda

• How did I get here?• Netflix overview, Netflix OSS teaser• How to grade public cloud services



44

Thoughts onContinuous Delivery

• Legacy waterfall habits are hard to break– “Leaks” of old world continue to show– Especially if product has to be released in “shrink

wrapped” form in parallel

• Netflix approach and technology assists breaking these habits– Provide the tools and proof points and the organization

will follow

Inspiration

45

Continuous Delivery Pipeline• Developers

– Perform local testing before checking code into continuous build

• Continuous build– Builds code, tests code and flags any breaks for immediate attention– Builds packages ready for image installation

• Image bakery– Builds image for deployment that then show up in Asgard

• Continuous deployment– Images deployed through Asgard– Instances are given image and environmental context from Asgard

• Same images should be used in production that are used in test

• Due to micro-services (API as contract) approach– No need to co-ordinate typical deployments across teams

46

Asgard devops procedures• Fast rollback• Canary testing• Red/Black pushes• More through REST interfaces

– Adhoc processes allowed, enforced through Asgard model• More coming using Glisten and workflow services

47

Demo

48

Ability to reconfigure - Archaius• Using dynamic properties, can

easily change properties across cluster of applications, either– NetflixOSS named props

• Hystrix timeouts for example

– Custom dynamic props• High throughput achieved by

polling approach• HA of configuration source

dependent on what source you use– HTTP server, database, etc.

Container

Libraries

Application Props

Persisted DB

Runtime

Hie

rarc

hy

URL

Application

JMXKaryonConsole

DynamicIntProperty prop = DynamicPropertyFactory.getInstance().getIntProperty("myProperty", DEFAULT_VALUE);int value = prop.get(); // value will change over time based on configuration 49

Get baked!• Caution: Flame/troll bait ahead!!

– Criticism – “Netflix is ruining the cloud”• Overhead of images for every code version• Ties to Amazon AMI’s (have proven this tie can be broken)

• Netflix takes the approach of baking images as part of build such that– Instance boot-up doesn’t depend on outside servers– Instance boot-up only starts servers already set to run– New code = new instances (never update instances in place)

• Why?– Critical when launching hundreds of servers at a time– Goal to reduce the failure points in places where dynamic system

configuration doesn’t provide value– Speed of elastic scaling, boot and go– Discourages ad hoc changes to server instances

50

AMInator• Starting image/volume

– Foundational image created (maybe via loopback), base AMI with common software created/tested independently

• Aminator running – Bakery– Bakery obtains a known EBS volume of the

base image from a pool– Bakery mounts volume and provisions the

application (apt/deb or yum/rpm)– Bakery snapshots and registers snapshot

• Recent work to add other provisioning such as chef as plugins 51

Imaginator

• Implementation of Aminator– For IBM SoftLayer cloud

• Creates image templates– Starts from base OS and adds deb/rpm’s

• Snapshots images for later deployment

• Not yet open sourced52

Good Monkeys

• Janitor Monkey– Somewhat a mitigation for baking approach– Will mark and sweep unused resources

(instances, volumes, snapshots, ASG’s, launch configs, images, etc.)

– Owners notified, then removed

• Conformity Monkey– Check instances are conforming to rules

around security, ASG/ELB, age, status/health check, etc.

53

http://www.flickr.com/photos/sonofgroucho/5852049290

Agenda




54

Thoughts on Operational Visibility

• Programming model to expose metrics should be simple

• Systems need to expose internals in a way that is sensible to the owners and operators

• The tools that view the internals need to match the level of abstraction developers care about

• The tools must give sufficient context when viewing any single metric or alert

55

Monitoring - Servo

• Annotation based publishing through JMX of application metrics

• Gauges, counters, and timers

• Filters, Observers, and Pollers to publish metrics– Can export metrics to metric collection servers

• Netflix exposes their metrics to Atlas– The entire Netflix monitoring infrastructure hasn’t been open

sourced due to complexity and priority

56

Back to Hystrix• Main reason for Hystrix is protect

yourself from dependencies, but …

• Same layer of indirection to services can provide visualization

• You can aggregate the view across clusters via Turbine

• Other alert system and dashboards can read from Turbine

57

Edda

• IaaS does not typically provide– Historical views of the state of the system– All views between components an operator might want to see

• Edda polls current state and stores the data in a queriable database

• Provides a adhoc queriable view of all deployment aspects• Provides a historical view

– For correlation of problems to changes– Becoming a more common place feature in cloud

58

Ice

• Cloud spend and usage analytics

• Communicates with billing API to give birds eye view of cloud spend with drill down to region, availability zone, and service team through application groups

• Watches differently priced instances and instance sizes to help optimize

• Not point in time– Shows trends to help predict future

optimizations

59

Agenda

• Blah, blah, blah

• How can I learn more?

• How do I play with this?

• Let’s write some code!

60

Want to play?• NetflixOSS blog and github

– http://techblog.netflix.com– http://github.com/Netflix

• NetflixOSS as ported to IBM Cloud– https://github.com/EmergingTechnologyInstitute– SoftLayer Image Templates coming soon

• Acme Air, NetflixOSS AMI’s– Try Asgard/Eureka with a real application– http://bit.ly/aa-AMIs

• See what we ported to IBM Cloud (video)– http://bit.ly/noss-sl-blog

• Fork and submit pull requests to Acme Air– http://github.com/aspyker/acmeair-netflix

61

Thanks!

Questions?

Download - Cloud Services Powered by IBM SoftLayer and NetflixOSS

Top Related