cloud connected devices on a global scale (cpn303) | aws re:invent 2013

41
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. KW Justin Leung, Banjo November 15, 2013 Scaling From Zero to 6 Million Mobile Users

Upload: amazon-web-services

Post on 15-Jan-2015

1.605 views

Category:

Technology


1 download

DESCRIPTION

Increasingly, mobile and other connected devices are leveraging the scalability and capabilities of the cloud to deliver services to end users. However, connecting these devices to the cloud presents unique challenges. Resource constraints make it impossible to use many common frameworks and transport restrictions make it difficult to use dynamic cloud resources. In this session, learn how you can develop and deploy highly-scalable global solutions using Amazon Web Services (Amazon Virtual Private Cloud, Elastic IP addresses, Amazon Route 53, Auto Scaling) and tools like Puppet. Hear how Panasonic and Banjo architect their cloud infrastructure from both a start-up and enterprise perspective.

TRANSCRIPT

Page 1: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

KW Justin Leung, Banjo

November 15, 2013

Scaling From Zero

to 6 Million Mobile Users

Page 2: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Banjo

• Real-time location meets social data

• An engineering-focused company

• Events recommendation, alert, & discovery

• Top Developer and Editor’s Choice in Google Play

• Named Top 10 World Innovator in Local - Fast

Company

Page 3: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Growth Factors

• Grew from 0 to 5+ Million in 2 years

• Indexed over 700 Million profiles

• Processing Billions of location-based social

posts

• Geospatial indexing for 500K+ posts per hour

• Categorized 1000’s of event types

• Over 50 Million background jobs processed daily

Page 4: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

The Stack

• Amazon EC2 / Elastic Load Balancing / Amazon S3

/ Elastic Beanstalk / Heroku

• Ruby on Rails

• MongoDB

• Redis

• Memcached

• Sidekiq

• NewRelic / PagerDuty / Papertrail / Graphite

Page 5: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

First 9 Months, from 0 to Million

• Amazon EC2 deployment with Rubber

• No background jobs, frontend instances only

• Hosted MongoDB clusters

• 0 DevOp

Page 6: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Challenges @ 1M Users

• Limited engineering resources

• Not too agile with Rubber

• Outgrew hosted MongoDB limit

• No DevOp

Page 7: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Growing to 2M+ Users

• Migrated from EC2 instances to Heroku

• Delayed jobs: GirlFriday -> Qu -> Sidekiq

• In-house MongoDB clusters on EC2

• Social graph increased to 300M+ profiles

• 1 DBA / DevOp

Page 8: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Challenges @ 2M Users

• Explosion of social graph size

• Cost to process background jobs

• Latency to poll external social feeds

Page 9: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Banjo @ 4M+ Users

• 100x Heroku workers

• Social graph increased to 400M+ profiles

• Indexed one month of global location-based

posts

• 10 Millions of background jobs processed daily

• Still -1 DBA / DevOp

Page 10: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Challenges @ 4M Users

- I • Heroku Dynos limited to 512MB of memory, slow

CPU

• Heroku routing latency becomes obvious

• Bloated codebase, limited forking for

concurrency

• Power users with large social graph churns data

Page 11: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Now, 6 Million Users

• Social graph increased to 700M profiles

• Heroku -> Elastic Beanstalk

• Service-oriented architecture

• Unicorn -> Elastic Beanstalk with Nginx + Passenger

• 50 Million background jobs processed daily

• Hundreds of EC2 instances

• And... still 1 DBA/Dev-Op

Page 12: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Heroku PROS / CONS

• The Pros: – Brainless deploy / rollback flow

– Instant availability of dynos and workers

– Zero setup & maintenance cost

– No Dev-Op need

• The Cons: – Limited memory & CPU make it hard for concurrency

– Routing layer latency

– No built-in auto-scaling, limited available zones (US/EU)

– Not enough control, limited access when there are platform issues

Page 13: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Elastic Beanstalk - PROS / CONS

• The Pros: – Choice of instance types, Availability Zones

– Increased concurrency with Passenger / Nginx, support for auto-scaling

– Low latency with Amazon Route 53 & Elastic Load Balancing

– Cost efficient

• The Cons: – Initial setup cost for beanstalk containers and environments

– Slow container updates - currently Ruby 1.9 / Passenger 3.0.17 + Nginx 1.23

– Time to spin up new instances for seamless deploys

– There’s some learning curve to Elastic Beanstalk scripts

Page 14: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Managed EC2 Instances

• MongoDB instances (DBA)

• Elastic Beanstalk managed environments (Eng)

• Heroku managed services (Eng)

• Elastic Beanstalk + Heroku can easily be

managed by small-sized, agile engineering

team

Page 15: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Recommendation for startups:

• Start prototyping on small scale PaaS services

• Add-ons are really helpful Papertrail, NewRelic, Hosted MongoDB/Redis/Memcached/Metrics

• Pager alerts with ScoutApp, Pingdom,

PagerDuty

• Make use of health & metrics dashboards

• Deploy frequently & scale up along the way

Page 17: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

CPN303

Page 18: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Cloud Connected Devices On A Global Scale

Bryant Eastham, Panasonic

November 15, 2013

Page 19: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Two roads diverged in a wood, and I—

I took the one less traveled by …

Robert Frost

Page 20: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Understanding “Small” And “Cloud”

• What is “small”? – Production scale that required minimal cost

– Devices that Moore forgot

– Speed in MHz, memory in KB, solution designed around resources

• What is “cloud”? – HTML/XML for transport

– SSL for security

– Solutions typically don’t consider resource-constrained devices

Page 21: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Implications Of Small Devices

• Support for whitelisting – Yes, it is still done

– No, “open up all outbound traffic” is not acceptable

• One-stop connectivity – Not every protocol uses TCP

– UDP is still great for some things, and required for others (NTP)

Page 22: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Cloud System Requirements

• Support whitelisting (fixed IP address set)

• Support UDP as well as TCP

• Support Auto Scaling and Elastic Load Balancing

• Off-instance logging and monitoring

• AWS gets us 90% there – the last 10% is our focus

today

Page 23: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Connectivity Using Elastic IP Addresses,

With A Configuration Detour

Page 24: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Meeting The

Requirements • Reuse when possible,

invent when necessary

• Reuse =

Amazon Web Services:

– Amazon VPC

– EIP addresses

– Amazon Route 53

(IP1 .. IPn)

Page 25: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Application-managed IP Addresses

• “Standard” EIP are not enough – All addresses must be active all the time

– Addresses must move to adapt to scale changes

– Support multiple addresses per instance for low-scale periods

• Application-managed IP addresses fill the gaps – All addresses can be active (assigned to an instance)

– API control of EIP assignment provides migration during scaling, and this can be done “cleanly” by the application

However, only VPC instances allow multiple EIP assignments

Page 26: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Chickens And Eggs • Managed IP addresses require

multiple EIPs and configuration

• VPC is required to allow multiple EIP management

• Configuration requires Puppet and AWS access

• Puppet and AWS require access to the network (from the VPC)

• Network access requires instance configuration and a VPC bridge

• Instance configuration is part of application configuration (managed IP address information)

• Rinse, lather, repeat…

Page 27: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Breaking The Cycle

• Each VPC requires a bridge for network access

• Putting a Puppet Master on this bridge breaks the cycle of network access/configuration – Allows the use of VPC security groups to control access

– Use a lightweight instance

– Assign any EIP to allow external access

– Configure to support VPC/Internet bridging

• All VPC instances are configured to use the Puppet Master as their gateway (initially)

Page 28: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Too Many Puppet Masters

VPC (per AZ)

(IP1 .. IPn)

Bridge/Puppet Master

VPC (per AZ)

(IP1 .. IPn)

Bridge/Puppet Master

VPC (per AZ)

(IP1 .. IPn)

Bridge/Puppet Master

VPC (per AZ)

(IP1 .. IPn)

Bridge/Puppet Master

VPC (per AZ)

(IP1 .. IPn)

Bridge/Puppet Master

Page 29: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

What About All Those Masters?

• World-wide support requires many VPCs – Multiple Availability Zones

– Multiple regions

• Each VPC requires network access – Each VPC requires a bridge

• We solved one problem, and introduced another

– Puppet Master configuration

Page 30: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Mastering The Puppet Masters

• Amazon S3 is an excellent choice for Puppet

Master configuration – Global, highly available

– Excellent security (access control) and logging

– Sharable between accounts

• One-way synchronization from Amazon S3 to

distributed Puppet Masters solves the

configuration problem

Page 31: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

What About Performance?

• We cannot stop here – we don’t want all traffic to

always go through a bridge

• So we do not stop here, we only configure here – Access to the Puppet Master and the network allows access to

our configuration

– Our configuration includes information about our EIP pool, as

well as whether we need to acquire additional EIPs

– If we don’t need an EIP, then we continue to use our the

bridge/Puppet Master

Page 32: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Tools For Instance Configuration

• Instance metadata – Instance ID, user data (always available)

– AWS (requires Internet access)

• Remember, instance data uses AWS API calls

• Puppet – Configuration rules

– Unsecured files

• Amazon S3 – Secured files (use role-based API authentication)

Page 33: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

EIP Management

• DNS (Amazon Route 53) for address configuration – Configure a master name that contain all EIPs (for configuration)

– Configure host name for regional EIPs (latency-based)

• Each instance knows the master name

• Use EIP APIs to intersect the master list with the

EIP list of the instance’s VPC

• Instances find their neighbors and share the EIPs

• Each instance periodically checks itself

Page 34: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

EIP Pseudo-Code – Startup Get a Primary Public IP – repeat until successful

Allocate Network Interfaces and Private IPs (based on instance type)

Notify application of all Public IPs acquired with a Network Interface

Page 35: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

EIP Pseudo-Code – Periodic Use DNS, EIP APIs to determine current pool for my region, intersect

Validate my Primary Public IP – get one if required

Validate configured Public IPs – release if no longer configured

Check Scale Group, determine address count per instance (ROUND UP)

Determine Public IP changes, and allocate/release with application’s help

Release a Public IP if I have too many (application determines which)

Allocate all required Public IPs if I have too few

If there are nodes without an address, give one up

Instances are ordered, and know who will give up an address

Application picks the least used address

Page 36: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

EIP Pseudo-Code – Shutdown Release all additional Network Interfaces

Release all Public IPs except my Primary Public IP (for logging)

The instance then terminates, freeing the Primary Public IP

Page 37: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

EIP Management

VPC (per AZ)

IP1 .. IPn

Bridge/Puppet Master

Route 53

Amazon S3

Instance Data

User Data

Page 38: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Global Scale, Global Services

Page 39: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Adding Back The 90%

• Our configured instances play nice with AWS – Bootstrapping through AMI or Cloud Init

– Auto Scaling groups set user and instance data

– Load balancers managed with Auto Scaling groups

– Latency-based Route 53 address for TCP/HTTP

– Latency-based Route 53 address for UDP and whitelists

• Internet access for remote logging

• Amazon CloudWatch for monitoring

Page 40: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Elastic Load Balancing

AutoScale (per region)

Goal Achieved

VPC (per AZ)

IP1 .. IPn

Bridge/Puppet Master

Route 53

Amazon S3

Instance Data User Data

CloudWatch

papertrail

Route 53

Latency-based

Lookups

Page 41: Cloud Connected Devices on a Global Scale (CPN303) | AWS re:Invent 2013

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

CPN303