(cmp406) amazon ecs at coursera: a general-purpose microservice

Download (CMP406) Amazon ECS at Coursera: A general-purpose microservice

Post on 16-Apr-2017

3.518 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Frank Chen, Coursera

    Brennan Saeta, Coursera

    October 2015

    CMP406

    Amazon ECS at CourseraPowering a general-purpose near-line execution

    microservice, while defending against untrusted code

  • What to Expect from the Session

    Techniques for a unified near-line, batch, and scheduled

    micro-service powered by Amazon ECS

    Security vulnerabilities and countermeasures when

    running untrusted code in Docker with Amazon ECS

    Reasons to modify the Amazon ECS agent

  • Session Outline

    Introduction to Coursera

    Near-line, batch and scheduled job execution framework

    Motivations and background

    Amazon ECS benefits and limitations

    Iguaz and its architecture

    Evaluating programming assignments

    System requirements

    Security threat model

    Attacks and defenses

  • Education at Scale

    15 million learners worldwide

    2.5 millioncourse completions

    1,300+courses

    125+partners

  • A unified execution framework

  • Batch Processing Enables

    Reporting

    Instructor Reports

    Grade exports

    Learner demographics

    Course progress

    statistics

    Internal Reports

    Business metrics

    Payments

    reconciliation

  • Scheduled Processing Enables

    Marketing

    Recommendation emails

    Targeted marketing / reactivation emails

  • Nearline Processing Enables

    Pedagogical Innovations

    Peer-review matching & analysis

    Auto-graded programming assignments

  • The early days

    January 2012

  • Bad Old Days of Batch Processing @ Coursera

    Cascade

    PHP-based job runner

    Originally ran in screen sessions Polled APIs for new jobs

    Forced restarts on regular basis

    due to unidentified memory leaks

    Fragile and unreliable

    The early

    days

  • Bad Old Days of Batch Processing @ Coursera

    Saturn

    Scala scheduled batch job runner Powered by Quartz Scheduler library

    Better than Cascade, but

    All jobs ran on same JVM, causing

    interference

    The not-

    so early

    days?

  • Looking for something better

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What We Wanted

    Reliable Easy Development Easy Deployment

    High Efficiency Low Ops Load Cost Effective

  • What Else Did We Look At?

    Home-grown Tech

    Tried, but proved

    to be unreliable

    Difficult to

    handle

    coordination and

    synchronization

    Powerful, but

    hard to

    productionize

    Needs

    developers with

    experience

    Designed for

    GCE first

    Not a managed

    service, higher

    Ops load

  • Amazon ECS to the Rescue

    Amazon re:Invent 2014 Dr. Werner Vogels introducing Amazon ECS

    Screenshot from https://www.youtube.com/watch?v=LE5uBqNp2Ds by Amazon Web Services

    https://www.youtube.com/watch?v=LE5uBqNp2Ds

  • Amazon ECS to the Rescue

    Little

    maintenance

    Integrated with

    rest of AWSEasy to

    develop for

  • Amazon ECS to the Rescue

    Little

    maintenance

    Integrated with

    rest of AWSEasy to

    develop for

  • Amazon ECS to the Rescue

    Little

    maintenance

    Integrated with

    rest of AWSEasy to

    develop for

  • However

    Amazon ECS is a great building block,

    but we still need to build tools around it

    for our purposes.

  • What We Built: Iguaz

    Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0

    Batch Job Scheduler for Amazon ECS

    Immediately

    Deferred (run once at X time)

    Scheduled recurring (cron-like)

    Programmatically accessible internally via

    our standard APIs and clients

    Named for Iguaz falls

    Worlds largest waterfall by volume

    We hope Iguaz handles a similar volume of jobs

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Iguaz

    Frontend

    Iguaz

    SchedulerIguaz

    Backend

    Iguaz: Architecture

    CassandraServices Services

    Iguaz

    Admin

    ECS

    Workers

    SQS

    ECS API

    Devs

    Users

  • Developing Iguaz Jobs

    class Job extends AbstractJob with StrictLogging {

    override val reservedCpu = 1024 // 1 CPU core

    override val reservedMemory = 1024 // 1 GB RAM

    def run(parameters: JsValue) = {

    logger.info("I am running my job! ")

    expensiveComputationHere()

    }

    }

  • Running Jobs from Other Services

    // invoking a job with one function call

    // from another service via Naptime RPC/REST framework

    val invocationId = IguazuJobInvocationClient

    .create(IguazuJobInvocationRequest(

    jobName = "exportQuizGrades",

    parameters = quizParams))

  • Iguaz: Developer / Ops User Interface

  • Deploying Jobs

    Easy Deployment

    1. Developers Merge into master. Done!

    Jenkins Build Steps:

    1. Builds zip package from master

    2. Prepares Docker image with zip file

    3. Pushes image into Docker registry

    4. Registers updated jobs with

    Amazon ECS API

  • Logs

    Logs are in /var/lib/docker/containers/*

    Upload into log analysis service (Sumologic)

    Wrapper prints out job name and job ID

    at the start for easy searching

    Good enough for now

  • Metrics

    Using third-party metrics collector (Datadog)

    Metrics for both jobs and container instances

    So long as the worker machines can talk to Internet,

    things will work out pretty well

  • Since April 2015

    65 jobs in

    production

    >1000 runs

    per day

    44 different

    scheduled jobs

  • Evaluating

    Programming Assignments

  • Programming Assignments at Coursera

  • The Security Challenge

    Compiling and running untrusted, arbitrary code in

    Amazon EC2

    Would you like to compile and run C code from random

    people on the Internet on your servers?

  • 1st Generation System

    Class graders in

    separate AWS acct

    Custom grader systems

    on cloud providers

    Course grader under the

    instructors desk

    Learners Coursera Servers Queue Service

  • 1st Generation System: Weaknesses

    No Auto Scaling No standard security Graders crashed

  • 1st Generation System: Weaknesses

    No Auto Scaling No standard security Graders crashed

  • 1st Generation System: Weaknesses

    No Auto Scaling No standard security Graders crashed

  • Design Goals

    Cost Savings No Maintenance Near Real-time Secure Infrastructure

  • Design Goals

    Cost Savings No Maintenance Near Real-time Secure Infrastructure

  • Design Goals

    Cost Savings No Maintenance Near Real-time Secure Infrastructure

  • Design Goals

    Cost Savings No Maintenance Near Real-time Secure Infrastructure

  • Threat Model

    Prevent submitted code from:

    impacting the evaluation of other submissions.

    disrupting the grading environment (e.g., DoS)

    affecting the rest of the Coursera learning platform

    Additional goals:

    Minimize exfiltration of inform