tapjoy openstack summit paris breakout session

Tapjoy & OpenStackDelivering Billions of

Requests Daily

Wes JosseyHead of Operations @Tapjoy

Tapjoy

● Global App-Tech Startup● We Power For Mobile Developers:

○ Monetization○ Analytics○ User Acquisition○ User Retention

● 450M+ Monthly Users Across 270k+ Apps● Worldwide Presence

Technical Details

● Early AWS Adopter. ● Grew Predominantly on AWS.● Over 1,100 AWS VMs Daily (10/2014)● Active Regions in Asia, Europe, N.A.● Over One Trillion Requests Handled

Annually

Tech Philosophy

● Compute (EC2 & Nova) Driven Company○ Operate Your Own Infrastructure

■ But Not Necessarily Built-From-Scratch○ Zero Heart-Attack Nodes

■ All Nodes Are Ephemeral■ Data is Always Distributed■ Failure is Always Tolerated■ Misbehaving Instances Are Terminated Quickly

Services We Use

● SQS○ Simple, Inexpensive, Durable. ○ Currently Building New Internal System Influenced

by SQS, but with Different Guarantees○ No Lock-In (See https://github.com/Tapjoy/chore)

● RDS○ No Lock in. Simple. Easy.

● Cloudwatch (but also statsd)

Services We Use Cont.

● ELB○ SSL Termination Only. Routing Handled Elsewhere.

● Auto-Scaling○ Traffic can fluctuate 30% peak to valley

● S3○ Where we store ALL the things○ Still price competitive for what it provides. No plans

to leave as of today.

Use Compute Everywhere

● Every Dev Has Access to Either AWS or Tapjoy-1 (Tapjoy’s OpenStack Deployment)

● Simulate Changes Against Useful Data● Test Algorithms on Large Hadoop Clusters● Practice for Failure With Access to Real

Services (not mock endpoints)

Going Hybrid● We Spend in the Millions on AWS● Picked Data-Science Infrastructure because

of Portability, and Ability to Leverage More Nodes

● Lower Risk than Tier-1 Production Services● Wanted a Partner to Maintain OpenStack

like Amazon ‘Maintains’ AWS● We Want to Operate Apps

OpenStack Timeline

Vendors (It Matters)

● Metacloud○ Verified our Design○ Deployed Openstack○ Provisioned Network○ Allowed Us to Focus on Business Applications

● Equinix○ Cooling & Power Design○ Remote Hands○ Went Above and Beyond on Numerous Occasions

Vendors: Full List

● Metacloud● Equinix● Quanta● Cumulus● Level3● Newegg

Challenges● Hardware Delays Killed Our Timelines

○ Blew through our contingency windows.○ Hurt our budgets.○ Delayed subsequent purchases

● Setting Up IP Transit Can Be Slow● No Physical Presence in DC

○ Also a Pro● No Internal Previous Success Story… So

Lots of Skepticism

The Not So Glamorous Job

● Negotiations Can Be Exhausting● If You’re An Engineer, the Turn Around Time

Can Be Frustrating● You Probably Need a Gantt Chart● There’s Nothing Agile About Writing a Big

Check

348 ‘Data’ All Purpose Nodes● Quanta S910-X31E: 12 Node Configuration● Per Node

○ Intel 1265Lv3 @ 2.5GHz○ 4x1TB 7200RPM ○ 32GB RAM○ Dual 1Gig NIC

● ‘Recyclable’ for Other Tasks if we Evolve

Tapjoy-1: Data Nodes

12 ‘Management’ Nodes● Quanta S180: 4 Node Configuration● Per Node

○ Intel 2650v2 x2 @2.60GHz○ 128GB RAM○ 6x480GB SSD○ Dual 10Gig NIC

Tapjoy-1: Management Nodes

Glamor Shot

Same Price, Different Outcome

Diagrams!

High-Level Request Flow Architecture

Detailed Flow

Data Pipeline

Tapjoy-1

Plan For Failure

● Hardware○ I’m Not Saying You Shouldn’t Use CEPH…

■ But You’ll Notice it’s Absent Here● Service Boundaries

○ Have Hardware & Software Contingencies■ Backup Links■ Temporary Cache(s)

○ Actually Test Failure in Production

Info

● Twitter! @dustywes● Email: [email protected]

tapjoy openstack summit paris breakout session

Software

aws vms

node configuration

ephemeral data

use compute

failure hardware im

datascience infrastructure

distributed failure

node intel 1265lv3