tapjoy openstack summit paris breakout session
TRANSCRIPT
Tapjoy & OpenStackDelivering Billions of
Requests Daily
Wes JosseyHead of Operations @Tapjoy
Tapjoy
● Global App-Tech Startup● We Power For Mobile Developers:
○ Monetization○ Analytics○ User Acquisition○ User Retention
● 450M+ Monthly Users Across 270k+ Apps● Worldwide Presence
Technical Details
● Early AWS Adopter. ● Grew Predominantly on AWS.● Over 1,100 AWS VMs Daily (10/2014)● Active Regions in Asia, Europe, N.A.● Over One Trillion Requests Handled
Annually
Tech Philosophy
● Compute (EC2 & Nova) Driven Company○ Operate Your Own Infrastructure
■ But Not Necessarily Built-From-Scratch○ Zero Heart-Attack Nodes
■ All Nodes Are Ephemeral■ Data is Always Distributed■ Failure is Always Tolerated■ Misbehaving Instances Are Terminated Quickly
Services We Use
● SQS○ Simple, Inexpensive, Durable. ○ Currently Building New Internal System Influenced
by SQS, but with Different Guarantees○ No Lock-In (See https://github.com/Tapjoy/chore)
● RDS○ No Lock in. Simple. Easy.
● Cloudwatch (but also statsd)
Services We Use Cont.
● ELB○ SSL Termination Only. Routing Handled Elsewhere.
● Auto-Scaling○ Traffic can fluctuate 30% peak to valley
● S3○ Where we store ALL the things○ Still price competitive for what it provides. No plans
to leave as of today.
Use Compute Everywhere
● Every Dev Has Access to Either AWS or Tapjoy-1 (Tapjoy’s OpenStack Deployment)
● Simulate Changes Against Useful Data● Test Algorithms on Large Hadoop Clusters● Practice for Failure With Access to Real
Services (not mock endpoints)
Going Hybrid● We Spend in the Millions on AWS● Picked Data-Science Infrastructure because
of Portability, and Ability to Leverage More Nodes
● Lower Risk than Tier-1 Production Services● Wanted a Partner to Maintain OpenStack
like Amazon ‘Maintains’ AWS● We Want to Operate Apps
OpenStack Timeline
Vendors (It Matters)
● Metacloud○ Verified our Design○ Deployed Openstack○ Provisioned Network○ Allowed Us to Focus on Business Applications
● Equinix○ Cooling & Power Design○ Remote Hands○ Went Above and Beyond on Numerous Occasions
Vendors: Full List
● Metacloud● Equinix● Quanta● Cumulus● Level3● Newegg
Challenges● Hardware Delays Killed Our Timelines
○ Blew through our contingency windows.○ Hurt our budgets.○ Delayed subsequent purchases
● Setting Up IP Transit Can Be Slow● No Physical Presence in DC
○ Also a Pro● No Internal Previous Success Story… So
Lots of Skepticism
The Not So Glamorous Job
● Negotiations Can Be Exhausting● If You’re An Engineer, the Turn Around Time
Can Be Frustrating● You Probably Need a Gantt Chart● There’s Nothing Agile About Writing a Big
Check
348 ‘Data’ All Purpose Nodes● Quanta S910-X31E: 12 Node Configuration● Per Node
○ Intel 1265Lv3 @ 2.5GHz○ 4x1TB 7200RPM ○ 32GB RAM○ Dual 1Gig NIC
● ‘Recyclable’ for Other Tasks if we Evolve
Tapjoy-1: Data Nodes
12 ‘Management’ Nodes● Quanta S180: 4 Node Configuration● Per Node
○ Intel 2650v2 x2 @2.60GHz○ 128GB RAM○ 6x480GB SSD○ Dual 10Gig NIC
Tapjoy-1: Management Nodes
Glamor Shot
Same Price, Different Outcome
Diagrams!
High-Level Request Flow Architecture
Detailed Flow
Data Pipeline
Tapjoy-1
Plan For Failure
● Hardware○ I’m Not Saying You Shouldn’t Use CEPH…
■ But You’ll Notice it’s Absent Here● Service Boundaries
○ Have Hardware & Software Contingencies■ Backup Links■ Temporary Cache(s)
○ Actually Test Failure in Production
Info
● Twitter! @dustywes● Email: [email protected]