netflix edge engineering open house presentations - june 9, 2016
TRANSCRIPT
Daniel Jacobson@daniel_jacobson
Satish Gudiboina@sgudiboina
Suudhan Rangarajan@suudhan
Vasanth Asokan@vasanthasokan
Edge Engineering Open House - June 9, 2016
190 Countries (not China and a few others)
81+ Million Subscribers
1000+ Different Device Types
Over 42 Billion Hours Streamed in 2015
Streaming Hours Per Year in Billions
Streaming Hours Per Year in Billions
Over 42 Billion Hours Streamed in 2015
Over 42 BillionSuccesses!
Of Course, There Are Failures Too…
Two Primary Drivers Behind Our Successes
People Desire to Watch Netflix
Two Primary Drivers Behind Our Successes
People Desire to Watch Netflix
Systems Scale to Meet Desires
Two Primary Drivers Behind Our Successes
Sign-Up
Sign-Up
Discovery / Browse
Sign-Up
Discovery / Browse
Playback
Edge Engineering provides data and
functionality to support these
three experiences
Designing APIs
EnablingPlayback Scaling
Routing
InsightsDX
Resiliency
Tools
Edge Engineering provides data and
functionality to support these
three experiences
DEVICES
DEVICES
ROUTING
DEVICES
ROUTING
DEVICES
ROUTING
API
API API API API API API
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2Playback LifecycleS2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2S2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2Metadata
S2S2Playback Data S2S2DRM
Ownedby Edge
Engineering
Playback Lifecycle
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2S2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata
Playback Lifecycle
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2S2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata
Playback Lifecycle
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2S2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata
Playback Lifecycle
DEVICES
ROUTING
API
API API API API API API
SERVICES
S2S2RecsS2S2Member
S2S2RatingsS2S2S2S2Authn/z
S2S2A/BS2S2Search
S2S2IdentityS2S2 S2S2Playback Data S2S2DRMMetadata
Playback Lifecycle
API API API API API API
S2S2S2S2Authn/z
S2S2Playback Data S2S2DRM
INSIGHTS
TOOLS
DX
Playback Lifecycle
42 Billion Hours2015
200 Billion Hours
2015
Future
42 Billion Hours
The rest of
Netflix’s AWS Cloud Footprint by %
Talking About the Future of Edge Engineering
Satish GudiboinaAPI and Upcoming Re-Architecture
Suudhan RangarajanPlayback Experience
Vasanth AsokanDeveloper Tools, Velocity and Experience
The Netflix API Platform for Server-Side Scripting
Current and The FutureSatish Gudiboina
The Netflix API
Streaming Hours Per Year in Billions
Scale is multi-faceted
Growing number of users ( → RPS)
Growing number of device types
Growing number of A/B tests
Growing number of languages
Growing number of countries
What we need to build for
Velocity
Resiliency
Other requirements:PerformanceGreat developer experienceOperational insightsTooling
SERV
ICE
LAYE
R
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...
Netflix Microservices
script
script
script
script
...
script
script
script
script
Network boundary
API Server JVM
Today’s architecture
Resiliency with Hystrix
Developer Velocity: Decoupled deployments of versions
n+3
Day 1
Day 2
Day 3
Day 4
Day 5
API device 1 device 2 device 3 device 4
i+4
i+1i+2i+3
i
n+2
n+1
n
k+1
k j
j+1
l
Changing risk profile
Growing number of users ( → RPS)
Growing number of devices
Growing number of A/B tests
Growing number of languages
Growing number of countries
Growing number and complexity of scripts (scripts → apps)
SERV
ICE
LAYE
R
Js(mostly)
java
Client AClient BClient C
Client A
Client YClient Z
...
...
Netflix Microservices
script
script
...
script
script
Network boundary
API Server JVM
Today’s system (T-3yrs)
few, small scriptsfewer uploads
SERV
ICE
LAYE
RJs
(mostly)java
Client AClient BClient C
Client A
Client YClient Z
...
...
Netflix Microservices
script
script
script
script
...
script
script
script
script
Network boundary
API Server JVM
Today’s system (T)
scripts
scripts
hundreds of more complex scripts,10-50 uploads per day
What we need
Velocity
Resiliency?
Lack of process isolation is a growing risk.
Moving toward our ideal API:What will change
Scripts will run in containers
Scripts will call API remotely
SERV
ICE
LAYE
RJs
(mostly)java
Client AClient BClient C
Client A
Client YClient Z
...
...
Netflix Microservices
node script
node script
...
node script
node script
Network boundary API Server JVM
The (near) future
node.js
process isolation
node for device teams
Why containers?
Process isolation
Fast startup
Consistent developer experience across environments
Isolated failures: scripts don’t affect each other
API
device 1 device 2 device 3 device 4Temporarily unavailable!
Independent autoscaling
API
device 1 device 2 device 3 device 4
Fast startup
New API server: minutesNew container: seconds
Fast rollout, fast rollback, fast MTTR
The Netflix API
Edge Developer ExperienceTranslating developer productivity to Netflix customer delight
Developer Experience?
DEVELOP(rapidly)
DEPLOY(reliably)
OPERATE(effectively)
Experimentation driven innovation
~700 apps, dozens of pushes a day15+ client teams, ~200 developers
~50 direct services, 100s of AB tests, dozens of new features
The Innovation Funnel
API
Devices
Netflix Services
Client Adaptor Applications
Why care about DevEx?
DeveloperProductivity
ProductInnovation
Tools
Automation
Insights
CustomerSatisfaction
App Development and Management
DEVELOP(rapidly)
DEPLOY(reliably)
OPERATE(effectively)
SERV
ICE
LAYE
R
Netflix Microservices
appW
AN
Boun
dary API SERVER JVM
js java
Developer Ergonomics
app
...
app
app
CLI
EN
T LI
BR
AR
IES
Large / Complex
SERV
ICE
LAYE
R
REM
OTE
SERV
ICE
LAYE
Rapp
API SERVER JVM
Developer Ergonomics ...
app
...
app
app
CLI
EN
T LI
BR
AR
IES
js javajs
DOCKER CONTAINERS
WAN
Bo
unda
ryNetflix
Microservices
Setup Canary
SupportProd Push
Pre-Prod
MetricsTracing
Lifecycle
Alerts
Build
Bootstrap
API Discovery
REPL
Unit Test
SDK Debug Logging
Profiling
Audits
Security
Custom Routing
Dependency Management
Client Application Development Critical Component!
Dx Developer Experience
$ newt init
Just bring your Javascript business logic
NeWT: Netflix Workflow Toolkit
Continuous Integration
Deployment Pipelines
Autoscaling
Dashboards
Alerting
Logging
Lifecycle Management
Audits and Analytics
Container tooling
Canaries
Dependency Management
Titus
ATLAS
NeWT: Netflix Workflow Toolkit
Edge PaaS UI
$ newt auto-deploy -d
nodeJSproject
Docker Machine
node-inspector
DebuggerFile watcher / live reload trigger
File watcher agent
NeWT: Local Container Development
Local Container
docker build / run
$ newt auto-deploy -d
Docker Machine
NeWT: Local Container Development
Local Container
CloudMicroservices
Cloud Proxy
Terminate security
Disc
over
y Ag
ent
Service Discover
y
Loca
l Sy
stem
Clou
d
App Operations and Insights
DEVELOP(rapidly)
DEPLOY(reliably)
OPERATE(effectively)
• Low Latency, High throughput, Highly Efficient• Handle bursty or large scale loads• Extensible programming model
600 jobs in production, 8M messages/sec at peak, 100Gbps network throughput
Mantis - Stream Processing Platform
Monitoring facets of aggregate application health, globally
Aggregate Insights
Aggregate Insights
Analyze in real-time, requests matching a precise set of conditions
Surgical Insights
Surgical Insights - Real-time Stream Queries
Surgical Insights - Real-time Stream Queries
Surgical Insights - Real-time Stream Queries
Monitoring server side calling pattern and internal application profile
Session Tracing
Session Tracing
Session Tracing - Request Profile
Session Tracing - Per Node Profile
Automatic monitoring of high cardinality data across multiple dimensions
Real-time Anomaly Detection
Real-time Anomaly Detection
• Scaling developer productivity with business growth
• Provide fully managed PaaS experience to client developers • Shift Left Insights to power smart development• Curated, blended visualizations that simplify devops
In conclusion...
Tech Soup
Scaling Playback Services
Suudhan Rangarajan Senior Software Engineer, Playback Features
@suudhan
Playback Lifecycle
DECIDE
COLLECT & LEARN
AUTHORIZE
Decide
MANIFEST (Tracks and URLs)
Authorize
LICENSE
❏ Content usage / resolution policies
❏ Plan / device limits enforcement
❏ DRM / License generation
Collect & Learn
Bookmarks & Hours Watched
Streaming Errors and Metrics
Quality Of Experience metrics
4
Lets look at Play Decisions
DECIDE
MANIFEST
AUTHORIZE
COLLECT & LEARN
LICENSE
SESSION
Huge number of Streams
Resolutions - 720p, 1080p, 4K etcCodecs - H.264,HEVC etcBitrates - 230, 780, 3000 etc
Channels - Stereo, Surround SoundLanguages - English, French etc
Types - Subtitles, Closed Captions, Forced NarrativesLanguages - English, French etc
Streams to Tracks
- H.264 Main Profile- English 5.1 Audio- No Subtitle
- HEVC Dash Profile- French 2.0 Audio- English CC
- HDR Dash Profile- Spanish AAC Audio- English Forced Narrative
Decide & Filter
MANIFEST SERVICE
Many Many Dimensions
PLAYBACKMANIFEST
USER PREFERENCES
TITLEMETADATA
COUNTRY
DEVICE
NETWORK
Big Opportunity
Rich playback experiences
Tremendous increase in scale
Customer growth
Challenge: Efficient Scaling
Targeting sub-linear growth
# of Requests
Cloud Costs
Predictable Viewing Patterns
Key Insight
Key Insight
CONTENT RANK
PLAY
RE
QUES
TS
Also..Manifest Request for one title
PLAY
RE
QUES
TS
TIME
Current: Completely Real-time
Real-time manifest generation
With Caching
Real-time manifest generation
80% Cached20% Real-time
Challenges
How do we determine the optimal combination of attributes to cache on?
Challenges
Cache Considerations: ●When to populate?●When to bust?●How to scale for
cache-miss or failures?
Potential Win
10x increase in requests with only 4x increase in costs
Optimize computation
Can we re-imagine our service processing to dramatically increase throughput?
Anatomy of a Playback Manifest Request
Metadata Access
27%
36%
Tracks Generation
16%
Streams Filtering
21%
Serialization
Potential Win
10x increase in requests with just 2x increase in service costs
Two-pronged Strategy to Scaling
Cache Manifests
Re-architect code to reduce processing time
Scaling Problems Across Services
Decide Authorize Collect & Learn
Playback Features
Playback Access
Playback Data Systems
Thanks!
@suudhan
Come Talk to Us!
Image AttributionAll Images used are under creative commons or public domain license:
● Video icon - http://simpleicon.com/wp-content/uploads/video-camera-1.png● Speaker icon -
https://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Speaker_Icon.svg/1024px-Speaker_Icon.svg.png
● Subtitle icon - https://thenounproject.com/term/subtitles/78795/ ● Uptrend image - https://pixabay.com/en/chart-line-line-chart-diagram-trend-148256/ ● Funnel image - https://commons.wikimedia.org/wiki/File:Funnel_Mech.svg ● Business Intelligence image -
https://pixabay.com/static/uploads/photo/2015/04/14/23/17/it-business-722950_960_720.png ● Key icon - https://pixabay.com/static/uploads/photo/2014/04/03/10/55/key-311738_960_720.png ● Person icon-
https://pixabay.com/static/uploads/photo/2015/12/22/04/00/photo-1103596_960_720.png ● Mobile icon-
https://upload.wikimedia.org/wikipedia/commons/thumb/1/14/Mobile_phone_font_awesome.svg/1024px-Mobile_phone_font_awesome.svg.png
● Globe image - https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px-Simple_Globe.svg.png
● Devices icon- https://upload.wikimedia.org/wikipedia/commons/thumb/6/60/Simple_Globe.svg/1024px-Simple_Globe.svg.png
● wifi icon - https://pixabay.com/static/uploads/photo/2016/01/03/11/32/wireless-signal-1119306_960_720.png
● cell tower - https://pixabay.com/static/uploads/photo/2012/04/13/00/23/tower-31235_960_720.png