sve: distributed video processing at facebook scale · cessing frameworks including batch...

SVE: Distributed Video Processing at Facebook Scale

Qi Huang1, Petchean Ang1, Peter Knowles1, Tomasz Nykiel1,Iaroslav Tverdokhlib1, Amit Yajurvedi1, Paul Dapolito IV1, Xifan Yan1,Maxim Bykov1, Chuen Liang1, Mohit Talwar1, Abhishek Mathur1,

Sachin Kulkarni1, Matthew Burke1,2,3, and Wyatt Lloyd1,2,4

1Facebook, Inc., 2University of Southern California, 3Cornell University, 4Princeton [email protected],[email protected]

ABSTRACTVideos are an increasingly utilized part of the experienceof the billions of people that use Facebook. These videosmust be uploaded and processed before they can be sharedand downloaded. Uploading and processing videos at ourscale, and across our many applications, brings three keyrequirements: low latency to support interactive applications;a �exible programming model for application developersthat is simple to program, enables e�cient processing, andimproves reliability; and robustness to faults and overload.This paper describes the evolution from our initial monolithicencoding script (MES) system to our current Streaming VideoEngine (SVE) that overcomes each of the challenges. SVEhas been in production since the fall of 2015, provides lowerlatency than MES, supports many diverse video applications,and has proven to be reliable despite faults and overload.

1 INTRODUCTIONVideo is a growing part of the experience of the billions ofpeople who use the Internet. At Facebook, we envision avideo-�rst world with video being used in many of our appsand services. On an average day, videos are viewedmore than8 billion times on Facebook [28]. Each of these videos needsto be uploaded, processed, shared, and then downloaded. Thispaper focuses on the requirements that our scale presents,and how we handle them for the uploading and processingstages of that pipeline.

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for pro�t or commercial advantage and that copiesbear this notice and the full citation on the �rst page. Copyrights for third-party components of this work must be honored. For all other uses, contactthe owner/author(s).SOSP ’17, October 28, 2017, Shanghai, China© 2017 Copyright held by the owner/author(s).ACM ISBN 978-1-4503-5085-3/17/10.https://doi.org/10.1145/3132747.3132775

Processing uploaded videos is a necessary step beforethey are made available for sharing. Processing includes val-idating the uploaded �le follows a video format and thenre-encoding the video into a variety of bitrates and formats.Multiple bitrates enable clients to be able to continuouslystream videos at the highest sustainable quality under vary-ing network conditions. Multiple formats enable support fordiverse devices with varied client releases.

There are three major requirements for our video upload-ing and processing pipeline: provide low latency, be �exibleenough to support many applications, and be robust to faultsand overload. Uploading and processing are on the path be-tween when a person uploads a video and when it is shared.Lower latency means users can share their content morequickly. Many apps and services include application-speci�cvideo operations, such as computer vision extraction andspeech recognition. Flexibility allows us to address the everincreasing quantity and complexity of such operations. Fail-ure is the norm at scale and overload is inevitable due to ourhighly variable workloads that include large peaks of activitythat, by their viral nature, are unpredictable. We thereforewant our systems to be reliable.

Our initial solution for uploading and processing videoscentered around a monolithic encoding script (MES). TheMES worked when videos were nascent but did not handlethe requirements of low latency, �exibility, or robustness wellas we scaled. To handle those requirements we developed theStreaming Video Engine (SVE), which has been in productionsince the fall of 2015.SVE provides low latency by harnessing parallelism in

three ways that MES did not. First, SVE overlaps the upload-ing and processing of videos. Second, SVE parallelizes theprocessing of videos by chunking videos into (essentially)smaller videos and processing each chunk separately in alarge cluster of machines. Third, SVE parallelizes the storingof uploaded videos (with replication for fault tolerance) withprocessing it. Taken together, these improvements enableSVE to reduce the time between when an upload completesand a video can be shared by 2.3⇥–9.3⇥ over MES.

SOSP ’17, October 28, 2017, Shanghai, China Q. Huang et al.

SVE provides �exibilitywith a directed acyclic graph (DAG)programming model where application programmers writetasks that operate on a stream-of-tracks abstraction. Thestream-of-tracks abstraction breaks a video down into tracksof video, audio, and metadata that simplify application pro-gramming. The DAG execution model easily allows program-mers to chain tasks sequentially, while also enabling themto parallelize tasks, when it matters for latency.Faults are inevitable in SVE due to the scale of the sys-

tem, our incomplete control over the pipeline, and the di-versity of uploads we receive. SVE handles component fail-ure through a variety of replication techniques. SVE masksnon-deterministic and machine-speci�c processing failuresthrough retries. Processing failures that cannot be maskedare simple to pinpoint in SVE due to automatic, �ne-grainedmonitoring. SVE provides robustness to overload through aprogressing set of reactions to increasing load. These reac-tions escalate from delaying non-latency-sensitive tasks, torerouting tasks across datacenters, to time-shifting load bystoring some uploaded videos on disk for later processing.The primary contribution of this paper is the design of

SVE, a parallel processing framework that specializes dataingestion (§4.1,§4.3), parallel processing (§4.2), the program-ming interface (§5), fault tolerance (§6), and overload control(§7) for videos at massive scale. In addition, the paper’s contri-butions also include articulating the requirements for videouploading and processing at scale, describing the evolutionfrom MES (§2.3) to SVE (§3), evaluating SVE in production(§4,§6.4,§7.3), and sharing experience from operating SVE inproduction for years (§8).

2 BACKGROUNDThis section provides background on the full video pipeline,some production video applications, our initial monolithicencoding script solution, and why we built a new framework.

2.1 Full Video PipelineThe full video pipeline takes videos from when they are cap-tured by a person to when they are viewed by another. Thereare six steps: record, upload, process, store, share, stream. Wedescribe the steps linearly here for clarity, but later discusshow and when they are overlapped.Videos are uploaded after they are recorded. Either the

original video or a smaller re-encoded version is uploaded,depending on a variety of factors discussed in Section 4.1.Video data is transferred from the user’s device to front-endservers [27] within Facebook datacenters. There is high vari-ance in upload times and some times are quite long. Whilethe video is being uploaded, the front-end forwards it to theprocessing system (MES or SVE).

The processing system then validates and re-encodes thevideo. Validation includes verifying the video and repairingmalformed videos. Veri�cation ensures that an uploaded�le is indeed a video as well as blocking malicious �les, forexample those that try to trigger a denial of service attackduring encoding. Repairing malformed videos �xes metadatasuch as an incorrect frame count or a mis-alignment betweenthe time-to-frame mappings in the video and audio tracks.After the processing system validates a video it then re-

encodes the video in a variety of bitrates. Re-encoding intodi�erent bitrates allows us to stream videos to user devices,with di�erent network conditions, at the highest bitrates theycan sustain. It also allows us to continuously adapt bitratesas the network conditions of individual users vary over time.Re-encoding is the most computationally intensive step.Once the video is processed, we reliably store it in our

binary large object (BLOB) storage system [6, 19] so thatit remains available for future retrieval. Sharing a video isapplication speci�c, which we discuss below. Once a videois shared it can be streamed to the user’s device.This streaming is done by a video player on the user’s

device that downloads, bu�ers, and plays small chunks ofthe video from our CDN [13, 29]. The player can downloadeach chunk in any bitrate with which we re-encoded thevideo. The player monitors network conditions and selectsthe highest bitrate that it can play back smoothly. Higherbitrates and less user-visible bu�ering are desirable, andencoding a video into multiple bitrates is necessary for this.The focus of this paper is the pre-sharing pipeline, which

is the part of the pipeline between when a video is recordedand when it can be shared. This includes the uploading,processing, and storing steps. Our goals for these steps arehigh reliability and low latency.

2.2 Production Video ApplicationsThere are more than 15 applications registered with SVEthat integrate video. The applications ingest over tens of mil-lions of uploads per day and generate billions of tasks withinSVE. Four well known examples are video posts, Messengervideos, Instagram stories, and Facebook 360. Video posts arecreated through Facebook’s main interfaces and appear ona user’s Timeline and in their friend’s News Feeds alongwith a user post, thumbnails, and noti�cations to the targetaudience. This application has a complicated set of DAGsaveraging 153 video processing tasks per upload. Messengervideos are sent to a speci�c friend or a group for almostreal-time interaction. Due to a tight latency requirement,it has the simplest DAGs that average a little over 18 tasksper upload. Instagram stories are created by recording onInstagram and applying special �lters through a stand-alone�lter app interface. Its pipeline is slightly more complicated

SVE: Distributed Video Processing at Facebook Scale SOSP ’17, October 28, 2017, Shanghai, China

Client

Front-end Encoder

Storage

12

3

4

Figure 1: The initial video processing architecture.Videos were �rst uploaded by clients to storage viathe front-end tier as opaque blobs (1–2). Then a mono-lithic encoder sequentially reencoded the video forstorage (3–4) and ultimately sharing.

than the Messenger case, generating over 22 tasks per upload.360 videos are recorded on 360 cameras and processed to beconsumed by VR headsets or 360 players, whose pipeline gen-erates thousands of tasks per upload due to high parallelism.Each application strives for low latency and high reliabilityfor the best user experiences. Each application also has adiverse set of processing requirements, so creating reusableinfrastructure at scale that makes it easy for developers tocreate and monitor video processing is helpful.

2.3 Monolithic Encoding ScriptOur initial processing system was a Monolithic EncodingScript (MES). Figure 1 shows the architecture of the pre-sharing pipeline with MES. Under this design videos aretreated as a single opaque �le. A client uploads the videointo storage via a front-end server. Once storage of the fullvideo completes, the video is processed by an encoder withinMES. The processing logic is all written in one large mono-lithic encoding script. The script runs through each of itsprocessing tasks sequentially on a single server. There weremany encoding servers, but only one handled a given video.

MES worked and was simple. But, it had three major draw-backs. First, it incurred high latency because of its sequentialnature. Second, it was di�cult to add new applications withappropriate monitoring into a single script. Third, it wasprone to failure and fragile under overload. SVE addresseseach of these drawbacks.

2.4 Why a New Framework?Before designing SVE we examined existing parallel pro-cessing frameworks including batch processing systems andstream processing systems. Batch processing systems likeMapReduce [10], Dryad [14], and Spark [35] all assume thedata to be processed already exists and is accessible. Thisrequires the data to be sequentially uploaded and then pro-cessed, which we show in Section 4.1 incurs high latency.

Streaming processing systems like Storm [5], Spark Stream-ing [36], and StreamScope [16] overlap uploading and pro-cessing, but are designed for processing continuous queriesinstead of discrete events. As a result, they do not support ourvideo-deferring overload control (§7), generating a special-ized DAG for each video (§5), or the specialized schedulingpolicies we are exploring. This made both batch processingand stream processing systems a mismatch for the low-levelmodel of computation we were targeting.

A bigger mismatch, however, was the generality of thesesystems. The interfaces, fault tolerance, overload control, andscheduling for each general system is necessarily generic. Incontrast, we wanted a system with a video-speci�c interfacethat would make it easier to program. We also wanted asystemwith specialized fault tolerance, overload control, andscheduling that takes advantage of the unique characteristicsof videos and our video workload.We found that almost noneof our design choices—e.g., per-task priority scheduling (§3),a dynamically generated DAG per video (§5.2)—were provideby exisiting systems. The next four sections describe ourvideo-speci�c design choices.

3 STREAMING VIDEO ENGINEThis section provides an overview of the architecture ofSVE with the next four sections providing details. Section 4explains how SVE provides low latency. Section 5 explainshow SVE makes it simple to setup new processing pipelines.Sections 6 and 7 explain how SVE is robust to failure andoverload, respectively.Figure 2 shows a high-level overview of the SVE archi-

tecture. The servers involved in processing an individualvideo are shown. Each server type is replicated many times,though without failure only one front-end, preprocessor, andscheduler will handle a particular video.There are four fundamental changes in the SVE architec-

ture compared to the MES architecture. The �rst change isthat the client breaks the video up into segments consistingof a group of pictures (GOP), when possible, before uploadingthe segments to the front-end. Each GOP in a video is sepa-rately encoded, so each can be decoded without referencingearlier GOPs. Segments split based on GOP alignment are,in essence, smaller stand alone videos. They can be playedor processed independently of one another. This segmentingreduces latency by enabling processing to start earlier. Thesecond change is that the front-end forwards video chunksto a preprocessor instead of directly into storage. The thirdchange is replacing the MES with the preprocessor, sched-uler, and workers of SVE. Disaggregating the encoding inthese three components enables low latency through par-allelization, higher reliability through fault isolation, andbetter reaction to overload through more options to shift


WorkerWorker

Front-end Preprocessor

Storage Worker

Scheduler

1

2 3

3′ 4

5

Client

Figure 2: SVE architecture. Clients stream GOP splitsegments of the video that can be reencoded indepen-dently (1–2). The preprocessor does lightweight vali-dation of video segments, further splits segments ifneeded, caches the segments in memory, and noti�esthe scheduler (3). In parallel, the preprocessor alsostores the original upload to disk for fault tolerance(30). The scheduler then schedules processing tasks onthe workers that pull data from the preprocessor tier(4) and then store processed data to disk (5).

load. And the fourth change is that SVE exposes pipelineconstruction in a DAG interface that allows application devel-opers to quickly iterate on the pipeline logic while providingadvanced options for performance tunings.

Preprocessor. The preprocessor does lightweight prepro-cessing, initiates the heavyweight processing, and is a write-through cache for segments headed to storage. The prepro-cessing includes the validation done in MES that veri�es theupload is a video and �xes malformed videos. Some olderclients cannot split videos into GOP segments before upload-ing. The preprocessor does the GOP splitting for videos fromthose clients. GOP splitting is lightweight enough to be doneon path on a single machine. It requires only a single passover the timestamp for each frame within the video data,�nding the nearest GOP boundaries to the desired segmentduration to divide the video apart.Encoding, unlike preprocessing, is heavyweight. It re-

quires a pixel-level examination of each frame within a videotrack and operates both spatial image compression as well astemporal motion compression across a series of frames. Thus,video encoding is performed on many worker machines, or-chestrated by a scheduler. The processing tasks that are runon workers are determined by a DAG of processing tasksgenerated by the preprocessor. The DAG that is generateddepends on the uploading application (e.g., Facebook or In-stagram) and its metadata, (e.g., normal or 360 video). Wediscuss how programmers specify the DAG in Section 5.The preprocessor caches video segments after it prepro-

cesses them. Concurrently with preprocessing, the prepro-cessor forwards the original uploaded segments to storage.Storing the segments ensures upload reliability despite the

failures that are inevitable at scale. Caching the segments inthe preprocessor moves the storage system, and its use ofdisks, o� the critical path.

Scheduler. The scheduler receives the job to execute theDAG from the preprocessor as well as noti�cations of whendi�erent segments of the video have been preprocessed. Thescheduler di�erentiates between low and high priority tasks,which are annotated by programmers when they specify theDAGs. Each cluster of workers has a high-priority and low-priority queue that workers pull tasks from after they �nishan earlier task. The scheduler greedily places tasks from theDAG into the appropriate queue as soon as all of its depen-dencies are satis�ed. Workers will pull from both queueswhen cluster utilization is low. When utilization is high theywill only pull tasks from the high-priority queue. SVE shardsthe workload by video ids among many schedulers. At anytime a DAG job is handled by a single scheduler.

Worker. Many worker machines process tasks in the DAGin parallel. Worker machines receive their tasks from thescheduler. They then pull the input data for that task eitherfrom cache in the preprocessor tier, or from the intermediatestorage tier. Once they complete their task they push the datato intermediate storage if there are remaining tasks or to thestorage tier if the DAG execution job has been completed.

Intermediate Storage. Di�erent storage systems are usedfor storing the variety of data ingested and generated duringthe video encoding process. The choice of storage systemdepends on factors such as the read/write pattern, the dataformat, or access frequency for a given type of data. Meta-data that is generated for use in the application is writtento a multi-tier storage system that durably stores the dataand makes it available in an in-memory cache [7] becausethe metadata is read much more frequently than it is written.The internal processing context for SVE is written to a stor-age system that provides durability without an in-memorycache. That context is typically written many times and readonce, so caching would not be bene�cial. The video and au-dio data is written to a specially con�gured version of ourBLOB storage system. That storage system, and the one forinternal processing context, automatically free this data aftera few days because it is only needed while the video is beingprocessed.We have found that letting this data expire insteadof manually freeing it is both simpler and less error-prone: incase a DAG execution job gets stuck, which happens whena user halts an upload without an explicit signal, or when aDAG execution job state is lost due to system failure or bugs,letting the storage system manage freeing the data upperbounds the excess capacity without requiring explicit tracingor job recovery.


Store

Upload

Process

Store

Upload

Process

MES

VEE

Pre-Sharing Latency

D

A

C

B

A B

C

D

E

E

Figure 3: Logical diagrams of the pre-sharing latencyof MES and SVE. Letters mark points in the diagramthat are measured and reported in later �gures.

4 LOW LATENCY PROCESSINGLow latency processing of videos makes applications moreinteractive. The earlier videos are processed, the sooner theycan be shared over News Feed or sent over Messenger. Thissection describes how SVE provides low latency by overlap-ping uploading and processing (§4.1), processing the videoin parallel (§4.2), and by overlapping fault tolerant storageand processing (§4.3). Figure 3 shows the logical pre-sharinglatency for the MES and SVE designs to provide intuition forwhy these choices lead to lower latency. This section alsoquanti�es the latency improvement that the SVE providesover MES (§4.4). All data is for News Feed uploads in a 6-dayperiod in June 2017 unless otherwise speci�ed.

4.1 Overlap Uploading and EncodingThe time required for a client to upload all segments of avideo is a signi�cant part of the pre-sharing latency. Figure 4ashows CDFs of upload times. Even for the smallest size class( 1MB) approximately 10% of uploads take more than 10seconds. For the 3–10MB size class, the percentage of videostaking more than 10 seconds jumps to 50%. For the large sizeclasses of 30–100MB, 100–300MB, 300MB–1GB, and � 1GB,more than half of the uploads take 1 minute, 3 minutes, 9minutes, and 28 minutes, respectively. This demonstratesthat upload time is a signi�cant part of pre-sharing latency.

Uploads are typically bottlenecked by the bandwidth avail-able to the client, which we cannot improve. This leaves uswith two options for decreasing the e�ect of upload latencyon pre-sharing latency: 1) upload less data, and 2) overlapuploading and encoding. One major challenge we overcomein SVE is enabling these options while still supporting the

large and diverse set of clients devices that upload videos.Our insight is to opportunistically use client-side processingto enable faster sharing when it is possible and helpful, butto use cloud-side processing as a backup to cover all cases.

We decrease the latency for uploads through client-side re-encoding of the video to a smaller size when three conditionsare met: the raw video is large, the network is bandwidthconstrained, and the appropriate hardware and softwaresupport exists on the client device. We avoid re-encodingwhen a video is already appropriately sized orwhen the clienthas a high bandwidth connection because these uploads willalready complete quickly. Thus, we prefer to avoid usingclient device resources (e.g., battery) since they will providelittle bene�t. Requiring all three conditions ensures we onlydo client-side re-encoding when it meaningfully decreasespre-sharing latency.

We decrease overall latency by overlapping uploading andserver-side encoding so they can proceed mostly in parallel.This overlap is enabled by splitting videos into GOP-alignedsegments. When there is client-side support for splitting,which is common, we do splitting there because it is a light-weight computation. When there is not client-side support,the preprocessor splits the video to enable parallelizing up-loading and processing for all videos. As a result, the com-bined upload and processing latency can be as low as theupload latency plus the last segment processing latency.

4.2 Parallel ProcessingThe time required to process a video (D–E) is a signi�cantpart of the pre-sharing latency. Figure 4b shows CDFs of stan-dard de�nition (SD) encoding time for di�erent size classesof videos under MES. Unsurprisingly, there is a strong cor-relation between video size and encoding time. For the sizeclasses smaller than 10MB, most videos can be encoded infewer than 10 seconds. Yet, for even the smallest size class,more than 2% of videos take 10 or more seconds to encode.For large videos the encoding time is even more signi�cant:53% of videos in the 100–300MB size class take more than1 minute, 13% of videos in the 300MB–1GB size class takemore than 5 minutes, and 23% of videos larger than 1GB takeover 10 minutes. This demonstrates that processing time isa signi�cant part of pre-sharing latency.Fortunately, segmenting a video along GOP boundaries

makes processing of the video parallelizable. Each segmentcan be processed separately from, and in parallel with, eachother segment. The challenges here are in selecting a seg-ment size, enabling per-segment encoding, and ensuring theresulting video is still well formed.Segment size controls a tradeo� between the compres-

sion within each segment and parallelism across segments.Larger segments result in better compression because there


0102030405060708090100

10-1 100 101 102 103 104

Percentile

Upload time (sec)

[min, 1M) [1, 3M) [3, 10M)

(a) Upload times (A–B).

0102030405060708090100

100 101 102 103 104

Percentile

Encode time (sec)

[10, 30M) [30, 100M) [100, 300M)

(b) MES encoding time (D–E).

0102030405060708090100

10-2 10-1 100 101 102 103

Percentile

Load time (sec)

[300M, 1G) [1G, max)

(c) MES load time (C–D).

Figure 4: Latency CDFs for the steps in the logical �ow of MES broken down for ranges from < 1MB to > 1GB.The upload time (A–B) is the same for MES and SVE. Storage sync time (B–C) is described in §4.3. There is a singlelegend across the three �gures.

0102030405060708090100

10-1 100 101 102 103 104

Percentile

Encode delay (sec)

[min, 1M) [1, 3M) [3, 10M)

(a) SVE encoding start delay (A–D).

0102030405060708090100

100 101 102 103 104

Percentile

Encode time (sec)

[10, 30M) [30, 100M) [100, 300M)

(b) SVE encoding time (D–E).

0

20

40

60

80

100

10-2 10-1 100 101 102 103Percentile

Load time (sec)

[300M, 1G) [1G, max)

(c) SVE per-video max fetch time.

Figure 5: Latency CDFs for the steps in the logical �ow of SVE broken down for ranges from < 1MB to > 1GB. Theupload time (A–B) is shown in Figure 4a. The storage time (B–C) is described in §4.3. Figure 5c shows the per-videomax latency across all fetches by workers from the preprocessor caches (similar to C–D in MES). There is a singlelegend spread across the three �gures.

is a larger window over which the compression algorithmcan exploit temporal locality, but less parallelism becausethere are fewer segments. We use a segment size of 10 sec-onds for applications that prefer lower latency over the bestcompression ratio—e.g., messaging and the subset of encod-ings that drive News Feed noti�cations. We use a segmentsize of 2 minutes for high quality encodings of large videoposts where a single digit percentage improvement on com-pression is prioritized, as long as the latency does not exceeda product-speci�ed limit.

Per-segment encoding requires converting processing thatexecutes over the entire video to execute on smaller videosegments. SVE achieves this by segmenting each video, witheach segment appearing to be a complete video. For videoswith constant frame rates and evenly distributed GOP bound-aries, there is no need for additional coordination during

encoding. But, for variable frame rate videos, SVE needs toadjust the encoding parameters for each segment based onthe context of all earlier segments—e.g., their frame countand duration. Stitching the separately processed segmentsback together requires a sequential pass over the video, butfortunately this is lightweight and can be combined with apass that ensures the resulting video is well formed.

The high degree of parallelism in SVE can sometimes leadto malformed videos when the original video has artifacts.For instance, some editing tools set the audio starting time toa negative value as a way to cut audio out, but our encoderbehaves di�erently when processing the audio track alonethan in the more typical case when it processes it togetherwith the video track. Another example is missing frame in-formation that causes our segmentation process to fail togenerate the correct segment. Ensuring SVE can handle such


cases requires repair at the preprocessing and track joiningstages that �xes misaligned timestamps between video andaudio tracks and/or re�lls missing or incorrect video meta-data such as frame information. We observe that 3% of videouploads need to be repaired, mainly at the preprocessingstage. 68% of these repairs are to �x framerates that are toolow or variable, 30% of them are to �x incomplete or missingmetadata, and the remaining 2% are due to segmentationissues. Repairing videos at the preprocessing stage preventsoverlapping upload and processing for this fraction of videos.We are investigating parallelizing the repair process with theupload to reduce end-to-end latency for such uploads.

4.3 Rethinking the Video SyncThe time required to durably store an uploaded video is some-times a signi�cant part of the overall latency. This syncingtime (B–C) includes ensuring the video has been synced tomultiple disks. This time is independent of the size of thevideo. It has a median of 200ms, a 90th percentile of 650ms,and a 99th percentile of 900ms. This demonstrates that thesyncing time is signi�cant for some videos.

To reduce the e�ect of the latency that comes from durablystoring the video, we overlap it with processing the video.This is essentially rethinking the sync [23, 24] for storagefrom the MES design. The MES design stored the uploadedvideo before processing to ensure it would not be lost onceuploaded. The SVE design stores the uploaded video in par-allel with processing it and waits for both steps to completebefore notifying the user. This decreases the pre-sharinglatency with the same fault tolerance guarantees.

After syncing the video to storage the MES encoder loadsthe video from the storage system (C–D). Figure 4c showsCDFs of the time required to load videos. The 90th percentileof loading times for all size classes of videos is over 1.3 sec-onds. For most videos that are 100MB or larger, the loadingtime is over 6 seconds. This demonstrates that the loadingtime is signi�cant for some videos. To reduce the latencyrequired to fetch video segments, SVE caches them in mem-ory in the preprocessor. This enables workers to fetch thesegments without involving the storage system or disks.

4.4 SVE Latency ImprovementsFigure 5 roughly parallels Figure 4 to show the improvementin the corresponding part of the logical �ow from MES toSVE. Figure 5a shows the delay from when a video uploadstarts until when SVE can start processing it (A–D). This sig-ni�cantly reduces latency compared to MES where a videoneeded to be fully uploaded (A–B), synced to storage (B–C), and then fetched from storage (C–D) before processingcould begin. Figure 5c shows the per-video maximum latency

012345678910

[min,1M)

[1,3M)[3,10M)

[10,30M)

[30,100M)

[100,300M)

[300M,1G)

[1G,max]

Avgspeedup

Size buckets

Figure 6: Speedup of SVE over MES in post-upload la-tency (B–E) broken down by video size.

for workers fetching segments from the preprocessor. Com-paring this maximum fetch-from-cache time in SVE to thefetch-from-storage time in MES (C–D) shows SVE eliminatesthe tail of high latency for all videos. Figure 5b shows theencoding latency for SVE (D–E) targeting the same videosas MES encodes in Figure 4b with the same SD quality and10 second segment size. Compared to the encoding timesfor MES, SVE delivers lower latency processing for all sizeclasses, and especially for large videos.1To get a better picture of SVE’s improvement in latency

over MES we tracked the post-upload latency (B–E) dur-ing our replacement of MES with SVE. Figure 6 shows thespeedup that SVE provides over MES. The speedup rangesfrom 2⇥ for 3MB video to 9⇥ for � 1GB videos. Thisdemonstrates SVE provides much lower latency than MES.

5 DAG EXECUTION SYSTEMThere are an increasing number of applications at Facebookthat process videos. Our primary goal for the abstractionthat SVE presents is thus to make it as simple as possibleto add video processing that harnesses parallelism (§5.1). Inaddition, we want an abstraction that enables experimen-tation with new processing (§5.2), allows programmers toprovide hints to improve performance and reliability (§5.3),and that makes �ne-grained monitoring automatic (§5.4).The stream-of-tracks abstraction achieves all of these goals.

5.1 DAGs on Streams-of-TracksThe DAG on a stream-of-tracks abstraction helps program-mers balance the competing goals of simplicity and paral-lelism. Programmers write processing tasks that executesequentially over their inputs and then connect tasks intoa DAG. The vertices of the DAG are the sequential pro-cessing tasks. The edges of the DAG are the data�ow. The1MES appears to have lower tail latency for the largest size classes becauseit times out and aborts processing after 30 minutes. If it were to fully processthose videos it would have much larger tail latencies than SVE.


SD Encoding

Combine Tracks

Notification

SD Encoding

HD Encoding

Analysis

Audio

Metadata

VideoThumbnail Generation

TasksTracks

Count Segments

Video Classification

Task GroupOriginal

Count Segments

Figure 7: Simpli�ed DAG for processing videos.Grayed tasks run for each segment of the video track.

granularity of both tasks and their inputs controls the com-plexity/parallelism tradeo� and is speci�ed as a subset of astream-of-tracks.

The stream-of-tracks abstraction provides two dimensionsof granularity that re�ect the structure of videos. The �rstdimension is the tracks within a video, e.g., the video trackand the audio track. Tasks can operate on either one track in-dividually or all tracks together. Specifying a task to operateon an individual track enables SVE to extract some paral-lelism and is simple for programmers. For example, speechrecognition only requires the audio track while thumbnailextraction and facial recognition only require the video track.Specifying these tasks to operate on only their required trackallows SVE to parallelize their execution without increasingthe burden on the programmer because the processing tasksdo not need to be rewritten.

The second dimension of granularity is the stream of datawithin a track, e.g., GOP-based segments within a video track.This dimension exposes more parallelism, but increases com-plexity because it requires tasks that can operate at the gran-ularity of individual segments. For instance, enabling re-encoding tasks to operate on segments required us to modifythe �mpeg commands we used and required us to add anew task that stitches together the segmented video into asingle video. Computer vision based video classi�cation isan example of a task that it would be di�cult to convert tooperate at the segment level. Our classi�er operates on thefull video and does things like track objects across frames.Reengineering this classi�er to operate across segments andthen combine the di�erent results would be complex.Figure 7 shows a simpli�ed version of the DAG for pro-

cessing videos to be shared on Facebook. The initial video issplit into tracks for video, audio, and metadata. The videoand audio tracks are then copied n times, one for each of then encoding bitrates (n = 2 for video, 1 for audio) in the �g-ure. At this point, the re-encoding tasks, which are the most

pipeline = create_pipeline(video)

video_track = pipeline.create_video_track()

if video.should_encode_hd

hd_video = video_track.add(hd_encoding)

.add(count_segments)

sd_video = video_track.add(

{sd_encoding, thumbnail_generation},

).add(count_segments)

audio_track = pipeline.create_audio_track()

sd_audio = audio_track.add(sd_encoding)

meta_track =

pipeline.create_metadata_track()

.add(analysis)

pipeline.sync_point(

{hd_video, sd_video, sd_audio},

combine_tracks,

).add(notify, �latency_sensitive�)

.add(video_classification)

Figure 8: Pseudo-code for generating the simpli�edDAG. Dependencies in the DAG are encoded by chain-ing tasks. Branches of the DAG can be merged withsync points. Tasks can also be annotated easily, e.g.,specifying the notify task to be latency sensitive.

computationally intensive, are operating at the maximumparallelism: segments of individual tracks. Thumbnail gener-ation, which is also moderately time consuming, is groupedinside the SD encoding task group to be executed at segmentlevel, without incurring an additional video track copy. Theoutput segments of each track are checked after they �nishencoding in parallel, by the “count segments” tasks as a syn-chronization point. Then all the tracks are joined for storage,before the user is noti�ed their video is ready to be shared.Some processing on the full video typically happens afterthe noti�cation, such as video classi�cation.The DAG in Figure 7 follows the typical pattern of our

DAGs: split into tracks, segment, split into encodings, collectsegments, join segments, and then join tracks. This structureenables the most parallelism for the most computationallyintensive tasks, which are re-encodings. It also provides asimple way for programmers to add most tasks. Most tasksoperate over the fully joined tracks, which is even simplerto reason about than one big script. This provides SVE withmost of the best of both worlds of parallelism and simplic-ity: parallelism is enabled for the few tasks that dominateprocessing time, which gives us most of the bene�ts of par-allelism without requiring programmers to reason aboutparallelism for more than a few tasks.


5.2 Dynamic DAG GenerationThe DAG for processing an individual video is dynamicallygenerated at the beginning of that video’s upload by thepreprocessor. The preprocessor runs code associated withthe uploading application to generate the DAG. For example,Figure 8 shows pseudo-code for generating the DAG shownin Figure 7.

Dynamic generation of the DAG enables us to tailor a DAGto each video and provides a �exible way to tune performanceand roll out new features. The DAG is tailored to each videobased on speci�c video characteristics forwarded from theclient or probed by the preprocessor. For instance, the DAGfor a video uploaded at a low bitrate would not include tasksfor re-encoding that video at a higher bitrate. As anotherexample, the width of parallelism for tasks operating onsegments of video depends on the length of the video and thespeci�ed segment size. For instance, a 120-second video witha segment size of 10 seconds would be split into 12 segments.Dynamic generation makes it simple to tune performancethrough testing di�erent encoding parameters, e.g., encodea small fraction of uploads in a di�erent way to see if theyresult in smaller segments on average. It also makes it simpleto roll out new features, e.g., enable a new feature only if theuploading user is an employee.

5.3 DAG Execution and AnnotationsOnce a DAG is generated, the preprocessor forwards it tothe scheduler, which tracks the dependencies, dispatchestasks to workers, and monitors progress updates from theworkers. The preprocessor regularly updates the schedulerwith the readiness of each segment and each track for agiven stream/DAG to enable the scheduler to dispatch tasksas soon as a segment becomes available.

All workers are equipped with HHVM [12], a virtual ma-chine supporting just-in-time compilation for all SVE tasks,which in turn are Hack [31] functions deployed continu-ously from our code repository. During execution, each taskis wrapped within framework logic that communicates withSVE components to prepare input, propagate output, andreport task results.

Our DAG speci�cation language has three types of annota-tions that programmers can add to tasks to control execution.The �rst annotation is a task group, which is a group of tasksthat will be scheduled and run together on a single worker.By default, each task is an independent task group. Combin-ing multiple task into a group amortizes scheduling overheadand eliminates cross-machine data�ow among these tasks.This provides the same performance as running all thesetasks within a single task. But, it also allows for �ner-grainedmonitoring, fault identi�cation, and fault recovery.

Component StrategyClient device Anticipate intermittent uploadsFront-end Replicate state externallyPreprocessor Replicate state externallyScheduler Synchronously replicate state externallyWorker Replicate in timeTask Many retriesStorage Replicate on multiple disks

Figure 9: Fault tolerance strategies for SVE.

The second annotation is whether or not a task is latency-sensitive, i.e., a user is waiting for the task to �nish. Tasksthat are on path to alerting user that their video is ready toshare are typically marked latency-sensitive. Tasks that arenot are typically marked not latency-sensitive. Task groupswith at least one latency-sensitive task are treated as latency-sensitive. The ancestors of a latency-sensitive task are alsotreated as latency-sensitive. Thus, in Figure 8 only the noti-�cation event needs to be marked latency-sensitive. Theseannotations are used by the scheduler for overload control.

5.4 Monitoring and Fault Identi�cationSeparating processing out into tasks also improves moni-toring and fault identi�cation. On receiving a task groupfrom scheduler, a worker executes its framework runtime tounderstand the incoming task composition, trigger tasks inorder, and automatically add monitoring to every task. Mon-itoring helps us make sure that DAG jobs are succeeding or,if not, it helps us quickly identify the task that is failing.Monitoring is also useful for analysis across many jobs.

For instance, which tasks take the longest time to complete?Or, which tasks fail the most?We have found that monitoringtasks, identifying failing tasks, and doing this type of analysisis far easier with the SVE design than it was with MES.

6 FAULT TOLERANCEOur scale, our incomplete control over the system, and thediversity of our inputs conspire to make faults inevitable.This section describes how SVE tolerates these failures.

6.1 Faults From ScaleSVE is a large scale system and, as with any system at scale,faults are not only inevitable but common. Figure 9 summa-rizes the fault tolerance strategies for components in SVE.We provide details on our strategy for handling task failurebecause we found it to be the most surprising.A worker detects task failure either through an excep-

tion or a non-zero exit value. SVE allows programmers tocon�gure the framework’s response to failure according to


exception type. For non-recoverable exceptions such as thevideo being deleted due to user cancellation or site integrityrevocation during an upload, SVE terminates the DAG exe-cution job and marks it as canceled. Cancellation accountsfor less than 0.1% of failed DAG execution jobs. (Most failedDAG execution jobs are due to corrupted uploads, e.g., thosethat do not contain a video stream.) In recoverable or unsurecases the worker retries the tasks up to 2 more times locally.If the task continues to fail, the worker will report failure tothe scheduler. The scheduler will then reschedule the taskon another worker, up to 6 more times. Thus, a single taskcan be executed as many as 21 times. If the last worker alsofails to complete the task, then the task and its enclosingtask group are marked as failed in the graph. If the task wasnecessary, this fails the entire DAG execution job.We have found that this schedule of retries for failure

helps mask and detect non-deterministic bugs. It masks non-deterministic bugs because multiple reexecutions at increas-ing granularities make it less and less likely we will triggerthe bug. At the same time, our logs capture that reexecu-tion was necessary and we can mine logs to �nd tasks withnon-negligible retry rates due to non-deterministic bugs.

We have found that such a large number of retries does in-crease end-to-end reliability. Examining all video-processingtasks from a recent 1-day period shows that the success rateexcluding non-recoverable exceptions on the �rst workerincreases from 99.788% to 99.795% after 2 retries; and on dif-ferent workers increases to 99.901% after 1 retry and 99.995%ultimately after 6 retries. Local retries have been particularlyuseful for masking failures when writing metadata and inter-mediate data, as the �rst local retry reduces such failures byover 50% (though second retry only further reduces failuresby 0.2%). Retries on new workers have been e�ective forovercoming failures localized to a worker, such as �lesystemfailures, out-of-memory failures, and task speci�c processingtimeouts triggered by CPU contention.

6.2 Faults from Incomplete ControlWedo not have complete control over the pre-sharing pipelinebecause we do not control the client’s device and its connec-tivity. The loss of a connection to a client for even a prolongedperiod of time (i.e., days) does not indicate an irrecoverablefault. As discussed in Section 4.1, some uploads take overa week to complete with many client-initiated retries afternetwork interruptions.

A very slowly uploaded video takes a di�erent path thana normal video and is protected with a grace period thatis many days long. Segments from the slowly uploadingvideos will fall out of the preprocessor cache after a fewhours. To protect against this we store the original segmentsfrom videos that have not been fully uploaded for the grace

period. The scheduler will also purge the DAG executionjob associated with the upload after a few days. To protectagainst this, if and when the upload does �nish, SVE detectsthis special case and schedules the job for execution again. If,on the other hand, after the grace period the upload has notsucceeded, then the preprocessed and processed segmentsare deleted.

6.3 Faults from Diverse InputsThere are many di�erent client devices that record and up-load videos, and many di�erent types of segments. On anaverage day, SVE processes videos from thousands of typesof mobile devices, which cover hundreds of combinations ofvideo and audio codecs. This diversity, combined with ourscale, results in an extreme number of di�erent code pathsbeing executed. This in turn leads to bugs from corner casesthat were not anticipated in our testing. Fortunately, ourmonitoring identi�es these bugs so we can �x them later.

6.4 Fault Tolerance in ActionSVE tolerates small scale failures on a daily basis. To demon-strate resilience to large scale failures we show two failuresscenarios with regional scope. In SVE, a region is a logicalgrouping of one or more data centers within a geographicregion of the world. For the �rst scenario we inject failure for20% of the preprocessors in a region. For the second scenario,which happened naturally in production, 5% of the workersin a region are gradually killed.

The �rst scenario is shown in Figure 10. We kill 20% of thepreprocessors in a region at 17:00 and then bring them backat 18:40. Figure 10a shows the error rate at front-ends andworkers in this region. There is a spike in the error rate atfront-ends when we kill the preprocessors because the front-end-to-preprocessor connections fail. The impact of thesefailures is that those front-ends must reconnect the uploadsthat were going to the failed preprocessors to new preproces-sors. Further, the clients may need to retransmit segmentsfor which they will not get end-to-end acknowledgments.

There is a small spike in error rate at the workers when thepreprocessors are killed and when they recover. When thepreprocessors are killed, ongoing segment fetches from themfail. Workers will then retry through a di�erent preprocessorthat will fetch the original segment from storage and re-preprocess it, if necessary. When the preprocessors recover,the shard mapping is updated and errors are thrown when aworker tries to fetch segments from a now stale preprocessor.Workers will then update their shard mapping and retry thefetch from the currently assigned preprocessor.Figure 10b shows the rate that DAG execution jobs are

started and �nished in this region. The �uctuations in thoserates are typical variations. The close relationship between


��

��

��

��

��

��

��

(a) Error rate.

��

��

��

��

��

��

��

(b) Throughput and success rate.

��

��

��

��

��

��

��

��

��

��

(c) End-to-end latency.

Figure 10: The e�ects of injecting a failure of 20% of the preprocessors in a region.

��

��

��

��

��

��

��

(a) Scheduling error.

��

��

��

��

��

��

��

��

��

��

��

(b) DAG throughput.

��

��

��

��

��

��

��

��

��

��

(c) DAG completion time.

Figure 11: The e�ects of a natural failure of 5% of workers within a 1k+ worker pool in a region.

the start and �nish rate demonstrates that the success rateand the throughput of SVE are una�ected by the failures.Figure 10c shows the median (P50) and 95th percentile (P95)end-to-end latency (A–E) in this region at this time. Thisdemonstrates that latency in SVE is una�ected by large-scalepreprocessor failures.

The second scenario is shown in Figure 11. In this scenario,5% of workers in a 1000+ machine pool within a regiongradually fail. The failures begin at 9:40 with a second spikeof failures at 10:30. All workers recovered by 11:45. Figure 11ashows the error rate at schedulers in this region. Schedulersthrow errors when they timeout waiting for a worker toreport completing an assigned task. The error rate closelytracks the failure of workers with a lag proportional to themax timeout for each job. These timeouts are set per taskbased on the expected processing time. This explains thegradual decrease in error rate: They are distributed arounda mean of a few minutes with a decreasing proportion atlonger times.Figure 11b shows the rate that DAG execution jobs are

started and �nished in this region. The decrease in start rateis a natural �uctuation not associated with these failures.The close tracking of �nish rate to start rate demonstratesthe success rate and throughput of SVE are una�ected bythis failure of 5% of workers. Figure 11c shows the median(P50) and 95th percentile (P95) end-to-end latency (A–E) in

this region at this time. The resilience of SVE to large-scaleworker failures is demonstrated by end-to-end latency beinguna�ected.

7 OVERLOAD CONTROLSVE is considered overloaded when there is higher demandfor processing than the provisioned throughput of the system.Reacting to overload and taming it allows us to continueto reliably process and share videos quickly. This sectiondescribes the three sources of overload, explains how wemitigate it with escalating reactions, and reviews productionexamples of overload control.

7.1 Sources of OverloadThere are three primary sources of overload for SVE: organic,load-testing, and bugs. Organic tra�c for sharing videosfollows a diurnal pattern that is characterized by daily andweekly peaks of activity. Figure 12b shows load from 5/4–5/11. A daily peak is seen each day and the weekly peak isseen on 5/8. SVE is provisioned to handle the weekly peak.

Organic overload occurs when social events result in manypeople uploading videos at the same time at a rate muchhigher than the usual weekly peak. For instance, the icebucket challenge resulted in organic overload. Some organicoverload events are predictable, for instance, New Years Eve


00.511.522.53

12/30 1/1 1/3 1/5 1/7

Relativetraffc

Date

DC-1DC-2DC-3

DC-4DC-5

(a) Organic overload in all DCs.

00.511.522.53

5/4 5/5 5/6 5/7 5/8 5/9 5/10 5/11

DC-2 drained

Relativetraffc

Date

DC-2DC-3

(b) Load test overload in DC-3.

0510152025303540

7/20 7/22 7/24 7/26 7/28

Preproc overload kickedin until fx

Evictionage(x100sec)

Date

(c) Bug-induced overload.

Figure 12: The e�ects of organic, load-test, bug-induced overload events.

2015 saw a 3⇥ increase in uploads over the daily peak. Otherorganic overload events are not, for instance, the coup at-tempt in Turkey on July 15, 2016, saw a 1.2⇥ increase in up-loads to the European data center as many people in Turkeyuploaded videos or went live.

Load-testing overload occurs when our load-testing frame-work Kraken [32] tests the entire Facebook stack includingSVE. The load-testing framework regularly runs a disastertolerance test that emulates a datacenter going o�ine, whichresults in tra�c redirection to our other datacenters. Thiscan result in some datacenters (typically the one nearestthe drained datacenter) having demand higher than weeklypeak load. Our load testing framework helps us know thatour overload controls work by testing them regularly. Wemonitor throughput and latency under this load so we canback it o� anytime it negatively a�ects users.

Bug-induced overload occurs when a bug in SVE results inan overload scenario. One example is when a memory leakin the preprocessor tier caused the memory utilization tomax out. This in turn resulted in evictions from the segmentcache, which is howwe detected the bug. Another example iswhen an �mpeg version change suddenly resulted in workersbecoming much slower in processing videos due to a changein�mpeg’s default behavior. In this scenario, we sawmultipletypes of overload control kick in until we were able to rollback the bad push.

7.2 Mitigating OverloadWe measure load in each component of SVE and then triggerreactions to help mitigate it when a component becomesoverloaded. We monitor load in the CPU-bound workers andmemory-bound preprocessors. The front-end and storagetiers, which are used across many applications, separatelymanage and react to overload. While video processing iscomputation and memory intense, the DAG complexity isrelatively small. Given that our design keeps the schedulerseparate from application data, it is unlikely to be overloadedbefore the workers and preprocessors.

The reactions to overload generally occur in the follow-ing order. First, SVE delays latency-insensitive tasks. Eachscheduler monitors the CPU load on the workers in its re-gion. When it schedules a task in its queue it checks thecurrent load against the “moderate” threshold. If the currentload is over the threshold and the task is marked as latency-insensitive, then the scheduler moves the task to the end ofthe queue again instead of scheduling it.If delaying latency-insensitive tasks does not alleviate

the overload in a single region, then SVE will redirect aportion of new uploads to other regions. The precise triggerfor this behavior is more complex than the logical triggerof worker CPU utilization being high. First, some latencysensitive tasks will be pushed back by the scheduler to avoidthrashing workers. This in turn �res a regional overloadalert to the on-call engineer. The on-call engineer then willtypically manually redirect a fraction of uploads by updatinga map of video tra�c to datacenters. In addition, there is anautomatic mechanism that will kick in and redirect tra�cat a more extreme rate of delaying latency sensitive tasks.Redirecting uploads to other regions allows the overload tobe handled by less loaded workers in those other regions.We redirect new uploads, instead of scheduling the tasks inother regions, to co-locate the workers for those videos withthe preprocessors that will cache their segments. This co-location reduces latency for those uploads by avoiding cross-region fetching of segments by workers. The co-location alsokeeps the memory utilization on the preprocessor tier in theoverloaded region from maxing out and triggering our �nalreaction to overload.

Our �nal reaction to overload is to delay processing newlyuploaded videos entirely. If all regions are building queuesof latency-sensitive videos then the memory in the prepro-cessor tier will �ll up as it caches segments for videos at arate greater than they can be processed and evicted from thecache. Videos are spread across preprocessors and each pre-processor detects and reacts to this form of overload. When apreprocessor nears its limit for memory it will start delaying


DAG execution jobs. These jobs are not preprocessed or sentto the scheduler. Instead they are forwarded to storage wherethey will be kept until the preprocessor is less loaded.Delaying newly uploaded videos is not ideal because we

want our users to be able to share their videos quickly. How-ever, given complete overload of the system it is inevitablethat some uploads will be delayed. Delaying whole uploadsallows the system to operate at its maximum throughput onthe videos it does process. Not delaying whole videos, in con-trast, would result in thrashing behavior in the preprocessorcache. This thrashing would decrease the throughput of thesystem and increase the latency of all videos.While the reactions to overload are generally triggered

in top-to-bottom order, this is not always the case. The trig-ger condition for each reaction is separately monitored andenforced. For instance, the memory leak bug triggered loadshedding through delaying some newly uploaded videoswithout a�ecting scheduling.

7.3 Overload Control in ActionFigure 12 shows overload control in action for three di�er-ent overload scenarios. Figure 12a shows a large spike inorganic tra�c during New Years Eve 2015. The number ofcompleted DAG execution jobs is shown relative to dailypeak (1⇥). There are three spikes in the graph that corre-spond roughly to midnight in Australia and Eastern Asia, Eu-rope, and the Americas. The �rst reaction, delaying latency-insensitive tasks, was triggered for datacenters hitting theirmax throughput. We anticipated this spike in tra�c and sothe uploads were already con�gured to be spread across moredatacenters than is typical. This prevented the redirectionoverload reaction from kicking in by, in essence, anticipatingwe would need it and proactively executing it. With the loadspread across all datacenters and latency insensitive tasksdelayed, DAG execution jobs were processed fast enoughthat the memory on preprocessors did not hit the last thresh-old and we did not need to delay any newly uploaded videos.The e�ect of overload control is seen in the �gure throughthe mix of tra�c spikes in di�erent datacenters. This demon-strates SVE is capable of handling large organic load spikesthat exceed the provisioned capacity of the global system.

Figure 12b shows overload from load testing. Speci�cally,a disaster readiness test drained all tra�c from datacenter2. This caused the upload tra�c for that datacenter to beredirected to other datacenters, with most of the tra�c goingto datacenter 3, which is the nearest. This demonstratesthat SVE is capable of handling load spikes that exceed theprovisioned capacity of a given datacenter. In addition, itdemonstrates that SVE is resilient to a datacenter failure.

Figure 12c shows overload caused by a memory leak onpreprocessors. Once the memory usage exceeded the thresh-old, preprocessors shed new uploads to storage, and thecache eviction age on the preprocessors dropped from itstypically 30+ minutes to a few minutes. The job completionrate (not shown) is una�ected. This demonstrates that SVEis capable of handling large bug-induced load spikes.

8 PRODUCTION LESSONSThis section shares lesson learned from our experience ofoperating SVE in production for over a year. We share thesein the hope they will be helpful to the community in givingmore context about design trade-o�s and scaling challengesfor practical video processing systems at massive scale.

8.1 Mismatch for LivestreamingLivestreaming is streaming a video from one to many usersas it is being recorded. Our livestream video application’srequirements are a mismatch for what SVE provides.The primary mismatch between SVE and livestreaming

stems from the overlap of recording the video with all otherstages of the full video pipeline in livestreaming. This pacesthe upload and processing for the video to the rate it isrecorded—i.e., each second only 1 second of video needs tobe uploaded and only 1 second of video needs to be processed.As a result, upload throughput is not a bottleneck as long asthe client’s upstream bandwidth can keep pace with the livevideo. And parallel processing is unnecessary because thereis only a small segment of video to process at a given time.Another mismatch is that the �exibility a�orded throughdynamic generation of a DAG for each video is unnecessaryfor livestreaming. A third mismatch is that SVE recoversfrom failures and processes all segments of video, while oncea segment is no longer “live” it is unnecessary to processit. Each of these mechanisms in SVE that is unnecessaryfor livestreaming adds some latency, which is at odds withlivestreaming’s primary requirement.

As a result, livestreaming at Facebook is handled by a sep-arate system, whose primary design consists of a �at tier ofencoders that allocate a dedicated connection and compu-tation slot for a given live stream. Re�ecting on this designmismatch helped us realize the similarity between the liveencoders and SVE’s worker. We are currently exploring con-solidating these two components to enable shared capacitymanagement. An important challenge that remains is balanc-ing the load between the two types of workloads: streameddedicated live encoding and batched full video processing.

8.2 Failures from Global InconsistencyWhen an upload begins it is routed from the client to aparticular region and then a particular front-end and from


there to a particular preprocessor. To improve locality, thispreprocessor will typically handle all uploaded segments ofa video. In all cases, once a video is bound to a preprocessorit will be processed entirely in the preprocessor’s region. Thebinding of a video to a preprocessor/region was originallystored in an eventually-consistent geo-replicated data store.For the vast majority of uploads where all segments arerouted to the same region this did not cause a problem.Whenthe global load balancing system, however, routed some partsof the upload to a di�erent region the eventually-consistentstorage did cause a problem.

If the binding from video to preprocessor/region had notyet replicated to this di�erent region, then the front-endthat received a part of the upload did not know where tosend it. Once we determined the root cause we coped with itby introducing a retry loop that continues to check for thebinding and delays the upload until the binding is replicated.This short-term solution is not ideal, however, because

it leaves the system vulnerable to spikes in replication lag.Many parts of our system—e.g., timeouts on processing re-quests on front-end machines—are tuned to handle requestsin a timely manner. When replication lag spikes, it can delayan upload an arbitrarily long time, which then breaks theexpectations in our tuning and can still fail the upload. Thuswhile our short-term �x handles the problem in the normalcase, we have latent failures that will be exposed by slowreplication. We are exploring a long-term solution that willavoid this problem by using strongly-consistent storage.

8.3 Failures from Regional InconsistencyWe also used a regional cache-only option of the same repli-cated store as the part of our intermediate storage (discussedin §3) that is used to pass metadata between di�erent tasks.We picked the cache-only option to avoid needing backingstorage, which we thought was a good choice given the meta-data is only needed while a DAG job is being executed. Boththe consistency of the data store and the cache-only choicehave caused us signi�cant maintenance problems.

The data store provides read-after-write consistencywithina cluster, which is a subset of a region. Once our pool of work-ers grew larger than a cluster, we started seeing exceptions asintermediate metadata would either not exist or be stale thatwould fail the upload. We initially coped with this problemusing retries, but this is fragile for reasons similar to thosediscussed above and did not solve the cache-only problem.The cache-only option can, and occasionally does, lose

data. When this happened it would also cause an uploadto fail. We initially coped with this problem by rerunningearlier tasks to regenerate the missing metadata. This wasnot ideal, however, because rerunning those tasks could takeseconds to minutes that would increase end-to-end latency

and was a waste of worker resources. We have solved bothproblems by moving to a persistent database for the inter-mediate metadata storage. While this increases latency onthe order of tens of milliseconds per operation, this smallincrease in latency is worth it to make our video uploadsmore robust.

8.4 Continuous SandboxingFor security we sandbox the execution of tasks within SVEusing Linux namespaces. Initially we would create a newsandbox for each task. When we moved our messaging usecase onto SVE we saw a considerable spike in latency due tosetting up unique network namespaces for each execution.The videos for the messaging use case tend to be smaller andas a result there are typically more of them being concur-rently processed on each worker. We found that setting upmany sandboxes concurrently caused the spike in latency.We solved this problem by modifying our system to reusepre-created namespace for sandboxing across tasks, whichwe call continuous sandboxing. We are now investigatingmore potential e�ciency improvements from making moreof our system continuous.

9 RELATEDWORKThis section reviews three categories of related work: videoprocessing at scale, batch processing systems, and streamprocessing systems. SVE occupies a unique point in the in-tersection of these areas because it is a production systemthat specializes data ingestion, parallel processing, its pro-gramming abstraction, fault tolerance, and overload controlfor videos at massive scale.

Video Processing at Scale. ExCamera [11] is a system forparallel video encoding that achieves very low latency throughmassively parallel processing of tiny segments of video thatcan still achieve high compression through state-passing.SVE is a much broader system than ExCamera—e.g., it dealswith data ingestion, many di�erent video codecs, program-ming many video processing applications. SVE could po-tentially lower its latency further, especially for the highestquality and largest videos, by adopting an approach similarto ExCamera.A few companies have given high-level descriptions of

their video processing systems. Net�ix has a series of blogposts [1, 33] that describes their video processing system.Their workload is an interesting counterpoint to ours: theyhave far fewer videos, with much less stringent latency re-quirements, and very predictable load. YouTube describesa method to reduce artifacts from stitching together sepa-rately processed video segments, which includes a high level


description of their parallel processing [17]. This paper pro-vides a far more detailed description of a production videoprocessing system that this prior work.

Other recent work focus on e�ciently using limited com-putational resources to improve video streaming or query-ing. Chess-VPS [30] is a popularity prediction service tar-geted at SVE and Facebook’s video workload. Chess-VPSaims to guide the re-encoding of videos that will becomepopular to enable higher quality streaming for more users.VideoStorm [37] targets a di�erent setting where it makes in-telligent decisions on how to process queries over live videosin real time.

Batch Processing Systems. There has been a tremendousboom in batch processing systems and related research in thelast �fteen years. MapReduce [10] is a parallel programmingmodel, system design, and implementation that helped kick-start this resurgence. MapReduce’s insights were to buildthe hard parts of distributed computation—fault tolerance,moving data, scheduling—into the framework so applica-tion programmers do not need to worry about them and tohave the programmer explicitly specify parallelism throughtheir map and reduce tasks. SVE, and many other distributedprocessing systems, exploit these same insights.

Dryad [14, 34] generalized the map-reduce programmingmodel to DAGs and introduced optimizations such as shared-memory channels between vertices to guarantee they wouldbe scheduled on the same worker. SVE also has a DAG pro-gramming model and SVE’s task group annotation is equiva-lent to connecting that set of vertices with shared-memorychannels in Dryad. Piccolo [25] is a batch processing sys-tem that keeps data in memory to provide lower latency forcomputations. SVE’s caching of segments in preprocessormemory is similar and helps provide low latency.CIEL [21] is a batch processing system that can execute

iterative and recursive computations expressed in the Sky-writing scripting language. Spark [35] is a batch processingsystem that uses resilient distributed datasets to keep datain memory and share it across computations, which resultsin much lower latency for iterative and interactive process-ing. Naiad [20] is a batch and stream processing system thatintroduced the timely data�ow model that allows iterativecomputations through cycles in its processing DAG. Compu-tations in SVE are neither iterative, recursive, nor interactive.

Stream Processing Systems. There is a signi�cant body ofwork on stream processing systems that start with early sys-tems like TelegraphCQ [8], STREAM [18], Aurora [3], andBorealis [2]. These systems, in contrast to batch processingsystems, consider data ingestion latency implicitly in theirmodel of computation. Their goal is to compute query re-sults across streams of input as quickly as possible. It is thusnatural for these systems to consider data ingestion latency.

More recent work [4, 9, 15, 16, 22, 26, 36] extends streamprocessing in new directions. Spark Streaming [36] focuseson faster failure recovery and straggler mitigation by movingfrom the continuous query model to modeling stream pro-cessing as (typically very small) batch jobs. JetStream [26]looks at wide-area stream processing and exploits aggrega-tion and degradation to reduce scarce wide-area bandwidth.StreamScope [16] is a streaming system at Microsoft withsome extra similarities to SVE as compared to typical streamprocessing systems. StreamScope is designed to make devel-opment and debugging of applications easy, as is SVE. Itsevaluation of a click-fraud detection DAG evaluates end-to-end latency. This is a rare instance in the stream processingliterature where the implicit consideration of data ingestionis made explicit. We suspect these additional similaritiesarose because both StreamScope and SVE are in production.

10 CONCLUSIONSVE is a parallel processing framework that specializes dataingestion, parallel processing, the programming interface,fault tolerance, and overload control for videos at massivescale. It provides better latency, �exibility, and robustnessthan the MES it replaced. SVE has been in production sincethe fall of 2015. While it has demonstrated its resilience tofaults and overloads, our experience operating it provideslessons for building even more resilient systems.

ACKNOWLEDGMENTSWe are grateful to the anonymous reviewers of the SOSPprogram committee, our shepherd Geo� Voelker, ZahaibAkhtar, Daniel Suo, Haoyu Zhang, Avery Ching, and KaushikVeeraraghavan whose extensive comments substantially im-proved this work. We are also grateful to Eran Ambar, GuxLuxton, Linpeng Tang, Pradeep Sharma, and other colleaguesat Facebook who made contributions to the project at di�er-ent stages.

REFERENCES[1] Anne Aaron and David Ronca. 2015. High Quality Video

Encoding at Scale. http://techblog.net�ix.com/2015/12/high-quality-video-encoding-at-scale.html. (2015).

[2] Daniel J Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel,Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, AnuragMaskey, Alex Rasin, Esther Ryvkina, et al. 2005. The Design of the Bo-realis Stream Processing Engine. In Proceedings of the 2005 Conferenceon Innovative Data Systems Research.

[3] Daniel J Abadi, Don Carney, Ugur Çetintemel, Mitch Cherniack, Chris-tian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, andStan Zdonik. 2003. Aurora: a new model and architecture for datastream management. The VLDB Journal–The International Journal onVery Large Data Bases 12, 2 (2003), 120–139.

[4] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, JoshHaberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom,and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing


at Internet Scale. Proceedings of the VLDB Endowment 6, 11 (2013),1033–1044.

[5] Apache Storm 2017. http://storm.apache.org/. (2017).[6] Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter

Vajgel. 2010. Finding a Needle in Haystack: Facebook’s Photo Storage.In Proceedings of the 9th USENIX Symposium on Operating SystemsDesign and Implementation. USENIX.

[7] Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, PeterDimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni,Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song,and Venkat Venkataramani. 2013. TAO: Facebook’s Distributed DataStore for the Social Graph. In Proceedings of the 2013 USENIX AnnualTechnical Conference. USENIX.

[8] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael JFranklin, Joseph M Hellerstein, Wei Hong, Sailesh Krishnamurthy,Samuel R Madden, Fred Reiss, and Mehul A Shah. 2003. TelegraphCQ:Continuous Data�ow Processing for an Uncertain World. In Proceed-ings of the 2003 ACM SIGMOD International Conference on Managementof Data. ACM.

[9] Tyson Condie, Neil Conway, Peter Alvaro, Joseph M Hellerstein,Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. InProceedings of the 7th USENIX Symposium on Networked Systems De-sign and Implementation. USENIX.

[10] Je�rey Dean and Sanjay Ghemawat. 2004. MapReduce: Simpli�ed DataProcessing on Large Clusters. In Proceedings of the 6th Symposium onOperating Systems Design and Implementation. USENIX.

[11] Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan VasukiBalasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman,George Porter, and Keith Winstein. 2017. Encoding, Fast and Slow:Low-Latency Video Processing Using Thousands of Tiny Threads.In Proceedings of the 14th USENIX Symposium on Networked SystemsDesign and Implementation. USENIX.

[12] HipHop Virtual Machine 2017. http://hhvm.com/. (2017).[13] Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Ku-

mar, and Harry C. Li. 2013. An Analysis of Facebook Photo Caching. InProceedings of the 24th ACM Symposium on Operating System Principles.ACM.

[14] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fet-terly. 2007. Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks. In Proceedings of the 2nd ACM SIGOPS European Con-ference on Computer Systems. ACM.

[15] Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli,Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, KarthikeyanRamasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Pro-cessing at Scale. In Proceedings of the 2015 ACM SIGMOD InternationalConference on Management of Data. ACM.

[16] Wei Lin, Zhengping Qian, Junwei Xu, Sen Yang, Jingren Zhou, andLidong Zhou. 2016. StreamScope: Continuous Reliable DistributedProcessing of Big Data Streams. In Proceedings of the 13th USENIX Sym-posium on Networked Systems Design and Implementation. USENIX.

[17] Yao-Chung Lin, Hugh Denman, and Anil Kokaram. 2015. Multipass En-coding for Reducing Pulsing Artifacts in Cloud Based Video Transcod-ing. In Proceedings of the 2015 IEEE International Conference on ImageProcessing. IEEE.

[18] Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shiv-nath Babu, Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosen-stein, and Rohit Varma. 2003. Query Processing, Resource Manage-ment, and Approximation in a Data Stream Management System. InProceedings of the 2003 Conference on Innovative Data Systems Research.

[19] Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill,Ernest Lin,Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivaku-mar, Linpeng Tang, et al. 2014. f4: Facebook’s Warm BLOB Storage

System. In Proceedings of the 11th USENIX Symposium on OperatingSystems Design and Implementation. USENIX.

[20] Derek Gordon Murray, Frank McSherry, Rebecca Isaacs, Michael Isard,Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Data�owSystem. In Proceedings of the 24th ACM Symposium on Operating SystemPrinciples. ACM.

[21] Derek G Murray, Malte Schwarzkopf, Christopher Smowton, StevenSmith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universalexecution engine for distributed data-�ow computing. In Proceedingsof the 8th USENIX Symposium on Networked Systems Design and Imple-mentation. USENIX.

[22] Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari.2010. S4: Distributed Stream Computing Platform. In Proceedings of the2010 IEEE International Conference on Data Mining Workshops. IEEE.

[23] Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, andJason Flinn. 2006. Rethink the Sync. In Proceedings of the 7th USENIXSymposium on Operating Systems Design and Implementation. USENIX.

[24] Edmund B Nightingale, Kaushik Veeraraghavan, Peter M Chen, andJason Flinn. 2008. Rethink the Sync. ACM Transactions on ComputerSystems (TOCS) 26, 3 (2008), 6.

[25] Russell Power and Jinyang Li. 2010. Piccolo: Building Fast, DistributedPrograms with Partitioned Tables. In Proceedings of the 9th USENIXSymposium on Operating Systems Design and Implementation. USENIX.

[26] Ariel Rabkin, Matvey Arye, Siddhartha Sen, Vivek S Pai, and Michael JFreedman. 2014. Aggregation and Degradation in JetStream: StreamingAnalytics in the Wide Area. In Proceedings of the 11th USENIX Sympo-sium on Networked Systems Design and Implementation. USENIX.

[27] Vijay Rao and Edwin Smith. 2016. Facebook’s new front-end serverdesign delivers on performance without sucking up power. https://code.facebook.com/posts/1711485769063510. (2016).

[28] Alyson Shontell. 2015. Facebook is now generating 8 billion videoviews per day from just 500 million people âĂŤ here’s how that’spossible. https://tinyurl.com/yc3jhxuu. (2015).

[29] Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li.2015. RIPQ: Advanced Photo Caching on Flash for Facebook. In Pro-ceedings of the 13th USENIX Conference on File and Storage Technologies.USENIX.

[30] Linpeng Tang, Qi Huang, Amit Puntambekar, Ymir Vigfusson, WyattLloyd, and Kai Li. 2017. Popularity Prediction of Facebook Videos forHigher Quality Streaming. In Proceedings of the 2017 USENIX AnnualTechnical Conference. USENIX.

[31] The Hack Programming Language 2017. http://hacklang.org/. (2017).[32] Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, So-

nia Margulis, Scott Michelson, Rajesh Nishtalaand Daniel Obenshain,Dmitri Perelman, and Yee Jiun Song. 2016. Kraken: Leveraging LiveTra�c Tests to Identify and Resolve Resource Utilization Bottlenecksin Large Scale Web Services. In Proceedings of the 12th USENIX Sympo-sium on Operating Systems Design and Implementation. USENIX.

[33] Rick Wong, Zhan Chen, Anne Aaron, Megha Manohara, and DarrellDenlinger. 2016. Chelsea: Encoding in the Fast Lane. http://techblog.net�ix.com/2016/07/chelsea-encoding-in-fast-lane.html. (2016).

[34] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson,Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A Systemfor General-Purpose Distributed Data-Parallel Computing Using aHigh-Level Language. In Proceedings of the 8th USENIX Symposium onOperating Systems Design and Implementation. USENIX.

[35] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, andIon Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Ab-straction for In-Memory Cluster Computing. In Proceedings of the 9thUSENIX Symposium on Networked Systems Design and Implementation.USENIX.


[36] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, ScottShenker, and Ion Stoica. 2013. Discretized Streams: Fault-TolerantStreaming Computation at Scale. In Proceedings of the 24th ACM Sym-posium on Operating System Principles. ACM.

[37] Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Phili-pose, Paramvir Bahl, andMichael J. Freedman. 2017. Live Video Analyt-ics at Scale with Approximation and Delay-Tolerance.. In Proceedings

of the 14th USENIX Symposium on Networked Systems Design and Im-plementation. USENIX.

sve: distributed video processing at facebook scale · cessing frameworks including batch...

Documents