agile cold starts for scalable serverless · 2019-07-08 · agile cold starts for scalable...

Agile Cold Starts for Scalable Serverless

Anup Mohan1 Harshad Sane1 Kshitij Doshi1 Saikrishna Edupuganti1 Naren Nayak1

Vadim Sukhomlinov2

Intel Corp.1, Google Inc.2

AbstractThe Serverless or Function-as-a-Service (FaaS) model capital-izes on lightweight execution by packaging code and depen-dencies together for just-in-time dispatch. Often a containerenvironment has to be set up afresh– a condition called “coldstart", and in such cases, performance suffers and overheadsmount, both deteriorating rapidly under high concurrency.Caching and reusing previously employed containers ties upmemory and risks information leakage. Latency for cold startsis frequently due to work and wait-times in setting up variousdependencies – such as in initializing networking elements.This paper proposes a solution that pre-crafts such resourcesand then dynamically reassociates them with baseline con-tainers. Applied to networking, this approach demonstratesan order of magnitude gain in cold starts, negligible memoryconsumption, and flat startup time under rising concurrency.

1 IntroductionContainerization decouples developers from provisioning ofhosting environments, simplifies software solutions, and fa-cilitates frictionless evolution and delivery of services. Incomparison with virtual machines (VMs), containers are re-source efficient means of performing quick executions, andthus provide for highly responsive on-demand staging ofsolutions at an exceptional scale [17], –notably, in micro-services oriented and server-agnostic deployments. In particu-lar, the serverless model (also known as Function-as-a-Serviceor FaaS) permits the developer to focus exclusively on theapplication logic [13], while freeing up the cloud serviceprovider (CSP) to maximize productive utilization of infras-tructure. FaaS has emerged as the natural choice for cloud-and container-based fulfillment [2] [7] [5] [8] as on-demandcomputation increasingly materializes as event-triggered tasksthat arise unpredictably and then run to completion. FaaS in-troduces new software management considerations that CSPsmust weigh [12,15]; these include, security, determinism, andstartup overheads of container-based dispatch.

A function’s execution must be preceded by the bring-upof its execution environment –which frequently comprises

a “cold start” wherein the container is started from scratch.Even though containers are several factors faster than VMstheir cold start time can still be a significant fraction of that ofa typical function’s execution and rise sharply with increasedconcurrency of function triggers [6] as described in section3. Concurrent function executions are common especiallyduring rush hours (e.g. Uber) and with function chaining.Common workarounds to reduce cold starts tend to be costlyand resource consuming, as explained in section 2.

A detailed analysis of time spent in various stages of acold start in section 4 identifies network creation and initial-ization as the prime contributor to latency (also shown in[16]), a common stage involved in almost all the serverlessframeworks. The solution proposed in this paper (and pro-totyped with Apache OpenWhisk) is to (i) pre-create andcache networking endpoints, (ii) bind them to function con-tainers when created, and (iii) salvage them for reuse whenfunction containers are dismantled. This is achieved by usingPause containers [1], which are network-ready empty contain-ers readily attachable to other containers that execute usingits predefined network configuration. Our solution includes amanagement system around this concept, called as Pause Con-tainer Pool Manager, and abbreviated as PCPM, reduces coldstart latencies and their concurrency impact by nearly an orderof magnitude; for example, at hundred concurrent containers,the startup time reduces by close to 80%, while consuming anegligible amount of system memory. Our solution, describedin section 5 and evaluated in section 6 is not limited to Open-whisk. Section 7 captures continuing and future work andconcludes. Even though this paper focusses on pre-creatingnetwork resources, the idea of identifying FaaS performancebottlenecks and pre-creating or pre-fetching resources can beextended beyond network endpoints as discussed in section 7.

2 Related Work

The existing approaches revolve around ensuring that the envi-ronment for a function (runtime, dependencies, and versions,

01020304050

1 10 20 50 80 100P99

Tota

l Tim

e (s

)

#Concurrent Cold Functions

Startup Time Run Time Others

(a) OpenWhisk

1VM2VMs

2VMs

3VMs

4VMs 5VMs

0510152025

0

20

40

60

1 10 20 50 80 100 #Col

dCon

tain

ers/

VM

P99

Tota

l Tim

e (s

)


ColdContainers/VM P99TotalTime

(b) AWS Lambda

Figure 1: Impact of concurrent cold starts: (a) shows the total time for different concurrent cold starts; total time increases withconcurrency, (b) shows the total time under concurrent cold start for AWS Lambda. Total time increases with concurrency andLambda employs additional virtual machines (VMs) to mitigate this problem. Multiple instances of the same function are used.

known memory and network requirements) pre-arranged forits invocation. One popular approach is pre-warmed contain-ers [10] which saves the cost of launching a new containerby pre-creating the container with the needed code and itsdependencies and starting it up when invoked. A related ap-proach, Warm containers [3, 10], goes further and saves thestartup time by caching a running copy of the container withthe needed environment (between successive executions) andusually this approach ties up more resources. Both techniquesare limited to using the containers for the specific environ-ments for which they are created; and failure to do so meanseither squandering memory, swap space, etc. by overprovision-ing such resources, or, degrading scalability and predictabil-ity of response time by undersubscribing them. Even slightunder-provisioning can provoke cold-start bursts when arrivalpatterns shift suddenly, and produce large startup times. Aheuristic, periodic warming [4] consists of submitting dummyrequests every so often to induce a CSP to keep containerswarm. Non-deterministic and concurrent invocations are noteffectively mitigated by this technique. Akkus et al. [11] re-duces container counts by running multiple functions fromsame user inside a container, which is still prone to cold startsas user counts rise.

3 Impact of Concurrent Cold Starts

This section describes the cold start issue and its measuredimpact in detail. First, we show the cold start symptomsthrough experimentation using the open source Apache Open-Whisk [8] framework. Here we trigger incremental concur-rent cold starts and use custom instrumentation to generatea latency breakdown of execution. Then, we confirm similareffects (but different magnitudes) on AWS Lambda [2]. Forboth frameworks, we employ a simple, short running, ALU in-tensive function (i.e. light on caches and memory bandwidth)as the test case; this function calculates prime numbers withina fixed range using the sieve method [9]. Note that the coldstart issue is independent of the function and related to the

container start up required to run the function.

3.1 Apache OpenWhiskOur Apache OpenWhisk setup uses a single Intel Xeon R©

Platinum 8180 server to both host the framework and exe-cute functions. Since OpenWhisk is built upon Docker, wehave used Docker’s logging and eventing capabilities to tracethe life of a function and generate a latency distribution. InFig. 1(a), we plot the 99th percentile time attributed to variousphases under different numbers of concurrent cold starts1. Forsimplicity in Fig. 1(a), we divide the vertical bars into threeparts: (i) startup time – which is the wall-clock time to createand initialize the Docker container, (ii) run time, which is theelapsed time to the function (i.e. compute prime numbers),and (iii) remaining time that includes the framework’s man-agement overheads such as identifying and meeting variousmodule dependencies, load balancing, user authentication etc.The total time to execute functions increases dramaticallywith concurrency even though the server has sufficient coresto run these activities independently. More than 90% of thistotal time is spent in the start up time (shown in red), indicat-ing a scaling bottleneck. The issue is not with the OpenWhiskarchitecture but with the container network setup in the kernelfor the containers, as explained in Section 4.

3.2 AWS LambdaTo ensure we are not merely confronting some Apache Open-Whisk artifact, we also characterized execution on AWSLambda. Fig. 1 (b) shows the total time to execute Lambdafunctions for concurrent function triggers. We use the ap-proach described by Wang et al. [18] to determine when newVirtual Machines (VMs) and containers are used by Lambdato execute functions. Fig. 1(b) shows that the total 99th per-centile time increases with concurrency, though at a gradual

1We cross-verify the startup time measurements by using Docker logsgenerated by the Docker daemon

Docker container lifecycle

Trigger

Service invocation

Cleanup

Container kill

Container die

Network disconnectContainer

destroy

Run time

Container start Execution

Startup

Container create

Network create

Network connect

Create namespace NS Create veth pairs (vethA, vethB)

Add vethB to namespace NS

Bring vethA up Enslave vethA to docket0 (default)

Bringup vethB after adding IP

Figure 2: Container and network namespace creation.

0 10 201

9

17

25

33

41

49

Time (s)

Indi

vidu

al C

onta

iner

s

Service Invocation Time Startup TimeRun Time Cleanup Time

Figure 3: Timeline of 50 concurrent functions with alternatecontainers. Startup time accounts for 90% of total time.

rate with the aid of multiple VMs used for concurrent functionexecution. The total time however increases significantly withincreasing cold containers within the same VM. As reportedin Wang et al. [18], the VMs used for Lambda are approxi-mately 3 GB in size; thus, Lambda incurs significant memoryoutlay to mitigate the concurrent cold start scaling bottleneck.

4 Role of Network in Cold Starts

Fig. 2, left, depicts the lifecycle of a Docker container (how-ever, the illustration also applies to other containers). It showsthat a container goes through four major stages, viz. (i) serviceinvocation, in which a function trigger reaches the Dockerdaemon, (ii) startup, which consists of creating a container,setting up its network, and connecting it to the network, (iii)run time, in which the function is executed, (iv) cleanup,which includes stopping the container, disconnecting its net-work, and destroying it. The service invocation, startup, andrun time fall directly in the critical path of the function execu-tion. The cleanup step could fall indirectly in the critical pathof other functions as it demands cycles from Docker daemon.

To understand the scaling bottleneck, we arbitrarily pickthe 50 concurrent execution case from Fig. 1 (a) and breakdown the total execution time of each function into the foursteps listed above. Fig. 3 shows a Gantt chart of their execu-tion timelines, which highlights the growth in startup timeas the major factor limiting the scaling of performance. Theincrease in the comparatively modest service invocation timeis from the Docker daemon taking more time to process theconcurrent requests. The run time remains steady across allthe containers. The cleanup time is the second highest contrib-utor to the full execution time. The service invocation, startup,and cleanup are overheads for the FaaS provider.

Referring back to Fig. 2, the startup time is further brokendown into Container Create, Network Create, and NetworkConnect times with the help of the Docker events tracingmechanism. Our analysis finds that the Network Create andNetwork Connect steps account for 90% of the startup time.To drill further into these network tasks we emulate network

Table 1: Network Namespace Creation and Removal Bottle-neck under Concurrency

#Concurrent Namespaces 1 10 50 100Create Time (s) 0.28 1.27 6.28 14.41Cleanup Time (s) 0.20 0.71 3.24 7.77

namespace creation within the Linux kernel – by using theLinux ip command to create and initialize the network in amanner similar to Docker. The steps in Fig. 2, right, includecreating a new network namespace, creating virtual Ethernet(veth) pairs, assigning IP address, and enabling the connec-tions. We create and initialize multiple networks concurrentlyto emulate the launching of multiple Docker containers. Ta-ble. 1 shows that the time to create network namespaces andinitialize them increases with concurrency, and are the mainreason for the super-linear increase in startup time. The clean-ing up of a container includes deleting the namespace and vethpairs, and Table 1 shows that this too increases significantlywith concurrency. A recent study [16] corroborates our find-ings and identifies that the scalability bottleneck of namespacecreation is due to a single global lock. They [16] also explainthat namespace cleaning happens in batches and is less timeconsuming compared to namespace creation. Virtually allserverless frameworks require creating network namespacesfrequently, thereby facing performance hurdle as of Linuxkernel version 4.4.0-116 (Ubuntu).

Trigger

Service invocation

Cleanup

Container kill

Container die

Release pause container for reuse

Container destroy

Run time

Container start Execution

Startup

Container create

Attach to pause container

Figure 4: Container lifecycle under pause containers – by-passes time-consuming Network create and connect steps.

Pause Container N

Invoker

Pause Container 1

Pause Container 2

PC 1

PC 2

PC N

(a) Build PhasePause Container N

Invoker

Pause Container 1

Pause Container 2

PC 2

PC N

Function Container

(b) Execution PhasePause Container N

Invoker

Pause Container 1

Pause Container 2

PC 2

PC N

PC 1

(c) Completion Phase

Figure 5: Different phases of Pause Container Pool Manager (PCPM)

5 Pause Container Pool Manager (PCPM)

This section describes the approach for the solution this papertakes. It consists of pre-creating networks and connectingthem to function containers, so that the time otherwise spentin the network create and network connect steps (Fig. 2) areremoved from the critical step of startup. For pre-creation andinitialization of networks, we borrow from Kubernetes [14],the concept of Pause containers [1] and use it as explained inSection 5.1. We also develop a pool manager for connectingthe appropriate function to a pause container and managing itslifecycle. We prototype the complete solution, termed PCPM,for evaluation under Apache OpenWhisk.

5.1 Network Pre-creation

A pause container (PC) is created ahead of time, with its ini-tialization paused after the network creation step where an IPaddress is assigned to it. A normal Docker (or any) containeris subsequently attached to a PC when needed, thus effectivelysharing its pre-established network namespace. Since PCs arepart of the (Docker) network, any container attached to a PCis also part of the same network and can communicate withother containers that are part of that network.

Our solution incorporates a pool management system —pause container pool manager (PCPM). The PCPM createsPCs and places them in a pool. When needed, they are thentaken from the pool and bound to function containers; andwhen those application containers are ready to terminate, thePCs are detached and placed back in the pool. Fig. 4 shows theresulting new cold start process — now, the time-consumingnetwork-create and network-connect steps of Fig. 2 are re-placed by the simple step of just attaching the newly createdexecution container to a PC taken from the pool.

When the function container terminates, the PC is simplydetached and put back in the PC pool for reuse. The PCsare environment agnostic and can be used by any applicationcontainer; thus, a PC used at one time by a Python functioncan be later re-used for a Node.js function. Further, exceptfor the network ID, the PCs are stateless, –needing just a fewkilobytes each; and the PCs do not become conduits of anyinformation between a previous attachment and a new one.

5.2 Pool Management

As just recounted, the PCPM manages a pool of PCs; it doesso in three key phases. The first phase is the build phaseduring which the FaaS framework is setup and initialized.Fig. 5 (a) shows the build phase of a generic container basedframework. The module responsible for launching containersand executing tasks is referred to as Invoker in OpenWhisk.As shown in Fig. 5 (a), during the build phase, a number ofpause containers are launched. At this time, PCPM initializesa free pool with the identifiers corresponding to the PCs. Theinvoker has knowledge about the PCs through the PCPM.

Fig. 5 (b) shows the second phase, – namely the executionphase. During this phase, the invoker queries the pool managerand obtains the identifier of an available PC (e.g., PC 1 inFig. 5 (b)). Using this identifier, the invoker attaches the newlylaunched container to the corresponding PC (i.e., PC 1), andremoves it from the free pool.

Fig. 5 (c) depicts the third phase, –namely the completionphase, when a function is completed and its execution con-tainer is terminated or otherwise recycled. At this point, theinvoker contacts the pool manager and the pool manager re-claims PC1 and inserts the identifier back into the free pool.The point of insertion doesn’t matter as PCs have no dataretention and therefore, no locality benefits.

It is beneficial to keep a large number of PCs in the poolto support function execution at scale, and this is reasonablegiven their small memory footprint. The Linux kernel hasa limit on creating multiple veth pairs on a single networkbridge (1024) but this is easily mitigated by creating multiplenetwork bridges and linking them together to support a largenumber (>> 1024) of PCs. In the absence of free PCs, adefault startup process will follow. A requirement for usingPCs is that the network configuration of the function con-tainer match that of the PC, and since in OpenWhisk, functioncontainers are expected to have port 8080 exposed, the PCsexpose port 8080 as well. Even though non-standard networkrequirements are unlikely since common FaaS frameworksmay not support them, it is easy to accommodate them whenthey arise, by creating multiple variety of PC pools for themunder PCPM. Even though we built PCPM for evaluationwithin OpenWhisk, it can be implemented for any other FaaSframework, as the basic concept of pre-creating and initializ-

01020304050

1 10 20 50 80 100P99

Tot

al T

ime

(s)


Baseline PCPM

Figure 6: Reduction in cold start time with PCPM. PCPMreduces the cold start execution time up to 80%.

0 10 2019

1725334149

Time (s)

Indi

vidu

al C

onta

iner

s

Service Invocation TimeStartup TimeRun TimeCleanup Time

Figure 7: Timeline for 50 concurrent functions with PCPMwith alternate container shown (cf. Fig. 3).

ing networks is common across frameworks.

6 ResultsWe evaluate concurrent executions for our PCPM-based Open-Whisk using unmodified OpenWhisk for baseline. In thissection we compare with the pre-warm and warm containermitigation and show up to 80% savings in execution time(relative to cold starts) and memory saving of several ordersof magnitude relative to pre-warm/warm containers.

6.1 Comparison with OpenWhiskFig. 6 compares its execution time with that of baseline, forconcurrency levels ranging from one to hundred. At fiftyconcurrent cold starts, execution time is cut by 75%; thisimproves further to a cut of 80% at hundred concurrent coldstarts. Fig. 7 shows a Gantt-chart for fifty cold-start executionsunder PCPM-OpenWhisk for comparison with the baselinein Fig. 3 (a). It may be noted that the cleanup time is alsoreduced as PCPM avoids deleting network namespaces andveth pairs during terminations. Work is in progress to identifynext level bottlenecks and reduce the service invocation time.

6.2 Comparison with Pre-warm and Warm

Table 2: PCPM vs Cold, Pre-warm, and Warm Containers

ContainerType

EnvironmentDependency

FunctionDependency

Mem Usage (MB)( 50 Containers)

Total Time (s)(50 Functions)

Cold NO NO 0 20.01Warm YES YES 1600 0.78Pre-Warm YES NO 1500 1.05PCPM NO NO 2 4.84

Table 2 compares PCPM-based cold-starts with baselineexecutions that are pre-warmed, warmed, or neither. Cold con-tainer execution has no memory pre-committed (column 4).Warm execution caches containers and code dependencies,and thus has the highest memory footprint. The pre-warmcase has the execution environment (e.g. Python) ready butmust bring in the code dependencies, causing a high foot-print. The memory footprint for the PCPM-based executionis negligible. For our simple prime number function, the foot-print and time of warm container compare well with those

of pre-warm containers. The gap between PCPM and Pre-warmed time, while much smaller than cold-start, reflects thedominance of the process setup work over the actual workin the function itself. However, for more complex work (e.g.,neural network inference), one should expect significantlylarger memory footprints but modest time reduction for warmcontainers over pre-warm containers. Similarly the gap inmemory footprint between PCPM-based and the pre-warm orwarm containers can be expected to widen significantly. Forsuch complex actions, the relatively small contribution of theprocess setup should make the total time comparable betweenPCPM-based and pre-warmed activations.

7 Summary and Discussion

This paper presented the ideas of pre-creating resources toimprove the FaaS performance. We focused on the containercold start issue and identified the network creation and initial-ization to be the major reason for the performance bottleneck.The proposed solution pre-creates networks (instead of pre-creating containers), and seamlessly attaches the pre-creatednetworks to function containers. To pre-create networks, thispaper uses pause containers, and to govern their lifecycle, itcreates a pause-container pool manager (PCPM) system. Eval-uation (using an OpenWhisk based prototype) demonstratesup to 80% reduction in execution time compared with coldcontainers, and several orders of magnitude reduction in mem-ory footprint with a modest drop in performance comparedwith pre-warmed containers. This solution does not excludethe use of current approaches in which function containersmay themselves be pre-created and/or cached for reuse; how-ever, it makes it significantly less necessary to adopt suchmeasures and tie down memory or clamp trigger rates.

With PCPM re-using pause containers and thereby IP ad-dresses, we would like to add multiple security and isolationpolicies over network namespaces to strengthen the securityaspects of PCPM. The idea of pre-creating or pre-fetchingresources can be extended beyond network entities. Someopportunities include pre-creating components of functionenvironments and pre-fetching code dependencies with thehelp of the container orchestrator to make FaaS agile withminimal memory footprint.

References

[1] The almighty pause container. https://www.ianlewis.org/en/almighty-pause-container.

[2] AWS Lambda. http://aws.amazon.com/lambda/.

[3] Cold start / Warm start with AWS Lambda.https://blog.octo.com/en/cold-start-warm-start-with-aws-lambda/.

[4] Dealing with cold starts in AWS Lambda.https://medium.com/thundra/dealing-with-cold-starts-in-aws-lambda-a5e3aa8f532.

[5] Google Cloud Functions. https://cloud.google.com/functions/.

[6] I’m afraid you’re thinking about AWS Lambda coldstarts all wrong. https://hackernoon.com/im-afraid-you-re-thinking-about-aws-lambda-cold-starts-all-wrong-7d907f278a4f.

[7] Microsoft Azure Functions. https://azure.microsoft.com/en-us/services/functions/.

[8] OpenWhisk. https://github.com/apache/incubator-openwhisk.

[9] Sieve of eratosthenes. https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes.

[10] Squeezing the milliseconds: How to makeserverless platforms blazing fast! https://medium.com/openwhisk/squeezing-the-

milliseconds-how-to-make-serverless-platforms-blazing-fast-aea0e9951bd0.

[11] Istemi Ekin Akkus et al. SAND: Towards high-performance serverless computing. In USENIX AnnualTechnical Conference), pages 923–935, 2018.

[12] Ioana Baldini et al. Serverless computing: Currenttrends and open problems. In Research Advances inCloud Computing, pages 1–20. Springer, 2017.

[13] Rajkumar Buyya et al. A manifesto for future genera-tion cloud computing: Research directions for the nextdecade. ACM Computing Surveys (CSUR), 51(5):105,2018.

[14] Inc Kubernetes. Kubernetes-production-grade containerorchestration. https://kubernetes.io/.

[15] Wes Lloyd et al. Serverless computing: An investigationof factors influencing microservice performance. InIEEE International Conference on Cloud Engineering(IC2E), pages 159–169. IEEE, 2018.

[16] Edward Oakes et al. SOCK: Rapid task provisioningwith serverless-optimized containers. In USENIX An-nual Technical Conference, pages 57–70, 2018.

[17] Stephen Soltesz et al. Container-based operating systemvirtualization: a scalable, high-performance alternativeto hypervisors. In ACM SIGOPS Operating SystemsReview, volume 41, pages 275–287. ACM, 2007.

[18] Liang Wang et al. Peeking behind the curtains of server-less platforms. In USENIX Annual Technical Confer-

ence), pages 133–146, 2018.

https://www.ianlewis.org/en/almighty-pause-container



http://aws.amazon.com/lambda/

https://blog.octo.com/en/cold-start-warm-start-with-aws-lambda/

https://blog.octo.com/en/cold-start-warm-start-with-aws-lambda/

https://medium.com/thundra/dealing-with-cold-starts-in-aws-lambda-a5e3aa8f532

https://medium.com/thundra/dealing-with-cold-starts-in-aws-lambda-a5e3aa8f532

https://cloud.google.com/functions/

https://cloud.google.com/functions/

https://hackernoon.com/im-afraid-you-re-thinking-about-aws-lambda-cold-starts-all-wrong-7d907f278a4f



https://azure.microsoft.com/en-us/services/functions/



https://github.com/apache/incubator-openwhisk

https://github.com/apache/incubator-openwhisk

https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes

https://medium.com/openwhisk/squeezing-the-milliseconds-how-to-make-serverless-platforms-blazing-fast-aea0e9951bd0




https://kubernetes.io/

agile cold starts for scalable serverless · 2019-07-08 · agile cold starts for scalable...

Documents