parallelizing ci using docker swarm-mode · •loads are automatically balanced via docker's...
TRANSCRIPT
Copyright©2017 NTT Corp. All Rights Reserved.
Akihiro Suda <[email protected]>
NTT Software Innovation Center
Parallelizing CI using Docker Swarm-Mode
Open Source Summit Japan (June 1, 2017) Last update: May 17, 2017
2Copyright©2017 NTT Corp. All Rights Reserved.
https://github.com/AkihiroSuda
• Software Engineer at NTT Corporation
• Several talks at FLOSS community
• FOSDEM 2016
• ApacheCon Core North America 2016, etc.
• Docker Moby Project core maintainer
Who am I
: ≒ :
Docker transited into Moby Project (April, 2017)
3Copyright©2017 NTT Corp. All Rights Reserved.
A problem in Docker/Moby project: CI is slow
https://jenkins.dockerproject.org/job/Docker-PRs/buildTimeTrend
capture: March 3, 2017
120 min
red valley = test failed immediately
4Copyright©2017 NTT Corp. All Rights Reserved.
How about other FLOSS projects?
https://builds.apache.org/view/All/job/PreCo
mmit-HDFS-Build/buildTimeTrend
https://builds.apache.org/view/All/job/PreCo
mmit-YARN-Build/buildTimeTrend
https://amplab.cs.berkeley.edu/jenkins/job/
SparkPullRequestBuilder/buildTimeTrend
capture: March 3, 2017200 min
260 min
240 min
https://grpc-
testing.appspot.com/job/gRPC_pull_requ
ests_linux/buildTimeTrend
550 min
(for ease of visualization, picked up some projects that use Jenkins from various categories)
5Copyright©2017 NTT Corp. All Rights Reserved.
•Blocker for reviewing/merging patches
•Discourages developers from writing tests
•Discourages developers from enabling additional testing features
• e.g. race detector (go test –race)
Why slow CI matters?
6Copyright©2017 NTT Corp. All Rights Reserved.
Why slow CI matters?
Poor implementation quality &Slow development cycle
7Copyright©2017 NTT Corp. All Rights Reserved.
•go test –parallel N
•mvn –-threads N
•parallel
• ...
Solution: Parallelize CI ?
8Copyright©2017 NTT Corp. All Rights Reserved.
But just doing parallelization is not enough
•No isolation
• Concurrent test tasks may race for certain shared resources(e.g. files under `/tmp`, TCP port, ...)
•Poor scalability
• CPU/RAM resource limitation
• I/O parallelism limitation
Solution: Parallelize CI ?
9Copyright©2017 NTT Corp. All Rights Reserved.
•Docker provides isolation
•Swarm-mode provides scalability
Solution: Docker (in Swarm-mode)
Distributeacross Swarm-mode
Parallelize & Isolateusing Docker containers
✓Isolation ✓Scalability
10Copyright©2017 NTT Corp. All Rights Reserved.
• For ideal isolation, each of the test functions should be encapsulated into independent containers
• But this is not optimal in actual due to setup/teardown code
Challenge 1: Redundant setup/teardown
func TestMain(m *testing.M) {
setUp()
m.Run()
tearDown()
}
func TestFoo(t *testing.T)
func TestBar(t *testing.T)
redundantly executed
setUp()
testFoo.Run()
tearDown()
container
setUp()
testBar.Run()
tearDown()
container
11Copyright©2017 NTT Corp. All Rights Reserved.
• Solution: execute a chunk of multiple test functions sequentially in a single container
Optimization 1: Chunking
setUp()
testFoo.Run()
tearDown()
container
setUp()
testBar.Run()
tearDown()
containerchunk
setUp()
testFoo.Run()
testBar.Run()
tearDown()
container
12Copyright©2017 NTT Corp. All Rights Reserved.
• Test sequence is executed in lexicographic order (ABC...)
• Observation: long-running test functions concentrate on a certain portion
• because testing similar scenarios
Challenge 2: Makespan non-uniformity
Execution Order
TestApple1TestApple2
TestBanana1
TestBanana2
TestCarrot1
TestCarrot2
13Copyright©2017 NTT Corp. All Rights Reserved.
Challenge 2: Makespan non-uniformity
0
10
20
30
40
50
60
70
80
90
100
0 500 1000 1500N-th test
Makespan (seconds)
Example: Docker itself
DockerSuite.TestBuild*
DockerSwarmSuite.Test*
14Copyright©2017 NTT Corp. All Rights Reserved.
Makespans of the chunks are likely to result in non-uniformity
Challenge 2: Makespan non-uniformity
Test1
Test2
Test3
Test4
Test1
Test2
Test3
Test4non-uniformity(wasted time for container 1)
speed-up
1 2
Chunks for containersSequence
15Copyright©2017 NTT Corp. All Rights Reserved.
Solution: shuffle the chunks
• No guarantee for optimal schedule though
Optimization 2: Shuffling
Test1
Test2
Test3
Test4
Test1
Test2
Test3
Test4
Test1
Test2
Test3
Test4
1 2 1 2
Chunks Chunks (Shuffled)Sequence
16Copyright©2017 NTT Corp. All Rights Reserved.
• Based on Funker (github.com/bfirsh/funker)
• FaaS-like architecture
• Loads are automatically balanced via Docker's built-in LB
• No explicit task queue
• Could be easily portable to other container orchestrators
Implementation
masterBuilt-in
LB
Client
worker.2
worker.1
worker.3
Funker!
Docker Swarm-mode cluster(typically on cloud)
(typically on laptop)
17Copyright©2017 NTT Corp. All Rights Reserved.
•Evaluated my hack against the CI of Docker itself• Of course, this hack can be applicable to CI of other
software as well
•Target: Docker 16.04-dev (git:7fb83eb7)• Contains 1,648 test functions
•Machine: Google Computing Enginen1-standard-4 instances (4 vCPUS, 15GB RAM)
Experimental result
18Copyright©2017 NTT Corp. All Rights Reserved.
Experimental result
1h22m7s(traditional testing)
4m10s
1 node
10 nodesCost: x10
Speed-up: x19.7
15m3sCost: x1
Speed-up: x5.5
1 node
•20 times faster at maximum with 10 nodes• But even effective with a single node!
19Copyright©2017 NTT Corp. All Rights Reserved.
Detailed result
Best BCR (benefit-cost
ratio)
1(Chunk size:
1648)
10(165)
30(55)
50(33)
70(24)
1
1h22m7s(=traditional)
15m3sN/A (more than 30m)
2 12m7s 10m12s 11m25s 13m57s
5 10m16s 6m18s 5m46s 6m28s
10 8m26s 4m31s 4m10s 4m20s
Containers running in parallel
Fastest
configuration
x5.5 (BCR x5.5)
x19.7 (BCR x2.0)
x8.1 (BCR x4.0)
x14.2 (BCR x2.8)
No
de
s
Time: average of 5 runs, Graph driver: overlay2
20Copyright©2017 NTT Corp. All Rights Reserved.
What if no optimization techniques?
Nodes ParallelizeParallelize
+ Chunking
Parallelize +
Chunking+
Shuffling(= previous slide)
1
more than 30m
14m58s 15m3s
2 10m1s 10m12s
5 7m32s 5m46s
10 6m9s 4m10s
Significantly faster x1.5 faster
See the previous slide for the number of containers running in parallel
21Copyright©2017 NTT Corp. All Rights Reserved.
Scale-out vs Scale-up?
NodesTotal
vCPUsTotal RAM Containers Result
Scaleout 10
40 150GB 50
4m10s
Scaleup 1 19m17s
with both chunking and shuffling
Scale-out wins• Better I/O parallelism, mutex contention, etc.
22Copyright©2017 NTT Corp. All Rights Reserved.
•Record and use the past execution history to optimize the schedule, rather than just shuffling
•Investigate deeply why scale-out wins
Future work
23Copyright©2017 NTT Corp. All Rights Reserved.
PR (merged): https://github.com/docker/docker/pull/29775
The code is available!
$ cd $GOPATH/src/github.com/docker/docker
$ make build-integration-cli-on-swarm
$ ./hack/integration-cli-on-swarm/integration-cli-on-swarm ╲
-replicas 50 ╲
-shuffle ╲
-push-worker-image your-docker-registry/worker:latest
24Copyright©2017 NTT Corp. All Rights Reserved.
•Yes, with just a few line of modifications
•Planning to split this hack into an independent generic test tool
Is it applicable to other software as well?
25Copyright©2017 NTT Corp. All Rights Reserved.
•Docker Swarm-mode is effective for parallelizing CI jobs
•Introduced some techniques for optimal scheduling
Conclusion
1h22m7s(traditional testing)
4m10s
1 node
10 nodesCost: x10
Speed-up: x19.7
15m3sCost: x1
Speed-up: x5.5
1 node