2017-04-18 scanner platformlab talks/will_crichton.pdfgoal: design a system that makes it easy to...
TRANSCRIPT
Scanner: Efficient Video Analysis at Scale
Will Crichton Alex Poms
Pat Hanrahan Kayvon Fatahalian
Emerging space of video-driven applications
CMU Panoptic studioCapture of 480 640x480 video streams 147 MPixel capture at 24 fps (3.5 GPixel/sec)
[Image Credit: Joo et al. 2015]
3D human pose reconstruction40-second sequence (captured human social interactions) Reconstruction time: hand-coded solution by MS grad student — 7 hours on a 4-GPU machine [Cao 2016]
[Joo et al. 2015]
Example: human pose detection
▪ Attempt #1: do it in Python
▪ Attempt #2: import multiprocessing
▪ Attempt #3: rewrite it in C++
▪ Attempt #4: use multiple GPUs (6 month checkpoint)
▪ Attempt #5: scale across a cluster
Video application challenges
▪ Video is huge! - Requires compression, decode (efficient pixel delivery) - Varying access patterns - 3-25x blowup in size for explode to frames - 1000x vs. raw pixels
▪ Execution requires heterogeneous hardware
▪ Stateful/stenciled computation patterns
No existing system well suited for the job
▪ Distributed data-analytics frameworks [MapReduce, Spark]
▪ Distributed machine learning frameworks [TensorFlow, MxNet]
▪ Array/raster databases [SciDB, RasDaMan, GIS databases]
Goal: design a system that makes it easy to rapidly author video analysis pipelines that scale.
Scanner: design goals / principlesDesign principle 1: keep it simple - Enable non-expert programmers - Focus on video applications
Design principle 2: be efficient - “Near-HW-peak single-node perf” then scale out - Utilize heterogenous hardware
Non goal: - Do not be a new kernel description language - Do not be a general purpose dataflow/streaming engine
Setupmyvideos/cam000.mp4myvideos/cam001.mp4myvideos/cam002.mp4…myvideos/cam479.mp4
I have a list of videos in a directory…
And I have a library of parallel pixel processing kernels for CPUs and GPUs:
Image crop/rescale (Halide)
Optical flow (OpenCV)
Video Tracker (CUDA)
Depth from disparity (Halide)
Eigen (C)
Caffe DNN EvalNVIDIA cuDNN
Face detection network
Human Pose estimation
Object detector
Depth/normal estimator
…
[Cao16]
[Redmon16]
[Bansal17]
[Hu17]
Represent videos as relations (tables)myvideos/cam000.mp4myvideos/cam001.mp4myvideos/cam002.mp4…myvideos/cam479.mp4
Scanner dataset: capture_session
table: cam000.mp4frame_id image
table: cam001.mp4frame_id image
table: cam479.mp4frame_id image
…
table: cam002.mp4frame_id image
Ingest into Scanner…
Computations: DAGs of image processing operations
Pipeline kernels map to heterogeneous resources: CPUs, GPUs, ASICs
......... ...Image Resize DNN Eval Estimate pose
Halide code Caffe lib C++ code
Resource: GPU
Resource: GPU
Resource: 4 CPU cores
640x480 frames
496x368 frames
62x46x15 activations
(x,y) pos for body parts
Scanner maps pipelines onto a stream of video frames from tables
Pipeline:
Input:
Output: pose_000
cam_000id frame
pose_000id 2dpose
taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]Output: pose_000
0 1 2 3 ...
pose_000id 2dpose
task0 1 2 3 ...
pose_000id 2dpose
task0 1 2 3 ...
Sampling/joining input frames
cam_000id frame
pose_000id 2dpose
taskPipeline: Input: cam_000 [0, 2, 4, 6, ...]Output: pose_000
0 2 4 6...
cam_000id frame
depth_0_1id depth
taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]
Output: depth_0_1 cam_001id frame
cam_001 [0, 1, 2, 3, ...]
0 1 2 3 ...
0 1 2 3 ...
Stride, gather, range, join, …
Stride by 2:
Join two tables:
cam_000id frame
pose_000id 2dpose
taskPipeline: Input: cam_000 [0, 2, 4, 6, ...]Output: pose_000
0 2 4 6...
cam_000id frame
depth_0_1id depth
taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]
Output: depth_0_1 cam_001id frame
cam_001 [0, 1, 2, 3, ...]
0 1 2 3 ...
0 1 2 3 ...
Distributing tasks across a cluster
I/O
GPU HW Decoder
Resize Instance 0
DNN Eval Instance 0
Estimate Pose Instance 0
Node 0: multi-core CPU + GPU
GPU 0
Node 1: multi-core CPU + 2 GPUs
GPU 1
GPU 2
I/O
I/O
I/O
I/O
GPU HW Decoder
Resize Instance 1
DNN Eval Instance 1
Estimate Pose Instance 1I/O
I/O
I/O
GPU HW Decoder
Resize Instance 2
DNN Eval Instance 2
Estimate Pose Instance 2
Cluster
cam_000id frame
pose_000id 2dpose
taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]Output: pose_000
0 1 2 3 ...
Demo
Video challenges
Storing video collections efficiently
table: cam_000frame_id frame
H.264 encoded bytestreamcam_000:frames
cam_000::meta0 864b 54 5478b 240 8194b
iframe # offset
Stored RepresentationLogical relation
Background: h264 Encoded Video
H.264 encoded bytestream
Stored Representation
cam_000:frames
Background: h264 Encoded Video
H.264 encoded bytestream
Stored Representation
GOP GOP G.. GOP GOP GOP
cam_000:frames
Background: h264 Encoded Video
H.264 encoded bytestream
Stored Representation
GOP GOP G.. GOP GOP GOPI P P B P
GOP
cam_000:frames
Background: h264 Encoded Video
H.264 encoded bytestream
Stored Representation
GOP GOP G.. GOP GOP GOPI P P B P
GOP
0 864b 54 5478b 240 8194b …. …
iframe # offset
cam_000:frames
cam_000::meta
Maintain keyframe index to assist parallel decode
frame 0 (keyframe)
byte 0
frame 120 (keyframe)
byte 4840
frame 270 (keyframe)
byte 6796
frame 340 (keyframe)
byte 12480
frame 310 (keyframe)
byte 11284
130 131 192 320
Kernels need more than one frame
▪ Goal #1: Temporal extent
▪ Goal #2: Sampling
▪ Goal #3: Batching
▪ Goal #4: Modularity
Two-level hierarchy on stream elements
Pipeline Instance 0
Pipeline Instance 1
span 0 frames 0-1023
span 1 frames 1024-2048
…
batch 0 batch 1
Pipeline batch size: Consecutive elements:
4 1024
Two-level hierarchy on stream elements
Pipeline Instance 0
Pipeline Instance 1
span 0 frames 0-1023
span 1 frames 1024-2048
…
batch 0 batch 1
Pipeline batch size: Consecutive elements:
4 1025
Warmup size = 1
Two-level hierarchy on stream elements
Pipeline Instance 0
Pipeline Instance 1
span 0 frames 0-1023
span 1 frames 1024-2048
…
batch 0 batch 1
Pipeline batch size: Consecutive elements:
4 1024
Warmup size = 1
cam_000id frame
pose_000id 2dpose
taskPipeline: Input: cam_000 [0, 2, 4, 6, ...]Output: pose_000
0 2 4 6...
Stride by 2:
Stenciling patterns Stencil
Stencil & stride
Stride & stencil
Stenciling patterns 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12 1 6 11
1 2 3 4 5 6 7 8 9 10 11 12 1 2 6 7 11 12
Solution: change kernel interface
1 2 3 4 5 6 7 8 9 10 11 12 1 2 6 7 11 12
1 6 6 111 2 3 4 5 6 7 8 9 10 11 12
Evaluation
Scanner has low overheadTh
roug
hput
HistogramCPU GPU
Optical Flow DNN Eval GatherCPU GPU CPU GPUGPU
1
0
ScannerBaseline
Single Node: Throughput Relative to Hand-Tuned Pipelines
Scaling to dense GPU nodesTh
roug
hput
(re
lativ
e to e
xper
t-tun
ed)
HistogramCPU GPU
Optical Flow DNN Eval GatherCPU GPU CPU GPUGPU
1
0
2
3
Scanner (4-GPU)
Multi-node scaling
Num nodes
Thro
ughp
ut (f
ps)
4K
8K
16K
12K
01 2 4 168 Num nodes
150
300
450
0Thro
ughp
ut (f
ps)
1 2 4 168
1.5K
3K
6K
4.5K
0
Num nodes1 2 4 168Num nodes
Thro
ughp
ut (f
ps)
4K
8K
16K
12K
01 2 4 168
Scanner (CPU only)Scanner (GPU only)
Histogram Optical Flow
DNN Eval
3D pose reconstruction
Grad student hand-tuned:Scanner:Scanner on cluster:
2.6 hrs (4 GPU node)38 mins (4 node x 4 GPU/node cluster)
7 hrs (4 GPU node)
Approaching viability for marker less motion capture.
Processing 40 seconds of video from CMU Panoptic studio
Shot segmentation (cinematography analysis)
608 feature length films (2.4 TB) 103M frames Histogram-based shot segmentation of all films: 4.7 hrs (4 node cluster, 4 GPUs/node)
What’s next?
▪ Scale out with Google Cloud
▪ 3 PB Internet Archive dataset
▪ Youtube 8M face database
▪ Esper frontend for Magic Grant
Thank you!https://github.com/scanner-research/scanner