2017-04-18 scanner platformlab talks/will_crichton.pdfgoal: design a system that makes it easy to...

Scanner: Efficient Video Analysis at Scale

Will Crichton Alex Poms

Pat Hanrahan Kayvon Fatahalian

Emerging space of video-driven applications

CMU Panoptic studioCapture of 480 640x480 video streams 147 MPixel capture at 24 fps (3.5 GPixel/sec)

[Image Credit: Joo et al. 2015]

3D human pose reconstruction40-second sequence (captured human social interactions) Reconstruction time: hand-coded solution by MS grad student — 7 hours on a 4-GPU machine [Cao 2016]

[Joo et al. 2015]

Example: human pose detection

▪ Attempt #1: do it in Python

▪ Attempt #2: import multiprocessing

▪ Attempt #3: rewrite it in C++

▪ Attempt #4: use multiple GPUs (6 month checkpoint)

▪ Attempt #5: scale across a cluster

Video application challenges

▪ Video is huge! - Requires compression, decode (efficient pixel delivery) - Varying access patterns - 3-25x blowup in size for explode to frames - 1000x vs. raw pixels

▪ Execution requires heterogeneous hardware

▪ Stateful/stenciled computation patterns

No existing system well suited for the job

▪ Distributed data-analytics frameworks [MapReduce, Spark]

▪ Distributed machine learning frameworks [TensorFlow, MxNet]

▪ Array/raster databases [SciDB, RasDaMan, GIS databases]

Goal: design a system that makes it easy to rapidly author video analysis pipelines that scale.

Scanner: design goals / principlesDesign principle 1: keep it simple - Enable non-expert programmers - Focus on video applications

Design principle 2: be efficient - “Near-HW-peak single-node perf” then scale out - Utilize heterogenous hardware

Non goal: - Do not be a new kernel description language - Do not be a general purpose dataflow/streaming engine

Setupmyvideos/cam000.mp4myvideos/cam001.mp4myvideos/cam002.mp4…myvideos/cam479.mp4

I have a list of videos in a directory…

And I have a library of parallel pixel processing kernels for CPUs and GPUs:

Image crop/rescale (Halide)

Optical flow (OpenCV)

Video Tracker (CUDA)

Depth from disparity (Halide)

Eigen (C)

Caffe DNN EvalNVIDIA cuDNN

Face detection network

Human Pose estimation

Object detector

Depth/normal estimator

…

[Cao16]

[Redmon16]

[Bansal17]

[Hu17]

Represent videos as relations (tables)myvideos/cam000.mp4myvideos/cam001.mp4myvideos/cam002.mp4…myvideos/cam479.mp4

Scanner dataset: capture_session

table: cam000.mp4frame_id image



…


Ingest into Scanner…

Computations: DAGs of image processing operations

Pipeline kernels map to heterogeneous resources: CPUs, GPUs, ASICs

......... ...Image Resize DNN Eval Estimate pose

Halide code Caffe lib C++ code

Resource: GPU

Resource: GPU

Resource: 4 CPU cores

640x480 frames

496x368 frames

62x46x15 activations

(x,y) pos for body parts

Scanner maps pipelines onto a stream of video frames from tables

Pipeline:

Input:

Output: pose_000

cam_000id frame

pose_000id 2dpose

taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]Output: pose_000

0 1 2 3 ...

pose_000id 2dpose

task0 1 2 3 ...

pose_000id 2dpose

task0 1 2 3 ...

Sampling/joining input frames

cam_000id frame

pose_000id 2dpose


0 2 4 6...

cam_000id frame

depth_0_1id depth

taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]

Output: depth_0_1 cam_001id frame

cam_001 [0, 1, 2, 3, ...]

0 1 2 3 ...

0 1 2 3 ...

Stride, gather, range, join, …

Stride by 2:

Join two tables:

cam_000id frame

pose_000id 2dpose


0 2 4 6...

cam_000id frame

depth_0_1id depth

taskPipeline: Input: cam_000 [0, 1, 2, 3, ...]

Output: depth_0_1 cam_001id frame

cam_001 [0, 1, 2, 3, ...]

0 1 2 3 ...

0 1 2 3 ...

Distributing tasks across a cluster

I/O

GPU HW Decoder

Resize Instance 0

DNN Eval Instance 0

Estimate Pose Instance 0

Node 0: multi-core CPU + GPU

GPU 0

Node 1: multi-core CPU + 2 GPUs

GPU 1

GPU 2

I/O

I/O

I/O

I/O

GPU HW Decoder

Resize Instance 1

DNN Eval Instance 1

Estimate Pose Instance 1I/O

I/O

I/O

GPU HW Decoder

Resize Instance 2

DNN Eval Instance 2

Estimate Pose Instance 2

Cluster

cam_000id frame

pose_000id 2dpose


0 1 2 3 ...

Video challenges

Storing video collections efficiently

table: cam_000frame_id frame

H.264 encoded bytestreamcam_000:frames

cam_000::meta0 864b 54 5478b 240 8194b

iframe # offset

Stored RepresentationLogical relation

Background: h264 Encoded Video

H.264 encoded bytestream

Stored Representation

cam_000:frames




GOP GOP G.. GOP GOP GOP

cam_000:frames




GOP GOP G.. GOP GOP GOPI P P B P

GOP

cam_000:frames




GOP GOP G.. GOP GOP GOPI P P B P

GOP

0 864b 54 5478b 240 8194b …. …

iframe # offset

cam_000:frames

cam_000::meta

Maintain keyframe index to assist parallel decode

frame 0 (keyframe)

byte 0

frame 120 (keyframe)

byte 4840


byte 6796


byte 12480


byte 11284

130 131 192 320

Kernels need more than one frame

▪ Goal #1: Temporal extent

▪ Goal #2: Sampling

▪ Goal #3: Batching

▪ Goal #4: Modularity

Two-level hierarchy on stream elements

Pipeline Instance 0

Pipeline Instance 1

span 0 frames 0-1023


…

batch 0 batch 1

Pipeline batch size: Consecutive elements:

4 1024


Pipeline Instance 0

Pipeline Instance 1



…

batch 0 batch 1


4 1025

Warmup size = 1


Pipeline Instance 0

Pipeline Instance 1



…

batch 0 batch 1


4 1024

Warmup size = 1

cam_000id frame

pose_000id 2dpose


0 2 4 6...

Stride by 2:

Stenciling patterns Stencil

Stencil & stride

Stride & stencil

Stenciling patterns 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 1 6 11

1 2 3 4 5 6 7 8 9 10 11 12 1 2 6 7 11 12

Solution: change kernel interface

1 2 3 4 5 6 7 8 9 10 11 12 1 2 6 7 11 12

1 6 6 111 2 3 4 5 6 7 8 9 10 11 12

Evaluation

Scanner has low overheadTh

roug

hput

HistogramCPU GPU

Optical Flow DNN Eval GatherCPU GPU CPU GPUGPU

1

0

ScannerBaseline

Single Node: Throughput Relative to Hand-Tuned Pipelines

Scaling to dense GPU nodesTh

roug

hput

(re

lativ

e to e

xper

t-tun

ed)

HistogramCPU GPU

Optical Flow DNN Eval GatherCPU GPU CPU GPUGPU

1

0

2

3

Scanner (4-GPU)

Multi-node scaling

Num nodes

Thro

ughp

ut (f

ps)

4K

8K

16K

12K

01 2 4 168 Num nodes

150

300

450

0Thro

ughp

ut (f

ps)

1 2 4 168

1.5K

3K

6K

4.5K

0

Num nodes1 2 4 168Num nodes

Thro

ughp

ut (f

ps)

4K

8K

16K

12K

01 2 4 168

Scanner (CPU only)Scanner (GPU only)

Histogram Optical Flow

DNN Eval

3D pose reconstruction

Grad student hand-tuned:Scanner:Scanner on cluster:

2.6 hrs (4 GPU node)38 mins (4 node x 4 GPU/node cluster)

7 hrs (4 GPU node)

Approaching viability for marker less motion capture.

Processing 40 seconds of video from CMU Panoptic studio

Shot segmentation (cinematography analysis)

608 feature length films (2.4 TB) 103M frames Histogram-based shot segmentation of all films: 4.7 hrs (4 node cluster, 4 GPUs/node)

What’s next?

▪ Scale out with Google Cloud

▪ 3 PB Internet Archive dataset

▪ Youtube 8M face database

▪ Esper frontend for Magic Grant

Thank you!https://github.com/scanner-research/scanner

http://github.com/scanner-research/scanner

2017-04-18 scanner platformlab talks/will_crichton.pdfgoal: design a system that makes it easy to...

Documents