empowering live perception with mediapipe · 2019. 10. 30. · ming guang yong, google research...
TRANSCRIPT
Proprietary + Confidential
Ming Guang Yong, Google Research [link to presentation]
mediapipe.dev
MediaPipe
Empowering Live Perception with
Proprietary + Confidential
Outline
● Introduction
● Example Pipeline for Live Perception
● MediaPipe Toolkit
Proprietary + Confidential Proprietary + Confidential
Widely used at Google in research & products to process
and analyze video, audio and sensor data:
● Dataset preparation pipelines for ML training
● ML inference pipelines
● Media processing pipelines
MediaPipe is Google's open source
cross-platform framework for building perception pipelines
Introduction
Proprietary + Confidential Proprietary + Confidential
AR:
YouTube, ARCore, Duo
Mobile:
Visual Search, Lens
Cross-platform:
Motion Photos
Server:
Video Previews
Android
Server
Proprietary + Confidential Proprietary + Confidential
MediaPipe OSS
● Initial release to GitHub in June,
2019
● 3200+ stars on GitHub to date
● Platforms: Mobile/Desktop
○ Linux (C++ Native APIs)
○ iOS (Obj C APIs)
○ Android (Java APIs)
● ML Acceleration
○ EdgeTPU
○ Mobile GPU
○ Desktop GPU (mesa libraries)
mediapipe.dev (github.com/google/mediapipe)
Proprietary + Confidential Proprietary + Confidential
docs.mediapipe.dev viz.mediapipe.dev
Proprietary + Confidential Proprietary + Confidential
Mobile Examples
Proprietary + Confidential Proprietary + Confidential
Live Perception Example
Proprietary + Confidential Proprietary + Confidential
Goal: Hand Tracking
Proprietary + Confidential Proprietary + Confidential
ML Model to Localize Hand Landmarks
Model
Image/Video Landmarks
There's more to it
Proprietary + Confidential Proprietary + Confidential
Simple Perception Pipeline Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Geometric transformation to resize and rotate etc
For a given ML model, runs ML inference
Renders landmarks onto associated images
Model
Video input, e.g., from live camera viewfinder
Decodes tensors into high-level metadata
Video output, e.g., display
Converts images to tensors
Proprietary + Confidential Proprietary + Confidential
Pipeline via MediaPipe Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Model
A MediaPipe Graph represents a perception
pipeline
Each node in the pipeline is a MediaPipe Calculator
Two nodes can be connected by a Stream, which
carries a sequence of Packets with ascending
timestamps
Proprietary + Confidential Proprietary + Confidential
Calculator Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Model
● Written in C++
● Declares input/output ports
○ Type of packet payload through the port
○ Ports can be declared as optional
● Implements Open/Process/Close methods, called by
framework
○ Open - Before a full graph run
○ Process - Repeatedly with incoming packet(s)
ready to be processed
○ Close - After a full graph run
Proprietary + Confidential Proprietary + Confidential
Built-in Calculators Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Model
Family of image and media
processing calculators
Native integration with TensorFlow and TF Lite
for ML inference
Family of ML post-processing calculators for
common ML tasks, e.g., detection, segmentation
and classification
Family of utility calculators for, e.g., image
annotation, flow control
Proprietary + Confidential Proprietary + Confidential
Synchronization Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Model
Synchronization handled by MediaPipe framework
to align time-series data
● By default all inputs to a calculator are
synchronized
● Process method in a calculator is invoked with
packets with the same timestamp across inputs
Proprietary + Confidential Proprietary + Confidential
Subgraph Video
Video
ImageTransform
IMAGE
TRANSFORMED_IMAGE
ImageToTensor
IMAGE
TENSOR
Inference
TENSORS
TENSORS
TensorToLandmar
ks
TENSORS
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Video
Video
HandLandmark
IMAGE
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
A (partial) graph can be declared as a
subgraph and used as a regular calculator in
other graphs Model
Proprietary + Confidential Proprietary + Confidential
Simple Perception Pipeline Video
Video
HandLandmark
IMAGE
LANDMARKS
Renderer RENDERED_IMAGE
LANDMARKS IMAGE
Remaining issues:
● In practice, a hand can appear anywhere in an
image, at very different scales
● Takes a lot of model capacity to deal with the
large variation in location & scale
● The model is either large/slow or inaccurate
Proprietary + Confidential Proprietary + Confidential
"On-Device, Real-Time Hand Tracking with MediaPipe", Aug '19
Google AI Blog
Proprietary + Confidential Proprietary + Confidential
MediaPipe Toolkit
Proprietary + Confidential
MediaPipe Tech Stack
Graphs
Cross-platform Framework (C++)
TensorFlow,
OpenCV,
OpenGL/Metal,
Eigen etc
Graph
Construction
API
(Protobuf)
Calculator
API (C++)
Graph
Execution
API (C++,
Java, Obj-C)
Helper
Utils
(Android,
iOS)
Example
Applications
Example
Graphs
Built-in
Calculators Calculators
Graphs
Applications (Desktop/Server, Android, iOS, Embedded)
Proprietary + Confidential Proprietary + Confidential
Thank you!
mediapipe.dev