deeppicar: a low-cost deep neural network-based autonomous...
TRANSCRIPT
DeepPicar: A Low-cost Deep Neural Network-based
Autonomous Car
Michael Bechtel$, Elise McEllhiney$, Minje Kim^, Heechul Yun$
$ University of Kansas, ^ Indiana University Bloomington
1
End-to-End Deep Learning
• Produce control outputs directly from sensory inputs.
• Simplifies process by bypassing intermediary steps.
2
Adopted from http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_1_introduction.pdf
DAVE-2• 2016 project done by NVIDIA.
• Used the End-to-End approach with a Convolutional Neural Network (CNN).
• Could successfully drive a car on public roads.
3Source: https://devblogs.nvidia.com/deep-learning-self-driving-cars/
DAVE-2's CNN
• DAVE-2 used a 9-layer CNN to drive their car• ~250K weights
• ~27M connections
• 3MB large
• Relatively small by today's standards• More recent networks have millions of
weights and are >100MB large
4
DAVE-2 CNN
Outline
• Background
• DeepPicar Platform
• CNN Evaluation
• Shared Resource Isolation
• Embedded Platform Comparison
• Conclusions
5
DeepPicar
• A low cost, small scale replication of NVIDIA’s DAVE-2.
• Uses the exact same CNN.
• Runs on a Raspberry Pi 3/4 in real-time.
6
System Design
7
USB
GPIO
Jumper
Camera Embedded Computer
Actuator
RC car
Jumper
Portable charger
Motor Control• The RC car has two separate motors: steering and throttle
• Convert the steering angle to a PWM value
• Send a signal to the steering motor with the PWM value 8
Steering Throttle
Jumper wires
CNN-Based Real-Time Control Loop
9
Image Collection• Get images from the camera sensor using OpenCV
• Configured to return a 320x240x3 image frame• The network requires 66x200x3 input
10
read()
Image Preprocessing
• Transform the image's dimensions (also with OpenCV)
11
resize()
320x240x3 66x200x3
CNN Inferencing
• Feed the preprocessed image to the network
12
Steering angle(Radians)
Output Handling
• Convert network output to degrees
• Control car based on relative value• Angle > 15: turn left
• Angle < -15: turn right
• Else: go straight
13
14
Outline
• Background
• DeepPicar Platform
• CNN Evaluation
• Shared Resource Isolation
• Embedded Platform Comparison
• Conclusions
15
CNN on Raspberry Pi 3
16
• Pi 3 is able to run the CNN based control at under 40 Hz (25 ms).
• CNN inferencing dominates the processing time (>80%).
Time breakdown
Effect of Number of Cores Used
• Performance improves with more cores: 20Hz (1core) – 40Hz (4cores).
• But scalability is limited (due to parallelization overhead).
17
Effect of Multiple Concurrent Models
• CNNs experience modest slowdown (due to interference).
18
2Nx2C: 2 CNNs each using 2 cores4Nx1C: 4 CNNs each using 1 core1Nx1C: 1 CNN using 1 core1Nx2C: 1 CNN using 2 cores
2Nx2C
Effect of Memory Intensive Co-runners
19
Co-runners:BwRead: 16MB 1D array readBwWrite: 16MB 1D array write
• CNN can suffer very high (up to 11.6X) slowdown.
• Likely caused due to contention in shared hardware resources.
Effect of Co-runners
20
Outline
• Background
• DeepPicar Platform
• CNN Evaluation
• Shared Resource Isolation
• Embedded Platform Comparison
• Conclusions
21
Isolation Mechanisms
• L2 Cache Isolation: PALLOC (*)• Page-coloring based kernel-level memory allocator that partitions the cache
by allocating memory pages to disjoint cache sets.
• DRAM Isolation: MemGuard (**)• Memory bandwidth reservation system that limits the bandwidth each core
gets in a given interval (1 ms).
22(*) H. Yun et al., “PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14(**) Yun et al., “MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms.” RTAS’13
PALLOC
• L2 cache is partitioned using bits 13 and 14.
• Four partitions are created with 4, 3, 2, and 1 colors.• 100%, 75%, 50%, and 25% L2 cache space availability.
23
PALLOC cont.
• The CNN workload is insensitive to cache space availability.
24
DRAM
LLC
Core1 Core2 Core3 Core4
DNN
PALLOC cont.
• Cache partitioning is ineffective in protecting the CNN.• Using PALLOC provides no benefits.
25
MemGuard
• CNN performance is sensitive to memory bandwidth.• At least 400 MB/s required for ideal performance.
26
MemGuard cont.
• Performance improves when co-runner bandwidths are limited.• Using MemGuard is very beneficial.
27
solo
CNN bandwidth: 1000 MB/s
Outline
• Background
• DeepPicar Platform
• CNN Evaluation
• Shared Resource Isolation
• Embedded Platform Comparison
• Conclusions
28
Embedded Platform Comparison
• Does the CNN behave similarly on other platforms?
• We test the NVIDIA TX2 with GPU, and without.
• Three experiments were replicated on the other platforms.
29
Comparison of Multicore Experiments
• The CNN scales on all platforms except for TX2 (GPU).
• Scalability is still limited.
30
Comparison of Multimodel Experiment
• All platforms experience some interference.
• Slowdown is tolerable on all platforms.31
2Nx2C: 2 CNNs each using 2 cores4Nx1C: 4 CNNs each using 1 core1Nx1C: 1 CNN using 1 core1Nx2C: 1 CNN using 2 cores
2Nx2C
Comparison of Co-runners
• Worst-case performance is bad on all platforms.
• But the Pi 3 is especially bad for memory write co-runners.
32
Outline
• Background
• DeepPicar Platform
• CNN Evaluation
• Shared Resource Isolation
• Embedded Platform Comparison
• Conclusions
33
Conclusions
• DeepPicar Platform• Low-cost replication of the DAVE-2 autonomous car.
• Runs the same CNN in real-time on a Raspberry Pi 3.
• Real-time CNN inferencing• Feasible on embedded multicore platforms.
• Multiple CNNs can be co-scheduled.
• Caution must be taken regarding interference.
• Shared Resource Isolation• L2 cache partitioning had no benefits.
• Limiting core memory bandwidths was very effective.
34
Thank youDisclaimer:
This research has been supported by the National Science Foundation (NSF) under the grant number CNS 1815959, and the National Security Agency (NSA) Science of Security Initiative. The Titan Xp and Jetson TX2 used for this research
were donated by the NVIDIA Corporation.
35