deeppicar: a low-cost deep neural network-based autonomous...

DeepPicar: A Low-cost Deep Neural Network-based

Autonomous Car

Michael Bechtel$, Elise McEllhiney$, Minje Kim^, Heechul Yun$

$ University of Kansas, ^ Indiana University Bloomington

1

End-to-End Deep Learning

• Produce control outputs directly from sensory inputs.

• Simplifies process by bypassing intermediary steps.

2

Adopted from http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_1_introduction.pdf

http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_1_introduction.pdf

DAVE-2• 2016 project done by NVIDIA.

• Used the End-to-End approach with a Convolutional Neural Network (CNN).

• Could successfully drive a car on public roads.

3Source: https://devblogs.nvidia.com/deep-learning-self-driving-cars/

https://devblogs.nvidia.com/deep-learning-self-driving-cars/

DAVE-2's CNN

• DAVE-2 used a 9-layer CNN to drive their car• ~250K weights

• ~27M connections

• 3MB large

• Relatively small by today's standards• More recent networks have millions of

weights and are >100MB large

4

DAVE-2 CNN

Outline

• Background

• DeepPicar Platform

• CNN Evaluation

• Shared Resource Isolation

• Embedded Platform Comparison

• Conclusions

5

DeepPicar

• A low cost, small scale replication of NVIDIA’s DAVE-2.

• Uses the exact same CNN.

• Runs on a Raspberry Pi 3/4 in real-time.

6

System Design

7

USB

GPIO

Jumper

Camera Embedded Computer

Actuator

RC car

Jumper

Portable charger

Motor Control• The RC car has two separate motors: steering and throttle

• Convert the steering angle to a PWM value

• Send a signal to the steering motor with the PWM value 8

Steering Throttle

Jumper wires

CNN-Based Real-Time Control Loop

9

Image Collection• Get images from the camera sensor using OpenCV

• Configured to return a 320x240x3 image frame• The network requires 66x200x3 input

10

read()

Image Preprocessing

• Transform the image's dimensions (also with OpenCV)

11

resize()

320x240x3 66x200x3

CNN Inferencing

• Feed the preprocessed image to the network

12

Steering angle(Radians)

Output Handling

• Convert network output to degrees

• Control car based on relative value• Angle > 15: turn left

• Angle < -15: turn right

• Else: go straight

13

Outline

• Background


• CNN Evaluation



• Conclusions

15

CNN on Raspberry Pi 3

16

• Pi 3 is able to run the CNN based control at under 40 Hz (25 ms).

• CNN inferencing dominates the processing time (>80%).

Time breakdown

Effect of Number of Cores Used

• Performance improves with more cores: 20Hz (1core) – 40Hz (4cores).

• But scalability is limited (due to parallelization overhead).

17

Effect of Multiple Concurrent Models

• CNNs experience modest slowdown (due to interference).

18

2Nx2C: 2 CNNs each using 2 cores4Nx1C: 4 CNNs each using 1 core1Nx1C: 1 CNN using 1 core1Nx2C: 1 CNN using 2 cores

2Nx2C

Effect of Memory Intensive Co-runners

19

Co-runners:BwRead: 16MB 1D array readBwWrite: 16MB 1D array write

• CNN can suffer very high (up to 11.6X) slowdown.

• Likely caused due to contention in shared hardware resources.

Effect of Co-runners

20

Outline

• Background


• CNN Evaluation



• Conclusions

21

Isolation Mechanisms

• L2 Cache Isolation: PALLOC (*)• Page-coloring based kernel-level memory allocator that partitions the cache

by allocating memory pages to disjoint cache sets.

• DRAM Isolation: MemGuard (**)• Memory bandwidth reservation system that limits the bandwidth each core

gets in a given interval (1 ms).

22(*) H. Yun et al., “PALLOC: DRAM Bank-Aware Memory Allocator for Performance Isolation on Multicore Platforms.” RTAS’14(**) Yun et al., “MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isolation in Multi-core Platforms.” RTAS’13

PALLOC

• L2 cache is partitioned using bits 13 and 14.

• Four partitions are created with 4, 3, 2, and 1 colors.• 100%, 75%, 50%, and 25% L2 cache space availability.

23

PALLOC cont.

• The CNN workload is insensitive to cache space availability.

24

DRAM

LLC

Core1 Core2 Core3 Core4

DNN

PALLOC cont.

• Cache partitioning is ineffective in protecting the CNN.• Using PALLOC provides no benefits.

25

MemGuard

• CNN performance is sensitive to memory bandwidth.• At least 400 MB/s required for ideal performance.

26

MemGuard cont.

• Performance improves when co-runner bandwidths are limited.• Using MemGuard is very beneficial.

27

solo

CNN bandwidth: 1000 MB/s

Outline

• Background


• CNN Evaluation



• Conclusions

28

Embedded Platform Comparison

• Does the CNN behave similarly on other platforms?

• We test the NVIDIA TX2 with GPU, and without.

• Three experiments were replicated on the other platforms.

29

Comparison of Multicore Experiments

• The CNN scales on all platforms except for TX2 (GPU).

• Scalability is still limited.

30

Comparison of Multimodel Experiment

• All platforms experience some interference.

• Slowdown is tolerable on all platforms.31

2Nx2C: 2 CNNs each using 2 cores4Nx1C: 4 CNNs each using 1 core1Nx1C: 1 CNN using 1 core1Nx2C: 1 CNN using 2 cores

2Nx2C

Comparison of Co-runners

• Worst-case performance is bad on all platforms.

• But the Pi 3 is especially bad for memory write co-runners.

32

Outline

• Background


• CNN Evaluation



• Conclusions

33

Conclusions

• DeepPicar Platform• Low-cost replication of the DAVE-2 autonomous car.

• Runs the same CNN in real-time on a Raspberry Pi 3.

• Real-time CNN inferencing• Feasible on embedded multicore platforms.

• Multiple CNNs can be co-scheduled.

• Caution must be taken regarding interference.

• Shared Resource Isolation• L2 cache partitioning had no benefits.

• Limiting core memory bandwidths was very effective.

34

Thank youDisclaimer:

This research has been supported by the National Science Foundation (NSF) under the grant number CNS 1815959, and the National Security Agency (NSA) Science of Security Initiative. The Titan Xp and Jetson TX2 used for this research

were donated by the NVIDIA Corporation.

35

deeppicar: a low-cost deep neural network-based autonomous...

Documents