caffeinated fpgas: fpga f training and inference of ......abstract caffeinated fpgas: fpga framework...

CAFFEINATED FPGAS: FPGA FRAMEWORK FOR TRAINING AND INFERENCE OF

CONVOLUTIONAL NEURAL NETWORKS WITH REDUCED PRECISION FLOATING-POINT

ARITHMETIC

by

Roberto DiCecco

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto

© Copyright 2018 by Roberto DiCecco

Abstract

Caffeinated FPGAs: FPGA Framework for Training and Inference of Convolutional Neural Networks With

Reduced Precision Floating-Point Arithmetic

Roberto DiCecco

Master of Applied Science

The Edward S. Rogers Sr. Department of Electrical & Computer Engineering

University of Toronto

2018

This thesis presents a framework for performing training and inference of Convolutional Neural Networks

(CNNs) with reduced precision floating-point arithmetic. This work aims to provide a means for FPGA and

machine learning researchers to use the customizability of FPGAs to explore the precision requirements of

training CNNs with an open-source framework. This is accomplished through the creation of a High-Level

Synthesis library with a Custom Precision Floating-Point data type that is configurable in both exponent and

mantissa widths, with several standard operators and rounding modes supported. With this library a FPGA

CNN Training Engine (FCTE) has been created along with a FPGA CNN framework FPGA Caffe, which is built

on Caffe. FCTE has a peak performance of approximately 350 GFLOPs, and has been used to show that a

mantissa width of 5 and exponent width of 6 is sufficient for training several models targeting the MNIST and

CIFAR-10 datasets.

ii

Acknowledgements

I would like to thank my supervisor, Professor Paul Chow. Throughout the course of my MASc he has provided

me with a lot of guidance both with respect to completing my thesis and also my professional and personal

development. Furthermore, it is certainly not a typical experience in an MASc to be reintroduced to playing

hockey through your supervisor, which was a great and memorable experience that I am very grateful for.

I would like to thank my friends: Ian Hansen, Pratik Engineer, Rostislav Dobrin, Thomas Lin, and Tristan

Pereira for all of the great times that we have had both before and during my MASc.

I would like to thank my siblings: Sante (and his wife Kaila), Krista, Alexa, and Liza for all of the love and

support that they have provided throughout this journey. I would also like to thank my dad: Joe for always

being there for myself, but also for all of my siblings. I would also like to thank my mom: Anne, who passed

away before I started university; I think she would be just as happy as I am to see what I have accomplished.

I would also like to thank my colleagues, particularly Daniel Ly-Ma and Daniel Rozhko for all of the im-

portant procrastination that we took part in, but also all of the other members of our research group: Charles

Lo, Eric Fukuda, Fernando Martin Del Campo, Jasmina Vasiljevic, Justin Tai, Nariman Eskandari, and Zhiqiang

Liu.

Last, but certainly not least; I would like to thank my wonderful girlfriend Sarah Billingsley for supporting

me during my MASc and always being patient with my strange student work schedule. Without her I would

not be the person that I am today, and for that I am forever grateful.

iii

Contents

Acknowledgements iii

Table of Contents iv

List of Tables vii

List of Figures viii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 SDAccel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Inference and Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Common Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Convolution Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Caffe Convolutional Neural Network Framework . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Floating-Point Number Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 FPGA-based CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.2 Reduced Precision CNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iv

3 Custom-Precision Floating-Point 24

3.1 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Comparison Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 FPGA-Caffe: Hardware 30

4.1 SDAccel Bandwidth Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 FPGA CNN Training Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.3 Winograd Convolution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.4 Max Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.5 ReLU Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.6 Core Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 FPGA-Caffe: Software 42

5.1 OpenCL Brew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 OpenCL Memory Management and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Auxiliary Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.1 Pad Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.2 HWCN Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3.3 CPFP Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4 FPGA Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Evaluation 49

6.1 Custom-Precision Floating-Point Area Utilization and Operating Frequency . . . . . . . . . . . . . 50

6.1.1 CPFP Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

6.1.2 CPFP Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Winograd Convolution Engine and FPGA CNN Training Engine Area Utilization and Operating

Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Training Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.4 Inference Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5 Training Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

v

6.6 Inference Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Future Work 65

7.1 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2 Mixed Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3 Additional Layers and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.4 Multi-FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.5 FCTE Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Conclusion 70

Bibliography 71

vi

List of Tables

2.1 Comparison Of FPGA Inference Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Description of configuration registers used for operating the Direct and Winograd convolution

engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Fully Connected Layer Register Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 Memory Synchronization State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.1 Convolution Engine Area Utilization (Utilization Percentage in Brackets) and Operating Frequency,

Exponent Width of 6, Mantissa Width of 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Convolution Engine Area Utilization (Utilization Percentage in Brackets) and Operating Frequency,

Exponent Width of 6, Mantissa Width of 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.3 Training forward-backward throughput (images/second) . . . . . . . . . . . . . . . . . . . . . . . . 61

6.4 Inference throughput (images/second) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

List of Figures

2.1 Stratix block layout [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Example of FPGA routing resources [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Loop pipelining example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Loop unrolling example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 Memory partitioning using cyclic, block, and complete . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Data packing example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7 SDAccel Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 Max pooling forward computation with a 2×2 window . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.9 Max pooling backward computation with a 2×2 window . . . . . . . . . . . . . . . . . . . . . . . . 14

2.10 Winograd Input Tile Stencil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.11 Caffe convolution model specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 CPFP multiplier, two operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 CPFP multiplier, three operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 CPFP adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1 SDAccel burst size run-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Demonstration of unroll factors that can be used for the convolution forward pass . . . . . . . . . 32

4.3 Processing element backward (BW) and forward paths. The forward path is in dashed red, show-

ing the bypass logic used to only use a portion of the adder tree. . . . . . . . . . . . . . . . . . . . . 34

4.4 Overall CNN training architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 Winograd processing element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.6 Winograd input and weight transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.7 Positive and negative Winograd output transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.8 Winograd system diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.9 Max pooling forward implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

viii

4.10 Max pooling backward implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 High-Level View of the Brew Options in Caffe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.2 Demonstration of memory layout before and after transforming the data from single-precision

floating-point to CPFP. This example assumes that the CPFP data is 16-bits or less. . . . . . . . . . 44

5.3 Demonstration of a CPFP variable stored within host memory . . . . . . . . . . . . . . . . . . . . . 45

5.4 Auxiliary layer usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 OCRLCRHWCN layer description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1 FPGA, GPU, and CPU evaluation systems considered to measure accuracy and throughput . . . . 50

6.2 CPFP multiplier LUT utilization compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . 51

6.3 CPFP multiplier FF utilization compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . . 52

6.4 CPFP multiplier clock period compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . . 52

6.5 CPFP adder LUT utilization compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . . . 53

6.6 CPFP adder FF utilization compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . . . . 53

6.7 CPFP adder clock period compared to Xilinx floating-point IP cores . . . . . . . . . . . . . . . . . . 54

6.8 Test error rates for MNIST and CIFAR-10 when using different floating-point representations . . . 57

6.9 NIN CIFAR-10 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.10 All Convolutional Network CIFAR-10 error rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.11 Network In a Network inference validation accuracy for different CPFP configurations . . . . . . . 60

6.12 VGG16 validation accuracy for different CPFP configurations . . . . . . . . . . . . . . . . . . . . . . 60

6.13 AlexNet validation accuracy for different CPFP configurations . . . . . . . . . . . . . . . . . . . . . 61

6.14 Convolution engine throughput comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ix

Chapter 1

Introduction

Deep learning has been at the forefront of many of the recent breakthroughs in image recognition, speech

recognition, and natural language processing [2]. Particularly, convolutional neural networks (CNNs) have

played an important role in the field of image recognition sparking advancements in many other fields as

well. Many of the recent advancements in image processing started in 2012 when AlexNet [3] was used to

win the ImageNet competition [4], which involves classifying 100,000 images across 1,000 categories using

over 1.2 million labelled images as training data. In this competition the AlexNet CNN was able to achieve

10% higher accuracy than the next best submission, significantly outperforming all other submissions. Every

year since then the winner of the ImageNet competition has been an implementation using deep learning and

CNNs [5–9], with classification accuracy now at 97% [9]. While deep learning and CNNs have been shown

to achieve very high accuracy in classification tasks, they are very computationally intensive, with training a

given network taking on the order of days [3]. Training networks is an iterative process, meaning that achieving

desirable accuracy could take on the order of weeks as model parameters are modified. This requires hardware

that is flexible enough to accommodate model changes as the field advances, as well as the ability to provide

high bandwidth and compute resources with a low power envelope to reduce the overall cost of training.

1.1 Motivation

Field-Programmable Gate Arrays (FPGAs) offer a compelling middle ground between Application Specific In-

tegrated Circuits (ASICs) and Graphics Processing Units (GPUs) or Central Processing Units (CPUs) in terms of

customizability and programmability. ASICs are tailored to a given application, while CPUs or GPUs have fixed

architectures that are general enough to be used across many applications. On the other hand, FPGAs have the

ability to be configured with a custom architecture for a given application, removing the need for unnecessary

1

CHAPTER 1. INTRODUCTION 2

logic that is found in the more general GPUs or CPUs. This allows for some of the customizability of ASICs,

while retaining some of the generality of CPUs or GPUs. FPGAs have started to see adoption in the datacenter,

leading to more widespread use. Microsoft has published several works regarding their adoption of FPGAs

in their datacenters, detailing that they are deploying FPGAs in the majority of their new servers, and using

them for accelerating applications such as the Bing search engine [10, 11]. More recently they have unveiled

plans for Project Brainwave, which is an FPGA-based platform for deep learning applications in the cloud,

with performance of 39.5 Teraflops using a custom floating-point representation [12]. Similarly, Amazon has

also started offering FPGA-based F1 instances, which allow for developers to spawn virtual machines with up

to 8 FPGAs connected, all of which have the potential to be programmed using OpenCL or other high-level

languages [13].

While platforms for FPGAs have become more easily accessible through the work of Microsoft and Ama-

zon, FPGAs are still a relatively difficult platform to target for researchers and developers without hardware

backgrounds. Tools from major vendors have greatly simplified the task of developing compute kernels for

FPGAs through the use of High-Level Synthesis (HLS) and system generation tools such as the Intel FPGA SDK

for OpenCL [14] and Xilinx SDAccel [15], however there are not many open-source libraries available for these

platforms. The rapid advancement of deep learning has sparked many developments in software and hard-

ware alike. With respect to software, there are now several frameworks that can be used to implement CNNs

for both training or inference including (but not limited to): Caffe [16], Tensorflow [17], Torch [18], MXNet [19],

and CNTK [20]. Much of the machine learning workloads mostly target NVIDIA GPUs due to their high perfor-

mance cuDNN library [21], with all of the libraries listed above offering cuDNN support but no FPGA support.

While GPUs are the current platform of choice for machine learning, it is not clear whether this will remain

the case with many hardware and algorithmic advancements beginning to take shape. In the case of NVIDIA’s

Volta line of GPUs, they have now started placing machine learning specific half-precision floating-point Ten-

sor Processing Units on chip [22]. Similarly, Google has created their own ASIC targeting machine learning

workloads, with the first iteration specifically targeting inference using 8-bit integer arithmetic [23]. There

are also many FPGA-specific works targeting deep learning inference, such as the work done by Intel [24] us-

ing half-precision floating-point and the recently announced Microsoft Brainwave project [12] which uses a

custom 8-bit floating-point representation. Through these developments it is clear that there are still many

unknowns in terms of architectural decisions with respect to creating hardware suitable for running CNNs.

FPGA-based implementations offer an advantage in this regard as FPGA-based designs can be iterated at a

much faster rate than ASICs, allowing them to potentially follow the pace of new algorithmic developments.

Though with the lack of open-source libraries available, the only option for many researchers is to develop

new hardware and software to support the new hardware essentially from scratch.


1.2 Goal

The goal of this work is to provide framework support for FPGA-based CNNs targeting both inference and

training using differing levels of precision. This allows for researchers and designers alike to iterate on FPGA-

based designs in an environment that is already in-use for designing CNNs. Furthermore, it provides the

opportunity for researchers to evaluate new precision schemes without having to design their own custom

operators. Secondary to this goal is to conduct precision studies on some CNNs to determine what level of

floating-point precision is sufficient for training CNNs.

1.3 Contributions

This thesis makes the following contributions:

1. An open-source FPGA Winograd Convolution Engine (WCE) has been developed using the Xilinx

SDAccel and Xilinx Vivado HLS tools.

2. An open-source FPGA CNN training engine (FCTE) capable of computing forward and backward passes

of convolution, ReLU, max pooling, and fully connected layers has been developed using the Xilinx

SDAccel and Xilinx Vivado HLS tools.

3. Software support in the CNN framework Caffe [16] has been added for seamlessly running training and

inference of CNNs using FPGAs.

4. An open-source library for custom-precision floating-point operations written for use with Vivado High-

Level Synthesis has been developed.

5. Various floating-point precision and rounding modes for training CNNs have been explored, showing

that for some models an exponent width of 6 and mantissa width of 5 is sufficient with round-to-zero

enabled for all multipliers and round-to-nearest enabled for all adders.

Furthermore, through the development of this work, two publications have been accepted to the Interna-

tional Conference on Field Programmable Technology (FPT) in 2016 and 2017. These publications are:

1. Caffeinated FPGAs: FPGA framework For Convolutional Neural Networks (FPT2016)

2. FPGA-Based Training of Convolutional Neural Networks with a Reduced Precision Floating-Point Li-

brary (FPT2017)


1.4 Overview

The rest of the thesis is organized as follows:

• Chapter 2 will provide background related to FPGAs, Convolutional Neural Networks, floating-point

number representations, and prior works related to FPGA CNN implementations and training with low

precision arithmetic.

• Chapter 3 will detail the Custom-Precision Floating-Point (CPFP) core implementations.

• Chapter 4 will detail the Winograd Convolution Engine and FPGA CNN Training Engine hardware.

• Chapter 5 discusses FPGA-Caffe, the software infrastructure that has been created to accommodate FP-

GAs within the Caffe CNN framework.

• Chapter 6 shows the resource utilization and operating frequency of the CPFP cores, the FPGA CNN

Training Engine, and the Winograd Convolution Engine. Accuracy and throughput results are also dis-

cussed for both training and inference for several models.

• Chapter 7 details potential for future work.

• Chapter 8 concludes the thesis.

Chapter 2

Background

This chapter details background information related to Field Programmable Gate Arrays (FPGAs), the software

tools that can be used for developing with FPGAs, convolutional neural networks (CNNs) and floating-point

number representations. Afterward, related work pertaining to FPGA-based CNN implementations will be

discussed followed by reduced-precision training efforts from various researchers.

2.1 Field-Programmable Gate Arrays

Field-Programmable Gate Arrays (FPGAs) are a type of integrated circuit with configurable routing and logic

resources that can be used to implement arbitrary functions. Modern FPGAs are composed of several types of

blocks: logic blocks, Digital Signal Processing units (DSPs), and Block RAM (BRAM), organized into columns

on the device, with high-speed interconnects at the periphery of the devices [1]. The logic blocks are com-

posed of several Look-Up Tables (LUTs) and Flip-flops (FFs), where the Look-Up Tables are SRAM cells which

can be configured to implement an N-input logic function, where N is the number of inputs to the LUT (typi-

cally a 6 input LUT or 5 input LUT in modern FPGAs [25, 26]). Figure 2.1, shows an example of an older Stratix

architecture, illustrating the column-based layout of the LABs (Intel/Altera equivalent of a logic block), mem-

ory blocks, and DSP blocks. To allow for high levels of flexibility, the FPGA blocks are surrounded by routing

resources, where the connections between blocks are configurable through programmable SRAM cells con-

nected to switches. An example routing architecture is shown in Figure 2.2, where the logic blocks are con-

nected to two connection blocks, and each connection block is connected to two switch boxes which allow

for connections between horizontal and vertical routing tracks. The flexibility of the logic blocks and routing

resources of FPGAs allow them to implement complex functions provided that they fit within the resource

budget of a given device. Given that routing resources are finite, designs also need to meet routing constraints

5

CHAPTER 2. BACKGROUND 6

Figure 2.1: Stratix block layout [1]Figure 2.2: Example of FPGA routing resources[1]

of the FPGA as well, which may result in lower performance designs, or designs that are not routable at all if

the designs are heavily congested.

2.1.1 High-Level Synthesis

High-Level Synthesis tools provide the ability for developers to write in higher level languages such as C, C++,

OpenCL, or SystemC to describe hardware rather than working in a hardware description language (HDL) such

as Verilog or VHDL. The tools take the high level programs and automatically convert them into an HDL-based

implementation that performs the same task [27]. This provides an easier entry point into FPGA-based design

as software developers that do not have backgrounds in HDL design can use HLS tools to develop hardware

targeting FPGAs. Furthermore, aside from aiding software developers to access FPGAs, HLS offers a simpler

path for hardware designers to iterate on their designs and to conduct design space explorations to determine

which hardware architectures result in the best area-power-throughput trade-off [27].

There are many existing HLS tools in both academia and industry, targeting ASIC and FPGA design. The

two largest FPGA vendors both have HLS tools: Intel HLS Compiler [28] and Vivado HLS [29], while in academia

there is LegUp [30] (also now a product as well), which is an open-source HLS tool. For all three tools, the high-

level languages are untimed, meaning they do not have any concept of a system clock.

All work conducted in this thesis used the Vivado HLS tool targeting Xilinx FPGAs. Some of the most im-

portant Vivado HLS design techniques are detailed below, with their specific uses related to the CPFP cores,

Winograd Convolution Engine, and FPGA CNN Training Engine described in Chapters 3 and 4.


Figure 2.3: Loop pipelining example

Operator Overloading

HLS supports the use of some C++ specific language features, one of which being operator overloading. This

allows for operators of custom data types to be designed with HLS code and instantiated within the design by

using simple C or C++ operators (e.g. +, *, /, etc.). With this feature, relatively easy changes can be made to the

data type used in designs, while maintaining the same overall architecture by simply replacing the data types

used by the operators.

Loop Pipelining

Loop pipelining refers to the ability to exploit pipeline parallelism in FPGA designs by transforming loops

in high level languages (e.g. for or while loops) to a hardware implementation that can accept a new input

every N clock cycles where N is often referred to as the initiation interval (II) [29]. This method is shown in

Figure 2.3, where the operations in the for loop are split up into different pipeline stages separated by registers.

This results in higher latency designs, but can provide much higher throughput because loop iterations can

be overlapped.

Loop Unrolling

Loop unrolling is used to more efficiently exploit the vast resources available in FPGAs by exploiting spatial

parallelism. This involves replicating the operations within a loop by a given unroll factor F, such that if there


Figure 2.4: Loop unrolling example

Figure 2.5: Memory partitioning using cyclic, block, and complete

are originally N loop iterations, there will be F copies of the loop operations, reducing the number of iterations

to N /F . Figure 2.4 shows an example of unrolling a loop to produce several outputs simultaneously.

Memory Partitioning

When arrays in Vivado HLS are defined, they will initially be mapped to several memory blocks with two ports

for reading or writing. In the case of designs employing loop unrolling and loop pipelining with array accesses,

the memories frequently require more than two read/write ports to achieve the targeted initiation interval and

throughput. This requires partitioning the arrays into several separate memories depending on the access

pattern. In the case of Vivado HLS three different memory partitions are supported, as shown in Figure 2.5:

cyclic memory partitioning, block partitioning, and complete partitioning [29].

Data Packing

When writing software targeting CPUs, the widths of external memory interfaces are abstracted from the soft-

ware developer. On the other hand, exploiting the widths of memory-interfaces for FPGA-based designs is

essential in achieving higher bandwidth utilization when moving data from on-board memory to on-chip


Figure 2.6: Data packing example

memory. The default interface widths will be whatever the width of the data type that is used (if using a single-

precision floating-point number, the width will be 32 bits), however interface widths can be as high as 512 bits

(or higher). Therefore to achieve high bandwidth utilization, reading and writing to external memory should

use the full bus width. This requires packing input and output data into large words corresponding to the in-

terface width. Figure 2.6 shows an example of this, where the memory interface width is 512 bits and the data

width is 32 bits. Without data packing this results in 6.25% bandwidth utilization. With data packing eight

32-bit words can be accessed simultaneously, drastically improving bandwidth utilization.

2.1.2 SDAccel

Xilinx SDAccel [15] is an OpenCL development framework for Xilinx FPGAs. As discussed in the previous sec-

tion, HLS can be used to ease the development of FPGA applications. SDAccel and FPGA OpenCL frameworks

in general further ease the development of FPGA applications by providing a platform to be used with HLS-

based kernels. In the case of the SDAccel platform, interfaces to on-board memory or for external communi-

cation, such as DDR, and PCIe, are provided in a static region of the device, while the HLS-based kernels are

placed inside of a reconfigurable region of the FPGA platform provided by Xilinx. This platform takes advan-

tage of partial-reconfiguration of the device to allow for kernels to be changed depending on the application,

while the static interfaces that the kernels require remain unchanged. This programming model allows for

FPGA application developers to focus primarily on the application that they are developing rather than the

interfaces required by the application, which tend to be similar for a given development board.

Using SDAccel to perform computations on an FPGA involves both host and kernel code. The host code

is used for programming the FPGA, passing data between the host’s memory and the FPGA’s global memory

(DDR memory), and launching the kernel on the FPGA. Typically, the host code is executed on an x86 based

CPU, connected to the FPGA board through PCIe. The kernel code is synthesized into hardware and config-

ured into the programmable region of the FPGA. The synthesized kernel can contain one or more compute

units (CUs), where a CU corresponds to the hardware unit responsible for the required computation. Instan-


Global Off-chip Memory

FPGA

SDRAMHost Global

Memory

Kernel

local memory

Host

PCIe

local memory

ComputeUnit

local memory

ComputeUnit

ComputeUnit

CPU

SDRAM

Figure 2.7: SDAccel Platform

tiating multiple CUs can be used as one approach to exploit parallelism in the targeted application as shown

in Figure 2.7, with each CU handling an equally sized portion of the problem [15]. The portion of the problem

handled by a single CU is referred to as the local work group size, while the size of the overall task to be com-

pleted is referred to as the global work group size. Each CU has its own local memory that is only accessible to

the CU while all CUs share the global off-chip memory of the platform. The local memory of a CU corresponds

to the block RAMs of the FPGA device, while the global off-chip memory is typically DDR.

2.2 Convolutional Neural Networks

CNNs are a type of feedforward network represented as a directed acylic graph where each node consists of a

different set of computations applied to the input data [2]. The most common computational layers of CNNs

are: convolution, pooling (max or average), activations (ReLU, Sigmoid, etc.), and fully connected (also called

inner product). Each layer has two sets of operations: forward and backward propagation, with forward being

used for inference and training and backward used only during training.

2.2.1 Inference and Training

A simplified view of a feedforward network is shown in Equation 2.1, where f (1) to f (n) represent the various

computations required by layers 1 to n [2].

f (x) = f (n)(...( f (2)( f (1)(x))...) (2.1)

When performing inference (or classification), Equation 2.1 is computed on an input or a set of inputs to

arrive at the final classification result for the input or set of inputs. Each of f (m) may have a set of parameters

that can be learned: θ. The goal of training feedforward networks is to adjust the model parameters such


that the model is able to correctly classify inputs that have never been seen by the model. To accomplish this

the model is used to classify inputs with known labels, with the quality of the output determined by a cost

function. The most common cost function used in deep learning applications is shown in Equation 2.2, which

is the negative log-likelihood, also described as the cross-entropy between the training data and the model

distribution [2]. To improve the quality of the outputs of the model amounts to minimizing the negative log-

likelihood, which requires back-propagation to compute the gradients of each of f (m) with respect to both the

inputs and the parameters, since J (θ) is a function of all of the layers. These gradients can be computed by

recursively applying the chain rule to compute the gradient of f (x) with respect to the inputs. The gradient

of the loss with respect to the parameters of layer N depends on the inputs to that layer and the gradient of

the cost function with respect to the inputs from layer N + 1, requiring all of the prior gradients (gradients of

layer N + 2 to layer K, where K is the last layer) to be computed. Back-propagation involves propagating the

gradient from layer M + 1 to layer M to compute the overall gradient of the network with respect to the input

and to also compute the gradient of the cost function with respect to the parameters of the network.

J (θ) =−Ex ,y∼pdata log(pmodel(y |x))

pmodel = f (n)(...( f (2)( f (1)(x))...)(2.2)

Where:

J is the cost function;

θ is the set of parameters;

x is the input data;

y is the input label;

Ex,y∼pd at a is the expected value of y;

pmodel is the output probability of the model;

To actually minimize the cost function, the parameters need to be adjusted, which requires a solver, with

the most common solver being stochastic gradient descent [2]. The equation for gradient descent is shown

in Equation 2.3, where λ is the learning rate, which is used to adjust how large of an impact the gradient will

have on the updated parameter. To compute the gradients required in gradient descent requires all of the

training data, which is usually prohibitively expensive to store in memory (e.g. ImageNet contains 1.2 million

training images [4]), therefore randomly sampled mini-batches of the training data can be used to compute

an estimate of the gradient, which forms the basis of stochastic gradient descent.


θtm = θt−1

m −λ∇θm J (θ) (2.3)

2.2.2 Common Layers

CNNs are an active area of research which means that they are frequently changing as new models and com-

putations are shown to be effective. In recent works there have been several common layers in many of the

high accuracy models that researchers have developed. In the following sections, these layers are detailed,

though this is by no means an exhaustive list of current layers in use.

Convolution

The convolution layer in CNNs implements Equations 2.4, 2.5, 2.6, and 2.7 with padding and stride not shown

for simplicity. Equation 2.4 corresponds to the forward pass, while Equations 2.5, 2.6, and 2.7 correspond

to the backward pass. Equations 2.4 and 2.5 have the same structure, each corresponding to a convolution,

though in Equation 2.5 the inputs are the backpropagated gradients of the loss from the previous layer and the

weights are rotated 180 degrees. Similarly, Equation 2.6 is also a convolution, though in this case the weights

are the backpropagated gradients of the loss from the previous layer. Equation 2.7 corresponds to the gradient

with respect to the bias. When the input and output have the same dimensions, Equations 2.4, 2.5 and 2.6 have

similar computational complexity, while Equation 2.7 has comparably low complexity, therefore the backward

pass requires approximately double the computations compared to the forward pass.

Y in,o,h,w =

C−1∑c=0

Kh−1∑q=0

Kw−1∑p=0

X in,c,h+p,w+q ·F i

o,c,p,q +B io (2.4)

∂J

∂X i n,c,h,w=

O−1∑o=0

Kh−1∑q=0

Kw−1∑p=0

∂J

∂X i+1 n,o,h+p,w+q·F i

o,c,Kh−p−1,Kw−q−1 (2.5)

∂J

∂F i o,c,p,q=

N−1∑n=0

H−1∑h=0

W −1∑w=0

X in,c,h+p,w+q · ∂J

∂X i+1 n,o,h,w(2.6)

∂J

∂B i o=

N−1∑n=0

H−1∑h=0

W −1∑w=0

∂J

∂X i+1 n,o,h,w(2.7)

Where:

i is the current layer index;

N is the batch size;


O is the number of output channels (sometimes called feature maps);

C is the number of input channels (sometimes called feature maps);

H is the input height;

W is the input width;

Y is the output;;

Kw is the horizontal dimension of the kernel;

Kh is the vertical dimension of the kernel;

X is the input;

B is the bias;

F is the set of weights;

J is the cost function;

Max Pooling

The forward pass for max pooling involves striding a window that is K ×K across each input feature map and

outputting the maximum value of the window at a given position. For each window position a tag is saved

corresponding to the index of the input with the maximum value. An example of a 2×2 pooling window with

a stride of two is shown in Figure 2.8. In this figure the tag is Tn , which corresponds to one of the possible

positions of the window. The output contains the value at position Tn of the pooling window. In the backward

pass the tag for each output is used to compute the gradient of the loss for the layer. In non-overlapping

max pooling layers (where the stride is greater than or equal to the kernel size), the gradient of the loss at

the position of the tag will be equal to the gradient of the loss from the previous layer, with all other window

positions set to zero. This is shown in Figure 2.9 for a 2×2 pooling window with stride of two. In this example

the corresponding tags are mapped to the output of the backward pass, with all other outputs set to zero.

On the other hand, for overlapping max pooling each pooling input within a pooling window can potentially

influence several output values, therefore the gradient of the loss is accumulated at the given tag position.

ReLU

Forward propagation for the ReLU layer involves a simple computation on each of its inputs, namely comput-

ing max(0, input[idx]). The backward pass on the other hand corresponds to a multiplexer, where the gradient

of the loss for the layer is equal to the gradient of the loss of the previous layer if the input data was larger than

zero, otherwise it is equal to zero.


Figure 2.8: Max pooling forward computation with a 2×2 window

Figure 2.9: Max pooling backward computation with a 2×2 window

Fully Connected

Fully connected layers are commonly used near the end of CNNs [3, 6, 31, 32]. These layers form the basic

building block of multilayer perceptrons [2] and can be represented as a matrix multiplication between the

input and a set of weights. Another way of interpreting fully connected layers involves viewing them as con-

volutions where the filter size is the size of the input, the filter depth is the same as the input depth, the stride

is one, and the padding is set to zero. In this interpretation, Equations 2.4, 2.5, 2.6, and 2.7 can be used for

computing the forward and backward passes of fully connected layers.

Softmax

Softmax layers are typically used as the final layer in CNNs. The role of a Softmax layer is to produce a dis-

tribution of size N from an input of size N which represents the probability of the input being one of the N

categories [2]. The equation used for computing Softmax is shown in Equation 2.8, where N is typically small

(for ImageNet N is 1000 [4]).


softmax(x)i = exp(xi )∑Nj=1 exp(x j )

(2.8)

2.2.3 Convolution Algorithms

There are several techniques that may be used to implement the convolutions shown in Equations 2.4, 2.5, 2.6,

and 2.7. These techniques consist of: direct convolution, matrix multiply convolution, Winograd convolution,

and others. Several of the techniques are detailed below.

Direct Convolution

In Code 2.1, a basic implementation of direct convolution is shown. This algorithm can be parallelized in

many different ways through unrolling and reordering the loop levels. There are loop-carried dependencies

across the two inner-most loops as well as the loop iterating over inChannels. Therefore, the most suitable

loops for parallelizing are the numImages, outChannels, outYDim, and outXDim loops. In the case of the

convolution involved in computing the gradient of the loss with respect to the weights, the algorithm is some-

what different, with the loop-carried dependencies along the numImages, outYDim, and outXDim instead as

there is a reduction along the number of images and the output dimensions, meaning that the outChannels,

inChannels, and ksize loops are most suitable for parallelizing.

Code 2.1: Direct Convolution

1 direct_convolution ( input , weights , bias , output ) {

2 for ( int n = 0 ; n < numImages ; ++n) {

3 for ( int o = 0 ; o < outChannels ; ++o) {

4 for ( int c = 0 ; c < inChannels ; ++c ) {

5 for ( int y = 0 ; y < outYDim ; ++y ) {

6 for ( int x = 0 ; x < outXDim ; ++x ) {

7 for ( int p = 0 ; p < ksize ; ++p) {

8 for ( int q = 0 ; q < ksize ; ++q) {

9 int in_y = y * s t r i d e − pad + p ;

10 int in_x = x * s t r i d e − pad + q ;

11 i f ( in_y >= 0 && in_y < inYDim && in_x >= 0 && in_x < inXDim)

12 output [n ] [ o ] [ y ] [ x ] += input [n ] [ c ] [ in_y ] [ in_x ] * weights [ o ] [ c ] [ p ] [ q ] ;

13 }

14 }


15 }

16 }

17 }

18 }

19 for ( int o = 0 ; o < outChannels ; ++o) {

20 for ( int y = 0 ; y < outYDim ; ++y ) {

21 for ( int x = 0 ; x < outXDim ; ++x ) {

22 output [n ] [ o ] [ y ] [ x ] += bias [ o ] ;

23 }

24 }

25 }

26 }

27 }

Matrix Multiply Convolution

Convolution can be reformulated as a series of matrix multiplications between the input and the weights by

using the im2col algorithm to modify the input image layout by reshaping and padding the input. This is the

approach taken in the native Caffe implementation [16] of convolution as matrix multiplication maps well

to existing BLAS libraries such as OpenBLAS [33] or CUBLAS [34]. While this approach maps well to existing

BLAS libraries, it results in many unneeded computations compared to directly computing the convolution as

im2col adds zeros to the input matrices to perform the convolution using matrix multiplication. Furthermore,

additional overhead is added by the im2col function itself which is required prior to actually computing the

matrix multiplication of the input and the weights.

Winograd Convolution

Winograd convolution exploits the Winograd minimal filtering algorithm [35] to implement convolution. This

approach has been shown to reduce the number of required floating-point operations [36]. The Winograd con-

volution algorithm output is referred to as F (m×m,r ×r ). In this expression m×m refers to the output tile size

for a given input tile, meaning that m×m output values are produced for every instance of F (m×m,r ×r ). The

filter size in this case is r×r . For a given filter size, many different values of m can be chosen, which changes the

computational complexity of F (m ×m,r × r ), but does not impact the overall result. In [37], we implemented

the F (2×2,3×3) Winograd algorithm as an initial approach. In the case of 3×3 convolutions, F (2×2,3×3)


has been shown to provide significant performance gains for GPU implementations in [36], though larger tile

sizes may produce additional gains. The equations for F (2×2,3×3) are shown in Equations 2.9 to 2.11. An

alternative approach may be to use one dimensional transforms instead to reduce the size of the process-

ing elements. In the case of a one dimensional Winograd transform such as F (2,3), Equations 2.9 to 2.11 are

also applicable, though in each case the dimensions are one dimensional rather than two dimensional and

the transform matrices are modified to accommodate the change in the number of dimensions. Equation 2.9

shows how the filter transformation is calculated.

U =GgGT (2.9)

Where

U is a (m + r −1)× (m + r −1) transformed filter;

g is an r × r filter;

G is an (m + r −1)× r transform matrix, defined by

the Winograd algorithm.

In the case of inference, The filter values (g ) are known at compile time and remain constant during run-

time. Therefore, to save resources during run-time, the Winograd transformation for the filter values, shown

in Equation 2.9, can be executed at compile time, on the CPU. This approach saves FPGA resources. However,

pre-computing filter values increases the memory storage requirement. For direct convolution, the 3×3 filters

require O×C×3×3 storage, where C is the number of input channels, and O is the number of output channels.

For Winograd-based convolution, after the transformation, the 3×3 filter is transformed into a 4×4 matrix,

requiring O ×C ×4×4. Therefore the storage requirement is increased by 77.7%.

Equation 2.10 shows how the input transformation is calculated.

V = B T dB (2.10)

Where

V is an (m + r −1)× (m + r −1) transformed data

tile;

d is an (m + r −1)× (m + r −1) input tile;

B is an (m + r −1)× (m + r −1) transform matrix,

defined by the Winograd algorithm.

For F (2 × 2,3 × 3) input tile (d) is 4 × 4 and is generated by a sliding window across the 2-D input feature


data. Shown in Fig. 2.10, the d window slides horizontally across the input data, with a stride of two. After

the Winograd transformation, the 4×4 V tiles are stored back into memory. With this storage strategy, input

tiles would require approximately four times more storage than an equivalent direct convolution due to the

overlapping nature of the tiles.

Equation 2.11 shows how the pre-computed transformed filter U and the run-time transformed input data

V are used to calculate the final output, a 2× 2 tile Y . Each Y tile corresponds to a 2× 2 non-overlapping

subsection of the overall convolution output.

Y = F (m ×m,r × r ) = AT [U ¯V ]A (2.11)

Where

Y is an m ×m output tile;

¯ is an element wise multiplication;

A is an (m + r −1)×m transform matrix, defined by

the Winograd algorithm.

INPUT DATA

4

4

stride = 2d - input tile

Figure 2.10: Winograd Input Tile Stencil

2.2.4 Caffe Convolutional Neural Network Framework

The Caffe CNN framework is one of many existing frameworks that can be used for describing a CNN. Caffe

natively supports CPU and GPU execution of CNNs, with the option to use either the Caffe CUDA [38] layer

implementations, or CUDNN [16, 21]. Other frameworks include Tensorflow [17], MXNet [19], CNTK [20],

Torch [18], and many others. While most of these frameworks are open-source, Caffe has the benefit of being

very modular and written in C++, which provides a relatively low resistance path to augmenting it to suit the


needs of a developer.

Caffe Brews

In Caffe a Brew is referred to as a mode of operation that determines the target architecture on which CNN

classification or training is executed. The original Brews are CPU or GPU, with the CPU Brew containing the

C++ infrastructure required to define layers using a CPU, and the GPU Brew providing similar features but for

NVIDIA GPUs using CUDA [38] and cuDNN [16,21]. For each Brew, Caffe contains test cases available for every

layer, allowing for fast determination of functional correctness and benchmarking.

Caffe Memory Management

Data in Caffe is represented as a four-dimensional flattened array, with allocation, resizing, and synchroniza-

tion between CPU and GPU resources abstracted from its usage [16]. The memory management API in Caffe

handles synchronization between the host and GPU devices such that memory is only transferred back to

the host when necessary. To accomplish this, the state of the memory is stored as either HEAD_AT_GPU,

HEAD_AT_CPU, or SYNCED, where HEAD_AT_GPU indicates the data has been modified by the GPU, but not

synced to the CPU and vice-versa for HEAD_AT_CPU. This state is verified upon accessing the data; if the state

of the data is HEAD_AT_GPU and the host requests the data, a data transfer from the device to the host will be

issued and the state will change to SYNCED.

Caffe Models

Models are described in Caffe using Google Protocol Buffers [16, 39]. This format allows for layer specific pa-

rameters and the topology of the CNN to be described using text files. Furthermore, Caffe solver configurations

for training CNNs can also be specified in a similar manner through another text file using the same format.

The parameters that can be specified in the text files can easily be augmented by modifying the associated

Protocol Buffer configuration file in Caffe. Figure 2.11 shows an example of a convolution layer specified using

the Caffe model format. The first set of parameters correspond to general layer parameters such as the name,

bottom (input), top (output), and type, while the other parameters are specific to the convolution layer.

2.3 Floating-Point Number Representations

Most CNN frameworks support single-precision floating-point for their layer computations [16–20]. A floating-

point number is represented using a sign (S), exponent (E), and mantissa (M) where the number can be in-

terpreted through Equation 2.12, where the offset is usually taken to be the mid-point of the exponent width


Figure 2.11: Caffe convolution model specification

if it were treated as an unsigned number. In the case of single-precision floating-point, the corresponding

exponent width is E=8 and the corresponding mantissa width is M=23 for a total bit-width of 32 bits [40]. For

half-precision the exponent width is E=5 and mantissa width M=10, for a total width of 16 bits [40]. In all cases,

some exponents are reserved for particular numbers. An exponent of all ones indicates that the number is ei-

ther infinity (if the mantissa is zero) or NaN (if the mantissa is non-zero). On the other hand, an exponent of

zero represents either zero (if the mantissa is zero) or a denormal number (if the mantissa is non-zero) [40].

A denormal number is a floating-point number where the one in the summation of Equation 2.12 is removed

such that very small numbers can represented. In the case of FPGAs, denormals are typically not supported

due to the large additional area and latency cost that they require, resulting in all denormals being treated as

zero [41, 42]. Each FPGA vendor provides support for HDL or HLS-based implementations of single-precision

floating-point [14,29,41,42] and there is an open-source project for creating FPGA IP blocks targeting custom-

precision floating-point: FloPoCo [43], though HDL blocks cannot be used in Xilinx SDAccel which is the tool

used in this work.

f = (−1)S ·2E−offset · (1+M∑

i=1bM−i ·2−i ) (2.12)


2.4 Related Work

With CNNs being a heavily active field of development, many recent works across several disciplines have

focused on advancing CNN throughput and reducing power consumption for inference and training of CNNs.

In the following section, several FPGA-based CNN works are discussed in relation to this thesis, as well as

several works discussing training with low-precision arithmetic.

2.4.1 FPGA-based CNNs

There have been numerous works in the FPGA community discussing methods for implementing inference

using FPGAs, but very little work regarding training. Of the prior works implementing training using FPGAs,

both make use of reprogramming the device between layer executions [44, 45], which does not scale to large

designs where reconfiguration times can take on the order of seconds. Furthermore, this implies that if a

new model is created, it may potentially require re-running synthesis and place and route, to accommodate

the new model, which would be a severe runtime penalty compared to more general architectures with no

reprogramming. When compared to training, FPGA-based inference works are much more plentiful, with

Table 2.1 detailing several high-performance implementations.

From Table 2.1, all of the prior FPGA works seem to have only focused on two networks for their evalu-

Table 2.1: Comparison Of FPGA Inference Works

- [24] [46] [47] [48] [49] [50]Programming

LanguageIntel

OpenCLSDAccel

(Xilinx OpenCL)Intel

OpenCL- Intel

OpenCLVerilog

Data TypeShared

Exponent FP1616 bitFixed

8, 16 bitFixed

16 bitFixed

16 bitFixed

8, 16 bitFixed

DeviceIntel A10

1150Xilinx KU060

AlteraStratix-V GSD8

XilinxZynq XC7Z045

Intel A10GX1150

Intel A10GX1150

Technology Node 20nm 20nm 28nm 28nm 20nm 20nm

Clock Frequencyin MHz

303 200 120 150 385 150

DSP Utilization(% Utilization)

1,476 (97%) 1,058 (38%) 727 (37%) - 1,378 (90.8%) 1,518 (100%)

ModelsConsidered

AlexNetAlexNet,VGG16

AlexNet,VGG16

VGG16-SVD VGG19 VGG16

Performance(GOP/s)

1,382 266 117.8 136.97 1,790 645.25

ations, either AlexNet or VGG16. This can be attributed primarily to the ease of implementation of the two

networks chosen, with AlexNet [3] being a relatively small network (5 convolution layers) with no forking and

VGG16 being an extremely regular network, using only 4 layer primitives (Convolution, ReLU, Max Pooling

and Fully Connected) with the same hyper-parameters across the layers (uniform window sizes and padding

for all convolution and max pooling layers). The simplicity of the two networks allows for smaller design space

explorations as well as less trade-offs when considering other convolution algorithms (e.g. Winograd Convolu-


tion discussed in Section 2.2.3), though it does not provide guarantees on the performance of the architectures

and techniques used when considering other networks.

Notably, the works in [24] and [49] achieve very high throughput while using Intel’s OpenCL solution while

targeting an Arria 10 device. In [24] they take advantage of the Winograd Convolution algorithm discussed in

Section 2.2.3 in conjunction with a systolic array of processing elements (PEs) for computing dot products.

The Winograd transform is F (4,3) along only the width, meaning that the transform is not applied to produce

2D tiled outputs, but since the PEs are organized as a systolic array the Winograd transform consumes rela-

tively little resources. Each PE implements Wvec dot products that are Cvec wide, where Wvec is the number

of Winograd outputs, and Cvec is the number of input feature maps processed per cycle. The outputs are then

accumulated into shift registers until all input feature maps have been processed. After processing all input

feature maps, the results are sent to downstream processing engines that handle transforming the outputs

from the Winograd domain and other common CNN functions. To reduce bandwidth and area requirements

for arithmetic, they use a shared exponent FP16 representation, where all inputs are stored in IEEE FP16 [40],

and prior to being sent to the PEs, the input feature maps and weights are adjusted such that they have a

common exponent, with a mantissa that is 18-bits. These techniques allow for their implementation to scale

well with the number of PEs, allowing them to have 48 PEs, each of which processes 8 (Cvec ) dot-products

which produce 6 (Wvec ) Winograd outputs. Given that the Winograd input transform is targeting 3×3 convo-

lutions, there are some efficiency losses with larger kernel sizes, though they show that for the AlexNet 5×5

convolution they achieve similar DSP efficiency to the other layers.

While [24] uses the Winograd Convolution algorithm, [49] converts convolutions into matrix multiplica-

tions to allow for a very regular structure to be used for all convolution layers. Their processing elements con-

sist of simple multiply-accumulate (MAC) blocks, with the number of processing elements inside of a compute

unit (CU) and the number of CUs being scalable. The main contribution of this work is to balance computa-

tion, on-chip and off-chip memory bandwidth through their PE array architecture which allows for higher

data-reuse by organizing the PEs into a grid, with one input broadcasting data to each PE row while the other

input broadcasts data to each PE column.

2.4.2 Reduced Precision CNN Training

Several works have focused on reduced precision training of CNNs, with [51,52] using fixed-point with stochas-

tic rounding, while [53] uses dynamic fixed-point. The work in [51] shows that in the case of the MNIST dataset

a network using 16-bit fixed-point can achieve an accuracy that is only degraded by 0.06% compared to single-

precision floating-point. Likewise for the CIFAR-10 dataset they show that with 16-bit fixed-point they can


achieve accuracy that is degraded by only 0.8%. However, for the CIFAR-10 results it was necessary to first

train with low-precision and then increase the precision for the last 15-20 epochs to obtain accuracy com-

parable to single-precision floating-point. The work in [52] further extends the work of [51], showing that in

some cases up to 8 bits can be sufficient to train CNN models. Similarly, the work in [53] shows that by us-

ing low-precision multipliers with high-precision accumulators and dynamic fixed-point, they can achieve

comparable accuracy to single-precision floating-point. With dynamic fixed-point of 10 bits for propagations

and 12 bits for the weight updates, they achieve accuracy that is only degraded by 0.08% of single-precision

in the MNIST dataset and 0.77% in the CIFAR-10 dataset. These works show that CNNs can be trained using

significantly reduced precision rather than single-precision floating point with minimal loss in accuracy.

Chapter 3

Custom-Precision Floating-Point

In the Vivado HLS flow, when a standard C or C++ operator is used it is mapped to a Xilinx IP core [29]. Half,

single, and double-precision floating-point are supported in Vivado HLS, each of which use the Xilinx floating-

point IP core [41], which provides IEEE compliance [40] with some minor restrictions. However, in the case

of applications where rounding does not have to be as strict or where standard floating-point precision is not

necessarily required, these operators allow for limited flexibility and customizability. In the case of CNNs,

as discussed in Section 2.4.2, many prior works have shown that CNNs do not necessarily need high preci-

sion to obtain high accuracy during training, meaning that reduced precision or less area intensive rounding-

modes can be used to improve area utilization of floating-point operations. Furthermore, the standard layers

discussed in Section 2.2.2 require only additions, multiplications, and comparisons, meaning that receiving

either Not a Number (NaN), or Infinity can be controlled entirely by the rounding logic utilized by the floating-

point core. This allows for the elimination of all checks for NaN and Infinity, which can further reduce area

requirements for a floating-point core.

To take advantage of the potential area savings through CPFP cores, addition, multiplication, and several

other auxiliary operators have been implemented. Each core is written entirely in HLS such that they can be

easily imported into SDAccel or other HLS projects through the use of a CPFP data type and operator overload-

ing. These cores are capable of using custom exponent and mantissa widths as well as two rounding modes:

round-to-zero or round-to-nearest. The focus of this work has been on low precision floating-point, with the

range of mantissa widths fixed from 2 to 14 bits, though it is easily extendable to larger mantissa widths.

24

CHAPTER 3. CUSTOM-PRECISION FLOATING-POINT 25

3.1 Multiplication

CPFP multiplication is a relatively simple operation, requiring only a signed E-bit adder (with additional logic

to handle the offset), where E is the exponent width, an (M+1)-bit multiplier, where M is the mantissa width,

normalization logic, and rounding logic. Algorithm 1 demonstrates the key steps for floating-point multipli-

cation, while Figure 3.1 shows the architecture that is used. In Algorithm 1, two additional functions are used

to compute the final result: NORM and RND_NORM. NORM is a small logic block used to add the implicit

one of the floating-point numbers to the mantissa prior to performing the multiplication, while RND_NORM

is used to round the result and normalize the output of the multiplication.

The architecture used is based on the floating-point multiplier described in [54]. However, where they

differ is that clipping on the result is employed if the resulting exponent is above the maximum possible expo-

nent, denormal numbers are not supported, to allow for lower latency and lower area implementations, and

the rounding mode is determined at build time. Furthermore, given that the exponent is clipped rather than

set to Infinity, invalid results cannot be input to the CPFP multiplier or output from the multiplier, removing

the need for all NaN and Infinity signalling logic. With respect to FPGA resource utilization for the core, the

(M+1)-bit multiplier uses a single DSP block, while the E-bit adder uses LUTs as the exponent size is generally

small enough to take advantage of the internal carry chains without any timing penalties.

Algorithm 1 Floating-Point Multiplication

1: exp = exp1+exp2−OFFSET;

2: mult = NORM(mantissa1)∗NORM(mantissa2);

3: mantissa,exp_rnd = RND_NORM(mult);

4: exp+= exp_rnd;

5: sign = (sign1)XOR(sign2);

Further optimizations can be made based on the requirements of the algorithm using the multiplication

operator and the FPGA DSP architecture. Figure 3.2 shows one example of a possible area saving optimization

that can be employed targeting the Xilinx UltraScale DSP architecture [55] based on the work in [56]. In this

multiplier architecture, two output values are produced from three operands, where FP0 and FP1 correspond

to the multipliers, and FP2 is the multiplicand. In the case of the Xilinx DSP architecture [55], the multiplier

is a 27-bit by 18-bit multiplier, meaning that this architecture can be used for mantissa widths up to 8 bits to

produce two outputs using a single DSP for the multiplication.


Figure 3.1: CPFP multiplier, two operands

Figure 3.2: CPFP multiplier, three operands


3.2 Addition

The approach taken to implement CPFP addition is to implement a dual path floating-point adder based on

the floating-point adder described in [54], which is shown in Figure 3.3. The algorithm for floating-point ad-

dition is described in Algorithm 2, where REORDER corresponds to swapping the inputs such that the largest

exponent is in exp1. In Algorithm 2, NORM provides the same functionality as in Algorithm 1, by simply adding

the implicit one of the floating-point number to the mantissa of each input. ABS is used for taking the abso-

lute value of the resulting subtraction, and RND_NORM also has the same functionality as in Algorithm 1, it

rounds the input and normalizes the result such that it matches the floating-point format.

The two paths are the close and far paths, where the close path corresponds to when the difference be-

tween exponents is zero or one and the operation is a subtraction, and the far path is all other cases. The close

path can potentially result in leading zeros, requiring both a leading one detector (LOD) and a barrel-shifter

to renormalize the output. The LOD determines the position of the first leading one in the output and gen-

erates a number corresponding to the number of bits that need to be shifted in the barrel shifter, such that

the resulting floating-point is normalized. To allow for CPFP addition with low LUT utilization, a library of

LODs is provided with the CPFP cores to use a particular LOD depending on the mantissa width. In the case

of an overflow the floating-point number is rounded to the largest exponent and mantissa size to guarantee

that Infinity and NaN cannot be generated. Likewise, in the case of an underflow, the floating-point number

is rounded to zero to remove the need for additional logic for denormal numbers.

Algorithm 2 Floating-Point Addition

1: sign0,sign1,exp0,exp1,mantissa0,mantissa1 = REORDER(in0, in1);

2: diff = exp0−exp1

3: Close-Path:

4: exp = exp0;

5: sub = NORM(mantissa0)−NORM(mantissa1);

6: mantissa = ABS(sub) << LOD(ABS(sub));

7: sign = SIGN(sub)?sign1 : sign0;

8: Far-Path:

9: exp = exp0;

10: res = NORM(mantissa0)+/− (NORM(mantissa1) >> diff)

11: mantissa,exp_rnd = RND_NORM(res)

12: exp+= exp_rnd

13: sign = sign0


Figure 3.3: CPFP adder


3.3 Comparison Operators

Comparison operators are also provided within the CPFP library that has been implemented. These operators

include: less than, less than equal, greater than, greater than equal, equal, and max. Comparison of floating-

point numbers can take advantage of unsigned integer comparison operations, with additional logic to handle

the sign. Therefore, in this library, the comparison operations are mapped directly to the Xilinx comparison

operators provided with Vivado HLS.

Chapter 4

FPGA-Caffe: Hardware

To enable inference and training using FPGAs, several kernels needed to be developed and integrated with

the SDAccel environment. These kernels include: a direct convolution engine for forward and backward con-

volution operations, forward and backward ReLU implementations, and forward and backward max pooling

operations. These kernels allow for the FPGA CNN Training Engine (FCTE) to perform forward and backward

operations for Convolution, ReLU, Max Pooling, and Fully Connected layers. Aside from these kernels, support

is also added for a forward-only implementation of a Winograd Convolution Engine, targeting 1×1, 3×3, and

5×5 convolutions with unity stride. The following subsections detail the architectures used for each kernel

and how the systems are configured for different operations.

4.1 SDAccel Bandwidth Analysis

To determine the sizes of the input, weight, and output buffers required for both the FCTE and the Wino-

grad Convolution Engine, an analysis of the available bandwidth in the SDAccel platform for a given board

is required. This is done through a simple test kernel that reads and writes 65536 32-bit numbers from the

on-board DDR into a memory sized to fit the full data. The data is read from and written to the DDR with

differing burst sizes, and the process is done 100 times to demonstrate the full impact of different burst sizes.

The results of this experiment are shown in Figure 4.1, where the burst size is varied by powers of 2 starting

at 16 to match the interface size of 512-bits. Figure 4.1 shows that the run time begins to level off at a burst

size of 4096, where the difference in run time compared to reading all of the data is 11%. From this we can

infer that memory sizes above 4096 elements begin to have diminishing returns, therefore we use a minimum

buffer size of 4096 elements in all designs, though frequently larger buffer sizes are used to allow for caching

the inputs and weights.

30

CHAPTER 4. FPGA-CAFFE: HARDWARE 31

Figure 4.1: SDAccel burst size run-time analysis

The results of Figure 4.1 also show that there can be severe penalties when reading smaller burst sizes of

data. The use of data packing as discussed in Chapter 2.1.1 dictates that the inner-most dimension of the input

data has to be padded such that it is a multiple of 16 (or 32 if using 16-bit numbers). Considering the standard

data layout of NCHW, where N is the batch size, C is the channel size, H is the size of the height, and W is the

size of the width, if the inner most dimension of the input data is not a factor of 16 (or 32), handling the striding

operation would require an all-to-all crossbar of the inputs to their corresponding processing elements. On

the other hand, padding the width could result in a significant amount of extra data being transferred and

computed for the final layers of CNNs that can have sizes of 7×7 [31], which would require 2.28 times more data

along the width. Another alternative may be to use a different data layout such as NHWC, HWCN, or CHWN.

NHWC is a common layout used in image processing algorithms and has the advantage that most of the time

the channel size is typically divisible by powers of 2 [3, 6, 31, 57], though the channel size is relatively small

for many CNNs and would result in under utilization of the bandwidth. With this taken into consideration,

the data layout used in this work is HWCN, as the batch size is typically 64 to 256 [3, 6, 31, 57] images, and

the channel size typically ranges from 3 to 512 [3, 6, 31, 57], thus resulting in a very simple access pattern

for the FCTE and Winograd Convolution Engine. Furthermore, this data layout has an added benefit that

the Forward convolution pass requires a reduction across the channels, while the backward pass requires a

reduction across the images. This means that the forward and backward passes can share groups of processing

elements through additional multiplexing while achieving similar throughput.


Figure 4.2: Demonstration of unroll factors that can be used for the convolution forward pass

4.2 FPGA CNN Training Engine

Many of the architectures explored in recent works targeting inference have a similar structure for their pro-

cessing elements (PEs), typically consisting of a set of multipliers followed by an adder tree to implement a

dot-product. The dot-product is implemented by partially or fully unrolling the three summations of Equa-

tion 2.4. If the two inner-most summations are unrolled, then this corresponds to computing the entire con-

volution window in a single cycle. However, this approach can lead to under-utilization of the PE since the

kernel size is dependent entirely on the model. Another approach would be to partially unroll the outer-most

summation of Equation 2.4 by a factor of the input channels, Cfact, as many networks typically have a number

of input feature maps that are divisible by powers of two (aside from the input layers) [3, 6, 31], leading to re-

duced PE under-utilization. The PEs can then be replicated within output feature maps, across output feature

maps by a factor Ofact, or across output images by a factor Nfact. Figure 4.2 shows an example configuration,

where the dot product is unrolled by a factor of 9 (K ×K = 9 in this case) along the kernel dimensions, by Cfact

along the input channel, and then replicated across many outputs by a factor of Ofact, meaning there are Ofact

dot product engines.

In the case of the backward pass, since the operation is also a convolution as discussed in Section 2.2.2, a

similar structure can be employed. For the derivative with respect to the input, the operation is essentially the


same as the forward pass therefore requiring no modifications. On the other hand, the derivative with respect

to the weights results in a reduction across the height and width of the input feature map, as well as across the

batch of images rather than across the convolution window and the input feature maps in the forward pass.

Therefore, if an architecture for the forward pass is replicated such that it processes Hfact ·Wfact ·Nfact inputs per

cycle, where Hfact, Wfact, and Nfact are partial unroll factors for the height, width, and batch size respectively;

then a corresponding backward pass requires the same number of multipliers as the forward pass, but an

additional Hfact ·Wfact ·Nfact-to-1 adder tree.

To address these needs, the PE structure shown in Figure 4.3 is used. Each PE contains 16 multipliers

and one 16-to-1 adder tree. The PEs are grouped into groups of four to allow for resource sharing between

the forward and backward passes. This PE structure corresponds to unrolling the batch size by a factor of 16

(Nfact = 16), while the number of PE groups corresponds to unrolling the number of input feature maps by a

factor of four (Cfact = 4). In the forward pass the PE processes 16 images across four input feature maps per

cycle, which requires the first two stages of the adder trees in the PE group as shown in Figure 4.3. Likewise, in

the backward pass the PE processes 16 inputs across four input feature maps per cycle, however in this case

the entire adder tree is used with no sharing between PEs in the PE-group as shown in Figure 4.3. Therefore

when considering a PE group of four, in the forward pass 16 outputs are produced from 64 inputs every clock

cycle, while in the backward pass four outputs are produced from 64 inputs every clock cycle. To maintain an

initiation interval of one, the input is partitioned into 64 memory partitions, each connected to one PE within

the PE group. Similarly for the outputs a bank of 16 buffers for each PE group is used. For the weights, in

the forward pass, four weights are required per cycle, while in the backward pass 16 are required, therefore

the weights are partitioned into 16 memories with additional muxing to handle the difference between the

forward and backward pass. To achieve higher overall throughput, this structure is replicated across output

feature maps until no more PE groups can fit in the SDAccel partial-reconfiguration region. The replication

across the output feature maps requires a broadcast of the inputs to each PE group, while each PE group will

have independent weights.

The overall architecture is shown in Figure 4.4. It consists of multiple PEs that are instantiated along with

ReLU and pooling layers. Not shown in Figure 4.4 is the logic used to fill each input bank. Given that Cfact ×Nfact elements are processed per cycle, bursts of multiples of Cfact ×Nfact are fetched from on-board DDR as

required. The input buffers are filled with the full convolution window, such that when the window is strided

across the input, only the new unseen input data is fetched. Furthermore, since the data is in an H ·W ·C ·N

layout, a regular access pattern is achieved while also allowing for skipping all zero-padding. Several counters

are used to determine how many of the window positions are non-zero, to reduce the amount of data to be

processed by the PEs. The backward portion of ReLU is implemented within the input stage of each PE given


Figure 4.3: Processing element backward (BW) and forward paths. The forward path is in dashed red, showingthe bypass logic used to only use a portion of the adder tree.

that it is a relatively simple operation, while the forward portion is initiated at the end of the computation of

a set of outputs. The pooling layer forward and backward paths are connected directly to the AXI interface

to simplify the control logic for pooling, though this comes at the penalty of additional reads and writes to

external DDR.

4.3 Winograd Convolution Engine

The Winograd convolution engine uses the Winograd convolution algorithm as discussed in Chapter 2.2.3 to

implement 1·1, 3·3, and 5·5 convolutions. The architecture used is similar to the work done by myself and my

colleagues in [37], though it has been modified to use a one dimensional version of the Winograd transform

and to accommodate the modified data layout detailed in Section 4.1. The processing element structure that

is employed is shown in Figure 4.5, which produces two outputs every cycle from 16 inputs and 12 weights.

The Winograd PE is connected to four input transforms and four weight transforms, with the input transforms

shown in Figure 4.6(a) and the weight transforms shown in Figure 4.6(b). The output stage of the Winograd PE

uses two separate blocks for performing the output transform that are detailed in Figure 4.7, with each block

producing one of the outputs along the width of the convolution.

The overall system employed for the Winograd Convolution Engine is shown in Figure 4.8. It consists


Figure 4.4: Overall CNN training architecture

Figure 4.5: Winograd processing element


(a) Winograd input transform (b) Winograd weight transform

Figure 4.6: Winograd input and weight transforms

Figure 4.7: Positive and negative Winograd output transforms

of a similar overall structure to that of the FCTE discussed above, though the PEs are larger and there is no

support for backward operations. The PEs are scaled along the output feature maps by a factor of Ofact and

along the batch of images by a factor of Nfact, with the input transform being replicated Nfact times, and the

weight transform replicated Ofact times. This means that each output of the input transform is required to be

broadcast to Ofact of the PEs, while the output of the weight transform needs to be broadcast across Nfact PEs.

The Winograd PE is capable of processing two outputs per cycle with Cfact set to four and an equivalent Ffact of

three. If the FCTE were to be created with a similar number of outputs per cycle with the same Cfact and Ffact,

this would require 24 multipliers, and 24 adders, while the Winograd PE requires 24 multipliers and 46 adders.

However, if both architectures were to be scaled to produce the same number of results for Nfact images and

Ofact output channels, Equation 4.1 and 4.2 detail the number of multipliers required, while Equation 4.3 and

4.4 detail the number of adders required. From these equations it can be shown that for any Nfact larger than

one, the Winograd PE will use less multipliers. Furthermore, the Ofact term for the multipliers could be greatly

reduced by introducing custom logic for the constant multiplications by 0.5 in the weight transformation. In

terms of adder utilization, for small valued Nfact and Ofact the Winograd PE will use more adders due to the

trailing terms in Equation 4.3, while for larger Nfact and Ofact the FCTE will require more adders. In practice,

the FCTE adder utilization will be larger when Nfact is 8 or larger and Ofact is 4 or larger, or Ofact is 8 or larger and


Figure 4.8: Winograd system diagram

Nfact is 4 or larger. Therefore, for smaller valued Nfact or Ofact, the decision to use the Winograd PEs rather than

FCTE would have to be a trade off between multiplier utilization and adder utilization, along with flexibility

trade offs between the two cores.

NWinograd_multipliers = Nfact ·Ofact ·16+Ofact ·8 (4.1)

NFCTE_multipliers = Nfact ·Ofact ·24 (4.2)

NWinograd adders = Nfact ·Ofact ·18+Nfact ·16+Ofact ·12 (4.3)

NFCTE_adders = Nfact ·Ofact ·24 (4.4)

4.4 Max Pooling Layer

The max pooling layer shown in Figure 4.4 can be further broken down into two primary components, the

forward and backward paths. The forward path is shown in Figure 4.9. It consists of a max tree, that reduces

nine inputs to one. Each max block shown takes two inputs, where each input consists of a 16-bit tag as

discussed in Section 2.2.2 (only four bits are actually required, but 16-bits are used to share an interface with

other layers and to simplify programming), and a CPFP number. The max block outputs the CPFP number

with the maximum value as well as its corresponding tag. Max pooling in many CNNs is usually either a 2 ·2


Figure 4.9: Max pooling forward implementation

or 3 ·3 window [3,6,32], therefore to accommodate both common sizes the max tree is sized to be nine inputs.

In the case of a 2 · 2 pooling window, the remaining inputs are set to the smallest possible CPFP value. The

overall structure is replicated 16 times such that 16 input images are processed every cycle. The input buffers

are partitioned using complete partitioning along the window size. Furthermore the data is packed such that

16 values are stored at each array index, which implicitly partitions the input and output buffers with a cyclic

factor of 16.

The backward path for max pooling is shown in Figure 4.10. It consists of 16 CPFP adders, each requiring

two inputs: the gradient of the loss with respect to the input from the previous layer, and the stored gradient

output which is selected based on the tag value.

4.5 ReLU Layer

Typically in Caffe CNN model descriptions, ReLU is specified as a separate layer [16]. ReLU usually follows

convolution or fully connected layers and is computed in-place for the Caffe CPU implementation. In the

case of FPGA convolution engines described in Sections 4.3 and 4.2, the ReLU forward and backward paths

are added directly to the convolution pipeline.

For a given output in the forward pass, the convolution needs to iterate through every input feature map

and every value in the input convolution window prior to completion. Once this has been completed, the

ReLU core is enabled such that if the input to the core is negative then the output is zero, otherwise the input

is allowed through the core. If the output is non-zero, then a one-bit tag is set indicating that ReLU was active.


Figure 4.10: Max pooling backward implementation

The ReLU forward core processes 16 inputs in parallel, with 16 outputs and 16 1-bit tag outputs. This core is

replicated such that there is an equivalent amount of ReLU forward cores and PE groups.

In the backward pass, all of the tags for a given output are read along with the derivative of the loss with

respect to the data from the previous layer. If the given output tag is set, then the derivative of the loss with

respect to the data from the previous layer is passed through the ReLU backward layer into the next stage of

the pipeline, otherwise zero is passed through. The ReLU backward layer is used within the input stage of the

processing elements, and also replicated across the weight input stages for each PE such that the derivative of

the loss with respect to the data from the previous layer can be stored in either the input buffer or the weight

buffer depending on the needs of the layer.

4.6 Core Usage

There are several registers that are used for configuring the operation of the convolution engines discussed in

Sections 4.2 and 4.3. Each register is described in Table 4.1, with detailed descriptions and example settings

described in the following subsections.

As discussed in Section 2.2.2, fully connected layers and convolution layers are essentially the same, thus


Table 4.1: Description of configuration registers used for operating the Direct and Winograd convolution en-gines

Register Description Min Value Max ValueinChannels Number of input feature maps 4 -

outChannels Number of output feature maps 1 -burstChannels Number of input feature maps to be burst

transferred4 2048

rpo Number of input reads per output (RPO) 1 -rpofm Number of output reads required to cover

all output feature maps1 -

burstoc Number of output feature maps to beburst transferred

1 (16×256)numImag es

yDim Input y dimension size 1 -xDim Input x dimension size 1 -kSize Convolution kernel size 1 11

numGroups Number of convolution groups (only sup-ported in the forward path to support theAlexNet pretrained model)

1 2

numImages Number of images to be batch processed 16 256reluWeights Enables backward ReLU computation on

the weight inputs0 1

relu Enables ReLU computation 0 1backward Sets either forward mode (0), backward

data mode (2), or backward weightsmode (1, only applicable for FCTE)

0 2

stride Convolution stride 1 15pad Convolution padding 0 15

engineOp If set then kernel computes max pooling,otherwise computes convolution

0 1

pkSize Pooling kernel size 2 3


Table 4.2: Fully Connected Layer Register Settings

Register SettingyDim 1xDim 1kSize 1

numGroups 1stride 1pad 0

they can share the same operational models, though the core settings will differ largely depending on which

operation is in use. In the case of the fully connected layer, the settings in Table 4.2 are used, with other

registers configured the same as if the operation were convolution.

Chapter 5

FPGA-Caffe: Software

To enable the hardware developments discussed in Chapters 3 and 4 to be deployed for inference and training

CNNs, software infrastructure is required. Rather than building a separate CNN framework for deploying these

hardware developments, the software has been built into the Caffe framework [16], allowing for the use of all

of the features of Caffe along with new FPGA specific features. The following sections detail the new features

present in FPGA-Caffe targeting the hardware developments in the previous chapters.

5.1 OpenCL Brew

This work extends the baseline Caffe framework to include the OCL (OpenCL) Brew, which provides support

for Xilinx FPGA-based CNNs and could easily be adapted to target Intel’s FPGA OpenCL [14] programming

environment as well. The user can choose between different Brews by building the framework using the cor-

responding Makefile flags and changing the Brew to OCL. Figure 5.1 shows an overview of the augmented

system with the OCL Brew, where inputs and outputs are the same as in the CPU and GPU Brews, but the un-

derlying hardware of the system is comprised of the CPU for host code and the FPGA for layer computations.

To perform a forward pass (inference) using the OCL Brew, an additional API call was added: forward_ocl().

Figure 5.1: High-Level View of the Brew Options in Caffe

42

CHAPTER 5. FPGA-CAFFE: SOFTWARE 43

The forward_ocl() API call is used as the forward operator on the condition that the Brew is OCL and the

function is defined, otherwise it defaults to the forward_cpu() call as in the baseline Caffe implementation

(provided that it exists). Similarly, there is an equivalent backward_ocl() API call that is used as the backward

operator for computing the gradients of a given layer. Similar to the baseline Caffe implementation, adding a

new FPGA-based layer involves defining the associated forward_ocl() and backward_ocl() API calls, as well as

layer specific setup functions.

5.2 OpenCL Memory Management and Synchronization

Support for memory synchronization between the host and the FPGA in the FPGA Caffe framework builds on

the memory synchronization features described in Section 2.2.4. To accomplish similar functionality, OpenCL

APIs are used with an additional object corresponding to the FPGA device memory object for each data struc-

ture. When data is passed from the host to the FPGA, the state of the memory changes to SYNCED from

HEAD_AT_CPU such that on subsequent accesses it will either stay in the device memory or be transferred

back to host memory. If the data is required by the host, a memory transfer will be issued from the device

to the host and the state of the memory will change to SYNCED. To access the FPGA memory object, calls to

either mutable_ocl_data() for modifying data (layer output data) or ocl_data() for static data (layer input data,

such as weights), are required. These two functions were added to Caffe to handle both the creation and syn-

chronization of the device and host memory, while maintaining transparency of memory manipulation as in

the baseline Caffe implementation. There are also similar functions for the gradients as in the baseline version

of Caffe, mutable_ocl_diff() and ocl_diff().

Table 5.1: Memory Synchronization State Transitions

Memory TransferInitial Memory State

SYNCED HEAD_AT_CPU HEAD_AT_OCL

ocl_data(), ocl_diff() SYNCED SYNCED HEAD_AT_OCLcpu_data(), cpu_diff() SYNCED HEAD_AT_CPU SYNCED

mutable_ocl_data(), mutable_ocl_diff() HEAD_AT_OCL HEAD_AT_OCL HEAD_AT_OCLmutable_cpu_data(), mutable_cpu_diff() HEAD_AT_CPU HEAD_AT_CPU HEAD_AT_CPU

To accommodate the custom-precision floating-point (CPFP) data type discussed in Chapter 3, dynamic

type casting is used on the host side to allow for CPFP layers to be used alongside single-precision floating-

point layers. Therefore, on the host-side for CPFP types that are 16-bits wide or less, the host memory con-

sumption is twice the amount that is required as shown in Figure 5.2. However, to reduce bandwidth con-

sumption for CPFP types 16-bits or less, optional parameters have been added to the mutable_ocl_data(),

mutable_ocl_diff(), mutable_cpu_data() and mutable_cpu_diff() to control whether the host-visible data size


Figure 5.2: Demonstration of memory layout before and after transforming the data from single-precisionfloating-point to CPFP. This example assumes that the CPFP data is 16-bits or less.

is transferred to the device, or the actual CPFP size is transferred. This allows for potentially a two times reduc-

tion in the on-device memory footprint where memory is less plentiful, and halves the amount of data that is

required to be transferred from the host to the device.

5.3 Auxiliary Layers

To facilitate the usage of FPGA-specific layers, several auxiliary layers were added to FPGA-Caffe. These layers

include: Pad layer, HWCN layer, and CPFP layer.

5.3.1 Pad Layer

The Pad layer allows for padding in arbitrary dimensions of the input data to the layer, such that the data is

in a shape that may be simpler to work with at the interface to a given FPGA layer. In the case of SDAccel,

as discussed in Section 2.1.2, the AXI interface can support data widths of up to 512 bits, meaning that for

32-bit numbers this can accommodate 16 values per cycle, and for 16-bit numbers it can accommodate 32

values per cycle. To take advantage of this with simple access patterns, the input data needs to be aligned

such that the inner most dimension of the input is a multiple of 16 (or 32), which is where padding in a given

dimension can be a useful utility. Another use case for padding is when a given input does not perfectly align

with the shape expectations of an FPGA-based CNN engine. This may be the case when a given FPGA-based

CNN engine processes a portion of the input in parallel, such as the input channels of an image. For the

most part, convolution layers in common CNN models are multiples of powers of two [3, 6, 31, 32], though the


Figure 5.3: Demonstration of a CPFP variable stored within host memory

first layer usually has three input channels corresponding to three colour channels of an image. Padding can

be performed along the channel dimension to make the first layer more amenable to an FPGA engine that

requires an even number of input channels.

5.3.2 HWCN Layer

The HWCN layer is a layer that reorders the input data to or from N×C×H×W format to or from H×W ×C×N ,

where N is the number of images, C is the number of input feature maps, H is the height of the image and W is

the width of the image. This layer is required because the standard data layout in Caffe is N ×C ×H ×W , but it

is simpler to access batches of data when it is aligned to a power of two, as is frequently the case in the batch

size, without requiring padding which can lead to under-utilization of the FPGA processing elements.

5.3.3 CPFP Layer

The CPFP layer is used for converting input data to or from single or double-precision to the CPFP repre-

sentation that is described in Chapter 3. This layer allows for interfacing between layers that require CPFP

representation and those that do not have CPFP implementations. Currently, the FPGA layers are written us-

ing CPFP, but not all layers in the FPGA Caffe environment have FPGA support, meaning those that do not

would require the CPU for execution, in which case single-precision would be faster given that it has a dedi-

cated datapath in the CPU. Figure 5.3 shows how abnormal CPFP sizes are handled when converting to CPFP

data types. Additional padding is added to the most significant bits of the CPFP number to make it 16-bits,

such that it is properly aligned with the host memory.

Figure 5.4 shows an example use case for the HWCN and CPFP layers. This layer sequence may represent

a portion of a larger model, where an FPGA-based layer computation sends data back to the host CPU to


Figure 5.4: Auxiliary layer usage

compute a non-supported layer, followed by sending the data back to the FPGA. The CPFP and HWCN layers

are used as intermediate steps in the data transfer such that the format of the data is as expected by the host

CPU.

5.4 FPGA Layers

To enable the usage of the FPGA kernels, several additional layers were added to FPGA-Caffe. These layers in-

clude: XCLProgram layer1, OCLCRHWCN layer, OCLHWCNInnerProduct layer, and OCLPoolingHWCN layer.

The XCLProgram layer is used for programming the FPGA with a specific kernel and can be inserted anywhere

in the model to allow for simple stand-alone kernel development prior to integrating into a full system. If

a single binary is used for the whole model, the XCLProgram layer can be enabled to only program the de-

vice once during the first forward pass. The other FPGA specific layers are derived from the CPU equivalent

layers and are used for configuring the registers of Table 4.1. The OCLCRHWCN layer, OCLHWCNInnerProd-

uct layer, and OCLPoolingHWCN layer are layers specifically used for launching FPGA-based layers for fused

Conv-ReLU, Inner Product, and Pooling respectively, where the data is layed out in HWCN order. Figure 5.5

shows an example model description using the OCL equivalent cores. In this figure, num_pe corresponds to

the number of PEs per group, while num_cu corresponds to the unroll factor Ofact discussed in Chapter 4.

5.5 Testbenches

Testing a layer in FPGA Caffe can be accomplished in two ways depending on the stage of development. Layers

can be tested using individual test cases through the test cases provided in Caffe. A test case in Caffe can be

used to test an FPGA implementation by changing the Brew to OCL and modifying parameters to suit the

layer. This allows for testing integration with the Caffe environment to determine whether the layer setup

is correct. Alternatively, the FPGA implementations can be tested through the use of standalone host code

by invoking only the host code required to launch the kernel. In either case, a layer can be tested using a

hardware or software emulation based implementation created with Xilinx SDAccel. Provided in FPGA-Caffe

1This layer was developed principally by Griffin Lacey from University of Guelph in [37].


Figure 5.5: OCRLCRHWCN layer description


are several standalone host test cases that utilize the Google Test framework for unit testing the convolution,

pooling, and fully connected functionality of the FPGA kernels for both forward and backward propagation

when applicable. Furthermore, unit tests exist for evaluating whether the CPFP operators are performing the

correct operations as well as unit tests for evaluating SDAccel bandwidth, which was discussed in Chapter 4.

Chapter 6

Evaluation

To evaluate the performance of the CPFP cores as well as the FPGA CNN Training Engine (FCTE), several met-

rics must be considered. These metrics include: area utilization, operating frequency, inference throughput,

inference accuracy, training throughput, and training accuracy. In the case of FCTE and the Winograd Convo-

lution Engine (WCE), these metrics vary depending on the configuration of the CPFP cores, therefore in this

chapter several different configurations are evaluated.

Figure 6.1 shows the different system configurations that are considered in this chapter to evaluate through-

put and accuracy. The FPGA-based system shown in Figure 6.1(a) uses an Alpha-Data ADM-PCIE-8K5 card [58],

with a Xilinx Kintex Ultrascale XCKU115, connected to a host CPU through PCIe Gen 3. The host CPU in this

case is an Intel Xeon E5-2650 v4 running at 2.2 GHz. The Xilinx SDAccel version used to generate the hardware

systems is 2016.3. CPU-based testing shown in Figure 6.1(c) uses an Intel Xeon E5-2650 v4 running at 2.2 GHz,

with OpenBLAS [33] used as the linear algebra library, with eight threads enabled. Eight threads is chosen

as this results in the best CPU throughput for the networks considered in this chapter, after also testing each

network with 2, 4, 12, 16, and 24 threads enabled. Finally, the GPU-based system shown in Figure 6.1(b) con-

tains an NVIDIA M60 GPU [59] and uses the cuDNN [21] library for CNN specific operations. The NVIDIA M60

GPU contains two GPUs on board, however we use only one as using both would impact model convergence,

which would result in an invalid baseline comparison. For all of the systems shown in Figure 6.1, the FPGA

Caffe framework is used for testing. For both the FPGA and GPU based systems, Caffe is used with one thread

enabled.

49

CHAPTER 6. EVALUATION 50

(a) FPGA Evaluation System (b) GPU Evaluation System

(c) CPU Evaluation System

Figure 6.1: FPGA, GPU, and CPU evaluation systems considered to measure accuracy and throughput

6.1 Custom-Precision Floating-Point Area Utilization and Operating Fre-

quency

In Figures 6.2, 6.3, 6.4, 6.5, 6.6, and 6.7 the LUT utilization, Flipflop (FF) utilization, and clock period of the

CPFP cores is detailed in comparison to the Xilinx half and single-precision cores for both supported round-

ing modes. The half and single-precision floating-point results were gathered with Vivado HLS 2017.1 with

several different configurations of DSP utilization, while the CPFP core results were gathered using Vivado

HLS 2016.3 due to a change in the way DSPs are inferred in 2017.11. To evaluate resource utilization for each

core, synthesis and place and route is run on two modules: one that multiplies two values together and one

that adds two values together. For each synthesis and place and route run the target clock frequency is set to

250 MHz and the part to the Xilinx Kintex Ultrascale XCKU115. The primary reason for selecting a target clock

frequency of 250 MHz is to match the target clock frequency of the SDAccel platform, though a lower clock fre-

quency could be used to improve area utilization. Two tests are run for each core: round-to-nearest enabled

and round-to-zero enabled. In each case the numbers were gathered using only one run, so it is possible that

there may be some variation depending on the random seed.

1Vivado HLS 2017.1 was unable to infer DSP usage for smaller mantissa sizes in the multipliers resulting in higher LUT utilization.


Figure 6.2: CPFP multiplier LUT utilization compared to Xilinx floating-point IP cores

6.1.1 CPFP Multiplier

Figure 6.2 shows that with round-to-nearest (RTN) enabled, the CPFP multiplication core almost always has

higher LUT utilization than the equivalent single DSP half-precision core (fp16, 1 DSP). However, with round-

to-zero enabled, CPFP core is able to have comparable LUT utilization to the Xilinx half-precision cores whether

two or one DSP is in use as shown in Figure 6.2. In this case, a mantissa size of 5-7 allows for LUT utilization

improvements for almost every combination of exponent and mantissa compared to the Xilinx half-precision

core with one DSP in use. In this range, a mantissa of 5 gives the largest LUT utilization improvements, with

54.3% for an exponent of 4, 13% for an exponent width of 5, 32.6% for an exponent width of 6, and 6.5% for an

exponent width of 7). Notably, the LUT utilization is relatively stable across the range of mantissas. Given that

a DSP block is used for the multiplication, most of the remaining logic is independent of the mantissa size and

thus remains stable across the range of mantissas.

On the other hand, Figure 6.3 shows that with either rounding mode the CPFP multiplication core usually

uses more FFs than the equivalent single DSP half-precision core for bitwidths larger than 11, though the

utilization is comparable with round-to-zero enabled. With round-to-zero the CPFP multiplier uses 59 FFs

with an exponent width of 6 and mantissa width of 5 compared to the 60 that are used by the half-precision

core, though if a half-precision CPFP multiplier is considered it uses 91 FFs compared to the 60 used by the

Xilinx half-precision core. The difference in FF utilization can largely be attributed to the heavily pipelined

implementation of the core synthesized by HLS, which results in a higher latency multiplier, but with a shorter

clock period as shown in Figure 6.4. The clock period of the CPFP multipliers is almost always shorter than

that of the Xilinx half-precision core with a single DSP, with the CPFP core achieving a two-fold reduction in

clock period compared to the Xilinx half-precision core with an exponent width of 6, mantissa width of 5, and

round-to-zero enabled. While the CPFP core is able to achieve a shorter clock period in these tests, since the


Figure 6.3: CPFP multiplier FF utilization compared to Xilinx floating-point IP cores

Figure 6.4: CPFP multiplier clock period compared to Xilinx floating-point IP cores

test circuits here are relatively small they may not expose clock period degradation that may be present in

larger designs, though this is not explored in this work.

6.1.2 CPFP Adder

In Figure 6.5, when comparing against the Xilinx half-precision core using the same number of DSP blocks as

the CPFP adder (fp16 0 DSP in Figure 6.5), the CPFP adder always has lower LUT utilization when the mantissa

width is below 6 and round-to-nearest is enabled. When considering a mantissa width of 5, the LUT utilization

improvement is 14.9% for an exponent width of 4, 24.57% for an exponent width of 5, 21.7% for an exponent

width of 6, and 15.4% for an exponent width of 7. When round-to-zero is enabled the LUT utilization is lower

than the Xilinx half-precision core for mantissa widths below 8. When compared with round-to-nearest, the

area improvement is less pronounced at the lower end of the mantissa range because the rounding logic is

proportionally less complex with decreasing mantissa sizes, though for larger mantissa sizes the round-to-


Figure 6.5: CPFP adder LUT utilization compared to Xilinx floating-point IP cores

Figure 6.6: CPFP adder FF utilization compared to Xilinx floating-point IP cores

zero mode begins to provide significant area improvements compared to round-to-nearest as its LUT usage is

increasing at a lower rate.

In a similar manner to the CPFP multipliers, Figure 6.6 shows the CPFP adders also have higher FF utiliza-

tion than the Xilinx half-precision adder with no DSPs in use. This is the case for a similar reason to the CPFP

multiplier, with a higher accumulation latency due to more extensive pipelining by HLS. As in the case of the

CPFP multiplier, the CPFP adder also achieves clock period improvements over the Xilinx half-precision adder,

with the Xilinx half-precision adder achieving a clock period that is 1.54 times larger than the CPFP adder for

an exponent width of 6 and mantissa width of 5 with round-to-nearest enabled, as shown in Figure 6.7.


Figure 6.7: CPFP adder clock period compared to Xilinx floating-point IP cores

6.2 Winograd Convolution Engine and FPGA CNN Training Engine Area

Utilization and Operating Frequencies

To evaluate the performance of FCTE, two different engines are synthesized: forward-only, and the full training

engine. The reason for this is to demonstrate the area-utilization and frequency penalty that results from

including the backward-paths as well. The full FCTE includes all stages of the adder tree as shown in Figure 4.3.

Table 6.1 shows the area utilization and frequency of the FCTE for both forward-only and the full engine when

using a CPFP representation with exponent width of 6 and mantissa width of 5 with 64 PEs (Cfact = 4, Ofact = 16,

while Table 6.2 shows the same configuration but with a mantissa width of 7. Also shown in Tables 6.1 and 6.2

is the area utilization and operating frequency of WCE with 64 PEs (Nfact = 16, Ofact = 4) for an exponent width

of 6 and mantissa widths of 5 and 7, respectively.

Notably, when comparing mantissa widths of 5 and 7, FF utilization increases by approximately 16% for

FCTE forward-only, 15% for FCTE, and 16% WCE. This is in line roughly with the gains from using the CPFP

adders that occupy nearly double the area of the FCTE multipliers, and are used slightly more than the CPFP

multipliers per PE due to the accumulators shown in Figure 4.3 or Figure 4.5 for WCE. For similar reasons, LUT

utilization increases by approximately 12.6% for FCTE Forward-Only, 12.9% for FCTE, and 14.5% for WCE.

On the other hand, the BRAM increases in utilization are architecture specific, with a 9.5% increase for FCTE

Forward-Only, 6.5% for FCTE, and 0% for WCE. This wide difference in BRAM utilization can be attributed to

how the various buffers in the engines are sized and how they are packed by Vivado HLS into BRAMs. From

these results, it can be inferred that the BRAMs are more densely packed in FCTE Forward-Only and FCTE

compared with WCE, as a 2-bit increase results in non-negligible increases in BRAMs. This points to potential

for further increasing the size of the WCE buffers, though given that WCE is not the primary focus of this work

it has not been as heavily optimized.


Table 6.1: Convolution Engine Area Utilization (Utilization Percentage in Brackets) and Operating Frequency,Exponent Width of 6, Mantissa Width of 5

Engine FF LUT DSP BRAM FrequencyFCTE

Forward-Only 295,242 (22.4%) 234,289 (35.6%) 1,109 (20.1%) 681 (31.5%) 181.4 MHzFCTE 390,714 (29.6%) 313,507 (47.6%) 1,153 (20.9%) 954 (44.2%) 166.5 MHzWCE 394,160 (29.9%) 312,988 (47.5%) 1,152 (20.9%) 517 (23.9%) 198.3 MHz

When considering the clock frequency of the various designs across both mantissa width settings, it is

immediately apparent that there is a large degradation in frequency. This can be attributed to two primary

factors: (1) the data path sizes increase routing congestion, making timing more difficult to meet, (2) the

device itself is partitioned across two separate dies to increase capacity, which has limited routing resources

at the boundary of the two devices. The penalty for the increased data path size is mostly visible in the FCTE

Forward-Only path as this design is small enough to still fit on a single die, with a performance degradation

of 8%. This degradation could potentially be resolved if the CPFP cores were further customized through RTL

and placement optimization, though it is usually always the case that larger data paths will be more difficult to

route given the finite routing resources of FPGAs. The penalty due to the physical partitioning of the two dies

is much more evident in FCTE and WCE, as these designs require over 50% of the LUTs on the device, with a

degradation of 33% for FCTE and 31% for WCE. If we assume that 8% of this degradation is due to the increased

data path sizes, this corresponds to an average frequency degradation of 24% between the two designs, which

significantly limits the performance as the designs are further scaled.

FCTE Forward-Only compared with FCTE uses noticeably less resources for the same number of PEs. The

number of FFs is 32% larger for FCTE, the number of LUTs is 33% larger, and the number of BRAMs is 40%

larger. The increase in both FFs and LUTs can be attributed to the additional adders in the adder tree required

by backward computations of FCTE shown in Figure 4.3 along with additional multiplexing required by the

data path, more complex control logic, and a deeper pipeline. The increase in BRAMs of FCTE compared to

FCTE Forward-Only is because the output buffers and weight buffers need to be the same size as the gradients

of the output are stored in the weight buffer depending on the mode of operation. While WCE has a simi-

lar number of PEs, the WCE PEs are significantly larger as they produce 128 outputs per cycle when 64 PEs

are placed, while the FCTE Forward-Only produces 64 outputs per cycle. This leads to significantly higher

resource utilization compared to FCTE Forward-Only, but it also has a higher clock frequency due to lower

fan-out requirements on the inputs as a given input only needs to fan-out to Ofact of the PEs, which is four for

this configuration rather than 16 for FCTE Forward-Only.


Table 6.2: Convolution Engine Area Utilization (Utilization Percentage in Brackets) and Operating Frequency,Exponent Width of 6, Mantissa Width of 7

Engine FF LUT DSP BRAM FrequencyFCTE

Forward-Only 342,521 (26%) 264,449 (40.1%) 1,109 (20.1%) 745 (34.5%) 167 MHzFCTE 448,167 (34%) 354,448 (53.78%) 1,153 (20.9%) 1,018 (47.1%) 111.5 MHzWCE 458,640 (34.8%) 358,573 (54.41%) 1,152 (20.9%) 517 (23.9%) 136.2 MHz

6.3 Training Accuracy

To evaluate the effectiveness of using CPFP cores, the CNN training engine is used along with the CPFP cores

to train models for MNIST and CIFAR-10. In each case round-to-zero is employed for the CPFP multipliers

and round-to-nearest for the CPFP adders. This allows for a reduction in area for the floating-point multipli-

ers, while still allowing for accurate accumulations. Prior works used stochastic rounding when considering

low-precision arithmetic for training CNNs [51]. While this may be necessary for low-precision fixed-point

training, it is not necessary for low-precision floating-point as the dynamic range of the floating-point num-

bers is a function of the exponent width rather than the bit-width of the full number in fixed-point representa-

tions. This allows for very small gradients to still be represented (although less accurately) with low-precision

floating-point numbers, which in turn allows for simpler rounding schemes to be employed. In each model

that is trained, the CPFP cores are used for computing the forward pass and the gradients of the fully con-

nected, ReLU, max pooling, and convolution layers. The updates to the weights are applied on the host side

by converting the gradient to single-precision floating-point, applying the update, and then converting back to

reduced-precision floating-point. This approach is taken mostly for simplicity reasons, as the update compu-

tation has a relatively low computational complexity and if it were to require higher precision, custom blocks

could be used to compute it with little additional area. Furthermore, this approach allows for experimentation

with different solvers in Caffe without having to design a new solver for each experiment.

6.3.1 MNIST

The MNIST dataset [60] consists of 60,000 training images and 10,000 test images with each image being black

and white with a size of 28× 28. To train the network the Adam Optimizer described in [61] is used with a

minibatch size of 256. The model used for this experiment is similar to LeNet-5 [60] and the model used

in [51]: two 5×5 convolutional layers with ReLU activations, each followed by a max pooling layer with stride

of two and window size of 2×2, with two fully connected layers followed by a 10-way softmax at the end of

the network. This model is larger than that of [51], with the first convolutional layer having 32 output feature

maps, the second having 64 output feature maps, and the first inner product layer having 512 outputs. The


(a) MNIST test error for different floating-point bit widths (b) CIFAR-10 test error for different floating-point bit widths

Figure 6.8: Test error rates for MNIST and CIFAR-10 when using different floating-point representations

model is slightly larger to reflect that most CNNs tend to have a larger number of output feature maps which

can result in more numerical inaccuracies due to more accumulations across channels in the forward pass.

Figure 6.8(a) shows the test error, which corresponds to the number of images that are incorrectly classi-

fied, for various settings of the exponent width and mantissa width along with single-precision floating-point

as a baseline. To determine the level of precision that is required, an accuracy degradation of 1% is considered

to be the maximum level of acceptable degradation. With this taken into consideration, for exponent widths of

5, 6, and 7 a mantissa width of 3 is required to be within the 1% accuracy degradation threshold. In the case of

an exponent width of 4, no mantissa setting resulted in acceptable accuracy, with the lowest achievable error

rate being 9%.

6.3.2 CIFAR-10

The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images, where each image is 32×32

with three colour channels [62]. Stochastic gradient descent is used with a minibatch size of 256 and a CNN

similar to [3, 51]: three 5×5 convolution layers with ReLU activations, each followed by a max pooling layer

with stride of two and window size of 3×3. The first two convolution layers have 32 output feature maps, while

the last has 64 output features maps. At the end of the network there is a fully connected layer with 10 outputs

connected to a softmax layer.

Figure 6.8(b) shows the test error for various exponent and mantissa widths with single-precision floating-

point as a baseline. Notably, an exponent width of 5 results in a degradation in accuracy compared to single-

precision floating-point. This can be attributed to a lack of denormal support in the CPFP cores, meaning

that the lower valued loss gradients were rounded to zero due to low dynamic range. This was even more

apparent for exponent widths of 4, where the lowest error rate was 90%. Compared to the MNIST model,


Figure 6.9: NIN CIFAR-10 Error Rate

higher precision is required: a mantissa width of 5 is sufficient to achieve comparable accuracy to single-

precision for exponent widths of 6 and 7.

To further explore the impact of using CPFP, two larger CIFAR-10 models were also studied. The two models

are the Network In a Network (NIN) model [57], which requires 0.45 GFLOPS per forward execution and the

All Convolutional Network [63].

The NIN model contains 9 convolutions, two of which are 5×5 convolutions, one is a 3×3 convolution,

and the rest are 1×1 convolutions. Rather than using fully connected layers at the end of NIN style networks,

an average pooling layer is used. When training this model, all forward and backward passes are computed

using CPFP with the exception of the last pooling layer, softmax, and dropout. The exponent width is set to

6 and mantissa width set to 5 as this was the smallest CPFP configuration that achieved reasonable accuracy

for the smaller CIFAR-10 model. Initially when using the default learning strategy used in [57], the model

began to diverge very early on. When switching from the standard SGD solver to the Adam optimizer [61], the

model’s accuracy matched that of FP32 throughout the whole training process (with FP32 also using the Adam

optimizer) as shown in Figure 6.9. The Adam optimizer uses a per parameter learning rate that is adjusted as

training progresses, which allows for more effective training with sparse updates [61].

The All Convolutional Net model contains 9 convolutions, with 7 3×3 convolutions, and two 1×1 convo-

lutions. Rather than having an FC layer at the end, the last layer is simply a 1×1 convolution. Furthermore,

unlike most other networks, this network attempts to remove all non-convolution layers aside from activa-

tions, softmax, and dropout. When training this model, all forward and backward passes are computed using

CPFP with the exception of softmax and dropout. Similar to the NIN model, the exponent width is set to 6 and

mantissa width is set to 5. Unlike the NIN model, the All Convolutional Net model was able to converge using


Figure 6.10: All Convolutional Network CIFAR-10 error rate

the exact same learning rate schedule and solver as outlined in the initial All Convolutional Net paper [63].

This can potentially be attributed to a lower likelihood of small gradients as less of the gradients will be zero

compared to networks with pooling layers, which increase the sparsity of the gradients at their inputs.

6.4 Inference Accuracy

The FCTE can be used to evaluate existing pre-trained models to determine the accuracy penalty in using the

CPFP cores rather than single-precision floating-point. To conduct these tests three models trained using the

ImageNet dataset [4] are considered. The models considered for these tests are the Network In a Network

(NIN) ImageNet model [57], the VGG16 model [6], and AlexNet [3]. In the case of the NIN ImageNet model the

CPFP cores are used for all layers except the last average pooling layer as this layer is not currently supported

as an FPGA layer. For the VGG16 model, all layers use the CPFP cores. For AlexNet, all layers use the CPFP

cores except for the two LRN layers as these are currently not supported. For layers that are not supported,

data is transferred back from the FPGA to the host CPU such that the computations can be run using the host

CPU.

The accuracy degradation associated with using the CPFP cores is shown in Figures 6.11, 6.12, and 6.13.

The NIN model has the least degradation in accuracy, with mantissa width of 5 allowing for accuracy within

1% of the single-precision baseline results. On the other hand, AlexNet and VGG16 require a mantissa width of

7 to achieve accuracy within 1% of single-precision floating-point. This can be attributed primarily to the dif-

ference in structure of VGG16 and AlexNet compared to the NIN model. VGG16 and AlexNet both have three

fully connected layers as the last layers within the two models. Fully connected layers require a large number


(a) Network In a Network Top 1 validation accuracy for differentmantissa and exponent widths

(b) Network In a Network Top 5 validation accuracy for differentmantissa and exponent widths

Figure 6.11: Network In a Network inference validation accuracy for different CPFP configurations

(a) VGG16 Top 1 validation accuracy for different mantissa andexponent widths

(b) VGG16 Top 5 validation accuracy for different mantissa andexponent widths

Figure 6.12: VGG16 validation accuracy for different CPFP configurations

of accumulations to reduce the input volume to the corresponding size. Considering one of the AlexNet fully

connected layers, it takes as input a blob that contains 9,216 elements, and produces an output that is 4,096

elements. This requires 9,216 multiplications and 9,215 additions per output. On the other hand, the NIN

model frequently employs 1x1 convolutions, with the largest having 1,024 input channels with 1,024 output

channels, which requires only 1,023 accumulations per output. This allows for significantly less accumulation

errors to propagate through the network compared to AlexNet or VGG16. To amend this issue, there are two

potential pathways: either adding additional bits for the accumulations and truncating the end result, or re-

training the network to fine-tune for the precision that is used rather than directly converting single-precision

weights to CPFP.


(a) AlexNet Top 1 validation accuracy for different mantissa andexponent widths

(b) AlexNet Top 5 validation accuracy for different mantissa andexponent widths

Figure 6.13: AlexNet validation accuracy for different CPFP configurations

6.5 Training Throughput

To evaluate the throughput of FCTE, FP12 is considered as a mantissa width of 5 was sufficient for MNIST and

CIFAR-10 training as well as inference with the NIN model. FCTE is then scaled until no additional PEs can fit

in the partial reconfiguration region of the device without significant frequency penalties. This corresponds to

the utilization discussed in Section 6.2, with 16 PE groups, for a total of 64 PEs. Table 6.3 shows the throughput

comparison between the CPU, GPU, and FPGA based implementations for MNIST, CIFAR-10 (the simpler

model discussed in Section 6.3), AlexNet [3], Overfeat [32], and VGG-A [6]. The last three models are modified

versions taken from the Soumith CNN benchmark suite. In each case, a batch size of 256 images is used, except

for VGG A running on the GPU as it could not fit in the GPU memory, resulting in a reduction to a batch size

of 64 images.

Table 6.3: Training forward-backward throughput (images/second)

- MNIST CIFAR-10 AlexNet VGG-A OverfeatFPGA(Exponent = 6, Mantissa = 5) Forward-Backward 1,172.7 687.4 26.6 3.6 10

CPU Single-Precision Forward-Backward 1,209 411 33.7 3.8 12GPU Single-Precision Forward-Backward 16,753 10,640 735.5 69 210

From the results in Table 6.3, the FPGA based training architecture has comparable performance to that

of the CPU. The speedup between the CPU and the FPGA ranges from 0.79 to 1.67, though all of the larger

models run faster on the CPU, ranging from 1.05 to 1.27 speedup for the CPU. While the throughput of the

FPGA is comparable to that of the CPU, they are both significantly lower performance compared to the GPU

implementation. This disparity is partially due to the difference between the architectures of the FPGA com-

pared to the GPU, with the GPU having a large number of floating-point units which consequently increases

the GPU’s power envelope. The GPU’s max power consumption is 300W [59], though the measurements were

done using only one out of the two GPUs on board so its usage is most likely closer to 150W. In comparison,


the max power supplied by the PCIe slot used by the FPGA platform is 25W [58], meaning that peak power

consumption of the FPGA compared to the GPU is 6 times lower than the GPU. Therefore, if performance per

watt is considered using these power estimates (though they are probably pessimistic for both the GPU and

the FPGA), the disparity is closer to 3.5 times better performance per watt for the GPU compared to the FPGA

implementation.

6.6 Inference Throughput

Several tests were conducted to evaluate inference throughput of WCE, FCTE, and FCTE forward-only with an

exponent width of 6 and mantissa width of 5 as in Section 6.5. The first set of tests involved executing many

different individual convolution layer configurations to determine the peak throughput of each engine with-

out considering external transfer times. Figure 6.14 shows the throughput of each engine in GFLOPs across

several different convolution input and output sizes with varying kernel sizes, where a FLOP is considered

to be one multiplication or addition operation (a multiply-accumulate is considered two FLOPs). For each

test, a batch size of 256 was used, and each test follows the convention conv{filter size}x{filter size}_{input

channels}x{input height}x{input width}_{output channels}x{output height}x{output width}. Notably, FCTE

forward-only is faster than WCE for all cases except for the case where the output size is 224×224

(conv3x3_64x224x224_64x224x224). This discrepancy compared to the expectations of the Winograd convo-

lution algorithm discussed in Section 2.2.3 and the reduced clock frequency compared to the FCTE is due

primarily to the differences between the input stages of the two engines. As discussed in Section 4.2, the FCTE

input stage does not add any padding to the input and maintains counters to keep track of how many non-

padded computations are required. On the other hand, the WCE is only capable of doing this along the height

of the input as the width uses the Winograd convolution algorithm which does not allow for the zero padding

to be bypassed. This results in significant gains for FCTE and FCTE forward-only for small sized convolutions

with padding as many of the computations are by-passed. This is particularly true for 5x5 layers as the padding

increases the width and the height by 4 pixels each, resulting in 38% increase in multiply-accumulates for an

8×8 output convolution. These gains are shown in Figure 6.14, where the conv5x5 layers for FCTE have sig-

nificantly higher throughput than WCE, but also higher throughput than the conv3x3 FCTE layers and the

conv1x1 FCTE layers. The wide swings in throughput in Figure 6.14 can be attributed to two primary factors

aside from the padding: 1. with larger filters the FCTE PEs are kept busy longer (compute to data transfer

ratio increases) and 2. larger inputs means more data must be reread from DDR in the case where it cannot

be cached (compute to data transfer ratio decreases). On the other hand WCE is not as susceptible to these

swings in data size, but is highly susceptible to the filter size as for filters that are not multiples of 3 the WCE


Figure 6.14: Convolution engine throughput comparisons

PEs will not be fully utilized.

Furthermore, the WCE is at a clear disadvantage compared to the FCTE and FCTE forward-only engines

when doing non 3×3 convolutions, because the WCE is under-utilized for either 1×1 or 5×5 convolutions

as discussed in Section 4.3. The FCTE forward-only engine is 2 to 3.4 times faster than the WCE for 1× 1

convolutions and 1.4 to 1.68 times faster than the WCE for 5×5 convolutions. The highest forward throughput

is achieved by the FCTE forward-only engine for the conv5x5_128x8x8_256x8x8 test, achieving 396 GFLOPs

with 0 skipping, though this is not necessarily representative of a real workload.

To evaluate overall inference throughput, tests from the Soumith CNN benchmarks [64] are used as this

provides end-to-end throughput results. This benchmark suite consists of previous ImageNet [4] winners:

AlexNet [3], Overfeat [32], VGG-A [6], and GoogleNet V1 [31]. To maintain fairness in the tests, the GoogleNet

V1 model is excluded from the tests as there are many layers that are not supported by the FPGA implemen-

tation in the GoogleNet V1 model (though these could be run using a CPU-fallback). Furthermore, given that

the WCE cannot support fully connected layers or convolution layers with stride greater than one, it is not

considered in these tests, though the performance scaling can be roughly inferred from Figure 6.14. Table 6.4

shows the results of each network of the FCTE and FCTE forward-only implementation compared with CPU

and GPU single-precision implementations. From these results it is clear that both FPGA implementations

are fairly comparable with the CPU implementation as in the case for the training throughput measurements,

with FCTE forward-only achieving higher throughput (average of 1.23-fold faster) than the CPU implementa-

tion, while FCTE performs roughly the same as the CPU (average of 1.02-fold faster). As in the training case,


the GPU is significantly faster than the FPGA implementations, achieving throughput that is on average 18.8

times faster than the FCTE forward-only implementation. The difference in throughput between FCTE and

FCTE forward-only in this set of benchmarks and the benchmarks in Figure 6.14 are expected as the FCTE

forward-only implementation is able to achieve a higher operating frequency due to removal of some of the

adder tree in Figure 4.3 and the associated by-pass multiplexing.

Table 6.4: Inference throughput (images/second)

- AlexNet Overfeat VGG-AFCTE forward-only (EXP=6, MANTISSA=5) 87.95 36 13.9

FCTE (EXP=6, MANTISSA=5) 74.2 31.5 10.75CPU Single-Precision Forward 78.4 30.3 9.97GPU Single-Precision Forward 2,125.4 624.8 207.1

Chapter 7

Future Work

The field of machine learning is advancing very rapidly, making it very difficult to cover all of the possible

algorithms and techniques within the scope of a single thesis. In this section, several additional topics are dis-

cussed which could be potentially explored as future work, building off of the ideas that have been presented

in the previous chapters.

7.1 Solvers

As discussed in Chapter 2, the weight update process requires a solver. The most standard solver is Stochas-

tic Gradient Descent (SGD), though many other solvers also exist. In each case the solvers that have been

developed do not take into consideration the data types of the weight updates which could lead to conver-

gence issues as shown in Section 6.3. The equation for the weight updates used for SGD with momentum is

shown in Equation 7.1, while the ADAM solver [61] weight update equation is shown in Equation 7.2. In each

equation, Wt corresponds to the current weights, Wt+1 corresponds to the updated weights, α corresponds

to the learning rate, L is the loss function, and β1, β2 and ε are constants. In the case of the SGD, applying

the updates can be sensitive to low-precision. This is the case because the learning rate is fixed, resulting in

potentially stagnant weights if the learning rate is sufficiently small to cause the updates to fall into the region

of precision that is truncated after updating. The ADAM weight updates on the other hand have essentially a

per-weight dynamic learning rate [61] due to the estimated moment (mt ) and second moment (vt ), which will

adjust the learning rate accordingly to reflect whether the previous updates had an impact on moving toward

a local minimum.

65

CHAPTER 7. FUTURE WORK 66

Vt+1 =µ ·Vt −α ·∇L(Wt )

Wt+1 =Wt +Vt+1

(7.1)

(mt )i =β1(mt−1)i + (1−β1)(∇L(Wt ))i

(vt )i =β2(vt−1)i + (1−β2)(∇L(Wt ))2i

(Wt+1)i = (Wt )i −α ·√

1− (β2)ti

1− (β1)ti

· (mt )ip(vt )i +ε

(7.2)

As discussed in Section 6.3, when considering the NIN model, the SGD updates led quickly to divergence

while the ADAM updates were able to essentially match the single-precision model at all points while train-

ing. Unfortunately training with ADAM has the consequence of an overall lower final accuracy (approximately

85%) compared to training with single-precision and SGD (89.6%). Therefore, while the ADAM updates al-

lowed for the model to converge, it reached a sub-optimal solution. This shows that there is room for further

improvements to the standard solvers used by CNN frameworks when considering training with low-precision

floating-point representations. One potential solution to this would be to add a per-weight dynamic learning

rate to be applied in tandem with the SGD updates to ensure that the weight updates can still have an impact if

the learning rate is set too low while still allowing for the SGD updates to dominate the overall weight update.

Another alternative may be to dynamically switch between solvers depending on the progress that has been

made toward minimizing the loss, though both cases would require maintaining the prior history of the ADAM

optimizer even when performing SGD. These methods could potentially allow for a higher overall quality of

result while still maintaining convergence.

7.2 Mixed Data Representations

The focus of this work has primarily been on using a uniform custom-precision floating-point (CPFP) repre-

sentation throughout the CNN models explored for both inference and training. This approach offers some

advantages by allowing for a simpler design, but it has been shown in many prior works that different fixed-

point representations can be used for different layers throughout a CNN without any loss in accuracy [65],

which allows for reduced data traffic across layers. A potential extension to this work may be to use a mix-

ture of floating-point and fixed-point representations for training CNNs. This would allow for potentially

higher compute density with lower data traffic between layers and for the weight updates. To enable this it

would require a heterogeneous architecture, with the data path being capable of supporting a configurable

level of precision. The most recent NVIDIA GPUs have some of this heterogeneity built into their architec-


ture, with support for operations in int8, half-precision floating-point, or single-precision floating-point [22].

This architecture could allow for them to potentially exploit int8 based operations for inference of some lay-

ers, half-precision for other layers, or even single-precision if it is necessary. While this offers some flexibility,

an FPGA-based implementation can go one step further, by allowing for any level of customization of their

data paths, with some portion used for int6, 12-bit CPFP, 14-bit CPFP, half-precision, single-precision, etc. To

enable this, future work would be required to extend the CPFP library to allow for multiple types of CPFP num-

bers to be supported. An ideal interface would be one where the data type is defined in a similar manner to the

fixed-point types in Vivado HLS [29], as cpfp<EXP, MANTISSA>, with operations that can be applied between

different cpfp precisions.

7.3 Additional Layers and Models

The number of layers that can be used and that are supported by Caffe or other CNN frameworks is very

high. This makes for a difficult task in deciding which layers should be supported. In this work the layer

support was constrained to support only Convolution, ReLU, Max Pooling, and Fully Connected layers, with

all other layers being supported through falling back to the host CPU. This allows for simple development,

though in practice if a large number of layers need to be ran on the CPU the communication overhead will

be too large for it to be practical. Therefore, additional layers should be considered in future works to allow

for both improved throughput, and perhaps also improved accuracy at lower bit-widths. Dropout [66] and

Batch Normalization [67] layers in particular would be well suited for hardened support in addition to the

supported layers as these have been proven to be essential in many modern CNN models to allow for better

generalization [3, 6, 31, 32, 67]. In particular, Batch Normalization [67], normalizes the inputs to a given layer

such that it has a zero mean and unit variance which may allow for further reductions in the exponent bit-

widths or even the mantissa bit-widths when considering custom-precision floating-point representations.

Furthermore, this work focused primarily on testing low-precision models targeting relatively small datasets.

With support for more layers such as Dropout and Batch Normalization, it would be more feasible to train on

some of the larger datasets (such as ImageNet) using the CPFP cores, though more work is also required to

improve the throughput of FCTE. Currently, to train AlexNet it would require approximately 45 days, which

makes it relatively difficult to evaluate the precision requirements of the model and also to evaluate other

models as well.


7.4 Multi-FPGA Implementations

While this work deals entirely with single-FPGA based training of CNNs, a natural extension would be to con-

sider using multiple FPGAs to scale the inference/training throughput. Many frameworks have the capability

of training CNNs using multiple GPUs, either in a single node [16, 17, 20], or across multiple nodes [17, 20].

Given that FPGAs have similar host connections through PCIe as many of the GPU cards, along with the

capability of using other high-speed interconnects such as 100GbE [68], FPGA-based implementations are

well poised for taking advantage of similar distributed training techniques as those employed by [16, 17, 20].

Though the interfaces are available, there is a significant amount of software and hardware infrastructure that

is required to enable this. NVIDIA has a software library called NVIDIA Collective Communications Library

(NCCL) that provides single and multi-node GPU support for: AllReduce, Broadcast, Reduce, AllGather, and

ReduceScatter [69]. NCCL allows for relatively simple integration of multi-GPU CNN implementations, with-

out concern for how to actually conduct the scaling across devices as it has been optimized specifically for

GPUs. The NCCL library is in some respects very similar to MPI, though it differs in the sense that it is targeted

specifically toward NVIDIA GPUs (and that it is proprietary). If a similar FPGA library were to be developed, it

would allow for relatively easy porting of this work to a multi-FPGA implementation, by taking advantage of

the existing GPU infrastructure.

7.5 FCTE Improvements

As highlighted in Sections 6.5 and 6.6, FCTE has comparable throughput to CPU implementations but still

under-performs significantly when compared with GPU implementations.

A large source of this performance degradation can be attributed to the lack of on-chip caching between

layers, which requires DDR transfers between each layer execution in both forward and backward passes. To

allow for on-chip caching between layers, some fundamental changes are required in FCTE, namely the output

banks would need to be removed, with the output fed directly into the input banks of FCTE. Furthermore, it

would be ideal to take advantage of more recent devices with larger on-chip memories such as the Xilinx

UltraScale+ devices with UltraRAM [70]. The UltraRAM blocks contain 288kbits of storage per block, which

could allow for a 3×224×224 image to be stored using only 7 UltraRAM blocks when using FP12. This amounts

to less than 1% of the available UltraRAM blocks of a VU9P [71], meaning that several batches could be cached

with limited storage overhead. Furthermore, if the weight updates were applied on the FPGA directly, this

would further improve throughput as there would be no need to transfer the weights to and from the host.

Aside from improvements in communication overhead, there is also substantial room for improvement in


the architecture to allow for scaling the design. As shown in Section 4.2, the input is required to be broadcast

to every PE group, which would severely impact the operating frequency as the number of PE groups grows.

To mitigate this a systolic array could be used to feed the inputs into the PEs such that connections are across

PEs, which would reduce the length of the required connections as well as the congestion at the input banks

as the number of PE groups grows. This would allow the number of PE groups to be scaled further depending

on the available device resources.

Chapter 8

Conclusion

The primary goal of this thesis was to provide framework support for both inference and training of CNNs with

FPGAs while using reduced precision. Through this work, it has been shown that a custom-precision floating-

point (CPFP) representation with a bit-width of 12 can be used for both inference with pre-trained state of

the art image recognition models, as well as for training CNN models on the CIFAR-10 and MNIST datasets

through the use of a FPGA CNN Training Engine implemented entirely using Vivado HLS and Xilinx SDAccel.

The CPFP data type is able to match single-precision floating-point in accuracy, while significantly reducing

the memory and logic footprints of the FPGA design. As discussed in Chapter 7, there are many more avenues

for research related to training CNNs using FPGAs and CPFP representations, though this work provides the

means for researchers to build upon it and easily reproduce the results achieved through providing an open-

source framework: FPGA-Caffe, for accomplishing all of the goals in this thesis.

70

Bibliography

[1] Mark L. Chang. Chapter 1 - device architecture. In Scott Hauck, , and André Dehon, editors, Reconfig-

urable Computing, Systems on Silicon, pages 3 – 27. Morgan Kaufmann, Burlington, 2008.

[2] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.

deeplearningbook.org.

[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural

Networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Infor-

mation Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[4] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej

Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale

Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[5] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. CoRR,

abs/1311.2901, 2013.

[6] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition.

CoRR, abs/1409.1556, 2014.

[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.

CoRR, abs/1512.03385, 2015.

[8] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui

Wang, Xiaogang Wang, and Wanli Ouyang. T-CNN: tubelets with convolutional neural networks for object

detection from videos. CoRR, abs/1604.02532, 2016.

[9] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks.

CoRR, abs/1707.01629, 2017.

71

http://www.deeplearningbook.org

http://www.deeplearningbook.org

BIBLIOGRAPHY 72

[10] Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi

Esmaeilzadeh, Jeremy Fowers, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati,

Joo-Young Kim, Sitaram Lanka, Eric Peterson, Aaron Smith, Jason Thong, Phillip Yi Xiao, Doug Burger,

Jim Larus, Gopi Prashanth Gopal, and Simon Pope. A reconfigurable fabric for accelerating large-scale

datacenter services. In Proceeding of the 41st Annual International Symposium on Computer Architecu-

ture (ISCA), pages 13–24. IEEE Press, June 2014.

[11] Adrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen

Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael

Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration

architecture. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture.

IEEE Computer Society, October 2016.

[12] Microsoft unveils Project Brainwave for real-time AI. [Online]. Available: https://www.microsoft.

com/en-us/research/blog/microsoft-unveils-project-brainwave/. Accessed: 2017-10-31.

[13] Amazon EC2 F1 Instance. [Online]. Available: https://aws.amazon.com/ec2/instance-types/f1/.

Accessed: 2017-10-31.

[14] Intel. Intel FPGA SDK for OpenCL, May 2017.

[15] Xilinx Inc. SDAccel Development Environment User Guide, June 2017.

[16] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio

Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv

preprint arXiv:1408.5093, 2014.

[17] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,

Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irv-

ing, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan

Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit

Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,

Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-

Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensor-

flow.org.

[18] Ronan Collobert, Samy Bengio, and Johnny Marithoz. Torch: A modular machine learning software li-

brary, 2002.

https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/

https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/

https://aws.amazon.com/ec2/instance-types/f1/

BIBLIOGRAPHY 73

[19] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan

Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous

distributed systems. CoRR, abs/1512.01274, 2015.

[20] Frank Seide and Amit Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In Proceedings of the

22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages

2135–2135, New York, NY, USA, 2016. ACM.

[21] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and

Evan Shelhamer. cuDNN: Efficient Primitives for Deep Learning. CoRR, abs/1410.0759, 2014.

[22] NVIDIA. NVIDIA Tesla V100 GPU Architecture. Technical report, 06 2017.

[23] Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah

Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark,

Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra

Gottipati, William Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, John Hu, Robert Hundt,

Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch,

Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle

Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan,

Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy

Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham,

Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick

Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter

performance analysis of a tensor processing unit. CoRR, abs/1704.04760, 2017.

[24] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C. Ling, and Gordon R. Chiu. An

OpenCL™Deep Learning Accelerator on Arria 10. In Proceedings of the 2017 ACM/SIGDA International

Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 55–64, New York, NY, USA, 2017. ACM.

[25] Xilinx Inc. UltraScale Architecture Configurable Logic Block, November 2017.

[26] Intel. Intel® Stratix® 10 Logic Array Blocks and Adaptive Logic Modules User Guide, 2017.

[27] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. An-

derson, and K. Bertels. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, 35(10):1591–1604, Oct 2016.

BIBLIOGRAPHY 74

[28] Intel. Product Brief: Intel® HLS Compiler.

[29] Xilinx Inc. Vivado Design Suite User Guide, High-Level Synthesis, April 2017.

[30] Andrew Canis, Jongsok Choi, Mark Aldham, Victor Zhang, Ahmed Kammoona, Jason H. Anderson,

Stephen Brown, and Tomasz Czajkowski. LegUp: High-level synthesis for FPGA-based processor/accel-

erator systems. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable

Gate Arrays, FPGA ’11, pages 33–36, New York, NY, USA, 2011. ACM.

[31] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Du-

mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going Deeper with Convolutions. CoRR,

abs/1409.4842, 2014.

[32] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. OverFeat:

Integrated Recognition, Localization and Detection using Convolutional Networks. CoRR, abs/1312.6229,

2013.

[33] Z. Xianyi, W. Qian, and Z. Yunquan. Model-driven Level 3 BLAS Performance Optimization on Loongson

3A Processor. In Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on,

pages 684–691, Dec 2012.

[34] NVIDIA. Dense linear algebra on gpus. [Online]. Available: https://developer.nvidia.com/cublas.

[35] S. Winograd. Arithmetic Complexity of Computations. CBMS-NSF Regional Conference Series in Applied

Mathematics. Society for Industrial and Applied Mathematics, 1980.

[36] Andrew Lavin. Fast Algorithms for Convolutional Neural Networks. CoRR, abs/1509.09308, 2015.

[37] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi. Caffeinated FPGAs: FPGA framework

For Convolutional Neural Networks. In 2016 International Conference on Field-Programmable Technology

(FPT), pages 265–268, Dec 2016.

[38] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable parallel programming with cuda.

Queue, 6(2):40–53, March 2008.

[39] Google. Protocol buffers-google’s data interchange format. [Online]. Available: https://github.com/

google/protobuf. Accessed: 2017-10-31.

[40] IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008, pages 1–70, Aug 2008.

[41] Xilinx Inc. LogiCORE IP Floating-Point Operator v7.0 Product Guide, April 2014.

https://developer.nvidia.com/cublas

https://github.com/google/protobuf

https://github.com/google/protobuf

BIBLIOGRAPHY 75

[42] Intel. Floating-Point IP Cores User Guide, December 2016.

[43] Florent de Dinechin and Bogdan Pasca. Designing custom arithmetic data paths with FloPoCo. IEEE

Design & Test of Computers, 28(4):18–27, July 2011.

[44] Wenlai Zhao, Haohuan Fu, W. Luk, Teng Yu, Shaojun Wang, Bo Feng, Yuchun Ma, and Guangwen Yang.

F-cnn: An fpga-based framework for training convolutional neural networks. In 2016 IEEE 27th Inter-

national Conference on Application-specific Systems, Architectures and Processors (ASAP), pages 107–114,

July 2016.

[45] X. Han, D. Zhou, S. Wang, and S. Kimura. Cnn-merp: An fpga-based memory-efficient reconfigurable

processor for forward and backward propagation of convolutional neural networks. In 2016 IEEE 34th

International Conference on Computer Design (ICCD), pages 320–327, Oct 2016.

[46] C. Zhang, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Towards uniformed

representation and acceleration for deep convolutional neural networks. In 2016 IEEE/ACM International

Conference on Computer-Aided Design (ICCAD), pages 1–8, Nov 2016.

[47] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun

Seo, and Yu Cao. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolu-

tional Neural Networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-

Programmable Gate Arrays, FPGA ’16, pages 16–25, New York, NY, USA, 2016. ACM.

[48] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu,

Sen Song, Yu Wang, and Huazhong Yang. Going deeper with embedded fpga platform for convolutional

neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable


[49] J. Zhang and J. Li. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neu-

ral Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable


[50] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation and dataflow in fpga

acceleration of deep convolutional neural networks. In Proceedings of the 2017 ACM/SIGDA International

Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 45–54, New York, NY, USA, 2017. ACM.

[51] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited

numerical precision. In Proceedings of the 32Nd International Conference on International Conference on

Machine Learning - Volume 37, ICML’15, pages 1737–1746. JMLR.org, 2015.

BIBLIOGRAPHY 76

[52] C. De Sa, M. Feldman, C. Ré, and K. Olukotun. Understanding and optimizing asynchronous low-

precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on

Computer Architecture, ISCA ’17, pages 561–574, New York, NY, USA, 2017. ACM.

[53] M. Courbariaux, Y. Bengio, and J.-P. David. Low precision arithmetic for deep learning. CoRR,

abs/1412.7024, 2014.

[54] M. D. Ercegovac and T. Lang. {CHAPTER} 8 - Floating-Point Representation, Algorithms, and Implemen-

tations . In M. D. Ercegovac and T. Lang, editors, Digital Arithmetic, The Morgan Kaufmann Series in

Computer Architecture and Design, pages 396 – 487. Morgan Kaufmann, San Francisco, 2004.

[55] Xilinx Inc. UltraScale Architecture DSP Slice User Guide, June 2017.

[56] Xilinx Inc. Deep Learning With INT8 Optimizations on Xilinx Devices, April 2017.

[57] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.

[58] Alpha Data. ADM-PCIE-8K5 User Manual, September 2017.

[59] NVIDIA. NVIDIA Tesla M60 GPU Accelerator, August 2016.

[60] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.

Proceedings of the IEEE, 86(11):2278–2324, Nov 1998.

[61] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[62] A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Technical report, 04 2009.

[63] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for sim-

plicity: The all convolutional net. CoRR, abs/1412.6806, 2014.

[64] S. Chintala. Convnet-benchmarks. [Online]. Available: https://github.com/soumith/

convnet-benchmarks.

[65] Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas

Moshovos. Proteus: Exploiting numerical precision variability in deep neural networks. In Proceedings

of the 2016 International Conference on Supercomputing, ICS ’16, pages 23:1–23:12, New York, NY, USA,

2016. ACM.

[66] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A

simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–

1958, 2014.

https://github.com/soumith/convnet-benchmarks

https://github.com/soumith/convnet-benchmarks

BIBLIOGRAPHY 77

[67] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing

internal covariate shift. CoRR, abs/1502.03167, 2015.

[68] Alpha Data. ADM-PCIE-9V3 User Manual, June 2017.

[69] NVIDIA. NVIDIA COLLECTIVE COMMUNICATION LIBRARY (NCCL), December 2017.

[70] Xilinx Inc. UltraScale Architecture Memory Resources, February 2018.

[71] Xilinx Inc. UltraScale Architecture and Product Data Sheet: Overview, March 2018.

caffeinated fpgas: fpga f training and inference of ......abstract caffeinated fpgas: fpga framework...

Documents