accelerator architectures for deep learning and graph
Post on 07-Apr-2022
1 Views
Preview:
TRANSCRIPT
Accelerator Architectures forDeep Learning and Graph Processing
by
Linghao Song
Department of Electrical and Computer EngineeringDuke University
Date:Approved:
Yiran Chen, Advisor
Hai Li, Co-Advisor
Kenneth Brown
Benjamin Lee
Jun Yang
Dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy
in the Department of Electrical and Computer Engineeringin the Graduate School of
Duke University
2020
ABSTRACT
Accelerator Architectures forDeep Learning and Graph Processing
by
Linghao Song
Department of Electrical and Computer EngineeringDuke University
Date:Approved:
Yiran Chen, Advisor
Hai Li, Co-Advisor
Kenneth Brown
Benjamin Lee
Jun Yang
An abstract of a dissertation submitted in partial fulfillment of therequirements for the degree of Doctor of Philosophy
in the Department of Electrical and Computer Engineeringin the Graduate School of
Duke University
2020
Abstract
Deep learning and graph processing are two big-data applications and they are widely
applied in many domains. The training of deep learning is essential for inference and has not
yet been fully studied. With data forward, error backward, and gradient calculation, deep
learning training is a more complicated process with higher computation and communication
intensity. Distributing computations on multiple heterogeneous accelerators to achieve high
throughput and balanced execution, however, remaining challenging. In this dissertation, I
present AccPar, a principled and systematic method of determining the tensor partition for
multiple heterogeneous accelerators for efficient training acceleration. Emerging resistive
random access memory (ReRAM) is promising for processing in memory (PIM). For high-
throughput training acceleration in ReRAM-based PIM accelerator, I present PipeLayer,
an architecture for layer-wise pipelined parallelism. Graph processing is well-known for
poor locality and high memory bandwidth demand. In conventional architectures, graph
processing incurs a significant amount of data movements and energy consumption. I present
GraphR, the first ReRAM-based graph processing accelerator which follows the principle
of near-data processing and explores the opportunity of performing massive parallel analog
operations with low hardware and energy cost. Sparse matrix-vector multiplication (SpMV),
a subset of graph processing, is the key computation in iterative solvers for scientific
computing. The efficiently accelerating floating-point processing in ReRAM remains a
challenge. In this dissertation, I present ReFloat, a data format and a supporting accelerator
architecture, for low-cost floating-point processing in ReRAM for scientific computing.
iv
Acknowledgements
It is never easy to get a Ph.D. degree, but with the contribution and help from many people I
finally did it. Here, I mention them.
Professor Yiran Chen, my advisor, and Professor Hai Li, my co-advisor, have invaluable
advice and help to my research, professional development and life during my whole Ph.D.
program. In the first three years of Ph.D. program, I struggled because none of my work
was accepted by a conference. It was a really hard time and I even planned to quit and
find an SDE job. But my advisors never gave up on me. They always encouraged me and
believed in me. They gave me every opportunity to talk to or have a lunch with a scholar
who visited our lab or was in collaboration, let me present at a senior student’s presentation,
and made every effort to let me feel that I was not alone or forgotten. Professor Chen also
gave me many skills suggestions for social communication, career and team management.
I would like to thank my committee members – Professor Kenneth Brown, Professor
Benjamin Lee and Professor Jun Yang, and Professor Daniel Sorin in my qualify exam
committee. Their time and efforts out of their busy schedules help me improve in the
preliminary or/and final exams.
I would like to thank Professor Xuehai Qian (USC) for his insightful and valuable
feedbacks on my research.
In my two summer research interns, my mentors Catherine Schuman (ORNL), Steven
Young (ORNL) and Ang Li (PNNL) broadened my knowledge in computing systems.
I thank the efforts from my co-authors – Paul Bogdan, Nagadastagiri Challapalle, Fan
Chen, Xiang Chen, Enes Eken, Yu Hua, Houxiang Ji, Bing Li, Xin Liu, Jiachen Mao,
Vijaykrishnan Narayanan, Kent Nixon, Gabriel Perdue, Thomas Potok, Ximing Qiao,
Sahithi Rampalli, Tianqi Tang, Chengning Wang, Yandan Wang, Yitu Wang, Wei Wen,
Wujie Wen, Chunpeng Wu, Yuan Xie, Cong Xu, Huanrui Yang, Jianlei Yang, Youwei Zhuo.
v
Contents
Abstract iv
Acknowledgements v
List of Figures ix
List of Tables xiii
1 Introduction 1
1.1 Tensor Partitioning for Deep Learning Accelerators . . . . . . . . . . . . 5
1.2 Layer-wise Pipelined Parallelism for Deep Learning Accelerators . . . . . 7
1.3 Accelerator Architecture for Graph Processing . . . . . . . . . . . . . . . 9
1.4 Low-Cost Floating-Point Processing in ReRAM . . . . . . . . . . . . . . 12
2 Tensor Partitioning for Deep Learning Accelerators 14
2.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Tensor Partitioning Space . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Partitioning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Layer-wise Pipelined Parallelism for Deep Learning Accelerators 41
3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 PIPELAYER Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Implementation of PIPELAYER . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
4 Accelerator Architecture for Graph Processing 70
4.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 GraphR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Mapping Algorithms in GE . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 Low-Cost Floating-Point Processing in ReRAM 98
5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.2 Motivation and REFLOAT Ideas . . . . . . . . . . . . . . . . . . . . . . 103
5.3 REFLOAT Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Accelerator Architecture for REFLOAT . . . . . . . . . . . . . . . . . . . 112
5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6 Conclusion 127
Bibliography 128
Biography 157
viii
List of Figures
1.1 An overview of my research. . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Three basic tensor partitioning types. . . . . . . . . . . . . . . . . . . . . 20
2.2 Inter-layer communication patterns between three basic tensor partitioningtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Layer-wise partitioning is determined by dynamic programming to mini-mize the computation cost and the communication cost. . . . . . . . . . . 31
2.4 Dynamic programming on multi-paths. . . . . . . . . . . . . . . . . . . 31
2.5 The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR
[2] and AccPar in a heterogeneous accelerator array. . . . . . . . . . . . 36
2.6 The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR
[2] and AccPar in a homogeneous accelerator array. . . . . . . . . . . . . 38
2.7 The selected partitioning types for the weighted layers in Alexnet. Thenumber of hierarchies is 7 and the batch size is 128. . . . . . . . . . . . . 39
2.8 The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR
[2] and AccPar under various partitioning hierarchies on Vgg19. . . . . . 40
3.1 Convolutional neural network (CNN). . . . . . . . . . . . . . . . . . . . 41
3.2 Data flows between two adjacent layers1. . . . . . . . . . . . . . . . . . 44
3.3 Basics of ReRAM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Data flows and dependencies in layer-wise pipeline. . . . . . . . . . . . . 47
3.5 A basic scheme for data input and kernel mapping. . . . . . . . . . . . . 50
3.6 A balanced scheme for data input and kernel mapping. . . . . . . . . . . 51
3.7 Layer-wise pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
ix
3.8 Latency of PIPELAYER architecture. . . . . . . . . . . . . . . . . . . . . 53
3.9 Memory subarray design. . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.10 Overall architecture of PIPELAYER. . . . . . . . . . . . . . . . . . . . . 55
3.11 Error backward for (a) activation function, (b) max pooling layer and (c)convolution layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.12 Error backward for convolution layer by convolution of error of layer l andreordered kernels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.13 The computation of partial derivatives for kernels. . . . . . . . . . . . . . 60
3.14 Tradeoff between resolution and accuracy. . . . . . . . . . . . . . . . . . 61
3.15 Weight partition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.16 Speedups of networks in both training and inference. . . . . . . . . . . . 65
3.17 Energy savings for PIPELAYER. . . . . . . . . . . . . . . . . . . . . . . 66
3.18 Speedups vs. parallelism granularity. . . . . . . . . . . . . . . . . . . . . 68
3.19 Area vs. parallelism granularity. . . . . . . . . . . . . . . . . . . . . . . 68
4.1 Graph processing in vertex-centric programs. . . . . . . . . . . . . . . . 70
4.2 (a) Edge-centric processing and (b) dual sliding windows. . . . . . . . . . 70
4.3 (a) Sparse matrix and its compressed representations in: (b) CSC, (c) CSR,and (d) COO. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.4 (a) A directed graph and its representations in (b) adjacency matrix and (c)coordinate list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Vertex Programming Model . . . . . . . . . . . . . . . . . . . . . . . . 75
4.6 GRAPHR Key insight: supporting graph processing with ReRAM crossbars. 76
4.7 GRAPHR accelerator architecture. . . . . . . . . . . . . . . . . . . . . . 78
x
4.8 Workflow of GRAPHR in out-of-core setting. . . . . . . . . . . . . . . . 80
4.9 Controller operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.10 Streaming-apply execution model. . . . . . . . . . . . . . . . . . . . . . 81
4.11 Preprocessing edge list. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.12 PageRank in vertex programming. . . . . . . . . . . . . . . . . . . . . . 85
4.13 SSSP in vertex programming. . . . . . . . . . . . . . . . . . . . . . . . . 85
4.14 The configuration of sALU to perform (a) add in PageRank and (b) min inSSSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.15 The processing of (a) PageRank and (b) SSSP in GRAPHR . . . . . . . . 89
4.16 GRAPHR speedup normalized to CPU platform. . . . . . . . . . . . . . . 94
4.17 GRAPHR energy saving normalized to CPU platform. . . . . . . . . . . . 95
4.18 GRAPHR (a) performance and (b) energy saving compared to GPU platform. 96
4.19 GRAPHR (a) performance and (b) energy saving compared to PIM platform. 97
4.20 GRAPHR (a) performance and (b) energy saving v.s. dataset density. . . . 97
5.1 Fixed-point (integer) MVM in ReRAM. . . . . . . . . . . . . . . . . . . 99
5.2 The bit layout of an (a) 8-bit signed integer and (b) a 64-bit double-precisionfloating-point number. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.3 The cycle number and crossbar number under various bit length. . . . . . 104
5.4 The conversion of index and value in floating-point format to REFLOAT
format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5 Comparison of a matrix block (a) in original full precision format and (b)in REFLOAT format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
xi
5.6 The overall accelerator architecture for scientific computing in REFLOAT
format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.7 The architecture for (a) a processing engine for floating-point MVM on amatrix block and (b) a crossbar cluster for fixed-point MVM. . . . . . . . 113
5.8 The row-major layout and block-major layout of a sparse matrix. . . . . . 116
5.9 The performance of GPU(in double and single precision), ESCMA andREFLOAT for CG solver. . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.10 The performance of GPU (in double and single precision), ESCMA andREFLOAT for BiCGSTAB solver. . . . . . . . . . . . . . . . . . . . . . . 123
5.11 The convergence traces of CG solver of GPU (red line) and REFLOAT (blueline). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.12 The convergence traces of BiCGSTAB solver of GPU (red line) and RE-FLOAT (blue line). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.13 Memory overhead of matrices in ESCMA and REFLOAT. . . . . . . . . . 125
xii
List of Tables
2.1 Notations and descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Landscape of DNN accelerators (accelerators highlighted in cyan are de-signed for training). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Rotational Symmetry of the Three Tensor Multiplications. . . . . . . . . 23
2.4 Intra-layer communication cost of the three basic tensor partitioning types. 26
2.5 Inter-layer communication cost between the three basic tensor partitioningtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 The amount of floating point operations (FLOP) in the three multiplications. 29
2.7 The specifications of the accelerators. . . . . . . . . . . . . . . . . . . . 34
2.8 The caparison of flexibility of DP, OWT, HYPAR and ACCPAR. . . . . . 40
3.1 Operations in a cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Cycles and cost in PIPELAYER architecture. . . . . . . . . . . . . . . . . 54
3.3 Hyper parameters of networks on MNIST. . . . . . . . . . . . . . . . . . 64
3.4 Configurations of the GPU platform. . . . . . . . . . . . . . . . . . . . . 64
3.5 Optimized parallelism granularity G configurations of each convolutionlayer in VGGs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1 Comparison of different architectures for graph processing. . . . . . . . . 86
4.2 Property and operations of applications in GRAPHR. . . . . . . . . . . . 90
4.3 Graph datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Specifications of the CPU platform. . . . . . . . . . . . . . . . . . . . . 93
4.5 Specifications of the GPU platform. . . . . . . . . . . . . . . . . . . . . 93
xiii
5.1 The iteration numbers for convergence under various exp(onent) and fra(ction)bit configurations for matrix crystm03. . . . . . . . . . . . . . . . . . . 107
5.2 List of symbols and descriptions. . . . . . . . . . . . . . . . . . . . . . . 108
5.3 Platform Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4 Matrices in the evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 The bit number for exponent and fraction of matrix block and vector seg-ment in REFLOAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.6 The absolute number of iterations to reaching convergence. . . . . . . . . 126
xiv
Chapter 1
Introduction
Advances in deep learning (DL) have become the main drivers of revolutions in various
commercial and enterprise applications, such as computer vision [3–5], social network [6–8],
financial data analysis [9–11], healthcare [12–14] and scientific computing [15–17]. Due
to the high demand of computing power in DL applications, we have recently witnessed a
phenomenal trend in which the landscape of computing has shifted from general-purpose
processors to domain-specific architectures [1, 2, 18–117]. By sacrificing some flexibility,
such domain-specific accelerators are specialized for executing kernels of modern DL
algorithms and therefore, are capable to deliver high performance with low power budget.
While the latest advances are pushing the envelope of DL acceleration for higher
performance and energy efficiency, a recent study of chip specialization [118] has predicted
an ultimate accelerator wall. More specifically, due to the constraints in the exploration of
mapping computational problems (e.g., DL) onto hardware platforms with fixed hardware
resources, the optimization space of chip specialization is limited by a theoretical roofline.
Combining with recent slow CMOS technology scaling, the gains from specific accelerator
designs will gradually diminish and the execution efficiency will eventually hit an upper-
bound. On the other hand, the computational demands of emerging DL applications continue
increasing in order to adapt to more complex models with deeper structures [119, 120] or
more sophisticated learning methods [121,122]. Efficient training acceleration of large-scale
deep learning is a challenge.
It has been widely accepted that Moore’s Law is ending soon [123], which means that
processor performance will no longer increase at the same pace as in recent decades. On the
other side, the volume of data that computer systems process has skyrocketed over the last
1
decade. In conventional Von Neumann architecture where computation and data storage
are separated, a large amount of data movements are also incurred due to the large number
of layers and millions of weights. Such data movements quickly become a performance
bottleneck due to limited memory bandwidth and more importantly, an energy bottleneck.
A recent study [124] showed that data movements between CPUs and off-chip memory
consumes two orders of magnitude more energy than a floating point operations. In an era
of soon-ending Moore’s law and the increasing demand of deep learning, it is urgent to
reduce the data movement and computing cost.
Processing-in-memory (PIM) is an efficient technique to reduce data movements in
memory hierarchy. On the other hand, the emerging non-volatile memory, metal-oxide resis-
tive random access memory (ReRAM) [125] has been considered as a promising candidate
for future memory architecture due to its high density, fast read access and low leakage
power. ReRAM-based PIM is a particularly appealing option because ReRAM provides
both computation and storage capability. Such design incurs no increase in cost-per-bit
implied by compute logic in traditional PIM based on DRAM. Recent works [88] demon-
strated that ReRAM-based PIM offer great acceleration of the inference of Convolutional
Neural Networks (CNNs) with low energy cost. However, how to accelerate training which
is essential for deep learning inference with PIM architectures, remains challenged.
Besides deep learning, graph processing is another big-data applications. With the
explosion of data collected from massive sources, graph processing received intensive
interests due to the increasing needs to understand relationships. It has been applied in
many important domains including cyber security [126], social media [127], PageRank
citation ranking [128], natural language processing [129–131], system biology [132, 133],
recommendation systems [134–136] and machine learning [137–139]. There are several
ways to perform graph processing. The distributed systems [140–146] leverage the ample
computing resources to process large graphs. However, they inherently suffer from synchro-
2
nization and fault tolerance overhead [147–149] and load imbalance [150]. Alternatively,
disk-based single-machine graph processing systems [151–155] (a.k.a. out-of-core systems)
can largely eliminate all the challenges of distributed frameworks. The key principle of
such systems is to keep only a small portion of active graph data in memory and spill the
remainder to disks. The third approach is the in-memory graph processing. The potential
of in-memory data processing has been exemplified in a number of successful projects,
including RAMCloud [156], Pregel [157], GraphLab [158], Oracle TimesTen [159], and
SAP HANA [160].
It is well-known for the poor locality because of the random accesses in traversing the
neighborhood vertices, and high memory bandwidth requirement, because the computations
on data accesses from memory are typically simple. In addition, graph operations lead
to memory bandwidth waste because they only use a small portion of a cache block. In
conventional architecture, graph processing incurs significant amount of data movements
and energy consumption.
Scientific computing, a subset of graph processing, is a collection of tools, techniques
and theories required to solve science and engineering problems represented in mathematical
systems [161]. Different from computer science that deals with quantities that are discrete,
the underlying variables in scientific computing are continuous in nature, such as time,
temperature, distance, density, etc. One of the most important aspects of scientific computing
is obtaining solutions of the partial differential equations (PDEs) models. These solutions
have a significant impact on the understanding of natural phenomena in science [162, 163]
and the design and decision-making of engineered systems [164, 165]. Unfortunately,
most problems in continuous mathematics cannot be solved accurately. In practice, they
are converted to a linear system Ax = b and then solved through an iterative solver that
ultimately converges to a numerical solution [166, 167]. To obtain an accurate answer,
intensive computing power [168, 169] is required to perform the sparse matrix-vector
3
Figure 1.1: An overview of my research.
multiplication (SpMV). In addition, a large memory footprint [170] is also essential for
storing and retrieving the input and output data, as well as data generated during the course
of the computation.
However, we are approaching the end of the scaling of Moore’s Law [171], and general-
purpose platforms, such as CPUs and GPUs, will no longer benefit from the integration of
cores [172]. Novel architecture paradigm is to be proposed to improve performance and
efficiency on emerging application domains. Beyond conventional technology, emerging
non-volatile memory such as resistive random access memory (ReRAM) is considered
as a promising candidate for implementing processing-in-memory (PIM) accelerators
[88, 89, 91, 92] that can provide orders of magnitude improvement of computing efficiency
in the matrix-vector multiplication (MVM) operations. Note that most of these proposed
ReRAM-based accelerators are limited to machine learning applications, and in most
cases, machine learning applications can accept a low precision, e.g., less than 16-bit
fixed-point [173]. Since the high-performance floating-point solver is a norm in scientific
computing, it is an opportunity but also a challenge to leverage the power of parallel in-
4
situ processing in ReRAM to efficiently support floating-point SpMV and accelerate the
mathematical model solving process in the field of science and engineering.
In this dissertation, I address the above mentioned challenges in deep learning and
graph processing by architecture level optimizations. Related research was published
on HPCA’20 [174], HPCA’19 [2], HPCA’18 [175] and HPCA’17 [92]. Another related
work [176] is under review. Figure 1.1 shows a logical overview of my research.
1.1 Tensor Partitioning for Deep Learning Accelerators
To address the mismatch between the diminishing performance gains in hardware accelera-
tors and the ever-growing computational demands, it is imperative to explore coarse-grained
parallelism among multiple performance-bounded accelerators to support large-scale DL
applications. In general, a deep neural network (DNN) model is a parametric function that
takes a high-dimensional input and makes useful predictions (i.e., inference), such as a
classification label. Model parameters, i.e., kernels or weights, are obtained through a large
number of iterations in training process involving data forward, error backward and gradient
calculation phases. The trained model can be used to perform inference function through
only the data forward phase. Compared to inference, training is much more complicated
and computational intensive. Hence, typically training is offloaded to high-end CPUs/GPUs
and then the trained models are deployed to end/user devices. It is a natural need to have
multi-accelerator architectures specialized for DNN training.
Given the high complexity of modern DNN models, finding the best distribution of com-
putations on multiple accelerators is nontrivial. Moreover, due to the inherent performance
difference of accelerators and the discrepancy of the communication bandwidth between
them, ensuring high throughput and balanced execution is extremely challenging. The
problem can be formulated as partitioning the model and data tensors among accelerators to
enable parallel processing. There are two approaches: data parallelism, where each acceler-
5
ator replicates the model and processes different input data in parallel before applying the
calculated gradients to update the model; and model parallelism, where each accelerator
keeps part of the model and performs a part of computation based on the same input. The
choice between the two parallelism configurations affects the overall performance since it
incurs different communication patterns between the accelerators.
The current solutions to this problem are either purely empirical or incomplete — both
lacking optimal guarantee. For example, for a given DNN, “One Weird Trick” (OWT) [1]
empirically suggests to use data parallelism for convolutional (CONV) layers and model
parallelism for fully-connected (FC) layers. HYPAR [2] proposes a principled approach
to search for the optimized parallelism configuration to minimize data communication.
Although it can achieve a much better result than OWT, HYPAR suffers from several
limitations: 1) the search is based on an incomplete design space; 2) it can only handle DNN
architectures with linear structure; 3) it lacks a cost model and uses only communication as
the proxy for performance optimization; and most importantly 4) it assumes an homogeneous
execution environment — the performance of each accelerator and the bandwidth between
them are all the same. A truly optimal solution for this critical problem still does not exist
yet.
The combination of model and data parallelism is explored in deep learning accelerator
architectures [62,92] multi-GPU training systems [1,177–180]. Recursive methods [76,179]
are proposed for tensor partitioning on multiple devices and dynamic programming methods
[177, 179] are proposed for tensor partitioning layer-wisely. Inspired by those previous
works [1, 76, 177–180], we present ACCPAR— a principled and systematic method of
determining the tensor partition among heterogeneous accelerator arrays. Our solution is
composed of several key innovations. First, we consider a complete tensor partition space
in all three dimensions: batch size, input data size, and output data size. Hence, our solution
is able to reveal previously unknown parallelism configuration. The completeness and
6
optimality of the searching algorithm is also guaranteed. Second, in order to better optimize
performance, we propose a cost model considering both computation and communication
cost of a heterogeneous execution environment, instead of using communication as the proxy
for performance as in HYPAR. Third, ACCPAR offers flexible tensor partition ratio between
the accelerators to match their unique computing power and network bandwidth. Finally,
we propose a technique to handle the emerging multi-path patterns in modern DNNs such
as ResNet [120]. ACCPAR significantly outperform the state-of-the-art solutions, offering
the first complete solution for tensor partitioning on heterogeneous accelerator arrays.
We simulate ACCPAR on a heterogeneous accelerator array composed of both TPU-v2
and TPU-v3 accelerators for training of large-scale DNN models such as Alexnet [119],
Vgg series [181] and Resnet series [120]. The average performance of “one weird trick”
(OWT) [1], HYPAR [2] and ACCPAR, normalized to the baseline data parallelism on the
heterogeneous accelerator array is 2.98×, 3.78×, 6.30×, respectively. For Vgg series,
ACCPAR can achieve a speedup up to 16.14×, while the highest speedup of OWT and
HYPAR are 8.24× and 9.46×, respectively. For Resnet series, ACCPAR can achieve
performance speedup from 1.92× to 2.20×, while the ranges of speedup achieved by OWT
and HYPAR are 1.22× to 1.38× and 1.03× to 1.04×, respectively.
1.2 Layer-wise Pipelined Parallelism for Deep LearningAccelerators
Despite the recent progresses [88–90], the current schemes based on ReRAM lack important
features to efficiently execute complete deep learning applications. First, they only focus on
testing (inference) phase of CNN but do not support the more sophisticated and intensive
training (learning) phase. Second, ISAAC [89] uses a very deep pipeline to improve system
throughput. However, it is only beneficial when a large number of consecutive images can
be fed into the architecture. This is not true in training phase as only a limited number of
7
consecutive image could be processed before weight updates. Third, the deep pipeline in
ISAAC also introduces pipeline bubbles. Moreover, data organization and kernel mapping
are not clearly addressed in PRIME [88].
To close the gaps of ReRAM-based acceleration for complete deep learning, this work
proposes PIPELAYER, a ReRAM-based PIM accelerator that supports complete deep
learning applications. Compared to previous works, we make the following contributions.
(1) Accelerating both training and testing. Supporting training phase is more sophisticated
and challenging because it involves weight updates and complex data dependencies. The
previous schemes assume that the weight values are only written to ReRAM once at the
start and never change. (2) The simple intra- and inter-layer pipeline designed for training.
To ensure high throughput for CNNs of many layers with weight updates, PIPELAYER
adopts a new pipelined architecture different from [89] so that data could continuously
flow into the accelerator in consecutive cycles. It is essential to support pipelined training
phase. Various data input and kernel mapping schemes could be exploited using our design
to balance data processing parallelism and hardware cost (i.e. the number of replicated
ReRAM arrays). (3) Spike-based data input and output. To eliminate the overhead of DACs
and ADCs, PIPELAYER uses a spike-based scheme, instead of voltage-level based scheme
for data input. Such design requires more cycles to inject data, however, the drawback is
offset by the pipelined architecture for multiple layers. The data input scheme used in [89]
is similar as our spike-based input, both eliminating DACs, but our Integration and Fire
component eliminates ADCs while [89] keeps.
In the evaluation, we use ten networks. Six are popular large scale networks, AlexNet [119],
and VGG-A, VGG-B, VGG-C, VGG-D, VGG-E [181], which are based on ImageNet [182].
Four networks based on MNIST [183] are built by ourselves. To evaluate PIPELAYER, we
build a simulator based on NVSim [184]. And the baseline is a platform with the newest
released GPU, GTX 1080. The experiment results show that, PIPELAYER achieves the
8
speedup of 42.45x compared with GPU platform on average. The average energy saving of
PIPELAYER compared with GPU implementation is 7.17x.
1.3 Accelerator Architecture for Graph Processing
A graph can be naturally represented as an adjacency matrix and most graph algorithms
can be implemented by some form of matrix-vector multiplications. However, due to the
sparsity of graph, graph data are not stored in compressed sparse matrix representations,
instead of matrix form. Graph processing based on sparse data representation involves:
(1) bringing data for computation from memory based on compressed representation; (2)
performing the computations on the loaded data. Due to the sparsity, the data accesses
in (1) may be random and irregular. In essence, (2) performs simple computations that
are part of the matrix-vector multiplications but only on non-zero operands. As a result,
each computing core experiences alternative long random memory access latency and short
computations. This leads to the well-known challenges in graph processing and other issues
such as memory bandwidth waste [185].
The current graph processing accelerators mainly optimize the memory accesses. Specif-
ically, Graphicionado [185] reduces memory access latency and improves throughput by
replacing random accesses to conventional memory hierarchy with sequential accesses to
scratchpad memory optimization and pipelining. Ozdal et al. [186] improves the perfor-
mance and energy efficiency by latency tolerance and hardware supports for dependence
tracking and consistency. TESSERACT [187] applies the principle of near-data process-
ing by placing compute logics (e.g., in-order cores) close to memory to claim the high
internal bandwidth of Hybrid Memory Cube (HMC) [188]. However, all architectures do
little change on compute unit, — the simple computations are performed one at a time by
instructions or specialized units.
To perform matrix-vector multiplications, two approaches exist that reflect two ends
9
of the spectrum: (1) the dense-matrix-based methods incur regular memory accesses and
perform computations with every element in matrix/vector; (2) the sparse-matrix-based
methods incur random memory accesses but only perform computations on non-zero
operands. In this work, we adopt an approach that can be considered as the mid-point
between these two ends that could potentially achieve better performance and energy
efficiency. Specifically, we propose to perform sparse matrix-vector multiplications on data
blocks of compressed representation. The benefit is two-fold. First, the computation and
data movement ratio is increased. It means that the cost of bringing data could be naturally
hidden by the larger amount of computations on a block of data. Second, inside this data
block, computations could be performed in parallel. The downside is that certain hardware
and energy will be wasted in performing useless multiplications with zero.
This approach could in principle be applied to the current GPUs or accelerators, but with
the same amount of compute resources (e.g., SM in GPUs), it is unclear whether the gain
would outweigh the inefficiency caused by the sparsity. Clearly, a key factor is the cost of
compute logic, — with a low-cost mechanism to implement matrix-vector multiplications,
the proposed approach is likely to be beneficial.
We demonstrate that the non-volatile memory, metal-oxide resistive random access
memory (ReRAM) [125] could serve as the essential hardware building block to perform
matrix-vector multiplications in graph processing. Recent works [88, 89, 92] demonstrate
the promising applications of efficient in-situ matrix-vector multiplication of ReRAM on
neural network acceleration. The analog computation is suitable for graph processing
because: (1) The iterative algorithms could tolerate the imprecise values by nature; (2)
Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph
algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. Due to the low-cost
and energy efficiency, matrix-based computation in ReRAM would not incur significant
hardware waste due to sparsity. Such waste is only incurred inside the ReRAM crossbar
10
with moderate size (e.g., 8 × 8). Moreover, the sparsity is not completely lost, — if a small
subgraph contains all zeros, it does not need to be processed. As a result, the architecture
will mostly enjoy the benefits of more parallelism in computation and higher ratio between
computation and data movements. Performing computation in ReRAM also enables near
data processing with reduced data movements: the data do not need to go through the
memory hierarchy like in the conventional architecture or some accelerators.
Applying ReRAM in graph processing poses a few challenges: (1) Data representation.
Graph is stored in compressed format, not in adjacency matrix, to perform in-memory
computation, data needs to be converted to matrix format. (2) Large graph. The real-world
large graphs may not fit in memory. (3) Execution model. The order of subgraph processing
needs to be carefully determined because it affects the hardware cost and correctness. (4)
Algorithm mapping. It is important to map various graph algorithms to ReRAM with good
parallelism.
We propose GRAPHR, a novel ReRAM-based accelerator for graph processing. It
consists of two key components: memory ReRAM and graph engine (GE), which are both
based on ReRAM but with different functionality. The memory ReRAM stores the graph
data in compressed sparse representation. GEs (ReRAM crossbars) perform the efficient
matrix-vector multiplications on the sparse matrix representation. We propose a novel
streaming-apply model and the corresponding preprocessing algorithm to ensure correct
processing order. We propose two algorithm mapping patterns, parallel MAC and parallel
add-op, to achieve good parallelism for different type of algorithms. GRAPHR can be used
as a drop-in accelerator for out-of-core graph processing systems.
In the evaluation, we compare GRAPHR with a software framework [151] on a high-end
CPU-based platform ,GPU and PIM [187] implementations. The experiment results show
that GRAPHR achieves a 16.01× (up to 132.67×) speedup and a 33.82× energy saving on
geometric mean compared to the CPU baseline. Compared to GPU, GRAPHR achieves
11
1.69× to 2.19× speedup and consumes 4.77× to 8.91× less energy. GRAPHR gains a
speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency compared to
PIM-based architecture.
1.4 Low-Cost Floating-Point Processing in ReRAM
Previous work [189] is arguably the first to realize scientific computing using ReRAM. For
accelerating the SpMV in an iterative solver, a whole matrix is partitioned into blocks and
the input vector is partitioned into segments. Each matrix block is mapped to a cluster of
ReRAM crossbars and the clusters process blocks of the matrix in parallel. For floating-
point (64-bit) multiplication on a matrix block, the matrix block is expanded to 118 bits
(53 bits for the fraction, 64 for the exponent padding and 1 bit for the sign) fixed-point
format and mapped to 118 crossbars [189]. The mapping cost in [189] is high for ReRAM
crossbars and leads to a low performance. Because the number of ReRAM crossbars on
an accelerator is limited [89, 175, 189], the number of available clusters that can be used in
parallel to process the matrix is also low because one cluster may include many crossbars.
At the same time, a large number of crossbars in a cluster leads to a large number of cycles
consumed by the floating-point multiplication for the matrix block with a vector segment.
In this work, we present REFLOAT, a data format and a supporting accelerator archi-
tecture, for low-cost floating-point processing in ReRAM for scientific computing. We
study the effects of lower bits in both the exponent and the fraction for both matrix block
and vector segment on (1) the number of required ReRAM crossbars to form a cluster to
perform the floating-point multiplication on a matrix block, (2) the cycle number consumed
to process the the floating-point multiplication on a matrix block and (3) the introduced
accuracy loss. Then we present the conversion scheme from default double-precision float-
ing format to REFLOAT format and the computing in REFLOAT format. The REFLOAT
accelerator architecture is presented to support the low-cost floating-point processing in
12
REFLOAT format. In the evaluation part, we show REFLOAT performs 21.88× better than a
GPU baseline and 4.30× faster than the state-of-the-art accelerator for scientific computing
in ReRAM [189].
13
Chapter 2
Tensor Partitioning for Deep LearningAccelerators
2.1 Background and Motivation
2.1.1 DNN Training
Table 2.1: Notations and descriptions.
Notation Description
FlInput feature map to Layer l (output fea-ture map of Layer l−1).
El+1 Input error to Layer l (output error ofLayer l +1).
Wl Kernel of Layer l.4Wl Gradient of Kernel in Layer l.
B Mini-batch size.Di,l Input data size(channel number) of Layer
l.Do,l Output data size(channel number) of
Layer l.ci The computation density of Accelerator
i.pi,l The partitioning of Accelerator i at Layer
l.bi The network bandwidth of Accelerator i.
A(·) Function to return the size of a tensor.
C(·) Function to return the amount of compu-tation to be performed by an accelerator.
α , β Partitioning ratios.
DNN training involves three tensor computing phases at each layer: forward, backward
and gradient. The notations and descriptions are listed in Table 2.1. In the forward phase, at
layer l, the input feature map tensor (Fl) from a previous layer and the kernel/weight tensor
(Wl) are multiplied in fully-connected (FC) layers or convolved in convolutional (CONV)
14
layers to generate the output feature map tensor, which is used as the input feature map
tensor for the next layer (Fl+1). Usually a non-linear activation f (·) is performed on each
scalar of the feature map. Thus, the forward phase can be represented as Fl+1 = f (Fl⊗Wl),
where ⊗ is either multiplication or convolution. In the backward phase, at layer l, the error
tensor (El) is computed by El =(El+1⊗W>l
) f ′(Fl), where El+1 is the error tensor from
layer l +1, is an element-wise multiplication, and f ′(·) is the derivative function of f (·).
In the gradient phase, the gradient to the kernel/weight is computed by4Wl = F>l ⊗El+1.
The three tensor computation phases capture the common flow in many popular training
algorithms, such as Gradient Descent, Stochastic Gradient Descent, Mini-batch Gradient
Descent, Momentum [190] and Adaptive Moment Estimation (Adam) [191]. For example,
Momentum method updates the parameter using vt = γ · vt−1 +η ·∇θ J(θ),θ = θ − vt ,
where θ is the parameter (weight), ∇θ J(θ) is the gradient of a loss function J(·) with
respect to θ , v is the velocity to record the historic gradient, γ is the momentum hyper
parameter and η is the learning rate, respectively.
2.1.2 Deep Learning Accelerator Architectures
Domain-specific computing architectures [225–228] are considered as a promising solution
to accommodate the ever-growing intensive computing in various deep learning applications.
This view has also been confirmed by a flurry of DNN accelerators [1, 2, 18–110] that
have emerged in recent years. Compared with general-purposed CPUs/GPUs, these custom
architectures achieved better performance and higher energy efficiency.
As summarized in Table 2.2, many designs include a vertical integration practice across
algorithm and hardware levels [73–86] where DNN models are typically optimized prior to
being deployed for inference. Some designs investigate the dataflow (or data reuse pattern)
in DNN workloads [53–61], among which Eyeriss [54–56] is a representative design that
explores many data reuse opportunities existed in DNN execution. Processing-in-memory
15
Table 2.2: Landscape of DNN accelerators (accelerators highlighted in cyan are designedfor training).
Segment DNN Accelerators & ArchitecturesNeuro Co-processor SpiNNaker [18], Neuromorphic Acc. [19], TrueNorth [20–22], MT-spike [23],
PT-spike [24]DNN Co-processor NPU [25], DianNao-family [26–29], Cambricon [30], Cambricon-x [31],
TPU [32, 33], ScaleDeep [34], Stripes [35], Neural Cache [192, 193], Diffy[194]
FPGA FPGA-DCNN [36], Embedded-FPGA-CNN [37], FPGA-Exploration [38],OpenCL-FPGA-CNN [39] [195], Caffeine [40], DeepBurning [41], FPGA-DPCNN [42], TABLA [43], DNNWEAVER [44], FP-DNN [45], FPGA-LSTM [46], ESE [47], FPGA-Dataflow [48], FPGA-BNN [49], FPGA-Utilization [50], FFT-CNN [51], VIBNN [52], iSwitch [196], Shortcut-Mining [197], FA3C [198], E-RNN [199]
Dataflow Neuflow [53], Eyeriss [54–56], Flexflow [57], Fused-CNN [58], CNN-Paritition [59], GANAX [60], UCNN [61], TANGRAM [200], Sparse-Systolic [201], MAERI [202]
PIM Neurocube [62], XNOR-POP [63], DRISA [64], 3DICT [65], NAND-NET [203], SCOPE [204], Promise [205]
Light Models EIE [66], SC-DCNN [67], SCNN [68], Escher [69], LookNN [70], Bit-Pragmatic DNN [71], Bit Fusion [72], Cnvlutin [101], TIE [206], La-conic [207], ADM-NN [208], Gist [209]
Co-Design DPS-CNN [73], Minerva [74], MoDNN [75], MeDNN [76], AdaLearner [77],Stitch-X [78], PIM-DNN [79], Scalpel [80], CirCNN [81], CoSMIC [82],SnaPEA [83], OLAcce [84], Prediction-based DNN [85], PERMDNN [86],MnnFast [210], DNN Computation Reuse [211], vDNN [212], Compressing-DMA-Engine [213], AxTrain [214], Eager pruning [215], Bit-Tactical [216],GENESYS [217]
Emerging Tech. TETRIS [87], RENO [90], PRIME [88], ISAAC [89], Memristive Boltz-mann Machine [91], PipeLayer [92], Atomlayer [93], ReCom [94], Re-GAN [95], ReRAM ACC. [96], EMAT [97], ReRAM-BNN [98], ZARA [99],SNrram [100], Sparse ReRAM Engine [218], RedEye [219], Quantum-SC-NN [220], FloatPIM [221], PUMA [222], FPSA [223]
Toolset/Framework Data Parallelism [108], OWT [1], NEUTRAMS [102], Perform-ML [103],Adaptive-Classifier [104], DNNBuilder [105], Group Scissor [106], FFT-CNN [107], HyPar [2], NNest [224]
(PIM) based designs are also proposed to reduce costly off-chip memory accesses [62–65].
Lightweight models are also introduced [66–72] to reduce computational effort. Many
designs based on emerging memory technologies, such as resistive random access memory
(ReRAM) technology [88–100, 229] with 3D stacking technology [62, 65, 87], are also
proposed.
16
2.1.3 Motivation
Although domain-specific architectures have effectively addressed the challenges of the
ending of Moore’s law [230], the recent study of chip specialization [118] has clearly
demonstrated the diminishing specialization returns and the ultimate accelerator wall. In
other words, it is unlikely to further achieve fine-grained optimizations on a single ac-
celerator. To satisfy the computation and memory requirement for large DNN models
and datasets that typically cannot be satisfied by a single accelerator, a natural solution is
coarse-gained DNN execution on an accelerator array. On the other hand, as highlighted in
cyan in Table 2.2, only a few of the existing DNN accelerators are designed for training.
Among these designs, strict constraints are often applied to the models that can be supported.
For example, [1, 2, 108] only considered homogeneous platforms, where the computation
capability and the network bandwidth for each accelerator are identical. In reality, how-
ever, it is more important to explore solutions for an array of heterogeneous accelerators
with various computation capacity and network bandwidth. For example, though a more
powerful TPU-v3 was released, the early deployed TPU-v2 may not retire immediately
considering the deployment cost and the need for supporting various acceleration workloads.
They are in fact both available off-the-shelf [231]. It is important to optimize large-scale
DNN training acceleration when both TPU versions are used simultaneously. To achieve
high throughput and balanced execution, we need to efficiently distribute data and model
tensors among accelerators with the awareness of heterogeneous computation capability
and network bandwidth. A principled and systematic approach is needed to overcome
the challenge of handling the complexity of DNN models and heterogeneous hardware
execution environment.
17
2.2 Tensor Partitioning Space
Compared with DNN inference, DNN training is more complicated because of the three
computation phases involved in training, i.e., forward, backward and gradient. The tensors
and computations in the three phases are closely coupled together. We need first construct a
complete set of the basic tensor partitioning types.
2.2.1 Problem Statement
We first consider FC layers and later show that the solution can be naturally extended to
CONV layers. In FC layers, DNN training involves three tensor computing phases:
Forward: Fl+1 = f (Fl×Wl),Backward: El =
(El+1×W>l
) f ′(Fl),
Gradient: 4Wl = F>l ×El+1.
Using the notations in Table 2.1, the shape of the tensors in the above three phases are
as below. Here we do not include the element-wise multiplications in the space relations
since they can be performed in place.
Forward: (B,Do,l)← (B,Di,l)× (Di,l,Do,l),
Backward: (B,Di,l)← (B,Do,l)× (Do,l,Di,l),
Gradient: (Di,l,Do,l)← (Di,l,B)× (B,Do,l).
For illustration purpose, this section considers a simple case of an array with two
accelerators. The problem is to exhaustively and systematically enumerate all possible
partitions of the tensors involved in the three phases among the two accelerators, and
understand the corresponding data communication and replication requirements. This is
critical because the partition will determine the communication between the accelerators
and affect overall training performance. We will also explain why the current solutions [1,2]
failed to provide a complete and comprehensive solution.
18
2.2.2 Partitioning in Three Dimensions
We note that the two matrices in each of the pairs (Fl ,El) and (Fl+1,El+1) have the same
shape. We assume that Fl and El (also Fl+1 and El+1) are partitioned in the same manner.
This constraint is intuitive since otherwise additional communication will be unnecessarily
incurred, contradicting our goal of minimizing communication between the accelerators.
We see only three dimensions appear in the three tensor computing phases: B (batch
size), Do,l (output data size of layer l), and Di,l (input data size of layer l). Therefore, we
can naturally focus on the partition in these three dimensions. For the partitions in one
dimension, we assume that the same partition parameter is used for this dimension in every
tensor to avoid additional communication.
Key observation: The dimensions are not independent. In fact, only one dimension can
be “free” in a partition.
We explain observation using an example: consider the forward phase and the partition
in B dimension, Since we will have only two partitions, for (B,Di,l) (Fl), the Di,l dimension
should not be partitioned. This also determines that (Di,l,Do,l) (Wl) should not be parti-
tioned in Di,l dimension, otherwise the matrix multiplication cannot be performed. The
only case left is the Do,l dimension of Wl . Suppose we partition that, the combination of
multiplication of the local partitions in each accelerator does not lead to a complete result
of Fl with shape (B,Di,l). Specifically, depending on the partition, only the upper left and
lower right sub-matrix, or upper right and lower left sub-matrix are computed. Therefore,
Do,l dimension of Wl cannot be partitioned. In fact, the whole Wl needs to be replicated on
the two accelerators to compute the complete Fl . The other scenarios can be considered
similarly.
With the assumption that Fl and El (also Fl+1 and El+1) use the same partition, and the
fact that only one dimension is free in a partition, there are only three partition types. We
discuss them one by one in the following.
19
Forw
ard
Back
war
dG
radi
ent
FlWl Fl+1
El+1
El+1
El
F>l
W>l
F>l
El+1
El+1
Fl+1Wl
W>l
El
Fl
4Wl 4Wl
↵B
↵B
↵B
↵B
↵B
↵B
Do,l
Do,l
Do,l
Do,l
Do,l
Do,l
Do,l
Do,l
Do,lDo,l
Do,l
B
B
B B
B
B
=)
=)
!
!=)
=)
!=)
!
Do,l
WlFl
B =)!
!
=)
!
Fl+1
B
B
El
=) BW>
l
B
=)
El+1
!
!
B
El+1
(a) Type-I (b) Type-II (c) Type-III
Di,l
Di,l
Di,l
Di,l
Di,l
Di,l Di,l
Di,l
Di,l Di,l
Di,l
Di,l
↵Do,l
↵Do,l
↵Do,l
↵Do,l
↵Di,l
↵Di,l
↵Di,l
↵Di,l
↵Do,l↵Do,l
↵Di,l↵Di,l
4Wl
F>l
The partitioning ratio for one accelerator is α and the partitioning ratio for the other accelerator is β = 1−α .Shadow tensors are assigned to one accelerator and non-shadow tensors are assigned to the other accelerator.
⊗ denotes element-wise addition of two tensors.
Figure 2.1: Three basic tensor partitioning types.
Type-I: Partitioning B Dimension
In Type-I, we partition the B dimension in the three tensor computing phases, as shown in
Figure 2.1(a).
In forward phase, Fl+1 = Fl×Wl . The element Fl+1[b,qo] in Fl+1 can be computed as
Fl+1[b,qo] = ∑qi∈1,··· ,Di,l
Fl[b,qi]×Wl[qi,qo], (2.1)
where b ∈ 1,2, · · · ,B,qo ∈ 1,2, · · · ,Do,l. We assume the ratio of computation to be
assigned to one accelerator is α , 0 ≤ α ≤ 1, and the ratio for the other is β , β = 1−α .
The set 1,2, · · · ,B is partitioned into two subsets 1,2, · · · ,αB and αB+1,2, · · · ,B.
Specifically, Fl[1 : αB, :] is assigned to one accelerator and Fl[αB+ 1 : B, :] is assigned
to the other. The two accelerators process disjoint subsets of the batch, and perform the
computation indexed by the corresponding b of the two subsets.
As discussed before, to ensure the validity of matrix multiplication and to get the com-
plete results of Fl+1, Wl is replicated in the two accelerators. After matrix multiplication,
each accelerator produces a portion of results based on the same partition in B dimen-
20
sion. Specifically, Fl+1[1 : αB, :] is produced by one accelerator and Fl+1[αB+1 : B, :] is
produced by the other.
In backward, an element El[b,qi] in El can be computed as
El[b,qi] = ∑qo∈1,··· ,Do,l
El+1[b,qo]×W>l [qo,qi]. (2.2)
Due to the constraint that Fl and El (also Fl+1 and El+1) use the same partition, one
accelerator keeps El+1[1 : αB, :] and produces El+1[1 : αB, :] with replicated W>l . The other
accelerator handles the other portion: El[αB+1 : B, :] and El+1[αB+1 : B, :].
The common pattern for forward and backward phases is that after replicating Wl ,
accelerators can perform computation locally and produce disjoint parts of the result matrix,
which can be combined to obtain the complete result. However, this pattern does not exist in
gradient phase as the two accelerators are not able to complete the computation individually.
In gradient phase, an element4Wl[qi,qo] in4Wl can be computed as
4Wl[qi,qo] = ∑b∈1,··· ,B
F>l [qi,b]×El+1[b,qo]. (2.3)
Based on the same partition of B dimension in F>l and El+1, the two accelerators can
perform local matrix multiplications. Each of them will produce the result matrix of shape
(Di,l,Do,l), the same as 4Wl . To get the final results in Equation (2.3), element-wise
additions need to be performed to combine the partial results:
4Wl[qi,qo] = ∑b∈1,··· ,αB
F>l [qi,b]×El+1[b,qo]
+ ∑b∈αB+1,··· ,B
F>l [qi,b]×El+1[b,qo].(2.4)
The computation pattern in gradient phase implies that communication is needed to
obtain the final results of4Wl , because one of the accelerators needs to perform the partial
sum. We call it as intra-layer communication. We can see that such communication happens
at different phases for different types of partition.
21
Type-II: Partitioning Di,l Dimension
The partition in Di,l dimension is shown in Figure 2.1(b). To perform matrix multiplication,
Di,l dimension of Fl is partitioned in the same way. Based on this, in forward phase, each
accelerator will compute a result matrix of shape (B,Do,l). Similar to the case of4Wl in
Type-I, computing the complete Fl+1 requires element-wise addition in this partitioning:
Fl+1[b,qo] = ∑qi∈1,··· ,αDi,l
Fl[b,qi]×Wl[qi,qo]
+ ∑qi∈αDi,l+1,··· ,Di,l
Fl[b,qi]×Wl[qi,qo].(2.5)
Since Fl+1 and El+1 follow the same partition, in backward phase, El+1 is replicated in
the two accelerators. This allows each of the accelerators produces a disjoint part of result of
El . The partition and replication are similar to that used in gradient phase. A key difference
between Type-I and Type-II is that the intra-layer communication incurs at forward phase,
instead of gradient phase.
Type-III: Partitioning Do,l Dimension
The partitioning Do,l dimension is shown in Figure 2.1(c). Fl needs to be replicated to
compute complete Fl+1 in forward phase. It is the case overlooked by all previous solutions.
Essentially, it means that the input feature maps of the same batch are replicated into the
two accelerators, instead of partitioning B. It may sound not intuitive since we want to have
accelerators process the same data. However, we show that it is an important partition in the
design space that presents the same trade-off in terms of communication just as in Type-I
and Type-II.
Similar to gradient phase of Type-I and forward phase of Type-II, the backward phase
22
of Type-III requires element-wise addition of partial results:
El[b,qi] = ∑qo∈1,··· ,αDo,l
El+1[b,qo]×W>l [qo,qi]
+ ∑qo∈αDo,l+1,··· ,Do,l
El+1[b,qo]×W>l [qo,qi].
(2.6)
2.2.3 Extension to CONV
In the previous sessions, we used matrix-matrix multiplication in FC to illustrate the three
types of partitions, which can be conceptually visualized in Figure 2.1. For CONV, the three
partitioning types are still valid. However, Fl[b,qi], Fl+1[b,qo], El[b,qi] and El+1[b,qo]
are no longer scalars but are 2-dimensional matrices. Therefore, Fl , Fl+1, El and El+1 are
4-dimensional tensors, i.e., (batch, channel)×(width, height). Similarly,4Wl[qi,qo] and
Wl[qi,qo] are also 2-dimensional matrices rather than scalars. Thus, 4Wl and Wl are 4-
dimensional tensors, i.e., (input channel, output channel)×(kernel width, kernel height). The
multiplication (×) in Equation (2.1), (2.2), (2.3), (2.4), (2.5), (2.6) then become convolution
(⊗). The additional dimensions (4D vs. 2D) and more complex operations (× vs. ⊗) only
imply different amount of computation but not affect the partition types based on existing
dimensions (B, Di,l , and Do,l).
2.2.4 Completeness
Table 2.3: Rotational Symmetry of the Three Tensor Multiplications.
Multiplication L Shape R Shape Partition Dim Psum Shape Basic TypeFl+1 = Fl×Wl (B,Do,l) (B,Di,l), (Di,l ,Do,l) Di,l (B,Do,l) Type-IIEl = El+1×W>
l (B,Di,l) (B,Do,l), (Di,l ,Do,l) Do,l (B,Di,l) Type-III4Wl = F>l ×El+1 (Di,l ,Do,l) (B,Di,l), (B,Do,l) B (Di,l ,Do,l) Type-I
In the three phases, only three dimensions appear and we have shown that only one
dimension can be partitioned at a time. Thus, the three types we derived constitute the
23
complete partition space. Table 2.3 summarized the key features of the partitions. The LHS
Shape and RHS Shape respectively indicate the shapes of the metrics on the left-hand and
right-hand side of the equation for each phase. The Psum Shape is the shape of the matrices
containing partial results in two accelerators that need to be combined using element-wise
additions. It happens when the matrix appear on the LHS. This is also the shape of the
matrix that needs to be replicated if it appears on the RHS. From Table 2.3, we can observe
a rotational symmetry on each column.
2.2.5 Problems of “One Weird Trick” & HyPar
Two solutions were recently proposed to address the same problem that is addressed by this
work, — communication and parallelism between accelerators. However, neither of these
two solutions is complete.
Krizhevsky [1] proposed “one weird trick” (OWT) to configure CONV layers with data
parallelism and FC layers with model parallelism to get a higher performance. It is certainly
better than just using data parallelism for all layers, however it does not provide any insight
on why this trick works and whether it is the best we can do. Therefore, this solution is
fundamentally empirical.
HyPar [2] is a more recent and principled approach to optimize the parallelism configura-
tions also by partitioning the layers based on the intra-layer and inter-layer communication.
However, it only considers the same two basic partitions in OWT, — data parallelism and
model parallelism. In fact, they correspond to Type-I and Type-II in Figure 2.1, respectively.
Therefore, the parallelism setting in HyPar is not complete. Even if it is based on a more
systematic approach to explore the partition space, it cannot find the optimal solution based
on incomplete basic partition types. Specifically, HyPar will miss one intra-layer communi-
cation pattern (Type-III) and five inter-layer communication patterns (see more details in
Section 2.3.1). Moreover, HyPar always partitions the tensors equally, so it cannot capture
24
the performance heterogeneity among accelerators.
2.3 Cost Model
To search the optimal partition of layers, we propose a cost model for multiple accelerators.
We consider the computation by individual accelerator and the communication between
accelerators as two major affecting DNN training performance. Compared to HYPAR [2],
which uses communication cost as the proxy for performance, the cost model of ACCPAR
takes both communication cost Ecm and computation cost Ecp into consideration. The
optimization target is to minimize overall cost.
2.3.1 Communication Cost Model
Assuming the network bandwidth for accelerator i is bi, and T is the accessed tensor needs
to be transferred from one to the other, we define the communication cost Ecm for the tensor
transfer as
Ecm =A(T)
bi. (2.7)
The tensor size A(T) is defined as the product of the lengths of all dimensions. For
example, the size of a 4-by-5 matrix is 20, and the size of a kernel whose input channel
is 16, kernel window width is 3, kernel window length is 3 and output channel is 32, is
4,608 = 16×3×3×32. Next, we will determine what the remotely-accessed tensor T is.
Intra-layer Communication Cost
As discussed in Section 2.2, for each of the three basic tensor partitioning types, there is
one and only one computation phase requires remote accessing.
In Type-I, the gradient phase requires remote accessing (Equation (2.4)). For the ac-
celerator whose partitioning ratio is α , for each b ∈ 1, · · · ,αB, the intermediate tensor
25
Table 2.4: Intra-layer communication cost of the three basic tensor partitioning types.
Basic Type Intra-layer Communication Cost
Type-I A(Wl)bi
Type-II A(Fl+1)bi
Type-III A(El)bi
Note that intra-layer communication cost is not dependable on the partitioning ratio α because intermediateresults are accumulated locally and partial sum tensors are accessed remotely.
size is A(F>l [:,b]×El+1[b, :]) = Di,l ·Do,l = A(4Wl) = A(Wl). Those intermediate ten-
sors (∀b ∈ 1, · · · ,αB) are accumulated locally (∑qi∈1,··· ,αDi,l(·)) by the accelerator i
to reduce remote accessing by the other accelerator j. Also, the accelerator j performs
local accumulation ∑qi∈αDi,l+1,··· ,Di,l(·). Thus, the size of the tensor remotely accessed by
accelerator i from accelerator j is A(Wl) rather than (1−α) ·B ·A(Wl). With Equation
(2.7), we get the intra-layer communication cost for accelerator i to remotely access the
partial sum tensor in accelerator j is A(Wl)bi
.
For Type-II and Type-III, readers can follow the similar idea to get the intra-layer
communication cost for the two basic tensor partitioning types. We list the intra-layer
communication cost of the three basic tensor partitioning types in Table 2.4.
Inter-layer Communication Cost
Table 2.5: Inter-layer communication cost between the three basic tensor partitioning types.
Layer l +1Type-I Type-II Type-III
Type-I 0 αβA(Fl+1)+αβA(El+1)bi
βA(Fl+1)bi
Layer l Type-II βA(El+1)bi
βA(El+1)bi
0
Type-III αβA(Fl+1)+αβA(El+1)bi
0 βA(Fl+1)bi
Since each layer is assigned a basic tensor partitioning type, when switching content
from one layer to the next layer, an accelerator may require remote accessing. That is the
26
(a) Type-I to Type-I
Fl+1 Fl+1Fl+1 Fl+1
El+1El+1
(b) Type-I to Type-II
Fl+1 Fl+1
El+1
(c) Type-I to Type-III
Fl+1 Fl+1
El+1
(d) Type-II to Type-I
Fl+1 Fl+1
El+1
(e) Type-II to Type-II
Fl+1 Fl+1
El+1
(f) Type-II to Type-III
Fl+1 Fl+1
El+1
(g) Type-III to Type-I
Fl+1 Fl+1
El+1
(h) Type-III to Type-II
Fl+1 Fl+1
El+1
(i) Type-III to Type-IIII↵
↵
↵↵
1
1
1
1
El+1
El+1
El+1 El+1El+1
El+1El+1
El+1El+1
Shadow tensors are held by one accelerator (whose partitioning ratio is α) and non-shadow tensors are heldby the other accelerator (whose partitioning ratio is β ).
Figure 2.2: Inter-layer communication patterns between three basic tensor partitioningtypes.
inter-layer communication. There are two tensor conversions may require remote accessing,
i.e., (1) the conversion of the output feature map tensor Fl+1 in layer l to the input feature
map tensor Fl+1 in layer l+1 and (2) the conversion of the output error tensor El+1 in layer
l +1 to the input error tensor El+1 in layer l. As we have three basic tensor partitioning
types, there are nine inter-layer communication patterns between the basic types, as shown
in Figure 2.2. Tensors in layer l are colored in green and Tensors in layer l +1 are colored
in blue.
27
(a) Type-I to Type-I, (f) Type-II to Type-III and (h) Type-III to Type-II.
In the tensor conversion of the three patterns, since the (green) tensors in layer l and the
(blue) tensors in layer l + 1 has the same partitioning, there is no conversion, and the
inter-layer communication cost is 0.
(c) Type-I to Type-III, (d) Type-II to Type-I, (e) Type-II to Type-II and (i) Type-III to
Type-III.
We take Figure 2.2(c) as an example. In the tensor conversion from Type-I to Type-III,
in the forward phase, the accelerator i (whose partitioning ratio is α) holds green shadow
tensor Fl+1 in layer l, but in the next layer l+1, the accelerator need the whole blue shadow
tensor Fl+1. The difference is the black part, and the black tensor incurs remote accessing to
the other accelerator j (whose partitioning ratio is β ). Thus the inter-layer communication
amount is (βB)×Do,l = βA(Fl+1), and the inter-layer communication cost by accelerator
i to remotely access the black tensor in accelerator j is βA(Fl+1)bi
. Reversely, the inter-layer
communication cost for the accelerator j is αA(Fl+1)b j
in this case. Note that the inter-layer
communication cost for (c) Type-I to Type-III, (d) Type-II to Type-I, (e) Type-II to Type-II
and (i) Type-III to Type-III are the same, but the shapes of conversion tensors are not the
same.
(b) Type-I to Type-II and (g) Type-III to Type-I.
We take Figure 2.2(b) as an example. In the tensor conversion from Type-I to Type-II, in
the forward phase, the accelerator i (whose partitioning ratio is α) holds green shadow
tensor Fl+1 (αB,Do,l) in layer l, but in the next layer l + 1, the accelerator need the
whole blue shadow tensor Fl+1 (B,αDo,l). The difference is the black part, and again the
black tensor incurs remote accessing to the other accelerator j (whose partitioning ratio
is β ). Thus the inter-layer communication amount is (βB)×αDo,l = αβA(Fl+1), and
28
the inter-layer communication cost by accelerator i to remotely access the black tensor in
accelerator j is αβA(Fl+1)bi
. Reversely, the inter-layer communication cost for the accelerator
j is (1−α)(1−β )A(Fl+1)b j
=βαA(Fl+1)
b jin this case. Note that the inter-layer communication
cost for (b) Type-I to Type-II and (g) Type-III to Type-I are the same, but the shapes of
conversion tensors are not the same.
We list the inter-layer communication cost for the nine patterns of the tensor conversion
between the three basic partitioning types in Table 2.5.
2.3.2 Computation Cost Model
We assume the tensor computation density of an accelerator i is ci and the amount of floating
point operations to perform the multiplication of the multiplication of two tensors T1×T2
is C(T1×T2). For an accelerator with a partitioning ratio α , the effective amount floating
point operations performed is α ·C(T1×T2). We can define the computation cost Ecp for
an accelerator i to perform the computation as
Ecp =α ·C(T1×T2)
ci. (2.8)
Table 2.6: The amount of floating point operations (FLOP) in the three multiplications.
Multiplication # FLOPFl+1 = Fl×Wl A(Fl+1) · (Di,l +Di,l−1)El = El+1×W>l A(El) · (Do,l +Do,l−1)4Wl = F>l ×El+1 A(Wl) · (B+B−1)
The most important step to get the computation cost is to calculate the number of floating
point operations (FLOP) of a tensor multiplication. In the forward phase, to get the output
tensor Fl+1, the number of FLOP is (B ·Do,l) · (Di,l +Di,l−1) = A(Fl+1) · (Di,l +Di,l−1).
The numbers of FLOP for the three multiplications are listed in Table 2.6.
29
2.3.3 Discussion on Convolutions
We can easily expand the communication cost and computation cost from fully-connected
layers to convolutional layers. In convolutions, Fl , Fl+1, El and El+1 are 4-dimensional
tensors, i.e., (batch, channel, height, width). We can view the four dimensional tensors
as three dimensional tensors, but the third and fourth dimension is a meta dimension, i.e.,
(batch, channel, [height, width]). The kernel Wl are also 4-dimensional tensors, i.e., (input
channel, output channel, kernel height, kernel width), and we can also view it as a three
dimensional tensor, and the second dimension is a meta dimension, i.e., (input channel,
[kernel height, kernel width], output channel). Thus, the communications costs listed in
Table 2.4 and 2.5 keep the same formats.
In a matrix-matrix multiplication MC = MA×MB, assume the shape of MC, MA, MB
is (MC,NC), (MC,P), (P,NC) respectively. The idea to to calculate the number of floating
points performed is to multiply the number of output elements and the the number of floating
points for each element. In the matrix-matrix multiplication, the number of output elements
is MC×NC = A(MC). For each output element, the number of multiplications performed
is P and the number of additions performed is P− 1. So the total number of FLOP is
A(MC) · (P+P−1). To find the number of FLOP for a convolutional layers, we need only
to find the number of FLOP for the convolution for one element in the output tensor because
the number of elements of a tensor Tout is always A(Tout) no matter what the dimension
it is. The number of multiplications performed is (input channel) × (kernel height) ×
(kernel width) and number of additions performed is ((input channel) × (kernel height) ×
(kernel width) - 1). Note that (input channel) is Di,l , Do,l or B in the three multiplications
respectively, and (kernel height) × (kernel width) is actually the 2D feature map or kernel
size, i.e., the size of the feature map or kernel except the input and output channel. So
for convolutional layers, the number of floating point operations is the entries in Table 2.6
multiply the 2D feature map or kernel size.
30
2.4 Partitioning Algorithm
In this section, we explain the ACCPAR partitioning method. Like recent work HyPar [2],
we determine the partitioning for each layer in a DNN model by a layer-wise dynamic
partitioning scheme. However, ACCPAR is much more general for three reasons: 1) the
algorithm considers the complete search space discussed in Section 2.4; 2) it can be parame-
terized with arbitrary partitioning ratio based on heterogeneous compute, communication
cost and effective bandwidth between accelerator groups; 3) it can handle multiple paths in
DNNs. As a result, we will see in Section 2.5 that ACCPAR achieves considerable speedups
over HYPAR.
2.4.1 Layer-wise Partitioning
Type-I
Type-II
Type-III
Li Li+1
Figure 2.3: Layer-wise partitioning is determined by dynamic programming to minimizethe computation cost and the communication cost.
(a)
P1 P2
Li
Li+1
P1
P2(b) (c)
Li Li+1
P1
P2Li Li+1
(d)
P1
P2Li Li+1
(a) Layer Li and Li+1 are connected by path P1 and path P2. (b) Partitioning for (Li, t = Type-I) to(Li+1, t = Type-I), (c) Partitioning for (Li, t = Type-II) to (Li+1, t = Type-I), and (d) Partitioning for
(Li, t = Type-III) to (Li+1, t = Type-I).
Figure 2.4: Dynamic programming on multi-paths.
31
To find the best partitioning for each layer in a DNN to minimize communication and
improve performance, an intuitive way is to enumerate all possible configurations by brute
force. Unfortunately, it will result in a O(3N) complexity for a DNN with N layers — not a
practical solution. Following the dynamic programming approaches [2,177,179], we reduce
search complexity to O(N) by dynamic programming.
Figure 2.3 illustrates the layer-wise partitioning procedure. For each layer, we determine
the minimum cost based on the three basic partitioning types from the first layer till the
current layer. We denote the accumulative cost up to layer Li when it is in state (Li, t) as
c(Li, t) — layer Li chooses a basic partitioning type t ∈ T =Type-I, Type-II, Type-III.
Based on the cost model in Section 2.3, the accumulative cost given partition choice t of
the current layer Li+1 (c(Li+1, t)) can be calculated inductively with the cost of the previous
layer Li (c(Li, tt)):
c(Li+1, t) = mintt∈Tc(Li, tt)+Ecp(t)+Ecm(tt, t). (2.9)
Here, Ecp(t) is the computation cost for the current layer Li+1 for a type t, and Ecm(tt, t)
is the sum of the intra-layer communication cost for a type t and the inter-layer communi-
cation cost when transition from state (Li, tt) to state (Li+1, t). For each basic partitioning
type, during the algorithm execution we need to record the path to a previous layer for
backtracking after going through all layers, shown as the black arrows in Figure 2.3. In this
manner, after we compute the accumulative cost of the last layer, we have obtained the cost
state of the last layer and all previous layers with backtracking. The algorithm starts by
initializing c(L0, t) for the three basic partitioning types in layer L0 to 0.
We employ a hierarchical (recursive) partition for multiple hierarchies similar to [2, 76,
179]. The idea is to apply the layer-wise partitioning recursively on a partitioned hierarchy
to partition on an accelerator array.
32
2.4.2 Handling Multiple Paths
Different from HYPAR, ACCPAR is able to determine the partition for the multi-path
typologies that are common in Resnet [120]. Figure 2.4 shows an example where there are
two paths between layer Li and Li+1. Path P1 consists of one weighted layer and Path P2
consists of two weighted layers. The key idea for multi-path partitioning is to (1) enumerate
the partition state of layer Li+1 (Li+1, t), t ∈ T , (2) enumerate the partition state of layer
Li (Li, tt), tt ∈T , (3) perform individual layer-wise partitioning for each path between the
two states (Li, tt) and (Li+1, t) for each combination, (4) determine the lowest cost for the
state (Li+1, t).
We need to determine the lowest cost for each state, i.e., enumerate the three colored
circles in Figure 2.4 in Layer Li+1. For example, (Li+1, t = Type-I) is one of the three
possible states shown as the yellow circle at Layer Li+1 in Figure 2.4. We then enumerate
the three colored circles at Layer Li. Figure 2.4(b) starts from (Li, t = Type-I), we search
the three paths in P1 and the three paths in P2 between the two states (Li, t = Type-I)
and (Li+1, t = Type-I). We then select the lowest-cost path in P1, and the the lowest-
cost path in P2 to (Li+1, t = Type-I), and combine them as the lowest-cost path from
(Li, t = Type-I) to (Li+1, t = Type-I). Similarly, we compute the lowest-cost path from
(Li, t = Type-II) to (Li+1, t = Type-I) as shown in Figure 2.4(c), and the lowest-cost path
from (Li, t = Type-III) to (Li+1, t = Type-I) as shown in Figure 2.4(d). With the three
paths, i.e., the lowest-cost paths (1) from (Li, t = Type-I) to (Li+1, t = Type-I), (2) from
(Li, t = Type-II) to (Li+1, t = Type-I) and (3) from (Li, t = Type-III) to (Li+1, t = Type-I), we
can finally determine the lowest cost to reach state (Li+1, t = Type-I) and record the lowest-
cost path from Layer Li to state (Li+1, t = Type-I). Following the similar procedure, we
can determine the lowest-cost path to state (Li+1, t = Type-II) and state (Li+1, t = Type-III).
Since we have searched the optimal states for Layer Li+1, the optimal states in the last layer
of the DNN model will be satisfied.
33
2.4.3 Partitioning Ratio
ACCPAR allows the partition ratio to be adjusted for heterogeneous accelerators to balance
the communication and computation costs of the individual accelerators For an accelerator
with a partitioning ratio α , the computation and communication cost are both a function of
α and a partitioning pi,l , i.e., Ecp(α, pi,l) and Ecm(α, pi,l). To calculate the partition ratio for
achieving the best performance, we need to find the ratio to balance the sum of computation
cost and communication cost among two accelerator groups. From Equation (2.7), (2.8)
and Table 2.4, 2.5 we can see that computation and communication cost are both linear with
respect to the partition ratio: Ecp(α, pi,l) = Ecp(pi,l)|α and Ecm(α, pi,l) = Ecm(pi,l)|α . For
the accelerator with a partitioning ratio β and a partitioning p j,l , we can get Ecp(p j,l)|β and
Ecm(p j,l)|β . To determine the partitioning ratio, we just need solve the linear equation
Ecp(pi,l)|α +Ecm(pi,l)|α = Ecp(p j,l)|β +Ecm(p j,l)|β . (2.10)
2.5 Evaluation
2.5.1 Evaluation Setup
We use nine DNNs to evaluate ACCPAR: Lenet [232], Alexnet [119], Vgg11, Vgg13,
Vgg19 [181], and Resnet18, Resnet34, Resnet50 [120]. We train Lenet on MNIST [233]
dataset and other eight DNN models are trained with ImageNet [234].
Table 2.7: The specifications of the accelerators.
TPU-v2 TPU-v3Cores 4×2 4×2
FLOPS 180T 420THBM Memory 64GB 128GB
Memory Bandwidth 2400GB/s 4800GB/s# Accelerators 128 128
We build a in-house simulator to model the performance of the accelerator array in tensor
34
processing unit TPU-v2 and TPU-v3. In the simulation, we derive the tensor accessing
traces (loading and storing) and partial sum computation (MULT and ADD) traces for
the simulation and then we calculate the time consuming for the computation and data
accessing. The trace granularity for FC layer is element-wise (i.e., 1) and for CONV is
kernel-wise (e.g., 3x3). While TPU-v1 is designed for DNN inference [33], TPU-v2 and
TPU-v3 target DNN training. Table 2.7 lists the specifications for these two accelerators.
An accelerator is a board that holds the processing units — 2 cores per chip and 4 chip
per board. The peak floating point operations per second (FLOPS) are 180T for TPU-v2
and 420T for TPU-v3, and the memory for the two accelerators are 64GB high bandwidth
memory (HBM) and 128GB HBM [231]. The memory bandwidth for TPU-v2 is 2400GB/s.
Since the memory bandwidth for TPU-v3 is not available, we assume a 4800GB/s memory
bandwidth for TPU-v3. For the network, the maximum data rate per core is 2Gb/s [235].
We set the network data rate for TPU-v2 as 8Gb/s and that for TPU-v3 as 16Gb/s. The
number of accelerators for TPU-v2 and TPU-v3 are both 128. The data format used in the
DNN training is bfloat, Google’s 16-bit floating point data format for training. We also set
the mini batch size to be 512.
To evaluate the effectiveness of ACCPAR, we compare it against data parallelism
(DP) [108], “One Weird Trick” (OWT) [1] and HYPAR [2]. In DP, each accelerator
maintains a local copy of DNN model, while training samples are partitioned among these
accelerators. In OWT, the CONV layers in a DNN model is configured for data parallelism,
and the FC layers are configured for model parallelism. In HYPAR, both CONV and FC
layers can be configured for data parallelism or model parallelism to minimize overall
communication. Data communication, operational computation and hardware heterogeneity
are jointly considered in ACCPAR for collaborative optimization. Here we use DP as the
baseline, and the performance and training throughput of OWT, HYPAR and ACCPAR are
all normalized to the DP design.
35
2.5.2 Heterogeneous Array
We fist evaluate the performance for a heterogeneous accelerator array using same number
of accelerators with different performance: 128 TPU-v2 accelerators and 128 TPU-v3
accelerators.
Lenet
Alexnet
Vgg11
Vgg13
Vgg16
Vgg19
Resnet18
Resnet34
Resnet50
Gmean
0
2
4
6
8
10
12
14
16
18Speedup
OWTHyParAccPar
DP
Figure 2.5: The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR [2]and AccPar in a heterogeneous accelerator array.
The normalized performance of DP, OWT and HYPAR are shown in Figure 2.5. The
geometric mean of speedup (the throughput improvement) in DP, OWT, HYPAR and
ACCPAR are 1.00×, 2.98×, 3.78×, 6.30×, respectively. For Vgg series, ACCPAR can
get a speedup up to 16.14×, while the highest speedup of OWT and HYPAR are 8.24×
and 9.46×. For Resnet series, ACCPAR can get speedups from 1.92× to 2.20×, while the
ranges of speedup achieved by OWT and HYPAR are 1.22× to 1.38× and 1.03× to 1.04×,
respectively.
We see that the speedups achieved by ACCPAR is significantly higher than OWT
and HYPAR. In Figure 2.5, on the four Vgg series, the speedups of OWT range from
5.33× to 8.25×, and the speedups of HYPAR range from 5.92× to 9.46×. However, the
36
speedups of ACCPAR range from 9.75× to 16.14×, which are significantly higher than
OWT and HYPAR. The improvements of ACCPAR come from: (1) the complete partitioning
type search space, which includes all three types of tensor partitioning settings to reduce
communication; and (2) the communication and computation joint optimization considering
heterogeneity. With the communication and computation by different accelerators balanced
by certain partitioing ratio, the idle time due to heterogeneous communication/computation
capability under equal partitioning in OWT [1] and HYPAR [2] are greatly alleviated.
From Figure 2.5, we also notice that the the speedups of Resenet series are lower than
that of Vgg series. Between the two series, the main difference is that the model sizes
of Vgg series are larger than those of Resnet series, while the computation densities of
Resnet series are higher than those of Vgg series. Among the three basic tensor partitioning
types, Type-II and Type-III partitions the weight of layers, i.e., the model, while Type-I
partitions the feature maps, i.e., the data. In data parallelism, all layers are configured by
Type-I. Thus, for DNNs with a large model size, such as Vgg series, Type-II and Type-III
are favored due to the potential greater reduction of communication when partitioning the
model. On the other side, for DNNs with higher computation density, Type-I is favored for
greater potential reduction of the communication when partitioning feature maps. Moreover,
because the difference of mode sizes among different layers is smaller than the difference
of feature maps, the relative benefits achieved by Type-II and Type-III are smaller than
those of Type-I. The above analysis explains the reason why speedups of Resenet series are
lower than that of Vgg series, since DNNs in former series mainly benefit from Type-II and
Type-III — model partition, the trend is true for not only ACCPAR but also HYPAR and
OWT. However, even for Resnet series, the speedups of ACCPAR are considerably higher
than OWT and HYPAR: the highest speedup by OWT and HYPAR on the Resnet series are
1.04× and 1.38× respectively, but the highest speedup by ACCPAR is 2.20×. This means
that ACCPAR is 112% better than OWT and 59% better than HYPAR.
37
2.5.3 Homogeneous Array
OWTHyParAccPar
DP
Lenet
Alexnet
Vgg11
Vgg13
Vgg16
Vgg19
Resnet18
Resnet34
Resnet50
Gmean
012
4
6
8
10
Speedup
Figure 2.6: The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR [2]and AccPar in a homogeneous accelerator array.
We then evaluate the performance for a homogeneous accelerator array, where 128
accelerators employed are of the same type, i.e., TPU-v3. The speedups of data parallelism
(DP), “one weird trick” (OWT), HYPAR and ACCPAR are shown in Figure 2.6. The results
are normalized to data parallelism. The geometric mean of DP, OWT, HYPAR and ACCPAR
are 1.00×, 2.94×, 3.51×, 3.86×, respectively. For Resent series, the DNNs are very “deep”
(i.e., the number of layers is very large), and all layers except the last fully-connected
layer are convolutional layers. While OWT, HYPAR, and ACCPAR all try to explore other
parallelism settings or partitions rather than static data/model parallelism, we observe in the
results that these three schemes evantually configure most of layers in Resnet series with
data parallelism or Type-I partition. This is the reason why they achieve lower speedups on
Resnet series than Vgg series. In comparison, the DNNs in Vgg series contain a variety of
layers, so OWT, HYPAR, and ACCPAR can effectively explore larger search space with more
diverse parallelism/partition settings. With homogeneous accelerator array, for a specific
38
DNN model, we observe the increasing speedups of DP, OWT, HYPAR and ACCPAR. It is
the direct consequence of the increasing flexibility.
7
6
5
4
3
2
1
cv1 cv2 cv3 fc1 fc2 fc3cv4 cv5
Type-I Type-II Type-III
Figure 2.7: The selected partitioning types for the weighted layers in Alexnet. The numberof hierarchies is 7 and the batch size is 128.
In Figure 2.7, we also show the selected partitioning types by ACCPAR for the weighted
layers in Alexnet. In the three fully-connected layers, i.e., fc1, fc2 and fc3, Type-II and
Type-III partitions are used to minimize the communication by maintaining a part of weight
locally and communicating the feature maps. In the convolutional layers, i.e., cv1 to cv5, we
can see that Type-I partition are mostly but not solely selected. ACCPAR allows all Type-I,
Type-II and Type-III to be selected to reduce the total communication further. Especially,
with the increase of hierarchy level, more layers are configured with Type-II or Type-III.
This illustrates the importance of having a complete search space as in ACCPAR.
2.5.4 Scalability with Hierarchy Levels
In this section, we study the scalability of ACCPAR on various hierarchies on the heteroge-
neous accelerator array. Figure 2.8 shows the speedups of DP, OWT, HYPAR and ACCPAR
on Vgg19 for hierarchy level from h = 2 to h = 9. With the increase of hierarchy level,
we partition the tensor for a finer-grained level. From Figure 2.8 we see that the speedups
39
Speedup
012
4
6
8
10
12
OWTHyParAccPar
DP
h=2 h=3 h=4 h=5 h=6 h=7 h=8 h=9
Figure 2.8: The speedup of data parallelism (DP), “one weird trick” (OWT) [1], HYPAR [2]and AccPar under various partitioning hierarchies on Vgg19.
of OWT and HYPAR tend to be saturated, while the speedups of ACCPAR continue to
increase with the increase of the partitioning hierarchies. For OWT, the convolutional
layers are always configured with data parallelism and the fully-connected layers are always
configured with model parallelism. While it provides more flexible than configuring all
layers with data parallelism in DP, it is still a static configuration. HYPAR and ACCPAR
are flexible configurations because they do not flexibly set a class of layer to be a specific
parallelism or partition, instead they flexibly explore the possible configuration to minimize
the communication or cost. Compared with HYPAR, ACCPAR benefits from a complete
partitioning type sets and the heterogeneity-aware configuration. The flexibility of the four
schemes from low to high is: DP ≺ OWT ≺ HYPAR ≺ ACCPAR. Table 2.8 shows the
comparison of DP, OWT, HYPAR, and ACCPAR.
Table 2.8: The caparison of flexibility of DP, OWT, HYPAR and ACCPAR.
DP OWT HYPAR ACCPAR
Static Static Flexible FlexibleLow −−−−−−−−−−−−−−−−−−−−−→ High
40
Chapter 3
Layer-wise Pipelined Parallelism for DeepLearning Accelerators
3.1 Background
3.1.1 Basics of Deep Neural Network
The core component of many deep learning applications (on computer vision [236–240],
data mining [241–244], language processing [245–249] and etc.) is convolutional neural
network (CNN).
CONV POOL IP
K
Cl+1Cl
Figure 3.1: Convolutional neural network (CNN).
Figure 3.1 illustrates the organization of an example CNN, which has three types of
layers: convolution layer, pooling layer and inner product layer. In a convolution layer, a set
of kernels are convoluted with data of channels from the previous layer (layer l) to generate
data for channels of next layer (layer l +1). dl is a cube of data in a layer. dl[x,y,c] is the
value at a point in the three dimensional data cube. We also denote the size of d in layer l as
(Xl×Yl×Cl), so 0 ≤ x ≤ Xd−1,0 ≤ y ≤ Yd−1,0 ≤ c ≤Cd−1 and Cd is the number of
channels. (xl,yl,cl) indicates a point in layer l’s data cube. K is the kernel composed of a
set of weights. Kl is the kernel used in the computation to generate data in layer l. A kernel
41
represents four dimensional data: the size of each dimension is Kx, Ky, Cl and Cl+1, where
Kx and Ky are determined by algorithm (e.g. in LeNet [233], Kx and Ky are both 5).
dl+1 is computed as:
dl+1[x,y,c] =Cl−1
∑cl=0
Kx−1
∑kx=0
Ky−1
∑ky=0
Kl[kx,ky,cl,c]×dl[x+ kx,y+ ky,cl]. (3.1)
To perform Equation (3.1), in total (Xl+1×Yl+1×Cl+1×Cl×Kx×Ky) multiplications
and (Xl+1×Yl+1×Cl+1× (Cl×Kx×Ky−1) additions are performed.
A pooling layer performs the subsampling. Taking average pooling as an example, a
window of data in l is averaged to get one data point in l +1 as follows:
dl+1[x,y,c] =1
KxKy
Kx−1
∑kx=0
Ky−1
∑ky=0
dl[Kxx+ kx,Kyy+ ky,c]. (3.2)
This average pooling operation performs (Xl+1×Yl+1×Cl+1× (Kx×Ky−1)) additions
and (Xl+1×Yl+1×Cl+1) multiplications. The multiplication could be implemented as
shift operation if (Kx×Ky) is the power of 2. Max pooling is another variance, where the
maximum value among values in a window in l is selected for l +1.
In the inner product layer, the values in data tube of l and l+1 are considered as a vector
(denoted as ~dl and ~dl+1). If the previous layer is convolution or pooling, the size of ~dl is
Xl×Yl×Cl . If the previous layer is also inner product, then the size of ~dl is the size of the
output vector from l. ~dl+1 is a n×1 vector, n is determined by the algorithm. Wl+1−l is a
weight matrix of size (n×m), m is the size of ~dl .~b is a vector of bias.
The vector of l +1 is computed as:
~dl+1 =Wl+1−l~dl +~b. (3.3)
42
This inner product operation performs (n× (m−1)) additions and (n×m) multiplica-
tions. Activation function is another important component. It is an element-wise operation
and usually a nonlinear function, such as sigmoid 11+e−x or rectified linear unit (ReLU)
max(0,x).
3.1.2 Data Forward and Backward in a Neural Network
Neural network has two phases: training (learning) phase and testing (inference) phase. In
testing phase, the weights of a neural network have been determined and the task is to use
the network on input samples, e.g. to recognize who is the person in an image. In testing
phase, input data flow through the layers consecutively in forward direction. It is shown in
Figure 3.2, dl−1 is the output of layer l−1 and the input of layer l. The forward process
can be represented as the two equations indicated above. The computation is the same as
what we discussed in Section 3.1.1.
Before a neural network for a specific application can be deployed, it needs to be trained
and generate the weights. This is done in training phase, where data not only move forward
but also backward, — to update weights based on errors. In training phase, a cost function
is defined to quantitatively evaluate how well the outputs of a neural network compare to
the standard labels. We use y and t to represent the output of a neural network and the
standard label respectively. An L2 norm loss function is defined as J(W,b) = 12 ||y− t||22 and
J(W,b) =−∑i, j
1(yi = t j)logp(yi = t j) is the softmax loss function.
The error δ for each layer is defined as: δl ,∂J∂bl
. If we use an L2 norm loss function,
for the last(output) layer L, the error is δL = f ′(uL) (y− t) where represents a Hadamard
product, i.e. element-wise multiplications. For other layers excluding the output layer, the
error is δl = (Wl+1)>δl+1 f ′(ul).
1In this chapter, we use ∇X to denote the partial derivative of a loss function J(·) with respect to a variableX , i.e., ∇X = ∂J
∂X .
43
dl−11
dl−1Nl−1
ul1
f
ulNl
f
Forward
Backwardδ l−1 = (Wl )
Τδ l ! ′f (ul )
∇Wl = dl−1(δ l )Τ
∇bl = δ l
⎧
⎨⎪
⎩⎪⎪
ul =Wldl−1 + bl−1dl = f (ul )
⎧⎨⎪
⎩⎪
… …
dl1
dlNl
dl2
dl−12
ul2
f
Figure 3.2: Data flows between two adjacent layers1.
And with a ReLU activation function, the error can be rewritten as δl = (Wl+1)>δl+1
f ′(dl). So that the backward partial derivatives to W l is ∂J∂Wl
= dl−1(δl)>. And the backward
partial derivatives to bl is ∂J∂bl
= δl . Now we can use the gradient descent method to update
the weights of neural network.
We see that the training phase is more complex than testing phase due to weight updates.
It also introduces more data dependencies (e.g. weight update to a layer depend on the
previous layer’s error and the data of the earlier forward computation). Therefore, training
phase is time consuming and could take 14 to 21 days [181].
3.1.3 ReRAM Basics
The resistive random access memory (ReRAM) [125] is an emerging non-volatile memory
with appealing properties of high density, fast read access and low leakage power. ReRAM
has been considered as a promising candidate for future memory architecture. A primary
44
application of ReRAM is to be used as an alternate for main memory [250–252]. Figure 3.3
(a) demonstrates the metal-insulator-metal (MIM) structure of an ReRAM cell. It has a
top electrode, a bottom electrode and a metal-oxide layer sandwiched between electrodes.
By applying an external voltage across it, an ReRAM cell can be switched between a high
resistance state (HRS or OFF-state) and a low resistance state (LRS or On-state), which
are used to represent the logical ”0” and ”1”, respectively, as shown in Figure 3.3 (b). The
endurance of ReRAM could be up to 1012 [253, 254], alleviating the lifetime concern faced
by other non-volatile memory, such as PCM [255].
Bitline
Wordline
a1
a2
a3
b1 b2 b3
w1,1
w2,1
w3,1
w1,2
w2,2
w3,2
w1,3
w2,3
w3,3
bj = f ( ai i wij
i∑ )
Top Electrode
Metal Oxide
Bottom ElectrodeC
urrent
Voltage
SET
RESET
HRS(‘0’)
LRS(‘1’)
(a) (b) (c)
Figure 3.3: Basics of ReRAM.
The ReRAM features the capability to perform in-situ matrix-vector multiplication [256,
257] as shown in Figure 3.3 (c), which utilizes the property of bitline current summation in
ReRAM crossbars to enable computing with high performance and low energy cost. While
conventional CMOS based system showed success on neural network acceleration [27,
43, 101], recent works [88–90, 92] demonstrated that ReRAM-based architectures offer
significant performance and energy benefits for the computation and memory intensive
neural network computing.
Recent works [88–90] demonstrated that ReRAM-based architecture could accelerate
matrix-vector multiplication in neural computation. However, the current works lack
important features to efficiently execute complete deep learning applications. First, they do
not support training that involves weight update and complex data dependencies. Second,
the deep pipeline in ISAAC [89] is not beneficial to training phase where the length
45
of consecutive images that could enter the pipeline is limited by batch size. Also, it is
vulnerable to pipeline bubble.
3.2 PIPELAYER Architecture
In this section, we first analyze the general training procedure and outline the required
mechanisms to support it. Then we present the ideas of intra- and inter-layer pipeline that
explore the parallelism and support training naturally.
PIPELAYER architecture directly leverages ReRAM cells to perform computation with-
out the need for extra processing units. Resistive random access memory (ReRAM or
RRAM) is a type of non-volatile memory that stores information by changing the cell
resistances. This work focuses on a subset of resistive memories, — metal-oxide ReRAM,
which uses metal oxide layers as switching materials. More details on the technology could
be found in [88, 253–255, 258–262].
Our design partition the ReRAM-based main memory into two regions: morphable
subarrays (Morp) and memory subarrays (Mem). The morphable subarrays can perform
both computation and data (weights) storage, the two modes can be configured. In computa-
tion mode, the morphable subarrays perform matrix-vector multiplications. The memory
subarrays are similar to conventional memory and have data storage capability. Due to
space limit, readers could refer to [88] for details on the ReRAM basics.
3.2.1 Training Support
For simplicity, Figure 3.4 shows the configuration of PIPELAYER to process the training
of a 3-layer CNN. The rectangles are one layer of morphable subarrays. The circles are
memory subarrays to store intermediate results transferred between morphable subarrays
for different layers.
46
A1W
A2W
A3W
A11d
A21d
A31d T
A22(W )*
A32(W )*
d0 d1 d2 d3
δ3δ2δ1 32
1 2 3
210
W W W1 2 3∇ ∇∇
T0 T1 T2 T3
T4T5T6T7
(a) Forward
(b) Backward
Figure 3.4: Data flows and dependencies in layer-wise pipeline.
Data dependencies exist in forward and backward computations. The computation
timing and dependencies are shown in Figure 3.4. Starting from forward computation,
the initial cycle is T0, in cycle T1, the input (d0) enters A1 (morphable subarrays), which
perform the matrix-vector multiplication. At the end of T1, the results are written to a
memory subarray, d1. For clarity, we use red dashed lines to show the system state between
two consecutive cycles. Note that the cycles in Figure 3.4 (e.g. T0, T1, ...) are logical cycles,
depending on implementation, each cycle could take several physical clock cycles. We will
show in Section 3.2.2 that the number of physical cycles is determined by parallelism and
hardware resource. The term ”cycle” from this point all refer to the logical cycle. The solid
lines between rectangles (morphable subarrays) and circles (memory subarrays) indicate
the data dependencies. The following forward layers are performed similarly.
The results of forward computation are eventually stored in d3 before backward compu-
tation starts at T4. The backward computations generate errors (δl) (l is the layer) and partial
derivatives to b and W (∇bl and ∇Wl). As the first step, the error for the last (third) layer
(δ3) is computed in T4. It is stored in memory subarrays served as the input for computations
in the next cycle.
47
In T5, two computations happen in parallel: (1) partial derivatives (∇W3) is computed
by previous results in d2 and δ3; (2) errors (δ2) of the second layer is computed from δ3.
Both of the computations depend on δ3, which is computed in T4. ∇W3 is stored in memory
subarrays, which will be used to update weights in A3 and A32 later. Note that (W3)∗ is
a reordered form of W3 and will be discussed in Section 3.3.3. Finally, in T7, the partial
derivatives for the first layer (∇W1) is computed and stored in memory subarrays.
In training, batch size (denoted as B) is used to specify the number of images that can
be processed before a weight update. If B = 1, ∇W1, ∇W2 and ∇W3 are used to update the
weights in A1,A2,A3 and A22,A32. If B > 1, ∇W1, ∇W2 and ∇W3 are stored in the buffers
in a special way so that the average of all partial derivatives can be computed automatically.
Training phase is more complicated than testing phase because it involves data de-
pendencies between outputs from previous layers and the compute of partial derivatives
and errors for weight update. Memory subarrays are used as buffers to store intermediate
results, this greatly reduce data movements and energy consumption by avoiding data being
transferred across memory hierarchy. When not necessary (e.g. with only testing), some of
them can be converted to morphable subarrays.
Within a cycle, different operations could be performed depending on the phase of
computation. Table 3.1 shows the operations in four possible cases. In an implementation,
the cycle time has to allow the longest sequence of operations to fit. Note that the cycle time
is not a bottleneck, because memory is much slower than processors.
3.2.2 Intra Layer Parallelism
Data Input and Kernel Mapping
We propose a basic scheme of data input and the mapping of kernels to a ReRAM array,
illustrated in Figure 3.5. In this example, data of layer l, kernels and data of layer l+1 have
a size of 114×114×128, 3×3×128×256 and 112×112×256, respectively.
48
Table 3.1: Operations in a cycle.
L: number of layers; l: layer index.Memory Spike Morp. Subarray I&F Act. Mem.
Read Driver Activation SubarrayForward 3 3 3 3 3 3
Backward 1 dL ∗ ∇bL
Backward 2 ∇bl 3 AL1(dl−1); AL2((Wl−1)∗) 3 3 ∇Wl
Update ∇Wl 3 Wl
For each channel in layer l +1, the corresponding kernel has a size of 3×3×128. That
means the size of an input vector (yellow bar) to the ReRAM array at one cycle is 1152×1
(actually it should be 1153× 1, we neglect the bias for express clarity). And in the next
cycle, the kernel window shifts right or down and we have the another 1152×1 yellow bar
from input. Consequently, input data enter ReRAM array in a sequential mode. It takes
12544 cycles to get all outputs of layer l +1.
Mapping all kernels to the same ReRAM array is not realistic. Weights of one kernel
is mapped to cells of one bit line, for example, the blue cuboid is mapped to the blue bar
in the array and it is the same case for the red, green and the rest cuboids, resulting in 256
bit lines and 1152 world lines for the ReRAM array. Therefore, the ReRAM array needs
to have a size of 1152×256. In Section 3.2.2, we will present a more realistic design, but
before that, let us explore the approach of ISAAC [89] to improve the naive design.
Pipeline in ISAAC [89]
The pipeline used in ISAAC [89] attempts to improve the throughput of the architecture by
computing small tiles of a layer and then let the next layer use the partial output as input in
the next cycle. It could notably improve through put by the deep pipeline. The advantage
is that if a large number of input data could be fed into the pipeline continuously, after a
(large) number of initial cycles to fill the pipeline, results could be generated each cycle.
49
114
114
128
1283
3
112
112
256
=
…
WL_0
WL_1…
BL_0BL_1BL_2
WL_1151
… ⇒
01212543
1152
…
BL_255
(1152=3*3*128)
12544(12544=112*112)
layer l+1layer l
Figure 3.5: A basic scheme for data input and kernel mapping.
Unfortunately, this assumption is untrue for neural network training phase, where weights
are updated at the end of a batch. The new inputs in the next batch need to be processed
based on the updated weights and we cannot have a large number of consecutive input data
for the pipeline. Therefore, the pipeline design in ISAAC does not fit well for training
phase.
Another issue is the vulnerability to pipeline bubble, leading to execution stall. Consider
a network where all kernels are of size 2×2×1. One point p50 in layer l5 depends on 4,
16, 64, 256 points in layer l4, l3, l2 and l1, which means the computation of p50 is stalled
when any of the 340 (= 4+ 16+ 64+ 256) points is delayed, which could further stall
computations depending on p50. Moreover, data dependencies are complex in training
phase, layer l may not only depend on layer l−1, but also from earlier layers (e.g. δ1 in 3.4
depends on δ2 and early data d1). It further increase the chance of stalls. [89] mentioned
pipeline imbalance issue but similar issues could easily occur with more complex training
50
phase and it is extremely hard to predict and avoid all of them.
Parallelism Granularity
0
16
1…
3
17
256
1152
1152
……
…
…
…
128
128
2…
294912
49
…
G=256
(256=2*128)
(1152=9*128)
(49=12544/G)
(294912= G*1152)
Figure 3.6: A balanced scheme for data input and kernel mapping.
The ReRAM array size could be made feasible by partition. We can decompose the
1152×256 matrix to a group of 18 (=9×2) matrices and map each of them to a 128×128
ReRAM array (shown in the right part of Figure 3.6). We can get the right results by
collecting array outputs horizontally and summing them vertically.
To improve performance, we define a metric called parallelism granularity, denoted as
G, indicating the number of duplicated copies of ReRAM arrays that store the same weights.
If G = 1, the design is equivalent to the naive scheme. If G = 12544, the results of a layer
could be generated in just one cycle but the hardware cost is prohibitive.
Essentially, parallelism granularity allows explore the trade-off between hardware
resource of ReRAM array and performance. A good trade-off requires a carefully chosen G.
Figure 3.6 shows an example with G = 256.
51
3.2.3 Inter Layer Parallelism
T0A32A31
A22A21
A11ErrLA3A2A1
T1 T2 T3 T4 T5 T6 T7 T8 T9
A32A31
A22A21
A11ErrLA3A2A1
A32A31
A22A21
A11ErrLA3A2A1
Figure 3.7: Layer-wise pipeline.
In training, the input data are normally processed in batch. The inputs in the same batch
are all processed based on the same weights at the start of the batch. The weight updates due
to each input are stored and only applied to at the end of a batch. Therefore, no dependency
exists among data inputs inside a batch. We propose an architecture to support pipelined
training. The performance gain is due to the fact that B is normally much larger than 1 (e.g.
64), otherwise, each input needs to be processed sequentially.
Based on Figure 3.4, the pipelined execution is shown in Figure 3.7. The execution
with intra- and inter-layer pipeline could achieve high throughput without the drawbacks of
previous work. In the execution, a new input can enter the pipeline every cycle within a
batch. At the end of batch, a new input belonging to the next batch cannot enter the pipeline
until all inputs in previous batch are processed and weights are updated. In another word, a
new batch has to wait the ”tail” of previous batch to drain from the pipeline.
Figure 3.8 (a) shows the latency of the baseline architecture without pipeline. Assuming
the neural network has L layers, we can see that the forward computation of one input takes
L cycles, the backward computation of one input takes (L+1) cycles, at the end of a batch,
one cycle is dedicated to the weight update. Therefore, a batch takes (2L+1)B+1 cycles.
52
I1
A1 AL
I2 IB Update I1 I2… …
A2 … ErrL AL1/AL2 … A11
1 batch
Forward Backward
(a) Latency of PipeLayer without pipeline
I1
I2
IB Update
I1
I2
…… ……
2L+1B-1 1
B: batch sizeL: # of layers
(b) Latency of PipeLayer with pipeline
Figure 3.8: Latency of PIPELAYER architecture.
For a total number of N inputs, the total number of training cycles is (2L+1)N +N/B.
Now let us consider the pipelined execution, the latency is illustrated in Figure 3.8 (b).
Within a batch, a new input could enter every cycle. Similar to the baseline case, the first
weight update is generated after (2L+1) cycles. Then there will be (B−1) cycles until
the end of batch. Finally, there is one cycle to apply all weight updates within the batch.
Since the forward and backward computation are intertwined with each other, we do not
distinguish the forward and backward cycles. From the above analysis, the total number of
cycles to process N inputs with L layers is (N/B)(2L+B+1).
δ2A21
W2∇
A2A1
(2L-1) entries
Figure 3.9: Memory subarray design.
53
The major implication of pipelining is the increased requirement of memory subarrays.
The problem is illustrated in Figure 3.9. Consider the same example in Figure 3.4, since now
A1 will produce output every cycle, we need to have more buffers so that the data which
will be used later are not over written. Specifically, referring to Figure 3.4, d1 computed in
T1 will be used after 5 cycles, in T6. Therefore, between A1 and A2, 5 buffers are needed.
A1 should logically keep a pointer pointing to the current buffer that the output should
be written to. After an output, the pointer is increased by one. Logically we implement
circular buffers. When it reaches 5, it will wrap around and return to the first entry. When
the output is written again to the first entry, the useful data (d1) used to compute ∇W2 can
be safely overwritten because ∇W2 has been computed. According to this, in Figure 3.4,
A1 will write 1st-, 2nd-, 3rd-, 4th- and 5th-entry in buffer between A1 and A2 in T1, T2, T3,
T4, T5, consecutively. Since ∇W2 is computed from the data in the 1st-entry in the buffer in
T5, in the next cycle, A1 could overwrite the 1st-entry of the buffer. In general, the buffer
requirement at l-th layer (assuming the total number of layers is L) is 2(L− l)+1.
Another implication of pipelining is that at the same cycle, there will be a read and
a write on the same buffer. To support this, we need to duplicate the single buffer. This
happens for the buffer at d3, δ1, δ2, δ3. Table 3.2 compares the number of cycles and the
cost of (groups of) arrays of nonpipelined and pipelined architectures.
Table 3.2: Cycles and cost in PIPELAYER architecture.
G: parallelism granularity; L: number of layers;B: batch size; N: number of total input images.
Non-pipelined PipelinedForward Cycles LN
(N/B)(2L+B+1)Backward Cycles (L+1)N +N/BMorp Subarrays GL+G(2L−1)
GL+G(L−1)+BLMem Subarrays 2L
54
3.3 Implementation of PIPELAYER
Global IO Row Buffer Controller
Memory SubarrayConnection
Dri
ver
Glo
bal R
ow D
ecod
er
I&FMorp
Subarray Dri
ver I&F
MorpSubarray D
rive
r …Activation Activation
Dri
ver I&F
MorpSubarray D
rive
r I&FMorp
Subarray Dri
ver …
Dri
ver I&F
MorpSubarray D
rive
r I&FMorp
Subarray Dri
ver …
Activation Activation
Dri
ver I&F
MorpSubarray D
rive
r I&FMorp
Subarray Dri
ver …
Data
Add
r_C
olA
ddr_
Row
b b
a
a
c
d
e
I Vth CounterC
u+
uo
uo
u+
t
t
Timing Control
K2 Vi(t)
Di
Vprg
Vo
. . .Vo
K1
cDP
DN
-& LUT D
Configuration
D’Reg
Vo
Figure 3.10: Overall architecture of PIPELAYER.
3.3.1 Overall Architecture
Figure 3.10 shows the architecture of PIPELAYER. Our design leverages ReRAM cells to
perform computation without the need for extra processing units. The design partitions the
ReRAM-based main memory into two regions: morphable subarrays and memory subarrays.
The morphable subarrays can perform both computation and data (weights) storage, the two
modes can be configured. In computation mode, the morphable subarrays could perform
matrix-vector multiplications. The memory subarrays are the same as conventional memory
and have data storage capability. They are used to store intermediate results between layers
and keep data that will be used in Figure 3.10 only shows the morphable subarrays and
memory subarrays.
55
a. Spike Driver. To reduce the area and energy overhead of voltage-level based input scheme
in [88], we use a weighted spike coding scheme similar to ISAAC [89]. The spike driver
converts the input to a sequence of weighted spikes. In weight update, spike driver serves as
write driver to tune weights stored in the ReRAM array.
b. Integration and Fire. It is a required component of a spike-based scheme. It integrates
input current and generates output spikes. The output spikes are connected to a counter, so
essentially the analog currents are converted into digital values. The data input scheme used
in [89] is similar as our spike input, both eliminating DACs, but our integration and fire
components eliminates ADCs while [89] keeps.
c. Activation. It implements the activation function defined in a CNN algorithm. When
morphable subarrays are configured as memory, it is bypassed.
d. Connection. This component connects morphable and memory subarrays. The outputs
from morphable subarrays (when they are in computation mode) need to be written to
memory subarrays so that they are used as input for the morphable subarrays for the next
layer in next cycle.
e. Control. It offloads the computation from the host CPU and orchestrates the data transfers
between memory subarrays and morphable subarrays in training and testing with based on
the algorithm configurations (e.g. batch size).
3.3.2 Component Designs
Spike Driver
We use a weighted spike coding scheme. For a N-bit input value, we need N time slots for
potential spikes. The spike driver design is shown in Figure 3.10 (a). Inside the driver, N
levels of reference voltage, ranging from V0/2N−1 to V0/2 are generated. A timing control
component shifts key K1 to connect one of the reference voltages at one time slot. Note that
the shifting is non-decreasing: the voltage of output spike increases as time slot progresses,
56
thus the data input is of a Least Significant Bit First (LSBF) mode. K2 is controlled by the
input data to decide whether the spike output is generated or not.
Another function of spike driver is to write or program data to a subarray, where K1 is
connected to a programing voltage. K2 tunes the resistance of a ReRAM cell controlled by
input data which serve as programing control sequences. Spike divers are reused between
adjacent subarrays.
Integration and Fire
Figure 3.10 (b) shows the design of Integration and Fire circuit. A controlled current source
serves as a current follower to inject an equivalent strength current as that from a bit line to
a capacitor. As a result, the voltage on this capacitor is accumulated.
When the voltage on the capacitor increases to the threshold (Vth) of the comparator,
an output spike is generated and counted by a counter. Note that the current on the bit
line is the accumulation of products of spikes (input data) and ReRAM cell conductances
(corresponding weights). For a fixed threshold Vth, a K times stronger current will makes
the comparator to generate K times of output spikes. As a result, the number of spikes
generated by the comparator is actually the accumulation of products of input data and
corresponding weights. A group of integration and fire components are connected to bit
lines of a subarray and they can generate an vector in parallel.
Activation Function
Figure 3.10 (c) shows the design of activation component. The activation component
consists of a subtractor and a look up table (LUT). Since positive weights and negative
weights are mapped to two subarrays, we need to collect the results DP from a positive
subarray and DN from a negative subarray. Our activation component is configurable by
different LUTs to realize the activation function defined in specific algorithm. In this
57
work, we mainly focus on rectified linear unit (ReLU), which is widely used in deep neural
networks. A register is used to keep the max value of a sequence, which realizes max
pooling.
3.3.3 Error Backward
Before generating weight updates by partial derivatives, the error for each layer should be
computed. In forward progress, data propagate from layer l to layer l+1, while in backward
progress, errors propagate from layer l inversely to layer l− 1. Figure 3.11 shows three
kinds of error backward in convolutional neural networks.
Act
ivat
ion 2 0 0 0
0 1 0
0 0 0 5
0 3 0 0
2 1
3 5
0
Act
ivat
ion
(b)(a) (c)
layer l-1 layer l-1
layer l-1 layer l
layer llayer l
Figure 3.11: Error backward for (a) activation function, (b) max pooling layer and (c)convolution layer.
Error backward for activation function is computed as δl−1 = f ′(ul)δl , which is an
element-wise operation. The yellow squares in Figure 3.11 (a) represent an element of
error δl in layer l and error δl−1 in layer l−1. Note the activation function used is ReLU,
which means f ′(·) is either 0 or 1. The backward is to AND (&) 0 or 1 with δl−1, which is
done by activation components (component c in Figure 3.10). Thanks to ReLU, we have
f ′(ul) = f ′(dl), there is no need to store intermediate data ul in forward progress.
Figure 3.11 (b) shows the error backward for a pooling layer. Error elements in layer
l−1 are copied to the positions which is the max values in windows of data in layer dl . For
example, the pink square, 3, in layer l−1, is copied to the lower right of the pink window
in layer l, while the rest three squares in the window get 0. The four squares in layer l
are propagated to the upper left, lower left, lower right and upper right of four windows in
58
layer l−1. Mathematically, max pooling can be viewed as one kind of activation function.
The error backward for a pooling layer is also performed in the activation component
(component c in Figure 3.10). With dl already stored in memory subarrays, the index for
the max element in a window can be found.
For error backward, we can reuse the logic in a convolution layer. In Figure 3.11 (c),
a 3× 3 kernel is used in forward progress. In the forward progress, to generate the nine
elements in the red window in layer l, all of the (light, medium and dark) blue squares in
layer l−1 are convoluted with the 3×3 kernel and the dark blue square (with red edges)
are exactly multiplied with each elements in the kernel. In the backward, to get the error for
the dark blue square, we can just convolute the nine elements in the red window in layer l
with the kernel.
114
114
128
2563
3
112+2+2
112+2+2
256
…
⇒
layer l-1 layer l
Figure 3.12: Error backward for convolution layer by convolution of error of layer l andreordered kernels.
δl−1 = conv2(δl, rot180(K), ’full’) is the computation for error backward for convolu-
tion layer. For elements at edges, zero paddings are added. Figure 3.12 shows the error
backward for the convolution layer in Figure 3.5, where error in layer l−1 are zero-padded
by 2 at each edge, so the length and width increase to 116. Also, the kernels are reordered.
In this way, we can perform convolutions by using the data input and kernel mapping
scheme discussed in Section 3.2.2.
59
3.3.4 Weight Update
Computing Partial Derivatives
The partial derivatives to bias is the sum of all related error: ∇b = ∑u,v(δl)u,v. It is
done by reading bitline with the input spikes representing 1. The partial derivative to a
weight is the sum of element-wise products of error and patches/windows of data: ∇W =
∑u,v(δl)u,v · (dl−1)u,v. It can be converted to convolution.
In Figure 3.13, the convolution of one blue channel of error in layer l−1 and the total
128 channels of data in layer l is the partial derivatives to weights of the blue kernel (size of
3×3×128). All the kernels (red, green and others) are similar. In the convolution, data in
layer l (yellow slices) can be viewed as convolution kernels while error in layer l−1 (doted
slices) can be viewed as convolution data. Therefore, the convolution is done by mapping
the yellow slices to ReRAM arrays and sending the backward error to the arrays.
114
114
128
112
112
256
128
3
3
…
dl-1 δllayer l-1 layer l
Figure 3.13: The computation of partial derivatives for kernels.
Writing New Weights to Morphable Subarrays
With the partial derivatives for weights ∇W and bias ∇b, old weights and bias that read from
morphable subarrays. The new (updated) weights and bias can be computed and written to
morphable subarrays.
In weight updating, we need to compute the average of the B (B is the batch size)
60
partial derivatives, and this can be done by the input spikes representing 1/B. With current
accumulation property of bitline, the read out data is actually the averaged partial derivatives.
At the same time, old weights are read out and new weights are computed. The LUT in
activation components are bypassed when reading. Different from default reading, DP
of the subtractors are connected to old weights and DN connected to the averaged partial
derivatives. As a result, the output data are the new weights/bias, which is (old weights/bias
minus averaged partial derivatives). And finally, the new weights/bias to the corresponding
morphable subarrays.
3.4 Discussion
3.4.1 Resolution and Accuracy
In practice, the ReRAM cell can only support limited precisions. For example, [88] used
6-bit ReRAM cells and [256] used 5 bits. Such limitation does not affect some neural
networks due to their inherent error tolerance.
float 8-bit 7-bit 6-bit 5-bit 4-bit 3-bit 2-bit0
0.2
0.4
0.6
0.8
1
Normalized Accuracy
M-1M-2M-3M-CC-4
Figure 3.14: Tradeoff between resolution and accuracy.
We conducted a set of experiments to understand the trade-off between resolution
(of ReRAM cells) and accuracy (of applications). We use five applications: M-1, M-2
and M-3 are multilayer perceptron neural networks while M-C and C-4 are convolutional
neural networks. Figure 3.14 shows the normalized accuracy (to float resolution in original
61
implementation) of using different number of bits. We see that for M-1, M-2, M-3, using 4-
bit just slightly decreases the accuracy. However, the accuracy is more sensitive to resolution
for convolutional neural networks, as accuracies of M-C and C-4 drop sharply from 3-bit.
Especially for C-4, even at 4-bit resolution, the normalized accuracy is only around 0.2.
When higher resolution is needed, we can use multiple arrays with lower resolution,
similar to [89]. The default resolution of PIPELAYER is 16-bit, the same as [27,89], and the
resolution of ReRAM cells used in PIPELAYER is 4-bit.
4-
0D0<<0
4-
1D1<<4
4-
2D2<<8
4-
3D3<<12
Din
D(a) Forwarding
4-
0W0
4-
1W1
4-
2W2
4-
3W3 Wold
(b) Updating
∇W+
WnewW3..0
W7..4
W11..8
W15..12
+ W3 W2 W1 W0
Figure 3.15: Weight partition.
In testing phase, we can easily get 16-bit results by adding four shifted 4-bit results, as
shown in Figure 3.15 (a). Here, the same data input four groups of 4-bit arrays. The four
groups store 4-bit weights for the the 15..12, 11..8, 7..4 and 3..0 segment respectively. We
can shift and add the results from the four groups and get the results for 16-bit resolution. In
training phase, we need to first read the old four segments of 4-bit weights then shift them
to get the old weights. With the partial derivatives, we get the new weights and write back
to the four groups to finish the update. It is shown in figure 3.15 (b).
3.4.2 Application Programming Interface
Thanks to the layer wise inter layer pipeline of PIPELAYER, the system can be config-
ured by layer interactively. Before discussing layer wise function invocation, we provide
62
two functions, Copy to PL and Copy to CPU to transfer data from CPU main memory to
PIPELAYER or inversely. For one layer configuration, a topology set function Topology set
is invoked. This function configures the connections and datapath of G groups of arrays
as shown in Figure 3.6, while G can be set by programmer or automatically optimized by
compiler. Next, the weight load function Weight load is invoked, which load pre-trianed
weights to the arrays in testing phase, or initial weights in training phase. After all layers
are configured, running with pipeline can be started by function Pipeline Set. Finally,
one of the running mode functions Train or Test starts the execution.
3.5 Evaluation
3.5.1 Benchmarks
The benchmark networks used are based on databases MNIST [183] and ImageNet [182].
MNIST is a classical database of handwritten digits for pattern recognition. It consists of a
training set of 60K images and a inference set of 10K images. And the labels for the images
range from 0 to 9. Images of MNIST are 28-by-28 gray images. ImageNet provides 14.2M
images and 21.8K indexed synsets. Most of the largest and state-of-the-art performance
neural networks are based on ImageNet.
For ImageNet, we select six popular large scale networks, i.e. AlexNet [119], and
VGG-A, VGG-B, VGG-C, VGG-D, VGG-E [181]. The topologies and hyperparameters
of these networks can be found from the references. For MNIST, we build four neural
networks as our benchmarks, the hyper parameters are shown in Table 3.3.
In Table 3.3, N1-N2 represents an inner product layer where there are N1 perceptrons in
layer l and N2 neurons in layer l+1 (e.g. 784−100 represents an inner product layer where
there are 784 neurons in the first layer and 100 neurons in the second layer). ConvKxC
represents a convolution layer where the kernel size is K by K and the number of channels
in the next layer is C.
63
Table 3.3: Hyper parameters of networks on MNIST.
Network Hyper ParametersMnist-A 784-100-10Mnist-B 784-500-250-10Mnist-C 784-1500-1000-500-10Mnist-0 conv5x10-360-10
3.5.2 Experiment Setup
In our experiments, we compare PIPELAYER with a GPU-based platform. Benchmarks
running on the platform are based on the widely used framework Caffe [263].
To run the GPU platform, we set the running mode in the Caffe prototxt to GPU. The
GPU used is the newest released GTX 1080. Table 3.4 shows the parameters of our GPU
platform. The run times for GPU platform are measured by caffe and the energy costs are
measured by the tool nvidia-smi provided by NVIDIA CUDA.
Table 3.4: Configurations of the GPU platform.
CPU: Intel Xeon E5-2630 V3, 8 cores,2.40 GHz, 8× (32+32)KB L1 Cache,8×256KB L2 Cache, 20 MB L3 Cache.Memory 128 GBStorage 1 TBGraphic Card NVIDIA Geforce GTX 1080Architecture PascalCUDA Cores 2560Base Clock 1607 MHzCompute Capability 6.1Graphic Memory 8 GB GDDR5XMemory Bandwidth 320 GB/sCUDA Version 8
To evaluate PIPELAYER, we build a simulator based on NVSim [184]. The read/write
latency, read/write energy cost used in the simulator are 29.31 ns/50.88 ns per spike, 1.08
pJ/3.91 nJ per spike, which are reported in [264]. And the area model is based on data
64
reported in [252].
3.5.3 Performance Results
Figure 3.16 shows the performance comparisons of training and inference (testing) phase.
For each application, we report the execution time comparison of GPU, non-pipelined and
pipelined PIPELAYER ( PIPELAYER). The GPU implementation is used as the baseline and
running times of benchmarks on different settings are normalized to it.
Mnist-A_train
Mnist-B_train
Mnist-C_train
Mnist-0_train
AlexNet_train
VGG-A_train
VGG-B_train
VGG-C_train
VGG-D_train
VGG-E_train
Gmean_train
Mnist-A_test
Mnist-B_test
Mnist-C_test
Mnist-0_test
AlexNet_test
VGG-A_test
VGG-B_test
VGG-C_test
VGG-D_test
VGG-E_test
Gmean_test
Gmean_all
1
10
100
Speedup
GPUPipeLayer w/o pipelinePipeLayer
Figure 3.16: Speedups of networks in both training and inference.
Compared to GPU platform, the geometric mean of overall (training + inference),
training and inference speedup achieved by pipelined PIPELAYER architecture are 33.85
x, 53.22x and 42.45x, respectively. Among all applications in both training and inference,
the highest speedup achieved by non-pipelined PIPELAYER is 20.81x, while for pipelined
PIPELAYER, the highest speedup is 146.58x.
PIPELAYER architecture archives lower speedups in training phase than in inference
phase. It is because training requires more intermediate data processing and weight updates.
The pipelined PIPELAYER achieves a geometric mean speedup of 42.45x, significantly
better than non-pipelined variants because the computations and data movements are
performed in a highly parallel manner.
Intuitively, as the depth and layers of the neural networks increase, PIPELAYER archi-
tecture will achieve higher speedups compared to conventional architecture. However, we
65
found that it is not always the case. For example, the speedup of Mnist-C is larger than
AlexNet in training. That is because Mnist-C is a mutilayer perceptron network, whose
weights are all matrixs and can be directly mapped to ReRAM arrays.
3.5.4 Energy Results
Figure 3.17 shows energy savings in GPU platform and PIPELAYER (pipelined). The higher
the bars are, the more energy efficient the corresponding architectures are. We show results
for both training and inference.
Mnist-A_train
Mnist-B_train
Mnist-C_train
Mnist-0_train
AlexNet_train
VGG-A_train
VGG-B_train
VGG-C_train
VGG-D_train
VGG-E_train
Gmean_train
Mnist-A_test
Mnist-B_test
Mnist-C_test
Mnist-0_test
AlexNet_test
VGG-A_test
VGG-B_test
VGG-C_test
VGG-D_test
VGG-E_test
Gmean_test
Gmean_all
1
10
100
Energy Saving GPU
PipeLayer
Figure 3.17: Energy savings for PIPELAYER.
The geometric mean of energy saving compared to GPU in training and inference are
6.52x and 7.88x, respectively and the overall geometric mean of energy saving is 7.17x.
The highest energy saving for training and inference are 27.03x (Mnist-C) and 70.03x
(Mnist-A), respectively. We see that energy saving of PIPELAYER in training is slightly less
than that in inference. This is due to the extra morphable and memory subarrays needed.
In summary, we see from the results that PIPELAYER architecture provides significant
performance improvement with drastic energy reduction. It comes from two sources. First,
the computation cost is reduced, the matrix-vector multiplications commonly used in CNNs
66
are performed by analog circuits, which is much more energy efficient than conventional
GPU implementation. Second, the data movements in memory hierarchy are greatly reduced.
In PIPELAYER, the data movements between different levels of memory hierarchies in
conventional architectures are replaced by data movements in memory itself through layers.
3.5.5 Sensitivity of Parallelism Granularity
This section analyzes the effect of parallelism granularity on PIPELAYER. Table 3.5 shows
the default parallelism granularity for each convolution layer of 5 VGG networks. We use a
scalar λ to enlarge or reduce the parallelism granularity of a network for λ times.
Table 3.5: Optimized parallelism granularity G configurations of each convolution layer inVGGs.
Layer VGG-A VGG-B VGG-C VGG-D VGG-Econv11 1024 1024 1024 1024 1024conv12 - 1024 1024 1024 1024conv21 256 256 256 256 256conv22 - 256 256 256 256conv31 64 64 64 64 64conv32 64 64 64 64 64conv33 - - 64 64 64conv34 - - - - 64conv41 16 16 16 16 16conv42 16 16 16 16 16conv43 - - 16 16 16conv44 - - - - 16conv51 4 4 4 4 4conv52 4 4 4 4 4conv53 - - 4 4 4conv54 - - - - 4
In Figure 3.18, λ = 0 means the parallelism granularity of all layers are G = 1 and
λ = ∞ means G is set to the maximum value for each layer. Figure 3.18 shows that the
speedup (compared with GPU) increase monotonically with λ . However, the increase
of parallelism granularity will also cause the increase in area, as shown in Figure 3.19.
Therefore, choosing the suitable parallelism granularity to explore the balance between
67
6=0 6=0.25 6=0.5 6=1 6=2 6=4 6=1
0.1
1
10
100
1000 VGG-AVGG-BVGG-CVGG-DVGG-E
Figure 3.18: Speedups vs. parallelism granularity.
speedup and area is critical. We choose the balanced parallelism granularity for the networks
as the default values shown in Table 3.5.
6=0 6=0.256=0.5 6=1 6=2 6=4 6=11mm 2
10mm 2
100mm 2
1000mm 2 VGG-AVGG-BVGG-CVGG-DVGG-DE
Figure 3.19: Area vs. parallelism granularity.
3.5.6 Computation Efficiency Results
The area of PIPELAYER is 82.63mm2. The computational efficiency of PIPELAYER is
1485GOPS/s/mm2, higher than both DaDianNao [27] (63.46GOPS/s/mm2) and ISAAC [89]
(479.0GOPS/s/mm2). On the other side, the power efficiency of PIPELAYER is 142.9GPOS/s/W ,
lower than DaDianNao (286.4GPOS/s/W ) and ISAAC (380.7GPOS/s/W ). In training
phase, intermediate data d are used as the kernels to compute partial derivatives. As
discussed in section 3.3.4, d is written to morphable subarrays for the convenience of
68
partial derivatives computing, which means the groups of ”storing” arrays are converted to
”computing” arrays. For this reason, PIPELAYER gains a higher computational efficiency.
However, the lower power efficiency is because we write all of data to ReRAM arrays, while
DaDianNao and ISAAC write to eDRAMs.
69
Chapter 4
Accelerator Architecture for Graph Processing
4.1 Background and Motivation
4.1.1 Graph Processing
…
…
Edges:
Vertices:
(b) Access pattern in vertex-centric program
…(a) Vertex and edge accesses in
graph processing
(random access)
(global random access)
(local sequential access)
Figure 4.1: Graph processing in vertex-centric programs.
……Edges:
Vertices(source):
…Updates:
1. Edge-centric scatter 2. Edge-centric gather
…Updates:
…Vertices(destination):
(a) Edge-centric processing
(sequential read)
(random access)
(sequential read)
(sequential write)
(random access)
(b) Dual sliding windows
Figure 4.2: (a) Edge-centric processing and (b) dual sliding windows.
Graph algorithms traverse vertices and edges to discover interesting graph properties
based on relationship. A graph could be naturally considered as an adjacency matrix, where
the rows correspond to the vertices and the matrix elements represent the edges. Most graph
algorithms can be mapped to matrix operations. However, in reality, the graph is sparse,
which means that there would be many zeros in the adjacency matrix. This property incurs
the waste of both storage and compute resources. Therefore, the current graph processing
systems use the format that is suitable for sparse graph data. Based on such data structures,
70
the graph processing can be essentially considered as implementing the matrix operations
on the sparse matrix representation. In this case, individual (and simple) operations in
the whole matrix computation are performed by the compute units (e.g., a core in CPU
or an SM in GPU) after data is fetched. In the following, we elaborate the challenge of
random accesses in various graph processing approaches. More details on sparse graph data
representation will be discussed in Section 4.1.3.
To provide an easy programming interface, the vertex-centric program featuring “Think
Like a Vertex (TLAV)” [157] was proposed as a natural and intuitive way for human brain
to think of the graph problems. Figure 4.1 (a) shows an example, an algorithm could first
access and process the red vertex in the center with all its neighbors through the red edges.
Then it can move to one of the neighbors, the blue vertex on the top, accessing the vertex
and the neighbors through the blue edges. After that, the algorithm can access another
vertex (the green one on the left), which is one of the neighbors of the blue vertex.
In graph processing, the vertex accesses lead to random accesses because the neighbor
vertices are not stored continuously. For edges, they incur local sequential access because
the edges related to a vertex are stored continuously but global random accesses because
algorithm needs to access the edges of different vertices. The concepts are shown in
Figure 4.1 (b). The random accesses lead to long memory latency and, more importantly,
the bandwidth waste, because only a small portion of data are accessed in a memory block.
In a conventional hierarchical memory system, this leads to the significant data movements.
Clearly, reducing random accesses is critical to improve the performance of graph
processing, this is particularly crucial for the disk-based single machine graph processing
systems (a.k.a out-of-core systems [151–154]), because the random disk I/O operations are
much more detrimental to the performance. In this context, the memory is considered small
and fast (therefore can afford random accesses) while disk is considered large and slow
(therefore should be only accessed sequentially). The edge-centric model in X-Stream [153]
71
is a notable solution for reducing random accesses. Specifically, the edges of a graph
partition are stored and accessed sequentially in disk and the related vertices are stored and
accessed randomly in memory. Such setting is feasible because typically the vertex data are
much smaller than edge data. During process, X-Stream generates Updates in scatter phase,
which incurs sequential writes, and then, applies these Updates to vertices in gather phase,
which incurs sequential reads. The concepts are shown in Figure 4.2 (a).
A notable drawback of X-stream is that the number of updates may be as large as that of
edges, GridGraph [151] proposed optimizations to further reduce the storage cost and data
movements due to updates. The solution is based on the dual sliding windows (shown in
Figure 4.2 (b)), which partitions edges into blocks and vertices into chunks. On visiting the
edge blocks, the source vertex chunk (orange) is accessed and updates are directly applied
to the destination vertex chunk (blue). This requires no temporary update storage as in
X-Stream. Edge blocks can be accessed in source oriented order (shown in Figure 4.2 (b))
or destination oriented order. Note that the dual sliding window mechanism is based on
edge-centric model.
4.1.2 Graph Processing Accelerators
Due to the wide applications of graph processing and its challenges, several hardware
accelerators were recently proposed. Ahn et al. [187] proposes TESSERACT, the first
PIM-based graph processing architecture. It defines a generic communication interface
to map graph processing to HMC. At any time, each core can Put a remote memory
access and get interrupted to receive and execute the Put calls from other cores. Ozdal
et al. [186] introduces an accelerator for asynchronous graph processing, which features
efficient hardware scheduling and dependence tracking. To use the system, programmers
have to understand its architecture and modify existing code. Graphicionado [185] is a
customized graph accelerator designed for high performance and energy efficiency based
72
on off-chip DRAM and on-chip eDRAM. It modifies the graph data structure and data path
to optimize graph access patterns and designs specialized memory subsystem for higher
bandwidth. These accelerators all optimize the memory accesses, reducing the latency or
better tolerating the random access latency. [265] and [266] focused on the data coherence in
memory which can be accessed by instructions from host CPU and the accelerator, and [266]
also considered atomic operations. Other PIM-based accelerators for graph processing
are [267, 268].
4.1.3 Graph Representation
The processing of the matrix-vector multiplications on small data blocks could increase
the ratio between computation and data movement and reduce the pressure on memory
system. With its matrix-vector multiplication capability, ReRAM could naturally perform
the low-cost parallel dense operations on the sparse sub-matrices (subgraphs), enjoying the
benefits without increasing hardware and energy cost.
The key insight of GRAPHR is to still store the majority of the graph data in the com-
pressed sparse matrix representation and process the subgraphs in uncompressed sparse
matrix representation. In the following, we review several commonly used sparse represen-
tations.
0010
0004
3700
8002
0123
0 1 2 3
(row,col,val)(0,2,3)(0,3,8)(1,2,7)(2,0,1)(3,1,4)(3,3,2)
(b) CSC
(row,val)0123
1246
colptr0
(col,val)0123
2346
rowptr0
1234
0
5(0,8)(1,7)
(3,4)
(3,2)
(2,1)
(0,3)1234
0
5
(2,3)(3,8)(2,7)(0,1)(1,4)(3,2)
(a) (c) CSR (d) COO
Figure 4.3: (a) Sparse matrix and its compressed representations in: (b) CSC, (c) CSR, and(d) COO.
The three major compressed sparse representations are compressed sparse column
73
(CSC), compressed sparse row (CSR) and coordinate list (COO). They are illustrated in
Figure 4.3. In the CSC representation, non-zeros are stored in column major order as (row
index, value) pairs in a list, the number of entries in the list is the number of non-zeros.
Another list of column starting pointers indicate the starting index of a row in the (row,val)
list. For example, in Figure 4.3 (a), 4 in the colptr indicates that the 4-th entry in (row,val)
list, i.e., (0,8) is the starting of column 3. The number of entries in colptr is the number
of columns + 1. Compressed sparse row (CSR) is similar to CSC, with row and column
alternated. For coordinate list (COO), each entry is a tuple of (row index, column index,
value) of the nonzeros.
2
00110110
00011111
11000011
11000010
00000011
00000010
00001101
00001101
(a) (b) (c)
0 20 31 21 32 03 03 1
4 15 05 16 06 17 16 26 37 2
4 64 75 65 76 46 57 47 67 7
B0-0 B0-1 B1-0 B1-1
2
10
7
34
5
6
Figure 4.4: (a) A directed graph and its representations in (b) adjacency matrix and (c)coordinate list.
In GRAPHR, we assume coordinate list (COO) representation to store a graph. Given a
graph in Figure 4.4 1 (a), its (sparse) adjacency matrix representation and COO representa-
tion (partitioned into four 4 × 4 subgraphs) are shown in Figure 4.4(b) and (c), respectively.
In this example, the coordinate list saves 61% storing space, compared with the adjacency
matrix representation. For real-world graphs with high sparsity, the saving is even high: the
coordinate list can only take 0.2% of space for WikiVote [269] compared to an adjacency
1As shown in Section 4.1.3, each entry in a coordinate list should be a three-element tuple of Source ID,Destination ID, Edge Weight. To simplify the example, we use an unweighted graph, so a two-elementtuple of Source ID, Destination ID is sufficient to represent one edge.
74
matrix.
4.2 GraphR Architecture
4.2.1 Sparse Matrix Vector Multiplication (SpMV) and Graph Pro-cessing
1 //Phase 1: compute edge values2 for each edge E(V,U) from active vertex V do3 E.value = processEdge(E.weight,V.prop)4 end for5
6 //Phase 2: reduce and apply7 for each edge E(U,V) to vertex V do8 V.prop = reduce(V.prop,E.value)9 end for
Figure 4.5: Vertex Programming Model
Figure 4.5 shows the general vertex programming model for graph processing. It follows
the principle of “Think Like The Vertex” [157]. In each iteration, the computation can
be divided into two phases. In the first phase, each edge is processed with processEdge
function, based on edge weight and active source vertex V ’s property (V.prop), it computes
a value for each edge. In the second phase, each vertex reduces values of all incoming
edges (E.value) and generates a property value, which is applied to the vertex as its updated
property.
Figure 4.6 (a) shows vertex program in graph view. V15, V7, V1 and V2 each executes a
processEdge function, the results are stored in corresponding edges to V4. After all edges
are processed, V4 executes reduce function to generate the new property and update itself.
In graph processing, the operation in processEdge function is multiplication, to generate
the new property for each vertex, what essentially needed is the Multiply-Accumulate
(MAC) operation. Moreover, the vertex program of each vertex is equivalent to a sparse
75
V4
V7
V1
V2
processEdge
reduce/apply
V4
V1 V2 V7
edges to V4value of all vertices
=
new value of V4
(a) Vertex Program in Graph View (b) Vertex Program in Matrix View
ReRAMCrossbar (CB)
perform SpMV in analog manner
(c) Ideal case: CB of |V|×|V||V| is the number of vertices in a graph
(d) Realistic case: memory ReRAM stores a block of graph; ReRAM GEs
process/scan subgraphs (sliding window)
2 Graph Engines (GEs)
All GEs “scan” the block
(sliding window)
CBCBCBCB
reduce
block
CBCBCBCB
scan
V15
Figure 4.6: GRAPHR Key insight: supporting graph processing with ReRAM crossbars.
matrix vector multiplication (SpMV) shown in Figure 4.6 (b). Assume A is the sparse
adjacency matrix, and x is a vector containing V.prop of all vertices, the vertex program of
all vertices can be computed in parallel in matrix view as ATx. As shown in Figure 3.3 (c),
ReRAM crossbar could perform matrix-vector multiplication efficiently. Therefore, as long
as a vertex program can be expressed in SpMV form, it can be accelerated by our approach.
We will discuss in Section 4.3 on the different patterns to express algorithms in SpMV.
From Figure 4.6 (b), we see that the size of the matrix and vector are |V | × |V | and
|V |, respectively, where V is the number of vertices in the graph. In ideal case, if we are
given a ReRAM crossbar (CB) of size |V |× |V |, the vertex program of each vertex can be
executed in parallel and the new value (i.e., V.prop in Figure 4.5) can be computed in one
76
cycle. More importantly, the memory to store AT can directly perform such in-memory
computation without data movement. Unfortunately, this is unrealistic, the size of CB
is quite limited (e.g., 4× 4 or 8× 8), to realize this idea, we need to use Graph Engines
(GEs) composed of small CBs to process subgraphs in certain manner. This is the ultimate
design goal of GRAPHR architecture. We face several key problems. First, as discussed in
Section 4.1.3, graph is stored in compressed format, not in adjacency matrix, to perform in-
memory computation, data needs to be converted to matrix format. Second, the real-world
large graphs may not fit in memory. Third, the order of subgraph processing needs to be
carefully determined, because this affects the hardware cost to buffer the temporary results
for reduction.
To solve these problems, Figure 4.6 (d) shows the high level processing procedure.
The GRAPHR architecture has both memory and compute module, each time, a block of
a large graph is loaded into GRAPHR’s memory module (in blue) in compressed sparse
representation. If the memory module can fit the whole graph, then there is only one block,
otherwise, a graph is partitioned into several blocks. Inside GRAPHR, a number of GEs (in
orange), each of which is composed of a number of CBs, “scan” and process (similar to a
sliding window) subgraphs in streaming-apply model (Section 4.2.3). To enable in-memory
computation, graph data are converted to matrix format by a controller. Such conversion
is straightforward with preprocessed edge list (Section 4.2.4). The intermediate results of
subgraphs are stored in buffers and simple computation logics are used to perform reduction.
GRAPHR can be used in two settings: 1) multi-node: one can connect different GRAPHR
nodes and connect them together to process large graphs. In this case, each block is
processed by a GRAPHR node. Data movements happen between GRAPHR nodes. 2)
out-of-core: one can use one GRAPHR node as an in-memory graph processing accelerator
to avoid using multiple threads and moving data through memory hierarchy. In this case, all
blocks are processed consecutively by a single GRAPHR node. Data movements happen
77
between disk and GRAPHR node. In this work, we assume out-of-core setting, so we do not
consider communication between nodes and its supports, we leave this as future work and
extension. Next, we present the overall GRAPHR architecture and its workflow.
4.2.2 GraphR Architecture
IO InterfaceCTRL
DRVS/H
S/HD
RV
S/HDRV
DRV
S/H
RegI
ADC S/A sALU RegO
MemoryReRAM
GEGE
MemoryReRAM
GE GE GE
CTRL
GE
DRV
S/H
RegO
RegI
sALU
S/A
Input registerOutput register
Simple ALU
Shift & add unitSample & hold
DriverGraph Engine
Controller
ADC Analog to digital
MemoryReRAM
MemoryReRAM
…
…
Figure 4.7: GRAPHR accelerator architecture.
Figure 4.7 shows the architecture of one GRAPHR node. GRAPHR is a ReRAM-based
memory module that performs efficient near data parallel graph processing. It contains two
key components: memory ReRAM and graph engine (GE). Memory ReRAM stores graph
data in original compressed sparse representation. GEs perform efficient matrix-vector
multiplications on matrix representation. A GE contains a number of ReRAM crossbars
(CBs), Drivers (DRVs), Sample and Hold (S/H) components placed in mesh, they are
connected with Analog to Digital Converter (ADC), Shift and Add units (S/A) and simple
algorithmic and logic units (sALU). The input and output register (RegI/RegO) are used to
cache data flow. We discuss the detail of several components as follows.
Driver (DRV) is used to 1) load new edge data to ReRAM crossbars for processing; and
2) input data into ReRAM crossbars for matrix-vector multiplication.
Sample and Hold (S/H) is used to sample analog values and hold them before converting
78
to a digital form.
Analog to Digital Converter (ADC) converts analog values to digital format. Because
ADCs have relatively higher area and power consumption, ADCs are not connected to every
bitlines of ReRAM crossbars in a GE but shared between those bitlines. If the GE cycle
is 64ns, we can have one ADC working at 1.0GSps to convert all data from eight 8-bitline
crossbars within one GE.
sALU is a simple customized algorithmic and logic unit. sALU performs operations
that cannot be efficiently performed by ReRAM crossbars, such as comparison. The actual
operation performed by sALU depends on algorithm and can be configured. We will show
more details when we discuss algorithms in Section 4.3.
Data Format is not practical for ReRAM cells to support a high resolution. Recent
work [256] reported 5-bit resolution on ReRAM programing. To alleviate the pressure
of diver, we conservatively assume the 4-bit ReRAM cell. To support higher computing
resolution, e.g., 16 bit, the Shift and Add (S/A) unit is employed. A 16-bit fixed point
number M can be represented as M = [M3,M2,M1,M0], where each segment Mi is a 4-bit
number. We can shift and add results from four 4-bit-resolution ReRAM crossbars, i.e.
D3 12+D2 8+D1 4+D0 to get a 16-bit result.
When sALU and S/A are bypassed, a graph engine could be simply considered as a
memory ReRAM mat. A similar scheme of reusing ReRAM crossbars for computing and
storing is employed in [88].
The I/O interface is used to load graph data and instructions into memory ReRAM
and controller, respectively. In GRAPHR, controller is the software/hardware interface
that could execute simple instructions to: 1) coordinate graph data movements between
memory ReRAM and GEs based on streaming-apply execution model; 2) convert edges in
preprocessed coordinate list (assumed in this work but can work with other representations
as well) in memory ReRAM to sparse matrix format in GEs; 3) perform convergence check.
79
Graph in coordinate list format (COO)
Preprocessing(Section 3.4)
C N G B
Ordered edge list (COO) on
disk
GraphR
Memory ReRAM
GEs
Load Block(i)(sequential disk I/O)
Steaming-apply (Section 3.3) subgraphs (sequential access)
Controllerin GraphR
SoftwareC: CB sizeN: # of CB in a GEG: # of GE in a GraphR nodeB: block size (# of vertices)
Figure 4.8: Workflow of GRAPHR in out-of-core setting.
1 while(true)2 load edges for next subgraph into GEs;3 process (processEdge) in GE;4 reduce by sALU;5 if(check_convergence())6 break;7
Figure 4.9: Controller operation.
Figure 4.8 shows the workflow of GRAPHR in an out-of-core setting. GRAPHR node
can be used as a drop-in in-memory graph processing accelerator. The loading of each
block is performed in software by an out-of-core graph processing framework (e.g., Grid-
Graph [151]). The integration is easy because it already contains the codes to load block
from disk to DRAM, we just need to redirect data to GRAPHR node. Since edge list is
preprocessed in certain order, loading each block only involves sequential disk I/O. Inside
GRAPHR node, the data is initial loaded into memory ReRAM, the controller manages
the data movements between GEs in streaming-apply manner. Edge list preprocessing for
GRAPHR needs to be carefully designed and it is based on the architectural parameters.
80
Figure 4.9 shows the operations performed by controller.
4.2.3 Streaming-Apply Execution Model
GE
Subgraph
GEreduce (sALU)
reduce (sALU)
GE GEGE GE
Subgraph in GE (matrix)
Block in disk
(COO)
Block in GraphR (COO)
Block in disk
(COO)
Block in disk
(COO)
(a) Streaming-Apply: Column-major (GraphR)
(b) Streaming-Apply: Row-major
(c) Global processing order
RegI RegO
Figure 4.10: Streaming-apply execution model.
In GRAPHR, all GEs process a subgraph, the order of processing is important, since
it affects hardware resource requirement. We present streaming-apply execution model
shown in Figure 4.10. The key insight is that subgraphs are processed in a streaming
manner and the reductions are performed on the fly by sALU. There are two variants of
this model: column-major and row-major. During execution, RegI and RegO are used to
store source vertices and updated destination vertices. In column-major order in Figure 4.10
(a), subgraphs with same destination vertices are processed together. The required RegO
size is the same as the number of destination vertices in a subgraph. In row-major order in
Figure 4.10 (b), subgraphs with same source vertices are process together. The required
RegO size is the total number of destination vertices of all subgraphs with the same source
vertices. It is easy to see that row-major order requires larger RegO. On the other side,
row-major order incurs less read of RegI, — only one read is needed for all subgraphs
with same source vertices. In GRAPHR, we use column-major order since it requires less
registers, and in ReRAM, the write cost is higher than read cost.
Figure 4.10 (c) shows the global processing order in a complete out-of-core setting. We
81
see that the whole graph is partitioned into 4 blocks, three of them are stored in disk in COO
representation, one is stored in GraphR’s memory ReRAM also in COO representation.
Only the subgraph being processed by GEs is in sparse matrix format. Due to the sparsity
of graph data, if the subgraph is empty, then GEs can move down to the next subgraph.
Therefore, the sparsity only incurs waste inside the subgraph (e.g., when one GE has an
empty matrix but others do not).
4.2.4 Graph Preprocessing
To support streaming-apply execution model and conversion from coordination list rep-
resentation to sparse matrix format, edge list needs to be preprocessed so that edges of
consecutive subgraphs are stored together. It also ensures sequential disk and memory
access on block/subgraph loading. In the following, we explain the preprocessing procedure
in detail.
9
102
113
124
135
146
157
168
4133
4234
4335
4436
4537
4638
4739
4840
2517
2618
2719
2820
2921
3022
3123
3224
5749
5850
5951
6052
6153
6254
6355
6456
C=4
N=2G=2
GE 1
Subgraph Block
V=6431
63
31 63B=32
Figure 4.11: Preprocessing edge list.
82
Given some architectural parameters, the preprocessing is implemented in software and
performed only once. Figure 4.2.4 illustrates these parameters and the subgraph order we
should produce. This order is determined by streaming-apply model. C is the size of CB; N
is the number of CBs in a GE; G is the number of GEs in a GRAPHR node; and B is the
number of vertices in a block (i.e., block size). Also, V is the number of vertices in the
graph. Among the parameters, C, N and G specify the architecture setting of a GRAPHR
node, B is determined by the size of memory ReRAM in the node. In the example, the
graph has 64 vertices (V = 64) and each block has 32 vertices (B = 32), so the graph is
partitioned into 4 blocks, each one will be loaded from disk to GRAPHR node. Further,
C = 4,N = 2,G = 2, so the subgraph size is C× (C×N×G) = 4× (4×2×2) = 4×16.
Therefore, each block is partitioned into 16 subgraphs, the number of each is the global
subgraph order. Our goal is to produce an ordered edge list according this order so that
edges could be loaded sequentially.
Before presenting the algorithm, we assume that the original edge list is first ordered
based on source vertex and then for all edges with the same source, they are ordered by
destination. In another word, edges are stored in row-major order in matrix view. We also
assume that in the ordered edge list, edges in a subgraph is stored in column-major order.
Considering the problem in matrix view, each edge (i, j), we first compute a global order ID
(I(i, j)), so we have a 3-tuple: (i, j, I(i, j)). This global order ID takes all zeros into account, for
example, if there are k zeros between two edges in the global order, the different of global
order ID of them is still k. Then, all 3-tuples could be sorted by I(i, j). The ordered edge list
is obtained if we output them according to this order. The space and time complexity of this
procedure are O(V ) and O(V logV ), respectively, with common method. The key problem
here is to compute I(i, j) for each (i, j). Without confusion, we simply denote I(i, j) as I. We
show that I can be computed in a hierarchical manner.
Let BI denote the global block order of the block that contains (i, j). We assume that
83
blocks are also processed in column-major order: B(0,0)→ B(1,0)→ B(0,1)→ B(1,1). The
coordinates of a block B are:
Bi =
⌊iB
⌋,B j =
⌊jB
⌋(4.1)
Based on column-major order, BI is:
IB = B j +(V/B)×B j (4.2)
We assume that B can divide V , and similarly, C can divide B, C×N×B can divide B.
Otherwise, we can simply pad zeros to satisfy the conditions, it will not affect the results
since these zeros do not correspond to actual edges. The block corresponding to BI start
with the following global order ID:
start global ID(BI) = BI×B2/(C2×N×G)+1 (4.3)
Next, we compute SI, (i, j)’s global subgraph ID. The coordinates of the edge from the
start of a block BI (Bi,B j) are:
i′ = i−Bi×B, j′ = j−B j×B (4.4)
The relative coordinates of the subgraph from the start of correspond block are:
SIi′ =
⌊i′
C
⌋,SI j′ =
⌊j′
C×N×G
⌋(4.5)
Then, we can compute SI:
SI = (Si′+S j′×B/C)+ start global ID(BI)
= (Si′+S j′×B/C)+BI×B2/(C2×N×G)+1(4.6)
Finally, we compute SubI, the relative order of (i, j) from its corresponding subgraph
(SI). The coordinates are:
SubIi = i− (B×Bi)− (Si×C),
SubI j = j− (B×B j)− (S j×C)(4.7)
84
Since edges in a subgraph are assumed to be stored in column-major order, SubI is:
SubI = SubIi +(SubI j−1)×C (4.8)
With SI and SubI computed, we get I:
I = (SI−1)× (C2×N×G)+SubI (4.9)
4.2.5 Discussion
1 //Phase 1: compute edge values2 for each edge E(V,U) from active vertex V do3 E.value = r * V.prop / V.outdegree4 end for5
6 //Phase 2: reduce and apply7 for each edge E(U,V) to vertex V do8 V.prop = sum(E.value) + (1-r) / Num_Vertex9 end for
Figure 4.12: PageRank in vertex programming.
1 //Phase 1: compute edge values2 for each edge E(V,U) from active vertex V do3 E.value = E.weight + V.prop4 activate(U);5 end forSomes6
7 //Phase 2: reduce and apply8 for each edge E(U,V) to vertex V do9 V.prop = min(V.prop,E.value)10 end for
Figure 4.13: SSSP in vertex programming.
Table 4.1 compares different architectures for graph processing. GRAPHR improves
over previous architectures due to two unique features. First, the computation is performed
in analog manner, others use either instructions or specialized digital processing units.
85
Table 4.1: Comparison of different architectures for graph processing.
CPU GPU Tesseract[187]
GAA [186] Graphicionado[185]
GRAPHR
ProcessEdge
Instruction Instruction Instruction SpecializedAU
Specialized unit ReRAM cross-bar
Reduce Instruction Instruction Instructionand inter-cubecommunica-tion
SpecializedAPU/SCU
Specialized unit ReRAM cross-bar or sALU
ProcessingModel
Sync/Async Sync Sync Async Sync Sync
DataMovement
Disk to memory(out-of-core);Memory hierar-chy
Disk to mem-ory; CPU/GPUmemory; GPUmem. hierarchy
Betweencubes (in-memoryonly)
Betweenmemory andaccelerator(in-memoryonly)
Between mod-ules in memorypipeline; mem-ory to SPM;SPM to/fromprocessingunits.
Disk to memory(out-of-core)or betweenGRAPHR nodes(multi-node);Between mem-ory ReRAMand GEs (insideGRAPHR)
MemoryAccess
Random: vertex access and start of edge list of a vertex; Reduced ran-dom withSPM.
Sequential edgelist.
Sequential: edge list. Pipelined mem-ory access
Generality All algorithms Vertex program Vertex programin SpMV
This provides the excellent energy efficiency. Second, all disk and memory accesses in
GRAPHR are sequential, this is due to preprocessing and less flexibility in scheduling
vertices. It is a good trade-off because it is highly energy efficient to perform parallel
operations in ReRAM CBs. We believe that GRAPHR is the first architecture scheme using
ReRAM CBs to perform graph processing, and the work presents detailed and complete
solution to integrate it as a drop-in accelerator for out-of-core graph processing systems.
The architecture, streaming-apply execution model and the preprocessing algorithms are
all novel contributions. Also, GRAPHR is general because it could accelerate all vertex
programs that can be performed in SpMV form.
4.3 Mapping Algorithms in GE
In this section, we discuss two patterns when mapping algorithms to GEs: parallel MAC
and parallel add-op. We use a typical example for each category (i.e., PageRank and SSSP,
86
respectively) to explain the insights. More examples (but not all) of supported algorithms in
GRAPHR are listed in Table 4.2. The first two are parallel MAC pattern and the last two are
parallel add-op pattern.
4.3.1 Parallel MAC
In an algorithm, if processEdge function performs a multiplication, which can be performed
in each CB cell, we call it parallel MAC pattern. The parallelization degree is roughly
(C×C×N×G) (see parameter in Figure 4.8).
PageRank [128] is an excellent example of this pattern. Figure 4.12 shows its vertex
program. It does the following iterative computation:−→PRt+1 = rM ·−→PRt +(1− r)−→e .
−→PRt is
the PageRank at iteration t, M is a probability transfer matrix, r is the probability of random
surfing and −→e is a vector of probabilities of staying in each page.
We consider a small subgraph that could be processed by a 5×4 CB (the additional row
is to perform addition). It contains at most 16 edges represented by the intersection of 4 rows
and 4 columns in the sparse matrix (shown in Figure 4.15 (a)). Thus, the block is related to 8
vertices (i.e., i0∼ i3, j0∼ j3). The following are the parameters for the 4 × 4 block PageR-
ank shown in Figure 4.15 (b1): M = [0,1/2,1,0;1/3,0,0,1/2;1/3,0,0,1/2;1/3,1/2,0,0]
,−→e = [1/4,1/4,1/4,1/4]T, r = 4/5.
We define M0 = rM and −→e0 = (1− r)−→e , so that−→PRt+1 = M0
−→PRt +
−→e0 . In CB in
Figure 4.15 (b2, b3), the values are already scaled with r. Figure 4.15 (b3) shows the
mapping of PageRank algorithm to a CB. The additional row is used to implement the
addition of −→e0 . The sALU is configured to perform add operation in the reduce function to
add PageRank values (Figure 4.14 (a)). To check convergence, the new PageRank value is
compared with it in the previous iteration, the algorithm is converged if the difference is
less than a threshold.
87
2 4 53
7 2 31
sALU reg(old)
9 6 84reg(new)
3 9 42
5 6 47
sALU reg old)
3 6 42reg(new)(b)(a)
Figure 4.14: The configuration of sALU to perform (a) add in PageRank and (b) min inSSSP.
4.3.2 Parallel Add-Op
In an algorithm, if processEdge function performs an addition, which can be performed in
for each CB row at a time, we call it parallel add-op pattern. The op specifies the operation
in reduce that is implemented in sALU. The parallelization degree is roughly (C×N×G).
Single Source Shortest Path (SSSP) [270] is a typical example of this pattern. Figure 4.13
shows the vertex program. We see that processEdge performs an addition and reduce
performs min operation. Therefore, sALU is configured to perform min operation shown in
Figure 4.14 (b).
In SSSP, each vertex v is given a distance label dist(v) that maintains the length of the
shortest known path to vertex v from the source. The distance label is initialized to 0 at the
source and ∞ at all other nodes. Then the SSSP algorithm iteratively applies the relaxation
operator, which is defined as follows: if (u,v) is an edge and dist(u)+w(u,v) < dist(v),
the value of dist(v) is updated to dist(u)+w(u,v). An active vertex is relaxed by applying
this operator to all the edges connected to it. Each relaxation may lower the distance label
of some vertex, and when no further lowering can be performed anywhere in the graph, the
resulting vertex distance labels indicate the shortest distances from the source to the vertex,
regardless of the order in which the relaxations were performed. Breadth-first numbering of
a graph is a special case of SSSP where all edge labels are 1.
We explain the mapping of SSSP algorithm to a CB using a small 8-vertex subgraph
88
SrcDst 0000
i0i1i2i3
i0(4 i1:3 i3:2i2:11
5 31 1
j0:7 j1:6 j3:Mj2:M
Src
Dst 31-
j0:7 j1:5 j3:4j2:3Dst 2 4
(b2)
/
/
/
/
0 0 0 0
) )
) )
) ) ) )
) ) )
/
/
/
/
0 0 0 0
(c1)
Src
Dst
Src
Dst
(c2)
i0:1/4
i1:1/4
i3:1/4
i2:1/4
1/3
1/3
1/3
j0 j1 j3j2
Src
Dst
(b1)
1/2 1/2
1
1/2
1/2
1/4
9/60 13/60 25/60
1/41/4
1/41
13/60
(b3)
) )
) )
) ) ) )
) ) )
min
7 6 M M
min min min
Dst 2 4
41
7 5 9 M
M 5 9 M
) )
) )
) ) ) )
) ) )
min
7 5 9 M
min min min
31
7 5 6 4
M M 6 4
) )
) )
) ) ) )
) ) )
min
7 5 6 4
min min min
11
7 5 6 4
M M M M
) )
) )
) ) ) )
) ) )
min
7 5 6 4
min min min
21
7 5 3 4
M M 3 M
( ) t=4( ) t=3( ) t=2( ) t=1
31
2
12
2
Dst 31-
i0
i1
i2
i3Src
(RegI)
(RegO)
(a)
Figure 4.15: The processing of (a) PageRank and (b) SSSP in GRAPHR
corresponding to a 4 × 4 block in sparse matrix, as shown in Figure 4.15 (c1). It can
be represented by adjacency matrix: W = [M,1,5,M;M,M,3,1; M,M,M,M;M,M,1,M]
where M indicates no edge connected two vertices and M is set to a reserved maximum
value for a memory cell in a CB. The values are stored in the CB shown in Figure 4.15 (c2).
Given a vertex u and dist(u), the row in the adjacency matrix W for u indicates w(u,v).
We could perform the relaxation operator (i.e., addition) for u in parallel. Here, SpMV is
only used to select a row in CB by multiplying with an one-hot vector.
The relaxation operator of u involves reading: 1) dist(u): it is computed iteratively from
the source, for the source vertex, the initial value is zero; 2) The vector of the dist(v) before
the relaxation operator: it is a vector indicating the distance between source and all other
vertices and is also computed iteratively from the source. In our example, for the source
vertices (i0, i1, i2, i3), the initial value is [4,3,1,2]; 3) The vector of the w(u,v): it is the
89
distance from u to the destination vertices in the subgraph, and can be obtained by reading a
row in adjacency matrix W.
Figure 4.15 (c3) shows the process to perform SSSP in a 5 × 4 CB. The last row (green
squares) is set to a fixed value 1, which is used to add dist(u) (the input to the last wordline)
to each w(u,v) in the relaxation operator. The initial value for dist(u) for the destination
vertices ( j0, j1, j2, j3) are [7,6,M,M]. The vector of w(u,v) is obtained by activating the
wordline associated to vertex v. In the time slot (t = 1), wordline 0 (for source vertex i0) is
activated (the red square with input value 1) and a value 4 (this is the current value in dist(v)
for source i0) is input to the last wordline (the green box line). The vector of the dist(v)
for the destination vertices (( j0, j1, j2, j3)) is set as the value of the output at each bitline,
which is [7,6,M,M]. With this mapping, the current summation in bitline in Figure 4.15
(c3-1) is [1×M+4×1,1×1+4×1,1×5+4×1,1×M+4×1] = [M,5,9,M]. It is the
dist(u)+w(u,v) computed in parallel, where u is the source vertex. Then the distance
of source to each vertex v is compared with the initial dist(v) ([7,6,M,M]) by an array
of comparators, and in the final output of bitline, we get [7,5,9,M], which is the updated
dist(v) vector after time slot t = 1. The parallel comparisons are performed by vertex-related
operations in. Different algorithms may require different functions on vertices.
Table 4.2: Property and operations of applications in GRAPHR.
Applications Vertex Property processEdge()SpMV Multiplication Value E.value = V.prop / V.outdegree * E.weightPageRank Page Rank Value E.value = r * V.prop / V.outdegreeBFS Level E.value = 1 + V.propSSSP Path Length E.value = E.weight + V.prop
Applications reduce() Ative Vertex ListSpMV V.prop = sum(E.value) Not RequiredPageRank V.prop = sum(E.value) + (1-r) / Num Vertex Not RequiredBFS V.prop = min(V.prop,E.value) RequiredSSSP V.prop = min(V.prop,E.value) Required
After time slot t = 1, we move to the next vertex in Figure 4.15 (c3-2), where we i)
90
activate the second wordline; ii) change the input to the last wordline to 3 (the distance
label for source vertex i1); and iii) set the intermediate dist(v) to be the final output of
bitline in time slot t = 1, which is [7,5,9,M]. Similar as Figure 4.15 (c3-1), the the current
summation in bitline is [M,M,6,4] and it is compared with [7,5,9,M], yielding the final
output of bitline for time slot t = 2 as [7,5,6,4]. We indicate the updated distance label
using orange squares. The operations in time slot t = 3 and t = 4 can be performed in the
similar manner.
Initially, before processing the block, the active indicator for each destination vertex
is set to be FALSE. After the serial processing in CB, the active indicators for all vertices
which have been updated (marked in orange in Figure 4.15 (c3)) are set to be TRUE. This
indicates that they are active for next iteration. In our example, j1, j2, j3 are marked active.
Referring to Figure 4.15, this means that the distance labels for these vertices have been
updated. Because we are processing the block in CB, the active indicator for destination
vertex may be accessed for multiple times, but if it is set be TRUE at least one time, this
vertex is active in next iteration. Globally, after all active source vertices and corresponding
edges are processed in an iteration, source vertex properties (values and active indicators)
that hold the old values are updated by the properties of the same vertices in destination.
The new source vertex properties are used in the next iteration. We can check if there are
still active vertices to determine the convergence.
4.4 Evaluation
4.4.1 Graph Datasets and Applications
Table 4.3 shows the datasets used in our evaluation. We use seven real-world graphs. For
WikiVote(WV), Slashdot(SD), Amazon(AZ), WebGoogle(WG), LiveJournal(LJ), Orkut(OK)
and Netflix(NF). We run pagerank(PR), breadth first search(BFS), single source shortest
path (SSSP) and sparse matrix-vector multiplication (SpMV) on the first six datasets. On
91
Table 4.3: Graph datasets.
Dataset # Vertices #EdgesWikiVote(WV) [269] 7.0K 103KSlashdot(SD) [269] 82K 948KAmazon(AZ) [269] 262K 1.2M
WebGoogle(WG) [269] 0.88M 5.1MLiveJournal(LJ) [269] 4.8M 69M
Orkut(OK) [271] 3.0M 106MNetflix(NF) [272] 480K users, 17.8K movies 99M
Netflix(NF), we run collaborative filtering(CF), and the feature length used is 32.
4.4.2 Experiment Setup
In our experiments, we compare GRAPHR with a CPU baseline platform, a GPU platform
and PIM-based architecture [187]. PR, BFS, SSSP and SpMV running on the CPU platform
are based on the software framework GridGraph [151], while collaborative filtering is
based on GraphChi [152]. PR, BFS, SSSP and SpMV running on GPU platform are based
on Gunrock [273], while CuMF SGD [274] is the GPU framework for CF. We evaluate
PIM-based architecture on zSim [275], a scalable x86-64 multicore simulator. We modified
zSim with HMC memory and interconnection model, heterogeneous compute units, on-chip
network and other hardware features. The results are validated results with NDP [276],
which also extends zSim for HMC simulation. In all experiments, graph data could fit in
memory. We also exclude the disk I/O time from the execution time of the CPU/GPU-based
platform.
Specifications of the CPU and GPU platforms are shown in Table 4.4 and Table 4.5.
The CPU energy consumption is estimated by Intel Product Specifications [277] while
NVIDIA System Management Interface (nvidia-smi) is used to estimate energy consump-
tion by GPU. The execution times for CPU/GPU platform are measured in the computing
frameworks.
92
Table 4.4: Specifications of the CPU platform.
CPU Intel Xeon E5-2630 V3,8 cores, 2.40 GHz8× (32+32)KB L1 Cache8×256KB L2 Cache20 MB L3 Cache
Memory 128 GBStorage 1 TB2 CPUs, a total number of 32 threads.
Table 4.5: Specifications of the GPU platform.
Graphic Card NVIDIA Tesla K40cArchitecture Kepler# CUDA Cores 2880Base Clock 745 MHzGraphic Memory 12 GB GDDR5Memory Bandwidth 288 GB/sCUDA Version 7.5
To evaluate GRAPHR, for the ReRAM part, we use NVSim [184] to estimate time and
energy consumption. The HRS/LRS resistance are 25M/50K Ω, read voltage (Vr) and
write voltage (Vw) are 0.7V and 2V respectively, and current of LRS/HRS are 40 uA and
2 uA respectively. The read/write latency and read/write energy cost used are 29.31ns /
50.88ns, 1.08 pJ / 3.91nJ respectively from data reported in [264]. The programming of
a bipolar ReRAM cell is to change (from High to Low) or inverse. For multi-level cell,
the programming is to change the resistance to a middle state between High and Low, and
the middle state is determined by the programming voltage pulse length. Actually, the
difference between High and Low is the worse case. Note that [256,262] describe a ReRAM
programming prototype, which includes: 1) writing circuitry; 2) ReRAM cell/array; and 3)
conversion circuitry. They demonstrated the possibility of 1% accuracy for multi-level cell.
However, in a real production system, only “writing circuitr” and “ReRAM cell/array” are
needed, there is no need to consider the conversion, as we just need to “acquiesce” a writing
precision. Therefore, this energy cost estimation for 4-bit cell programming is reasonable
93
and more conservative than two recent ReRAM-based accelerators [88, 89].
For on-chip registers, we use CACTI 6.5 [278] at 32nm to model. For Analog/Digital
converters, we use data from [279]. The system performance is modeled by code instrumen-
tation. The ReRAM crossbar size S, number of ReRAM crossbars per graph engine C and
number of graph engines is 8, 32, 64 respectively.
4.4.3 Performance Results
Figure 4.16 compares the performance of GRAPHR and CPU platform. The CPU implemen-
tation is used as the baseline and execution times of applications of GRAPHR are normalized
to it. Compared to CPU platform, the geometric mean of speedups with GRAPHR archi-
tecture on all 25 executions is 16.01×. Among all applications on the datasets, the highest
speedup achieved by GRAPHR is 132.67×, and it happens on SpMV on WikiVote dataset.
The lowest speedup GRAPHR achieved is 2.40×, on SSSP using OK dataset. PageRank and
SpMV are parallel MAC pattern and have higher speedup compared to CPU-based platform.
For BFS and SSSP which are parallel add-op pattern, GRAPHR achieves lower speedups
only due to parallel addition.
PageRank BFS SSSP SpMVWV SD AZ WG LJOK WV SD AZ WG LJOK WV SD AZ WG LJOK WV SD AZ WG LJOK CF Gm
1
10
100140
CPUGraphR
Figure 4.16: GRAPHR speedup normalized to CPU platform.
94
PageRank BFS SSSP SpMVWV SD AZWG LJOK WV SD AZWG LJOK WV SD AZWG LJOK WV SD AZWG LJOK CFGm
1
10
100
230CPUGraphR
Figure 4.17: GRAPHR energy saving normalized to CPU platform.
4.4.4 Energy Results
Figure 4.17 shows the energy savings in GRAPHR architecture over CPU platform. The
geometric mean of energy savings of all applications compared to CPU is 33.82×. The
highest energy achieved by GRAPHR is 217.88×, which is on sparse matrix-vector multipli-
cation on SD dataset. The lowest energy saving achieved by GRAPHR happens on SSSP on
OK dataset, which is 4.50×. GRAPHR gets the high energy efficiency from the non-volatile
property of ReRAM and the in-situ computation capability.
4.4.5 Comparison to GPU Platform
GPUs take advantage of a large amount of threads (CUDA cores) for high parallelism.
The GPU used in the comparison has 2880 CUDA cores, while in GRAPHR we have a
comparable number (2048= 32×64) of crossbars. To compare with GPU, we run PageRank
and SSSP on LiveJournal dataset, and collaborative filtering(CF) on Netflix dataset.
The performance and energy saving normalized to CPU are shown in Figure 4.18 (a)
and (b). Overall, the performance of GRAPHR is higher than GPU with considering the
data transfer time between CPU memory and GPU memory, — an overhead GRAPHR
does not incur. GRAPHR has performance gains ranging from 1.69× to 2.19× compared
to GPU. More importantly, GRAPHR consumes 4.77× to 8.91× less energy than GPU.
95
PR SSSP CF
1
10
60CPUGPUGraphR
PR SSSP CF
1
10
30CPUGPUGraphR
(a) Performance (b) Energy Saving
Figure 4.18: GRAPHR (a) performance and (b) energy saving compared to GPU platform.
Figure 4.18 (a) shows that, GRAPHR achieves higher speedups on PageRank and CF, where
MACs dominate the computation and are fully supported by GRAPHR and GPU. For SSSP,
the vertex-related traversing in GRAPHR requires accessing to main memory and storage.
In GPU, a cache based memory hierarchy better supports the random accessing. So the
speedup of GRAPHR is lower. The reason why GRAPHR still has gain in SSSP is due to
sequential access pattern. For energy saving, GRAPHR is better than GPU. Besides energy
saving by in-situ computation in GEs and the in-memory processing system design, from
Fig 17 in [185] we see that in conventional CMOS system, static energy consumption by
eDRAM (memory) incurs the majority of energy consumption. As the technology node
scales down, leakage power dominates in CMOS system. In contrast, ReRAM has almost 0
static energy leakage, so GRAPHR has higher energy saving.
4.4.6 Comparison to PIM Platform
We compare GRAPHR with a PIM-based architecture (i.e., Tesseract [187]). The perfor-
mance and energy saving normalized to CPU are shown in Figure 4.19 (a) and (b). GRAPHR
gains a speedup of 1.16× to 4.12×, and is 3.67× to 10.96× more energy efficiency com-
pared to PIM-based architecture.
96
WV AZ LJ WV AZ LJ
1
10
50
Performance
CPUPIMGraphR
WV AZ LJ WV AZ LJ
1
10
100200
Energy Saving CPU
PIMGraphR
PageRank PageRankSSSP SSSP
(a) Performance (b) Energy Saving
Figure 4.19: GRAPHR (a) performance and (b) energy saving compared to PIM platform.
4.4.7 Sensitivity to Sparsity
We use #Edge/(#Vertex)2 to represent the density of a dataset, and with the density decreases,
the sparsity increases. Figure 4.20 (a) and (b) shows the performance and energy saving of
GRAPHR (compared to the CPU platform) with the density of datasets. With the sparsity
increases, the performance and energy saving slightly decreases. Because with the sparsity
increases, the number of edge blocks to be traversed will increase, which slows down the
edge accessing time and consumes more energy.
WV SD AZ WG LJ
1
10
100200
Energy Saving
PRSSSP
10-3
10-2
10-1
100
Sparsity
WV SD AZ WG LJ
1
10
50
Performance
PRSSSP
10-3
10-2
10-1
100
Sparsity
(a) Performance (b) Energy Saving
Den
sity
Den
sity
Figure 4.20: GRAPHR (a) performance and (b) energy saving v.s. dataset density.
97
Chapter 5
Low-Cost Floating-Point Processing in ReRAM
5.1 Background
5.1.1 MVM Acceleration in ReRAM
Resistive Random Access Memory (ReRAM) [125, 280] has recently demonstrated tremen-
dous potential to efficiently accelerate the computing kernel, i.e., MVM, in machine learning
models. Conceptually, each element in a matrix M is mapped to the conductance state of
an ReRAM cell, while the input vector x is encoded as voltage levels that are applied on
the wordlines of the ReRAM crossbar. In this way, the current accumulation on bitlines
is proportional to the dot-product of the stored conductance and voltages on the word-
lines, representing the result y = M×x. Such in-situ computation significantly reduces the
costly memory access in MVM processing engines [256], and mostly importantly, provides
massive opportunities to exploit the inherent parallelism in an N×N ReRAM crossbar.
ReRAM-based MVM processing engines are fixed-point hardware in nature due to the
fact that the matrix and the vector are respectively represented in discrete conductance states
and voltage levels [125]. If ReRAM is used to support floating-point MVM operation, a
large number of crossbars will be provisioned for mantissa alignment, resulting in very high
hardware cost. Fortunately, the fixed-point precision requirement is acceptable for machine
learning applications thanks to the low-precision and quantized neural network algorithms
[173, 281–285]. Many fixed-point based accelerators [88, 89, 91, 92, 218, 222, 223, 286] are
built with the ReRAM MVM processing engine and achieve reasonably good classification
accuracy.
98
0 1 0 11 1 0 11 0 0 01 0 1 1
0 1 1 00 1 0 00 1 0 11 1 0 1
0101
M-b3 M-b2
V-b3
0 0 1 11 1 1 00 0 1 01 1 0 1
M-b10 1 1 11 0 1 01 1 0 10 0 1 1
M-b0
2 1 1 2O0 1 2 0 1 2 2 1 1 1 0 2 1
1111
V-b2
1010
V-b1
0001
V-b0
C1
2 1 1 2S1 1 2 0 1 2 2 1 1 1 0 2 1+<<
3 2 1 3 1 4 1 2 2 2 3 2 2 2 3 3O1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0S0
7 4 3 7 3 8 1 4 6 6 5 4 4 2 7 5S2+<<
+<<
C2
C31 1 0 1 0 2 1 1 0 0 2 1 1 2 1 2O2
S3 6 18 3 9 12 12 12 9 9 6 15 12
+<<
C41 0 1 1 1 1 0 1 1 1 0 1 0 0 1 1O3
S4 31 18 13 31 13 37 6 19 25 25 24 19 18 12 31 25 C5+<<
75 73 32 81+<<
C6
C7
S 368 354 207 387
175171 88 181+<<
15 9 6 15
Figure 5.1: Fixed-point (integer) MVM in ReRAM.
368354207387
d
=
0 13 7 1111 14 3 89 5 2 514 6 9 15
T
d
×
612613
d
=
0000 1101 0111 10111011 1110 0011 10001001 0101 0010 01011110 0110 1001 1111
T
b
×
0110110001101101
b
, (5.1)
Figure 5.1 shows an example of computing fixed-point MVM in Eq. (5.1) by utilizing
ReRAM-based MVM engine with single-bit precision. Before computation, decimal
99
integers in both the matrix and the vector are converted to binary bits, where we set the
precision for the matrix and input vector are 4-bit. The matrix is bit-sliced into 4 one-bit
matrices and then mapped to four crossbars, i.e., M-b3, M-b2, M-b1 and M-b0 in Figure
and the input vector is bit-sliced into 4 one-bit vectors, i.e., V-b3, V-b2, V-b1 and V-b0.
The multiplication is performed in pipeline. The multiplication results in each crossbars
are initiated by a zero vector S0. In the first cycle C1, the most significant bit (MSB)
vector V-b3 is applied on wordlines of the four crossbars, and the multiplication results,
denoted by O0, of V-b3 with M-b3, M-b2, M-b1 and M-b0 are obtained in parallel. In
cycle C2, S0 is right-shifted by 1 bit to get S1, and V-b2 is input to the crossbars to get the
multiplication results O1. Similar operations are performed in C3 and C4. After C4, we get
S4, the multiplication results of the input vector with four bit-slices if the matrix. In the
following threes cycles C5 to C7, we shift and add S4 from the four crossbars to get the
final multiplication result. For the fixed-point multiplication of an NM-bit matrix with an
Nv-bit vector, the processing cycle count is Cint = Nv +(NM−1).
5.1.2 Fixed-Point and Floating-Point
…63 52 0
sign exp (11 bits) frac (52 bits)
(a) (b)
7 0
sign
Figure 5.2: The bit layout of an (a) 8-bit signed integer and (b) a 64-bit double-precisionfloating-point number.
We use an 8-bit signed integer and an IEEE 754-2008 standard [287] 64-bit double-
precision floating-point number as an example to compare the difference between fixed-point
and floating-point numbers. Typically, these designs refer to the format used to store and
manipulate the digital representation of the data. As demonstrated in Figure 5.2 (a), fixed-
100
point numbers represent integers – positive and negative whole numbers – via a sign bit
followed by multiple (e.g., i-bit) value bits, yielding a data range of −2i to 2i−1. IEEE 754
double-precision floating-point numbers shown in Figure 5.2 (b) are designed to represent
and manipulate rational numbers, where a number is represented with one sign bit (s), 11-bit
exponent (e), and 52-bit fraction (b51b50...b0). The value of a double-precision floating-
point is interpreted as (−1)s× (1.b51b50...b0)×2(e−1023), yielding a dynamic data range
from ±2.2×10−308 to ±1.8×10308.
Different data formats have different computational performance, and hence resulting
in different final precision. From data storage perspective, the storage characteristic and
space required for fixed-point and floating-point numbers are also significantly different.
The choice of appropriate data format is influenced by both the application’s computational
accuracy requirements and the available hardware storage resources. However, in general,
floating-point is a norm for high-precision computations because it can support a wider
range of data values with the ability to precisely represent very small or very large numbers.
5.1.3 Scientific Computing
1 initiate x = x02 while (not converge) do3 //Step 1: compute the residual4 r = b - A * x5 //Step 2: compute the correction6 compute p7 //Step 3: update the current solution8 x = x + p9 end while
Code 5.1: The iterative solver.
Scientific computing is an interdisciplinary science that solves computational problems
in a wide range of disciplines including physics, mathematics, chemistry, biology, engineer-
ing, and other natural sciences subjects [288–290]. Those complex computing problems are
101
normally modeled by systems of large-scale Partial Differential Equations (PDEs). Since
it is almost impossible to directly solve the analytical solution of those PDEs, a common
practice is to discretize continuous PDEs into a linear model Ax = b [166, 167] to be solved
by numerical methods. The numerical solution of this linear system is usually obtained by
an iterative solver [291–293]. Code 5.1 illustrates a typical computing process in iterative
methods. The vector x that needs to be solved is firstly initialized usually to an all-zero
vector x0, followed by three steps in the main body: (1) the residual (error) of the current
solution vector is calculated as r = b−Ax; (2) to improve the performance of the estimated
solution, a correction vector p is computed based on the current residual r; (3) the current
solution vector is improved by adding the correction vector as x = x+p, aiming to reduce
the possible residuals produced in the next calculation iteration. The iterative solver stops
when a defined convergence criteria is satisfied. Two widely used convergence criteria
are (1) that the iteration index is less than a preset threshold K or (2) that the L-2 norm
of the residual (res = ||b−Ax||2) is less than a preset threshold τ . Notably, all the values
involved in Code 5.1 are implemented as double-precision floating-point numbers to meet
the high-precision requirement of mainstream scientific applications.
The various iterative methods follow the above computational steps and differ only in
the calculation of the correction vectors. Among all candidate solutions, Krylov subspace
approach is the standard method today. In this work, we focus on two representative Krylov
subspace solvers – Conjugate Gradient (CG) [294] and Stabilized BiConjugate Gradient
(BiCGSTAB) [295]. The computational kernels of these two methods are large-scale sparse
floating-point matrix-vector multiplication y = Ax as aforementioned, which requires the
support of floating-point precision computation, imposing significant challenges to the
underlying computing hardware.
102
5.2 Motivation and REFLOAT Ideas
5.2.1 Motivation
Given the huge success of ReRAM-based high-parallel MVM engines in machine learn-
ing [88, 89, 91, 92, 221, 286], it is naturally attractive to design accelerators that extend
ReRAM-based computing support to floating-point SpMV in scientific computing. Un-
fortunately, existing ReRAM-based accelerators are limited to models that have inherent
tolerance for low-precision fixed-point approximation (e.g., machine learning) and are rarely
explored to support floating-point processing in scientific computing. To bridge this gap, the
most relevant work along this direction is proposed in [189]. In this work, the sparse matrix
in scientific computing is first partitioned into multiple blocks. Each floating-point element
in the block is converted to a fixed-point number by adding approximately the same number
of padding bits as the effective data bit. The fixed-point number with extended bits are then
deployed onto the fixed-point ReRAM crossbars [189] for processing. In order to facilitate
the quantitative assessment of system performance and hardware overhead, we carefully
selected the cycle number T and the crossbar number C as two metrics for measurement: (1)
T is directly related to the performance because a smaller T indicates higher performance
and vice versa, and (2) C directly reflects the ability to execute floating-point MVMs in
parallel under a limited number of on-chip ReRAMs [89,175,189]. Specifically, the smaller
C, the stronger the parallelism. The primary goal of this work focuses on techniques that
reduce the cost for floating-point processing in ReRAM.
Crossbar number. Supposing we are computing the multiplication of a matrix block M
with a vector segment v. In the matrix block M, the number of fraction bits is fM and the
number of exponent bits is eM. In the vector segment v, the number of fraction bits is fv
and the number of exponent bits is ev. To map the matrix fraction to ReRAM crossbars,
we need ( fM + 1) ReRAM crossbars because the fraction is normalized to a value with
a leading 1. For example, (52+1) crossbars are needed to represent the 52-bit fraction in
103
double floating-point precision in [189]. To map the matrix exponent to ReRAM crossbars,
we need 2eM ReRAM crossbars for eM-bit exponent states, which is called padding in [189]
where 64-bit paddings are needed for an eM = 6. Thus, C is calculated as
C = 4× (2eM + fM +1), (5.2)
where the leading multiplier 4 is contributed from sign bits of the matrix block and the
vector segment.
Cycle number. Supposing the precision of digital-analog converters is 1-bit as that in
[89, 189]. The number of value states in the vector segment is (2ev + fv +1).For each input
state, we need (2eM + fM +1) to perform the shift-and-add to reduce the partial results from
the ReRAM crossbars. For a higher computation efficiency, a pipelined input and reduce
scheme [89] is used. Thus, T is calculated as
T = (2ev + fv +1)+(2eM + fM +1)−1. (5.3)
Cycle/crossbar number vs. accuracy loss. The bit number (of exponents and fractions in
matrix blocks and vector segments) bridges the connection between the hardware cost and
the processing accuracy. A better configuration of bit numbers may lead to low hardware
cost, i.e., fewer cycles and crossbars, with an acceptable accuracy.
Cyc
les
Exponent Bit Number for Vector Exponent Bit Number
for Matrix
Cro
ssba
r N
umbe
r
Fraction Bit Number Exponent Bit Number
Cyc
les
Fraction Bit Number
for Matrix
Fraction Bit Number for Vector
(a) (b) (c)
108642002
46
8
3000
0
1000
2000
105040302010001020304050
500
550
600
650
5040302010002
46
80
10002000300040005000
10
The cycle number for (a) various configurations of exponent bit number for vector segment and matrix blockand (b) various configurations of fraction bit number for vector segment and matrix block and (c) the crossbar
number for various configurations of fraction and fraction bit number for matrix block.
Figure 5.3: The cycle number and crossbar number under various bit length.
104
Cost exploration. We explore the effect of bit number on cycle number and crossbar number,
shown in Figure 5.3. From Figure 5.3 (a) we can see with the increase of the exponent bit
number for vector segment and matrix block, the cycle number increases exponentially. In
Figure 5.3 (b) with the increase of the fraction bit number for vector segment and matrix
block, the cycle number increases, but not as fast as that in 5.3 (a). From Figure 5.3 (c) we
can see the increase of exponent and fraction bit number for matrix block leads to more
crossbars, and the crossbar number increase faster with the increase of exponent bit number
than that of fraction bit number. As a conclusion, larger bit number for matrix block and
vector segment leads to higher hardware cost.
11353 11293 1.83482439049246e-1411356 11293 7.39097655807432e-1411359 11293 1.85943343625644e-1411731 11293 4.59696761249278e-1511734 11293 1.85170679402315e-1411737 11293 4.65849022690248e-1511794 11293 1.83878704499711e-1411797 11293 7.40682717609272e-1411800 11293 1.86339609076098e-1411857 11293 4.59696761249294e-1511860 11293 1.85170679402323e-1411863 11293 4.65849022690251e-1511294 11294 2.95639062322968e-1311297 11294 7.43773374502569e-1411354 11294 1.83482439049246e-1411357 11294 7.39097655807432e-14
List 5.1: A data slice of crystm03. Each line contains row index, column index and edgevalue.
Locality in real-world data. We show a data slice of the crystm03 matrix from the SuiteS-
parse matrix collection [296] in List 5.1. As we can see, the index of the row and the column
of the near entries are close, eg., 113xx, 117xx, 118xx, 1129x. The values of the entries
are also close, i.e., they are around 10−13 to 10−15. Actually we can use 11290 as a shared
bias for the row index and only one decimal digit to represent the column index and we can
use 10−14 as a shared exponent bias for the values in this example. As a result, the required
bits to represent the data slice is significantly reduced and consequently the computation
cost. We call this phenomenon Index and Value Locality. The index and value similarity
105
commonly exists because the sparse matrices are collocated from real-word scenarios. Thus
the elements in the matrix have spatial similarity with the nearby elements. The index and
value similarity also provide use the opportunity for low cost floating-point computing and
reduced memory size for the matrix.
5.2.2 Discussion
The bit numbers for exponent and fraction in matrix block contribute to the the number
C of crossbars required for one floating-point multiplication on a matrix block and the
processing cycle number T . The bit numbers for exponent and fraction in vector segment
contribute to the processing cycle number T . Lower bits lead to lower cost for floating-point
multiplication in ReRAM, thus a naive approach is to truncate bits for low computing
latency and resource demand. However, bit truncation may lead to significantly slower
convergence and, more importantly, nonvonvergence. We show the iteration numbers for
convergence under various exponent and fraction bit configurations in Table 5.1. In default
double-precision, it takes 80 iterations to convergence. If we fix the exponent bits and
truncate fraction bits, a 21-bit fraction takes 27 additional iterations and a fraction less
than 21 bits leads to nonconvergence. If we fix the fraction bits and truncate exponent
bits, 7-bit exponent increases the iteration number from 80 to 20620, and a exponent less
than 7 bits leads to nonconvergence. In state-of-the-art ReRAM accelerator for scientific
computing [189], to reduce the processing cost, Feinberg et al. assumed 6 bits for the
exponent (64-bit padding). The fatal problem in [189] is that the solvers are not functional
correct because of nonconvergence.
It is a challenge to reduce bit numbers for low-cost processing but maintain convergence
at the same time. To tackle this challenge, in this work, for low-cost floating-point processing
in ReRAM, we propose a data format rather than the default double-precision floating-point
format and an accelerator architecture to support the proposed data format.
106
Table 5.1: The iteration numbers for convergence under various exp(onent) and fra(ction)bit configurations for matrix crystm03.
exp 11 11 11 11 11 11frac 52 30 29 28 27 26#ite 80 82(+2) 82(+2) 83(+3) 83(+3) 84(+4)exp 11 11 11 11 11 11frac 25 24 23 22 21 20#ite 90(+10) 93(+13) 93(+13) 95(+15) 107(+27) NC
exp 10 9 8 7 6frac 52 52 52 52 52#ite 80 80 80 20620(+256×) NC
NC indicates nonconvergence.
5.3 REFLOAT Data Format
In this section, we introduce the REFLOAT data format and illustrate the conversion between
a coordinate (COO) format and REFLOAT. We also discuss how to perform computation on
REFLOAT format to achieve high performance with minimal hardware overhead.
5.3.1 ReFloat Overview
1 2 0 30 4 5 06 0 7 00 0 0 8 eb
1 2 0 30 4 5 06 0 7 00 0 0 8
1028
1029
1030
1031
15401542
15411543
0123
0 1 2 31028,1540)
(a) A block in full precision. (b) A block in ReFloat.
Figure 5.4: The conversion of index and value in floating-point format to REFLOAT format.
In general, we use ReFloat(b,e, f ) to represent the REFLOAT format, where b denotes
107
Table 5.2: List of symbols and descriptions.
Symbol DescriptionReFloat(b,e, f ) : ReFloat format notation.
2b The size of a square block.e The number of exponent bits for a matrix block.f The number of fraction bits for a matrix block.A A sparse matrix.b The bias vector for a linear system.x The solution vector for a linear system.r The residual vector for a linear system.a A scalar of A.
(a)e The exponent of a, (a)e ∈ 0,1,2, ....(a) f The fraction of a, (a) f ∈ (1,2).
Ac A block of the sparse matrix A.(i, j) The index for the block Ac.
(ii, j j) The index for the scalar a in the block Ac.(iii, j j j) The index for the scalar a in the matrix A.
ebThe number of bias bits for the exponents ofscalars in the block Ac.
ev The number of exponent bits for a vector segment.fv The number of fraction bits for a vector segment.
the matrix block size (the length and width of a square matrix block) is 2b, e and f
respectively denote the number of bits for the exponent and the fraction. The symbols and
corresponding descriptions related to REFLOAT are listed in Table 5.2.
We use Figure 5.4 to illustrate the idea of REFLOAT. As shown in Figure 5.4 (a), each
scalar is in 64-bit floating-point format. It requires a 32-bit integer for row index and a
32-bit integer for column index to locate each element in the matrix block. Therefore, we
will use 8× (32+ 32+ 64) = 1024 bits for storage of the eight scalars. In this example,
assuming we use the ReFloat(2,2,3) format as depicted in Figure 5.4 (b). Alternatively,
(1) each scalar in the block can be indexed by two 2-bit integers; (2) the element value is
represented by a 1+ 2+ 3 = 6-bit floating point; (3) the block is indexed by two 30-bit
integers and (4) a 11-bit exponent bias eb is also recorded. Therefore, we only use 8× (2+
2+6)+2×30+11 = 151 bits to store the entire matrix block, which reduces the memory
requirement by approximately 10× (151 vs. 1024). This reduction in bit representation is
108
also beneficial for reducing the number of ReRAM crossbars for computation in hardware
implementation. More specifically, the full precision format consumes 118 crossbars as
illustrated in previous work [189], while only 16 crossbars are needed in our design with
REFLOAT reformatting. Hence, within the limits of the same chip area, our design can
process more matrix blocks at a time.
5.3.2 Conversion to ReFloat
…63
52 0
sign
exp (11 bits) frac (52 bits)
double
ReFloat
b51 b50 b49sign
05010
eb
(b) The conversion of a floating-point value.
(a) The conversion of row index and column index.
…(iii,jjj) …( ),
31 310 0
(ii,jj) ( ),
1 0 1 0
…(i,i) …( ),
29 290 0
b31 - b2 b31 - b2
Figure 5.5: Comparison of a matrix block (a) in original full precision format and (b) inREFLOAT format.
In order to configure the original matrix with a ReFloat(b,e, f ) format, three parameters
need to be determined in advance. b defines how the indices of input data are converted,
while the size of b is determined by the physical size of ReRAM crossbars, i.e., a crossbar
with 2b wordlines and 2b bitlines. Assuming the crossbar size is 4× 4, which indicates
109
b = 2. As demonstrated in Figure 5.5 (a), the leading 30 bits, i.e., b31 to b2 of the index
(iii, j j j) for a scalar in the matrix A are copied to the same bits in the index (i, j) for the
block Ac. For each scalar in the block Ac, the index (ii, j j) for that scalar inside the block Ac
are copied from the last two bits of the index (iii, j j j). The block index (ii, j j) are shared
by the scalars in the same block, and each scalar uses less bits for index inside that block.
Thus, we saved memory space for indexing.
The hyper parameters e and f compress the floating-point values. Essentially, a floating-
point number consists of three parts, (1) the sign bit, (2) the exponent bits, and (3) the
fraction bits. In the conversion to REFLOAT, we first keep the sign bit. For the fraction, we
only keep the leading f bits from the original fraction bits and remove the rest bits in the
fraction, as shown in Figure 5.5 (b). For the exponent bits, we need first determine the bias
value eb for the exponent. As e means the number of bits for the “swing” range, we need
to find an optimal bias value eb to fully utilize the e bits. We formalize the problem as an
optimization for find the e to minimize a loss target L, defined as,
mineb
L, L = ∑a∈Ac
(log2
(a
(a) f×2eb
))2= ∑
a∈Ac
((a)e− eb)2 . (5.4)
Let ∂L∂eb
= 0, we can get
eb =
[1|Ac| ∑
a∈Ac
(a)e
]. (5.5)
Thus, in the conversion, we use the original exponent to minus the optimal eb to get an e-bit
signed integer. The e-bit signed integer is the exponent in REFLOAT.
[(−1)×1.1111×27 1.0101×28
(−1)×1.0000×29 1.0001×27
]=
[−248.0 336.0−512.0 136.0
]. (5.6)
28×[(−1)×1.11×2−1 1.01×20
(−1)×1.00×21 1.00×2−1
]=
[−224.0 320.0−512.0 128.0
]. (5.7)
110
We use an example to intuitively illustrate the format conversion. The original floating-
point values in Eq. 5.6 are converted to ReFloat(x,2,2) format in Eq. 5.7, where eb = 8.
From this example we can see, REFLOAT incurs conversion loss for the conversion of
floating-point values from the original. However, as we have mentioned in Sec. 5.2.1,
REFLOAT is application-specific for scientific computing, and the errors in the iterative
solver are gradually corrected. Thus, the errors introduced by the conversion will also
be corrected in the iteration. More importantly, with REFLOAT, we can achieve high
performance computation because the low computation cost in REFLOAT format. We will
show the performance and convergence of the iterative solver in REFLOAT format in Sec.
5.5.
5.3.3 Computation with ReFloat
The whole matrix A is partitioned into blocks. To compute the matrix-vector multiplication
y = Ax, the input vector x and the output vector y are partitioned correspondingly into
vector segments xc and yc. The size of the vector segments is (2b×1).
For the p-th output vector segment yc(p), the computation will be,
yc(p) = ∑i
Ac(i, p)xc(i), (5.8)
where Ac(i, p) is the matrix block indexed by (i, p) and xc(i) is the input vector segment
index by i. The matrix blocks at the p-th block column are multiplied with the input vector
segments for partial sums and then they are accumulated. In the computation for each matrix
block, because the original matrix block Ac(i, p) is converted to
Ac(i, p) = 2eb(i,p)Ac(i, p), (5.9)
and the original vector segment xc(i) is converted to
xc(i) = 2ev(i)xc(i), (5.10)
111
thus, the matrix-vector multiplication for the matrix block Ac(i, p) and the vector segment
xc(i) is computed as
Ac(i, p)xc(i) = 2eb(i,p)+ev(i)Ac(i, p)xc(i). (5.11)
The matrix-vector multiplication for the p-th output vector segment in Eq. 5.8 is then
computed as
yc(p) = ∑i
2eb(i,p)+ev(i)Ac(i, p)xc(i). (5.12)
Here we can see, with REFLOAT format, the block matrix multiplication in Eq. 5.8 is
preserved. The original high-cost multiplication in full precision Acxc is replaced by a
low-cost multiplication Acxc. The architecture for high performance scientific computing in
REFLOAT will be presented in Sec. 5.4.
5.4 Accelerator Architecture for REFLOAT
In this section, we will present an accelerator architecture for efficient processing of scientific
computing in ReRAM in REFLOAT format. We will present the architecture for floating-
point SpMV engine, the data streaming of the sparse matrix into the accelerator, the
format conversion between REFLOAT and the 64-bit double-precision floating-point and
the mapping of the Krylov subspace solvers to the proposed accelerator.
5.4.1 Accelerator Overview
Figure 5.6 shows the overall architecture of the proposed accelerator for scientific computing
in ReRAM with REFLOAT. The accelerator is organized by bank architecture. Within each
bank, ReRAM crossbars are deployed for the processing of floating-point MVM on matrix
blocks. The Input Vector (IV) and Output Vector (OV) buffer are used for buffering the
input and output vectors respectively. A commit buffer collects the partial SpMV results
112
OV Buffer
Commit Buffer
…
Sche
dule
r
IV B
uffe
r
MACsIO Buffer
Figure 5.6: The overall accelerator architecture for scientific computing in REFLOAT
format.
from the ReRAM crossbars and the Multiply-and-Accumulate (MAC) units are used to
update the vectors. The scheduler is responsible for the coordination of the processing.
The Input and Output (IO) buffer is used for the matrix data movement from the IO to the
accelerator.
5.4.2 Processing Engine
S/H+ADC
4
Shift & Add
OutBuffer
…
1f 0
b0b1
Driv
er
Driv
er
Driv
er
fg-1 0
InBuffer0
S/H+ADC S/H+ADC
4
fg-1 0
4
fg-1 0
5
fg 0
…63 52 0
6
7 eb
8 ev
…63 52 0
9
(a) A processing engine. (b) A crossbar cluster.
1 1fv 0 2e+f2ev+fv
2
fx-1 0
3
fc-1 0
Figure 5.7: The architecture for (a) a processing engine for floating-point MVM on a matrixblock and (b) a crossbar cluster for fixed-point MVM.
The most important component in the accelerator is the processing engine for floating-
point matrix-vector multiplication in REFLOAT format. A processing engine is a logical
unit for the processing of floating-point SpMV in REFLOAT format. The processing
113
engine consists of a bunch of ReRAM crossbars and the peripheral functional units. The
architecture of the processing engine is shown in Figure 5.7. Assuming we are processing
the floating-point SpMV on a matrix block with the format ReFloat(b,e, f ).
The input to the processing engine are (1) a matrix block in ReFloat(b,e, f ) format, (2)
an input vector segment in floating-point with ev exponent bits and fv fraction bits and the
vector length is 2b, and (3) the exponent bias bits eb for the matrix block. The output of a
processing engine is a vector segment for SpMV on the matrix block, and the output vector
segment is in double-precision floating-point.
Before the computation, the matrix block is mapped to the ReRAM crossbars. The
mapping is shown in Figure 5.7 (b). The fraction part of the matrix block in ReFloat(b,e, f )
represents a number of 1.b f−1...b0, then we have ( f + 1) bits for mapping. The e-bit
exponent of the matrix block contributes to 2e padding bits for alignment, then we have
another 2e bits for mapping. Thus, we map the total (2e + f + 1) bits 0 to (2e + f + 1)
ReRAM crossbars, where the i-th bits of the matrix block is mapped to the i-th crossbar.
(Here, we assume the cell precision for the ReRAM crossbars is 1-bit. For 2-bit cells, two
consecutive bits are mapped to a crossbar.) For the input vector segments with ev exponent
bits and fv exponent bits, the input 1 to the driver has (2ev + fv +1) bits.
In the processing, a cluster of crossbars are deployed for the processing of fixed-point
MVM for the fraction part of the input vector segment with the fraction part of the matrix
block using the shift-and-add method, as the example in Figure 5.1. The input bits are
applied to the crossbars by the driver and the output from the crossbar is buffed by a
Sample/Hold (S/H) unit and then converted to digital by a shared Analog/Digital Converter
(ADC). For each input bit to the driver (we assume an 1-bit DAC), as the crossbar size is
2b, the bit number for the ADC to convert the multiplication results 2 from crossbars is
fx = b. Then we need to shift-and-add the results 2 from all (2ev + fv +1) crossbars to get
the results 3 for the 1-bit multiplication of the vector with the matrix fraction. Thus, the
114
number of bits for the 3 is
fc = 2e + f +1+b. (5.13)
Next, we consequentially input the bits in 1 to the crossbars and shift-and-add the col-
lected 3 for each of the (2ev + fv+1) bits to get 4 , which is the result for the multiplication
of the matrix block with the input 1 . The number of bits for 4 is
fg = fc +2ev + fv +1+b. (5.14)
In a processing engine, as shown in Figure 5.7 (a), there is a sign bit for the matrix
block, thus we need two crossbar cluster for the signed multiplication. The elements in the
input vector segment also have a sign bits. Thus, we need four 4 to subtract them to get 5 ,
which is the results for the multiplication of the matrix block with the vector segment. The
number of bits for 5 is ( fg +1). We also note that 5 is a signed number because of the
subtraction.
In next step, we convert the 5 to a double-precision floating-point 6 . eb 7 is the
exponent bias for the matrix block and ev 8 is exponent for the vector segment. Thus
we add 7 and 8 to the exponent of 9 to get the 9 , which is the final results for the
multiplication of the matrix block with the vector segment. 9 is in 64-bit double-precision
floating-point format.
5.4.3 Streaming & Scheduling
The processing engine is designed for the computation of MVM on a matrix block. For the
whole SpMV, the computation is partitioned to the block MVM. However, because of the
sparsity in the real-word large-scale matrix, the streaming on the matrix and the scheduling
for the block matrix MVMs need careful consideration.
Matrix Market File Format [297] (MMFF) is the matrix format used in scientific com-
puting. In MMFF, a line of entry contains the row index, the column index and the value
115
1 2 0 30
1 20
10
2
00 0 04 5
003 4 0
3
0
4
00 0 0
000
0 0 01
20
30 00 0
0 0 0
00 0
0 0 00 000
000
000
0 0 0
0 0 00 000
0
600
00
000
0 0 0
0 0 00 000
0
00
000
0 0 0
0 0 00 000
078
05
60
0 0400
56 7
8 9 00
00
1 2 3 1 21
24 5 3
43 4
12 36 7 85 6
45 67 8 9
1 2 31 2
1 24 5
3 43 4
1 23
6 7 85 6
4 5 67 8 9
(a) A sparse matrix.
(b) Row-major layout.
(c) Block-major layout.
Figure 5.8: The row-major layout and block-major layout of a sparse matrix.
for the edge (a non zero) in a sparse matrix, as shown in List 5.1. The order of the entries
are usually in row major for the processing on the original matrix or in column major for
the processing on the transposed matrix. We show an example for row-major data layout in
Figure 5.8 (b) for the sparse matrix in Figure 5.8 (a). In the row-major layout, all the non
zeros in one row are stored continuously.
For the computation in Eq. 5.8, which perform the computations for the same destination
vector segment, it is required to continuously access non zeros from the same block column.
For example, the red and yellow non zeros are accessed. But in the row-major layout,
Figure 5.8 (b), we can see the accesses are irregular, which leads to the waste of memory
bandwidth. Thus, we need to perform the computation for the same source vector segment
for partial results and then accumulate the partial results to the destination vector segments.
However, there is another problem with the row-major layout. Assume the matrix
block to be processed is 4× 4 and there are four available processing engines. To avoid
bandwidth waste, the red and blue non zeros in the first 4 rows need to be fetched. They are
continuously accessed as shown in Figure 5.8 (b). but the non zeros are only for two matrix
blocks, as shown in Figure 5.8 (a). The number of matrix blocks in the next four rows are
unknown before all data are fetched and it is possible to get four matrix blocks in the next
fetching. At this time, the non zeros in the next four rows can not be accessed. Thus, only
two out of the four processing engines are busy but the other two are idle.
116
For full utilization of the processing engines, we propose a block-major layout, as shown
in Figure 5.8 (c). Within a matrix block, the non zeros are stored in row-major order, and
the data for the blocks are continuously stored. With the block-major layout, we fetch the
non zeros to the accelerator until we exhaust all available processing engines. To avoid
conflict on writing to the same vector segment, we use a commit buffer, as shown in Figure
5.6, to collect the partial results from the processing engines and then update the output
vector with the MAC units.
5.4.4 Accelerator Programming
Conjugate Gradient (CG) [294] and Stabilized BiConjugate Gradient (BiCGSTAB) [295]
are the two Krylov subspace methods used as the iterative solver in this work. We use CG
as an example to illustrate how we program the accelerator for the solver. The pseudo code
for the CG solver is listed in Code 5.2. In the code, We use double to indicate a variable
in 64-bit(double) floating-point precision, and refloat to indicate a variable in REFLOAT.
The variable name with a mat suffix indicates a matrix, the variable name with a vec suffix
indicates a vector, and others are scalars.
At the beginning, the exponent number for the matrix block e, the fraction number
for the matrix block f , the exponent number for the vector segment ev and the fraction
number for the vector segment f v in refloat is set. Within the accelerator, the crossbars in
a bank(subbank) will be configured with the parameter as clusters for processing SpMV
in refloat. Then the sparse matrix A mat is converted to refloat Ar mat and stored in
the block-major layout (Sec. 5.4.3) at Line 2 and 3. The bias vector b vec for the linear
system and the initial solution x vec is loaded to the accelerator at Line 4 and 5. At Line
6, the SpMV of Ar mat and xr vec is processed, where the matrix blocks of Ar mat are
streamed to the accelerator and partial results are accumulated to an double vector Ax. At
Line 7 the error vector r vec is computed by the MACs. At Line 8, a precision correction
117
1 refloat: e, f, ev, ef2 double A_mat3 refloat Ar_mat = (refloat) A_mat4 double b_vec5 double x_vec = x0_vec6 double Ax = Ar_mat * x_vec7 double r_vec = b_vec - Ax8 double p_vec = r_vec9 while (not converge) do10 double Ap_vec = Ar_mat * (refloat) p_vec11 double rr = r_vec.T * r_vec12 double pAp = p_vec.T * Ap_vec13 double alpha = rr / pAp14 x_vec = x_vec + alpha * p_vec15 r_vec = r_vec - alpha * Ap_vec16 double rrnew = r_vec.T * r_vec17 double beta = rrnew / rr18 p_vec = r_vec + beta * p_vec19 end while
Code 5.2: The pseudo code for the CG solver.
vector p vec is created. Line 9 to Line 19 are the main body for the CG solver. The SpMV
is processed on the matrix and the precision correction vector p vec which is converted
into refloat before processing at Line 10. From Line 11 to Line 13, a coefficient alpha
is calculated. The doc product of two vector is performed by the MACs. At Line 14 and
Line 15, the coefficient alpha is used to update the solution vector x vec and the error
vector vector r vec element-wisely. From Line 16 to Line 18, an other coefficient beta is
calculated to update the precision correction vector. The main body continue to run until a
desired stop condition (such as that the number of the iterations is larger than a threshold or
that the residual rr is smaller than a threshold) is reached.
118
5.5 Evaluation
In this section, we evaluate the speedup of our proposed accelerator in REFLOAT format
against the state-of-the-art ReRAM accelerator [189], denoted as ESCMA, for scientific
computing. We use a high-end Tesla V100 GPU as the baseline for comparison. We present
the error curves of the iterative solvers of REFLOAT to better understand the behavior of the
low-cost floating-point computing.
5.5.1 Evaluation Setup
Table 5.3: Platform Configuration.
GPU (Tesla V100 SXM2)Architecture Volta CUDA Cores 5012Memory 16GB HBM2 CUDA Version 10.2
ESCMABank 128 Crossbar Size 128×128Clusters/Bank 64 Precision doubleCrossbars/Cluster 128 Comp. ReRAM 17.1Gb
ReFloatBank 128 Crossbar Size 128×128Subank 128 Precision refloatCrossbars/Subank 64 Comp. ReRAM 17.1Gb
ADC10-bit pipelined SAR ADC @ 1.5GS/s
ReRAM Cells1-bit SLC, Tw = 50.88ns, Comp. Latency=107ns @ (128×128)
We list the configurations for the baseline GPU platform, the state-of-the-art ReRAM
accelerator [189] for scientific computing (ESCMA) and our REFLOAT in Table 5.3. For
the baseline GPU, we use an NVIDIA Tesla V100 SXM2 GPU, which has 5012 cuda cores
and a 16GB HBM2 memory. We use cuda 10.2 and cuSPARSE routines in the iterative
solvers for the processing and computing on sparse matrices. We measure the running time
for the solvers on the GPU. For the two ReRAM accelerators, i.e. ESCMA and REFLOAT,
119
we simulate with the parameters in Table 5.3. Both the two ReRAM accelerators have
128 Banks and the crossbar size is 128× 128. In ESCMA, we configure 64 clusters for
each bank, which is slightly larger than that (56) in the original work [189]. There are 128
crossbars in each cluster. The precision in ESCMA is double floating-point. In REFLOAT,
we configure 128 banks, 128 subbanks per bank, and 64 crossbars per subbank. The
precision in REFLOAT is refloat with a default setting that e = 3, f = 3, ev = 3 and fv = 8.
For the two ReRAM accelerators, the equivalent computing ReRAM is 17.1Gb. The ADC
and ReRAM cells for the two accelerators are of the same configurations. We use a 1.5GS/s
10-bit pipelined SAR ADC [298] for conversion. The DAC is 1-bit, which is implemented
by wordline activation. We use 1-bit SLC [264] and the write latency is 50.88ns. The
computing latency for one crossbar including the ADC conversion is 107ns [189]. In the two
ReRAM accelerators, each bank has a 16MB memory for buffering and 128 floating-point
MACs for vector related processing.
The matrices used in the evaluation are listed in Table 5.4. We evaluate on 12 solvable
matrices from the SuiteSparse Matrix Collection (formerly formerly the University of
Florida Sparse Matrix Collection) [296]. The size, i.e., the number of rows, of the matrices
ranges from 4,875 to 204,316 and the Number of Non-Zero entries (NNZ) of the matrices
ranges from 105,339 for 1,660,579. NNZ/Row is a metric for sparsity. A smaller NNZ/Row
indicates a sparser matrix. NNZ/Row ranges from 4.0 to 27.7. We also visualize the matrices
in Table 5.4.
We apply the iterative solvers CG and BiCGSTAB on the matrices. The convergence
criteria for the solvers is that the L-2 norm of the residual vector (we use the term “residual”
(denoted by R2) for simplicity to call it in this section) is less than 10−8. The error
(ei = xi− xi) of any scalar in the solution vector is bonded by the residual, because |ei|2 ≤
∑i |ei|2 = |x− x|2 = |r|2/|A|2 = 1|A|2 R2, where |A| is a constant.
120
Table 5.4: Matrices in the evaluation.ID Name #Rows NNZ NNZ/R κ
353 crystm01 4875 105339 21.6 4.21e+21313 minsurfo 40806 203622 5.0 8.11e+1354 crystm02 13965 322905 23.1 4.49e+22261 shallow water1 81920 327680 4.0 3.63e+01288 wathen100 30401 471601 15.5 8.24e+31311 gridgena 48962 512084 10.5 5.74e+51289 wathen120 36441 565761 15.5 4.05e+3355 crystm03 24696 583770 23.6 4.68e+22257 thermomech TC 102158 711558 6.9 1.23e+21848 Dubcova2 65025 1030225 15.84 1.04e+42259 thermomech dM 204316 1423116 6.9 1.24e+2845 qa8fm 66127 1660579 25.1 1.10e+2
353 1313 354 2261 1288 1311
1289 355 2257 1848 2259 845
5.5.2 Performance
CG Solver
We show the performance of GPU in double and single precision floating-point, ES-
CMA [189] and REFLOAT for CG solver in Figure 5.9. The performance p is defined
as p = tx/tGPU(double), x = GPU(single), ESCMA or REFLOAT. t is the processing time
for the iterative solver to satisfy that the residual is less than 10−8. Overall, the geometric-
mean(GMN) performance of GPU(single), ESCMA and REFLOAT are 0.78×, 6.07× and
18.71× respectively, resulting in a 3.08× for REFLOAT against ESCMA. GPU(single)
benefits from faster SpMV on lower bits, but the iteration number is increased. The highest
121
performance of GPU(single) 1.22× for Matrix 2261 and for most matrices the performance
gain for switching double to single is less than 1.2×. For Matrix 1311, the GPU(single)
performance is 0.015×, because it takes 1468 additional iterations. Less bits, such as
FP16 in a straightforward way, will not converge. For most of the matrices, ESCMA and
REFLOAT performs better than the baseline GPU. However, for matrix 2257 and 2259, the
performance of REFLOAT is 3.97× and 2.68× respectively. The performance of ESCMA
is even lower on the matrix 2257 and 2259, it is 0.72× and 0.51× respectively.
353 1313 354 2261 1288 1311 1289 355 2257 1848 2259 845 GMN10-210-110010150
ESCMA ReFloat GPU(f) GPU(d)
Figure 5.9: The performance of GPU(in double and single precision), ESCMA and RE-FLOAT for CG solver.
BiCGSTAB Solver
We show the performance of GPU double and single precision floating-point, ESCMA [189]
and REFLOAT for BiCGSTAB solver in Figure 5.10. The geometric-mean(GMN) perfor-
mance of GPU(single), ESCMA and REFLOAT are 1.15×, 5.09× and 21.88× respectively,
resulting in a 4.30× for REFLOAT against ESCMA. Notice that matrix 1311 does not
converge in GPU(single), and the 1.15× geomean excludes this matrix. The trend for the
three platforms on the evaluated matrices are similar to that for CG solver. In each iteration,
for CiCGSTAB solver, there are two SpMV on the whole matrix, while for CG solver,
there is one SpMV on the whole matrix. For ESCMA, more time is consumed for the
processing of two SpMVs, leading to a slightly lower performance than that of CG solver
(5.09× v.s. 6.07×) compared with the baseline GPU. From Table 5.6 we can see, the gap
(the difference of number of iterations to get converge) in BiCGSTAB solver is smaller
122
than the gap in CG solver for most matrices. For matrix 355, 2257, 2259 and 845, the
gap is negative, which means it takes fewer iterations in refloat compared with that in
double. For REFLOAT, although more time is consumed for the processing of two SpMVs
in BiCGSTAB, which is the same scenarios as that in ESCMA, because the gap is smaller,
the performance of REFLOAT for BiCGSTAB solver is higher compared with CG solver,
i.e., 21.88× v.s. 18.71×.
353 1313 354 2261 1288 1311 1289 355 2257 1848 2259 845 GMN10-1
100
10160 ESCMA ReFloat GPU(f) GPU(d)
Figure 5.10: The performance of GPU (in double and single precision), ESCMA andREFLOAT for BiCGSTAB solver.
5.5.3 Accuracy
100 101 10210-10
10-5
100
105353
100 101 10210-10
10-5
100
1051313
100 101 10210-10
10-5
100
105354
100 101 10210-10
10-5
100
1052261
100 102 10410-10
10-5
100
1051288
1 1.5 210-20
10-10
100
10101311
100 102 10410-10
10-5
100
1051289
100 101 10210-10
10-5
100
105355
100 101 10210-10
100
10102257
100 102 10410-10
100
10101848
100 101 10210-10
100
10102259
100 101 10210-10
10-5
100
105845
The Y axis is the residual and the X axis is the normalized iteration number.
Figure 5.11: The convergence traces of CG solver of GPU (red line) and REFLOAT (blueline).
We show the convergence traces (the residual over each iteration) of the evaluated
matrices in double and in refloat for CG solver in Figure 5.11 and the convergence traces
123
100 101 10210-10
10-5
100
105353
100 101 10210-10
10-5
100
1051313
100 101 10210-10
10-5
100
105354
2 4 6 810-10
10-5
100
1052261
100 102 10410-10
10-5
100
1051288
1 1.5 210-20
10-10
100
10101311
100 102 10410-10
10-5
100
1051289
100 101 10210-10
10-5
100
105355
100 101 10210-10
100
10102257
100 102 10410-10
10-5
100
1051848
100 101 10210-10
100
10102259
100 101 10210-10
10-5
100
105845
The Y axis is the residual and the X axis is the normalized iteration number.
Figure 5.12: The convergence traces of BiCGSTAB solver of GPU (red line) and REFLOAT
(blue line).
for BiCGSTAB solver in Figure 5.12. The iteration number is normalized by the consumed
time for the GPU baseline. The configurations of bit number for matrix block and vector
segment in refloat are listed in Table 5.5. The absolute (non-normalized) iteration number
to reaching convergence is listed in Table 5.6.
For CG solver, from Table 5.6 we can see, refloat leads to more number of iterations
to get converged when we do not consider the time consumption for each iteration. From
Figure 5.11 we can see, with the low bit representation, the residual curves are almost the
same as the residual curves in default double. Most importantly, all the traces in refloat
get converged faster than the traces in double. For matrix 1288 and matrix 1848, the bit
number for fraction of vector segment is 16 because the default 8 leads to nonconvergence.
For BiCGSTAB solver, from Table 5.6 we can see, while refloat leads to more number
of iterations to reaching convergence for 5 matrices, the number of iterations to reaching
convergence for 4 matrices are even fewer than those in double. We infer that is because
lower bit representation helps to enlarge the changes in the correction term, thus leads to
fewer iterations. We also notice there are spikes in the residual curves in refloat more
frequently than spikes in double, but they finally reach convergence.
124
Table 5.5: The bit number for exponent and fraction of matrix block and vector segment inREFLOAT.
ID CG BiCGSTABe f ev fv e f ev fv
353 3 3 3 8 3 3 3 81313 3 3 3 8 3 3 3 8354 3 3 3 8 3 3 3 82261 3 3 3 8 3 3 3 81288 3 3 3 16 3 3 3 161311 3 3 3 8 3 3 3 81289 3 3 3 8 3 3 3 8355 3 3 3 8 3 3 3 82257 3 3 3 8 3 3 3 81848 3 3 3 16 3 3 3 162259 3 3 3 8 3 3 3 8845 3 3 3 8 3 3 3 8
5.5.4 Memory Overhead
3531313 35
42261128813111289 35
5225718482259 84
5
GMEAN
0.1
0.2 0.3
0.5
1.0
ESCMA ReFloat
Figure 5.13: Memory overhead of matrices in ESCMA and REFLOAT.
In Figure 5.13, we compare the memory overhead for the matrix in refloat with
that in double (used in ESCMA [189]). On average, refloat consumes 0.21× memory
compared with double. For matrices except 2257 and 2259, refloat consumes less than
0.2× memory compared with double, which is 0.47× and 0.44× respectively. For matrix
2257 and matrix 2259, the average density within a matrix is relatively lower, thus more
memory is consumed for the matrix block index and the exponent bias as discussed in Sec.
5.3.1.
125
Table 5.6: The absolute number of iterations to reaching convergence.
ID CG BiCGSTABdouble refloat gap double refloat gap
353 68 85 +17 48 51 +31313 52 55 +3 30 69 +39354 54 95 +41 52 79 +272261 11 11 0 7 7 01288 262 305 +43 195 205 +101311 1 1 0 1 1 01289 294 401 +107 216 317 +101355 80 95 +15 55 52 -32257 55 56 +1 43 36 -71848 162 214 +52 117 145 +282259 57 58 +1 43 36 -7845 53 54 +1 41 45 -4
126
Chapter 6
Conclusion
In this dissertation, I study the architecture-level optimizations for efficient acceleration
of training in deep learning and efficient acceleration of graph processing. I present
ACCPAR, a principled and systematic method to determining the tensor partition among
heterogeneous accelerator arrays for achieving optimal performance. ACCPAR considers
the complete tensor partition space and reveals a new and previously unknown parallelism
configuration. It optimizes performance based on a cost model considering both computation
and communication cost of heterogeneous execution environment. I present PIPELAYER,
a ReRAM-based PIM accelerator architecture for high-throughput deep learning training
acceleration . I propose efficient layer-wise pipeline pipeline to exploit intra- and inter-layer
parallelism. PIPELAYER enables the highly pipelined execution of both training and testing.
I present GRAPHR, the first ReRAM-based graph processing accelerator. The key insight of
GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix
vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. The
core graph computations are performed in sparse matrix format in graph engines (ReRAM
crossbars) and edges are partitioned and sequentially fetched to the graph engines. I present
REFLOAT, a data format and a corresponding accelerator architecture to support low-cost
floating-point SpMV in ReRAM. REFLOAT is tailored for processing on ReRAM crossbars
where the number of effective bits is greatly reduced to reduce the crossbar cost and cycle
cost for the floating-point multiplication on a matrix block, thus the overall floating-point
SpMV processing is boosted.
127
Bibliography
[1] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,”arXiv preprint arXiv:1404.5997, 2014.
[2] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Hypar: Towards hybrid par-allelism for deep learning accelerator array,” in 2019 IEEE International Symposiumon High Performance Computer Architecture (HPCA), IEEE, 2019.
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detec-tion with region proposal networks,” in Advances in neural information processingsystems, pp. 91–99, 2015.
[4] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,“Ssd: Single shot multibox detector,” in European conference on computer vision,pp. 21–37, Springer, 2016.
[5] X. Liu, H. Yang, Z. Liu, L. Song, H. Li, and Y. Chen, “Dpatch: An adversarial patchattack on object detectors,” arXiv preprint arXiv:1806.02299, 2018.
[6] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social repre-sentations,” in Proceedings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining, pp. 701–710, ACM, 2014.
[7] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutionalnetworks,” arXiv preprint arXiv:1609.02907, 2016.
[8] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” inProceedings of the 22nd ACM SIGKDD international conference on Knowledgediscovery and data mining, pp. 855–864, ACM, 2016.
[9] T. Fischer and C. Krauss, “Deep learning with long short-term memory networks forfinancial market predictions,” European Journal of Operational Research, vol. 270,no. 2, pp. 654–669, 2018.
[10] X. Ding, Y. Zhang, T. Liu, and J. Duan, “Deep learning for event-driven stockprediction,” in Twenty-fourth international joint conference on artificial intelligence,2015.
[11] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learningfor financial signal representation and trading,” IEEE transactions on neural networksand learning systems, vol. 28, no. 3, pp. 653–664, 2016.
[12] R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep learning for health-care: review, opportunities and challenges,” Briefings in bioinformatics, vol. 19, no. 6,pp. 1236–1246, 2017.
128
[13] T. Ching, D. S. Himmelstein, B. K. Beaulieu-Jones, A. A. Kalinin, B. T. Do, G. P.Way, E. Ferrero, P.-M. Agapow, M. Zietz, M. M. Hoffman, W. Xie, G. L. Rosen, B. J.Lengerich, J. Israeli, J. Lanchantin, S. Woloszynek, A. E. Carpenter, A. Shrikumar,J. Xu, E. M. Cofer, C. A. Lavender, S. C. Turaga, A. M. Alexandari, Z. Lu, D. J.Harris, D. DeCaprio, Y. Qi, A. Kundaje, Y. Peng, L. K. Wiley, M. H. S. Segler, S. M.Boca, S. J. Swamidass, A. Huang, A. Gitter, and C. S. Greene, “Opportunities andobstacles for deep learning in biology and medicine,” Journal of The Royal SocietyInterface, vol. 15, no. 141, p. 20170387, 2018.
[14] O. Faust, Y. Hagiwara, T. J. Hong, O. S. Lih, and U. R. Acharya, “Deep learningfor healthcare applications based on physiological signals: A review,” Computermethods and programs in biomedicine, vol. 161, pp. 1–13, 2018.
[15] L. Song, F. Chen, S. R. Young, C. D. Schuman, G. Perdue, and T. E. Potok, “Deeplearning for vertex reconstruction of neutrino-nucleus interaction events with com-bined energy and time data,” in ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP), pp. 3882–3886, IEEE, 2019.
[16] P. Baldi, P. Sadowski, and D. Whiteson, “Searching for exotic particles in high-energyphysics with deep learning,” Nature communications, vol. 5, p. 4308, 2014.
[17] J. Wu, I. Yildirim, J. J. Lim, B. Freeman, and J. Tenenbaum, “Galileo: Perceivingphysical object properties by integrating a physics engine with deep learning,” inAdvances in neural information processing systems, pp. 127–135, 2015.
[18] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, andA. D. Brown, “Overview of the spinnaker system architecture,” IEEE Transactionson Computers, vol. 62, no. 12, pp. 2454–2467, 2013.
[19] Z. Du, D. D. Ben-Dayan Rubin, Y. Chen, L. He, T. Chen, L. Zhang, C. Wu, andO. Temam, “Neuromorphic accelerators: A comparison between neuroscience andmachine-learning approaches,” in Proceedings of the 48th International Symposiumon Microarchitecture, pp. 494–507, ACM, 2015.
[20] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan,B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser,R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S.Modha, “A million spiking-neuron integrated circuit with a scalable communicationnetwork and interface,” Science, vol. 345, no. 6197, pp. 668–673, 2014.
[21] S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha, “Back-propagation for energy-efficient neuromorphic computing,” in Advances in NeuralInformation Processing Systems, pp. 1117–1125, 2015.
[22] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopou-los, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. d. Nolfo, P. Datta,
129
A. Amir, B. Taba, M. D. Flickner, and D. S. Modha, “Convolutional networks for fast,energy-efficient neuromorphic computing,” Proceedings of the National Academy ofSciences (PNAS), p. 201604850, 2016.
[23] T. Liu, Z. Liu, F. Lin, Y. Jin, G. Quan, and W. Wen, “Mt-spike: A multilayer time-based spiking neuromorphic architecture with temporal error backpropagation,” inComputer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on,pp. 450–457, IEEE, 2017.
[24] T. Liu, L. Jiang, Y. Jin, G. Quan, and W. Wen, “Pt-spike: A precise-time-dependentsingle spike neuromorphic architecture with efficient supervised learning,” in DesignAutomation Conference (ASP-DAC), 2018 23rd Asia and South Pacific, pp. 568–573,IEEE, 2018.
[25] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural acceleration forgeneral-purpose approximate programs,” in Proceedings of the 2012 45th AnnualIEEE/ACM International Symposium on Microarchitecture, pp. 449–460, IEEEComputer Society, 2012.
[26] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: Asmall-footprint high-throughput accelerator for ubiquitous machine-learning,” inACM Sigplan Notices, pp. 269–284, ACM, 2014.
[27] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, andO. Temam, “Dadiannao: A machine-learning supercomputer,” in Proceedings of the47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622,IEEE Computer Society, 2014.
[28] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam,“Shidiannao: Shifting vision processing closer to the sensor,” in ACM SIGARCHComputer Architecture News, vol. 43, pp. 92–104, ACM, 2015.
[29] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, andY. Chen, “Pudiannao: A polyvalent machine learning accelerator,” in ACM SIGARCHComputer Architecture News, vol. 43, pp. 369–381, ACM, 2015.
[30] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “Cambricon: Aninstruction set architecture for neural networks,” in 2016 ACM/IEEE 43rd AnnualInternational Symposium on Computer Architecture (ISCA), pp. 393–405, IEEE,2016.
[31] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen,“Cambricon-x: An accelerator for sparse neural networks,” in 2016 49th AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12,IEEE, 2016.
130
[32] “Google supercharges machine learning tasks with tpu custom chip.”https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html.
[33] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates,S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell,M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey,A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean,A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix,T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek,E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing,M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor process-ing unit,” in Proceedings of the 44th Annual International Symposium on ComputerArchitecture, ISCA ’17, (New York, NY, USA), pp. 1–12, ACM, 2017.
[34] S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan,A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, “Scaledeep: A scalablecompute architecture for learning and evaluating deep networks,” in Proceedingsof the 44th Annual International Symposium on Computer Architecture, pp. 13–26,ACM, 2017.
[35] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos, “Stripes:Bit-serial deep neural network computing,” in Microarchitecture (MICRO), 201649th Annual IEEE/ACM International Symposium on, pp. 1–12, IEEE, 2016.
[36] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-basedaccelerator design for deep convolutional neural networks,” in Proceedings of the2015 ACM/SIGDA international symposium on field-programmable gate arrays,pp. 161–170, 2015.
[37] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song,Y. Wang, and H. Yang, “Going deeper with embedded fpga platform for convolutionalneural network,” in Proceedings of the 2016 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays, pp. 26–35, ACM, 2016.
[38] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space exploration offpga-based deep convolutional neural networks,” in 2016 21st Asia and South PacificDesign Automation Conference (ASP-DAC), pp. 575–580, IEEE, 2016.
[39] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s. Seo, andY. Cao, “Throughput-optimized opencl-based fpga accelerator for large-scale con-volutional neural networks,” in Proceedings of the 2016 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, pp. 16–25, ACM, 2016.
131
[40] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine: Towards uniformedrepresentation and acceleration for deep convolutional neural networks,” in Computer-Aided Design (ICCAD), 2016 IEEE/ACM International Conference on, pp. 1–8, IEEE,2016.
[41] Y. Wang, J. Xu, Y. Han, H. Li, and X. Li, “Deepburning: automatic generationof fpga-based learning accelerators for the neural network family,” in 2016 53ndACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, IEEE, 2016.
[42] C. Zhang, D. Wu, J. Sun, G. Sun, G. Luo, and J. Cong, “Energy-efficient cnnimplementation on a deeply pipelined fpga cluster,” in Proceedings of the 2016International Symposium on Low Power Electronics and Design, pp. 326–331, 2016.
[43] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, andH. Esmaeilzadeh, “Tabla: A unified template-based framework for acceleratingstatistical machine learning,” in High Performance Computer Architecture (HPCA),2016 IEEE International Symposium on, pp. 14–26, IEEE, 2016.
[44] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, andH. Esmaeilzadeh, “From high-level deep neural models to fpgas,” in Microarchitec-ture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pp. 1–12,IEEE, 2016.
[45] Y. Guan, H. Liang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, and J. Cong,“Fp-dnn: An automated framework for mapping deep neural networks onto fpgaswith rtl-hls hybrid templates,” in 2017 IEEE 25th Annual International Symposiumon Field-Programmable Custom Computing Machines (FCCM), pp. 152–159, IEEE,2017.
[46] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long short-termmemory recurrent neural networks,” in 2017 22nd Asia and South Pacific DesignAutomation Conference (ASP-DAC), pp. 629–634, IEEE, 2017.
[47] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang,H. Yang, and W. J. Dally, “Ese: Efficient speech recognition engine with sparselstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International Symposium onField-Programmable Gate Arrays, pp. 75–84, 2017.
[48] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing loop operation and dataflowin fpga acceleration of deep convolutional neural networks,” in Proceedings of the2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays,pp. 45–54, ACM, 2017.
[49] R. Zhao, W. Song, W. Zhang, T. Xing, J.-H. Lin, M. Srivastava, R. Gupta, andZ. Zhang, “Accelerating binarized convolutional neural networks with software-
132
programmable fpgas,” in Proceedings of the 2017 ACM/SIGDA International Sympo-sium on Field-Programmable Gate Arrays, pp. 15–24, 2017.
[50] Y. Shen, M. Ferdman, and P. Milder, “Overcoming resource underutilization in spatialcnn accelerators,” in Field Programmable Logic and Applications (FPL), 2016 26thInternational Conference on, pp. 1–4, IEEE, 2016.
[51] C. Zhang and V. Prasanna, “Frequency domain acceleration of convolutional neu-ral networks on cpu-fpga shared memory system,” in Proceedings of the 2017ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 35–44, 2017.
[52] R. Cai, A. Ren, N. Liu, C. Ding, L. Wang, X. Qian, M. Pedram, and Y. Wang,“Vibnn: Hardware acceleration of bayesian neural networks,” in Proceedings of theTwenty-Third International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 476–488, ACM, 2018.
[53] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, “Neuflow:A runtime reconfigurable dataflow processor for vision,” in Computer Vision andPattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conferenceon, pp. 109–116, IEEE, 2011.
[54] Y. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficientdataflow for convolutional neural networks,” in 2016 ACM/IEEE 43rd Annual Inter-national Symposium on Computer Architecture (ISCA), pp. 367–379, 2016.
[55] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficientreconfigurable accelerator for deep convolutional neural networks,” IEEE Journal ofSolid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[56] Y.-H. Chen, J. Emer, and V. Sze, “Using dataflow to optimize energy efficiency ofdeep neural network accelerators,” IEEE Micro, vol. 37, no. 3, pp. 12–21, 2017.
[57] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexible dataflowaccelerator architecture for convolutional neural networks,” in High PerformanceComputer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 553–564, IEEE, 2017.
[58] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer cnn accelerators,” inMicroarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposiumon, pp. 1–12, IEEE, 2016.
[59] Y. Shen, M. Ferdman, and M. Peter, “Maximizing cnn accelerator efficiency throughresource partitioning,” in Proceedings of the 44th Annual International Symposiumon Computer Architecture, pp. 535–547, ACM, 2017.
133
[60] A. Yazdanbakhsh, K. Samadi, N. S. Kim, and H. Esmaeilzadeh, “Ganax: A unifiedmimd-simd acceleration for generative adversarial networks,” in 2018 ACM/IEEE45th Annual International Symposium on Computer Architecture (ISCA), pp. 650–661, IEEE, 2018.
[61] K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, “Ucnn: Ex-ploiting computational reuse in deep neural networks via weight repetition,” in 2018ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA),pp. 674–687, IEEE, 2018.
[62] D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: Aprogrammable digital neuromorphic architecture with high-density 3d memory,” inComputer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Sympo-sium on, pp. 380–392, IEEE, 2016.
[63] L. Jiang, M. Kim, W. Wen, and D. Wang, “Xnor-pop: A processing-in-memoryarchitecture for binary convolutional neural networks in wide-io2 drams,” in LowPower Electronics and Design (ISLPED, 2017 IEEE/ACM International Symposiumon, pp. 1–6, IEEE, 2017.
[64] S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “Drisa: A dram-basedreconfigurable in-situ accelerator,” in Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 288–301, ACM, 2017.
[65] Q. Lou, W. Wen, and L. Jiang, “3dict: a reliable and qos capable mobile process-in-memory architecture for lookup-based cnns in 3d xpoint rerams,” in Proceedings ofthe International Conference on Computer-Aided Design, p. 53, ACM, 2018.
[66] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “Eie:efficient inference engine on compressed deep neural network,” in Proceedings of the43rd International Symposium on Computer Architecture, pp. 243–254, IEEE Press,2016.
[67] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, and B. Yuan, “Sc-dcnn:Highly-scalable deep convolutional neural network using stochastic computing,” inProceedings of the Twenty-Second International Conference on Architectural Supportfor Programming Languages and Operating Systems, pp. 405–418, ACM, 2017.
[68] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer,S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolu-tional neural networks,” in Proceedings of the 44th Annual International Symposiumon Computer Architecture, pp. 27–40, ACM, 2017.
[69] Y. Shen, M. Ferdman, and P. Milder, “Escher: A cnn accelerator with flexiblebuffering to minimize off-chip transfer,” in Field-Programmable Custom Computing
134
Machines (FCCM), 2017 IEEE 25th Annual International Symposium on, pp. 93–100,IEEE, 2017.
[70] M. S. Razlighi, M. Imani, F. Koushanfar, and T. Rosing, “Looknn: Neural networkwith no multiplication,” in 2017 Design, Automation & Test in Europe Conference &Exhibition (DATE), pp. 1775–1780, IEEE, 2017.
[71] J. Albericio, A. Delmas, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos,“Bit-pragmatic deep neural network computing,” in Proceedings of the 50th AnnualIEEE/ACM International Symposium on Microarchitecture, pp. 382–394, ACM,2017.
[72] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, “Bitfusion: Bit-level dynamically composable architecture for accelerating deep neuralnetwork,” in 2018 ACM/IEEE 45th Annual International Symposium on ComputerArchitecture (ISCA), pp. 764–775, IEEE, 2018.
[73] T. Na and S. Mukhopadhyay, “Speeding up convolutional neural network trainingwith dynamic precision scaling and flexible multiplier-accumulator,” in Proceedingsof the 2016 International Symposium on Low Power Electronics and Design, pp. 58–63, ACM, 2016.
[74] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks, “Minerva: Enabling low-power, highly-accuratedeep neural network accelerators,” in 2016 ACM/IEEE 43rd Annual InternationalSymposium on Computer Architecture (ISCA), pp. 267–278, IEEE, 2016.
[75] J. Mao, X. Chen, K. W. Nixon, C. Krieger, and Y. Chen, “Modnn: Local distributedmobile computing system for deep neural network,” in 2017 Design, Automation &Test in Europe Conference & Exhibition (DATE), pp. 1396–1401, IEEE, 2017.
[76] J. Mao, Z. Yang, W. Wen, C. Wu, L. Song, K. W. Nixon, X. Chen, H. Li, and Y. Chen,“Mednn: A distributed mobile system with enhanced partition and deployment forlarge-scale dnns,” in Proceedings of the 36th International Conference on Computer-Aided Design, pp. 751–756, IEEE Press, 2017.
[77] J. Mao, Z. Qin, Z. Xu, K. W. Nixon, X. Chen, H. Li, and Y. Chen, “Adalearner: Anadaptive distributed mobile learning system for neural networks,” in Computer-AidedDesign (ICCAD), 2017 IEEE/ACM International Conference on, pp. 291–296, IEEE,2017.
[78] C.-E. Lee, Y. S. Shao, J.-F. Zhang, A. Parashar, J. Emer, S. W. Keckler, and Z. Zhang,“Stitch-x: An accelerator architecture for exploiting unstructured sparsity in deepneural networks,” in SysML, 2018.
135
[79] J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao, “Processing-in-memory forenergy-efficient neural network training: A heterogeneous approach,” in MICRO,2018.
[80] J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, “Scalpel:Customizing dnn pruning to the underlying hardware parallelism,” in Proceedings ofthe 44th Annual International Symposium on Computer Architecture, pp. 548–560,ACM, 2017.
[81] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan,X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan, “Circnn: acceleratingand compressing deep neural networks using block-circulant weight matrices,” inProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchi-tecture, pp. 395–408, ACM, 2017.
[82] J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh, “Scale-outacceleration for machine learning,” in Proceedings of the 50th Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 367–381, ACM, 2017.
[83] V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh,“Snapea: Predictive early activation for reducing computation in deep convolutionalneural networks,” in 2018 ACM/IEEE 45th Annual International Symposium onComputer Architecture (ISCA), 2018.
[84] E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network accelerator based onoutlier-aware low-precision computation,” in 2018 ACM/IEEE 45th Annual Interna-tional Symposium on Computer Architecture (ISCA), pp. 688–698, IEEE, 2018.
[85] M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, “Prediction based execution on deepneural networks,” in 2018 ACM/IEEE 45th Annual International Symposium onComputer Architecture (ISCA), pp. 752–763, IEEE, 2018.
[86] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, “Permdnn: Efficientcompressed dnn architecture with permuted diagonal matrices,” in 2018 51st AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 189–202,IEEE, 2018.
[87] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable andefficient neural network acceleration with 3d memory,” in Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Lan-guages and Operating Systems, pp. 751–764, ACM, 2017.
[88] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, “Prime: A novelprocessing-in-memory architecture for neural network computation in reram-basedmain memory,” in Proceedings of the 43rd International Symposium on ComputerArchitecture, ISCA ’16, pp. 27–39, 2016.
136
[89] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu,R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network acceleratorwith in-situ analog arithmetic in crossbars,” in Proceedings of the 43rd InternationalSymposium on Computer Architecture, pp. 14–26, IEEE Press, 2016.
[90] X. Liu, M. Mao, B. Liu, H. Li, Y. Chen, B. Li, Y. Wang, H. Jiang, M. Barnell, Q. Wu,and J. Yang, “Reno: a high-efficient reconfigurable neuromorphic computing acceler-ator design,” in Design Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE,pp. 1–6, IEEE, 2015.
[91] M. N. Bojnordi and E. Ipek, “Memristive boltzmann machine: A hardware acceleratorfor combinatorial optimization and deep learning,” in High Performance ComputerArchitecture (HPCA), 2016 IEEE International Symposium on, pp. 1–13, IEEE, 2016.
[92] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based acceleratorfor deep learning,” in 2017 IEEE International Symposium on High PerformanceComputer Architecture (HPCA), pp. 541–552, IEEE, 2017.
[93] X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: a universal reram-basedcnn accelerator with atomic layer computation,” in Proceedings of the 55th AnnualDesign Automation Conference, p. 103, ACM, 2018.
[94] H. Ji, L. Song, L. Jiang, H. H. Li, and Y. Chen, “Recom: An efficient resistiveaccelerator for compressed deep neural networks,” in Design, Automation & Test inEurope Conference & Exhibition (DATE), 2018, pp. 237–240, IEEE, 2018.
[95] F. Chen, L. Song, and Y. Chen, “Regan: A pipelined reram-based accelerator forgenerative adversarial networks,” in Design Automation Conference (ASP-DAC),2018 23rd Asia and South Pacific, pp. 178–183, IEEE, 2018.
[96] B. Li, L. Song, F. Chen, X. Qian, Y. Chen, and H. H. Li, “Reram-based acceleratorfor deep learning,” in Design, Automation & Test in Europe Conference & Exhibition(DATE), 2018, pp. 815–820, IEEE, 2018.
[97] F. Chen and H. Li, “Emat: an efficient multi-task architecture for transfer learningusing reram,” in Proceedings of the International Conference on Computer-AidedDesign, p. 33, ACM, 2018.
[98] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolutional neural networkon rram,” in Design Automation Conference (ASP-DAC), 2017 22nd Asia and SouthPacific, pp. 782–787, IEEE, 2017.
[99] F. Chen, L. Song, H. H. Li, and Y. Chen, “Zara: A novel zero-free dataflow acceleratorfor generative adversarial networks in 3d reram,” in Proceedings of the 56th AnnualDesign Automation Conference 2019, pp. 1–6, 2019.
137
[100] P. Wang, Y. Ji, C. Hong, Y. Lyu, D. Wang, and Y. Xie, “Snrram: an efficient sparseneural network computation architecture based on resistive random-access memory,”in Proceedings of the 55th Annual Design Automation Conference, p. 106, ACM,2018.
[101] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos,“Cnvlutin: ineffectual-neuron-free deep neural network computing,” in Proceedingsof the 43rd International Symposium on Computer Architecture, pp. 1–13, 2016.
[102] Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. Chen, “Neutrams: Neuralnetwork transformation and co-design under neuromorphic hardware constraints,” inMicroarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposiumon, pp. 1–13, IEEE, 2016.
[103] A. Mirhoseini, B. D. Rouhani, E. M. Songhori, and F. Koushanfar, “Perform-ml: Per-formance optimized machine learning by platform and content aware customization,”in Proceedings of the 53rd Annual Design Automation Conference, p. 20, ACM,2016.
[104] Z. Takhirov, J. Wang, V. Saligrama, and A. Joshi, “Energy-efficient adaptive classifierdesign for mobile systems,” in Proceedings of the 2016 International Symposium onLow Power Electronics and Design, pp. 52–57, ACM, 2016.
[105] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.-m. Hwu, and D. Chen, “Dnnbuilder:an automated tool for building high-performance dnn hardware accelerators for fpgas,”in Proceedings of the International Conference on Computer-Aided Design, p. 56,ACM, 2018.
[106] Y. Wang, W. Wen, B. Liu, D. Chiarulli, and H. Li, “Group scissor: Scaling neuromor-phic computing design to large neural networks,” in 2017 54th ACM/EDAC/IEEEDesign Automation Conference (DAC), pp. 1–6, IEEE, 2017.
[107] J. H. Ko, B. Mudassar, T. Na, and S. Mukhopadhyay, “Design of an energy-efficientaccelerator for training of convolutional neural networks using frequency-domaincomputation,” in Proceedings of the 54th Annual Design Automation Conference2017, p. 59, ACM, 2017.
[108] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long,E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameterserver,” in 11th USENIX Symposium on Operating Systems Design and Implemen-tation (OSDI 14), pp. 583–598, 2014.
[109] Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller,A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer,C. T. Gray, B. Khailany, and S. W. Keckler, “Simba: Scaling deep-learning inference
138
with multi-chip-module-based architecture,” in Proceedings of the 52nd AnnualIEEE/ACM International Symposium on Microarchitecture, pp. 14–27, 2019.
[110] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardwareco-design for vector-wise sparse neural networks on modern gpus,” in Proceedings ofthe 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 359–371, 2019.
[111] Q. Luo, J. Lin, Y. Zhuo, and X. Qian, “Hop: Heterogeneity-aware decentralized train-ing,” in Proceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, pp. 893–907, 2019.
[112] W. Niu, X. Ma, S. Lin, S. Wang, X. Qian, X. Lin, Y. Wang, and B. Ren, “Patdnn:Achieving real-time dnn execution on mobile devices with pattern-based weight prun-ing,” in Proceedings of the Twenty-Fifth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, pp. 907–922, 2020.
[113] X. Peng, X. Shi, H. Dai, H. Jin, W. Ma, Q. Xiong, F. Yang, and X. Qian, “Capuchin:Tensor-based gpu memory management for deep learning,” in Proceedings of theTwenty-Fifth International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 891–905, 2020.
[114] Q. Luo, J. He, Y. Zhuo, and X. Qian, “Prague: High-performance heterogeneity-aware asynchronous decentralized training,” in Proceedings of the Twenty-FifthInternational Conference on Architectural Support for Programming Languages andOperating Systems, pp. 401–416, 2020.
[115] X. Wang, R. Hou, B. Zhao, F. Yuan, J. Zhang, D. Meng, and X. Qian, “Dnnguard:An elastic heterogeneous dnn accelerator architecture against adversarial attacks,” inProceedings of the Twenty-Fifth International Conference on Architectural Supportfor Programming Languages and Operating Systems, pp. 19–34, 2020.
[116] A. Li, T. Geng, T. Wang, M. Herbordt, S. L. Song, and K. Barker, “Bstc: a novelbinarized-soft-tensor-core design for accelerating bit-based approximated neural nets,”in Proceedings of the International Conference for High Performance Computing,Networking, Storage and Analysis, pp. 1–30, 2019.
[117] X. Zhang, S. L. Song, C. Xie, J. Wang, W. Zhang, and X. Fu, “Enabling highlyefficient capsule networks processing through a pim-based architecture design,” in2020 IEEE International Symposium on High Performance Computer Architecture(HPCA), pp. 542–555, IEEE, 2020.
[118] A. Fuchs and D. Wentzlaff, “The accelerator wall: Limits of chip specialization,” in2019 IEEE International Symposium on High Performance Computer Architecture(HPCA), pp. 1–14, IEEE, 2019.
139
[119] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep con-volutional neural networks,” in Advances in neural information processing systems,pp. 1097–1105, 2012.
[120] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”in Proceedings of the IEEE conference on computer vision and pattern recognition,pp. 770–778, 2016.
[121] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neuralinformation processing systems, pp. 2672–2680, 2014.
[122] A. Radford, L. Metz, and S. Chintala, “Unsupervised Representation Learn-ing with Deep Convolutional Generative Adversarial Networks,” arXiv e-prints,p. arXiv:1511.06434, Nov 2015.
[123] M. M. Waldrop, “The chips are down for moore’s law,” Nature News, vol. 530,no. 7589, p. 144, 2016.
[124] A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dramacceleration architecture leveraging commodity dram devices and standard memorymodules,” in High Performance Computer Architecture (HPCA), 2015 IEEE 21stInternational Symposium on, pp. 283–295, IEEE, 2015.
[125] H.-S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee, F. T.Chen, and M.-J. Tsai, “Metal–oxide rram,” Proceedings of the IEEE, vol. 100, no. 6,pp. 1951–1970, 2012.
[126] G. Vigna and R. A. Kemmerer, “Netstat: A network-based intrusion detectionapproach,” in Computer Security Applications Conference, 1998. Proceedings. 14thAnnual, pp. 25–34, IEEE, 1998.
[127] E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, “Finding high-qualitycontent in social media,” in Proceedings of the 2008 international conference on websearch and data mining, pp. 183–194, ACM, 2008.
[128] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking:bringing order to the web.,” 1999.
[129] C. Biemann, “Chinese whispers: an efficient graph clustering algorithm and itsapplication to natural language processing problems,” in Proceedings of the firstworkshop on graph based methods for natural language processing, pp. 73–80,Association for Computational Linguistics, 2006.
[130] S. H. Corston, W. B. Dolan, L. H. Vanderwende, and L. Braden-Harder, “Systemfor processing textual inputs using natural language processing techniques,” May 312005. US Patent 6,901,399.
140
[131] R. Mihalcea and D. Radev, Graph-based natural language processing and informa-tion retrieval. Cambridge University Press, 2011.
[132] E. J. Chesler, L. Lu, S. Shou, Y. Qu, J. Gu, J. Wang, H. C. Hsu, J. D. Mountz, N. E.Baldwin, M. A. Langston, D. W. Threadgill, K. F. Manly, and R. W. Williams, “Com-plex trait analysis of gene expression uncovers polygenic and pleiotropic networksthat modulate nervous system function,” Nature genetics, vol. 37, no. 3, pp. 233–242,2005.
[133] A. Conesa, S. Gotz, J. M. Garcıa-Gomez, J. Terol, M. Talon, and M. Robles,“Blast2go: a universal tool for annotation, visualization and analysis in functionalgenomics research,” Bioinformatics, vol. 21, no. 18, pp. 3674–3676, 2005.
[134] G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Item-to-itemcollaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003.
[135] J. B. Schafer, D. Frankowski, J. Herlocker, and S. Sen, “Collaborative filteringrecommender systems,” in The adaptive web, pp. 291–324, Springer, 2007.
[136] F. E. Walter, S. Battiston, and F. Schweitzer, “A model of a trust-based recommen-dation system on a social network,” Autonomous Agents and Multi-Agent Systems,vol. 16, no. 1, pp. 57–74, 2008.
[137] B. J. Frey, Graphical models for machine learning and digital communication. MITpress, 1998.
[138] M. J. Wainwright and M. I. Jordan, Graphical models, exponential families, andvariational inference. Now Publishers Inc, 2008.
[139] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
[140] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein,“Distributed graphlab: a framework for machine learning and data mining in thecloud,” Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716–727, 2012.
[141] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: Dis-tributed graph-parallel computation on natural graphs,” in Presented as part of the10th USENIX Symposium on Operating Systems Design and Implementation (OSDI12), pp. 17–30, 2012.
[142] R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica, “Graphx: A resilientdistributed graph system on spark,” in First International Workshop on Graph DataManagement Experiences and Systems, p. 2, ACM, 2013.
[143] R. Chen, J. Shi, Y. Chen, and H. Chen, “Powerlyra: Differentiated graph computationand partitioning on skewed graphs,” in Proceedings of the Tenth European Conferenceon Computer Systems, p. 1, ACM, 2015.
141
[144] M. LeBeane, S. Song, R. Panda, J. H. Ryoo, and L. K. John, “Data partitioningstrategies for graph workloads on heterogeneous clusters,” in SC15: InternationalConference for High Performance Computing, Networking, Storage and Analysis,pp. 1–12, Nov 2015.
[145] S. Song, M. Li, X. Zheng, M. LeBeane, J. H. Ryoo, R. Panda, A. Gerstlauer, and L. K.John, “Proxy-guided load balancing of graph processing workloads on heterogeneousclusters,” in 2016 45th International Conference on Parallel Processing (ICPP),pp. 77–86, Aug 2016.
[146] Y. Zhuo, J. Chen, Q. Luo, Y. Wang, H. Yang, D. Qian, and X. Qian, “Symplegraph:distributed graph processing with precise loop-carried dependency guarantee,” inProceedings of the 41st ACM SIGPLAN Conference on Programming LanguageDesign and Implementation, pp. 592–607, 2020.
[147] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis, “Mizan:a system for dynamic load balancing in large-scale graph processing,” in Proceedingsof the 8th ACM European Conference on Computer Systems, pp. 169–182, ACM,2013.
[148] M. Randles, D. Lamb, and A. Taleb-Bendiab, “A comparative study into distributedload balancing algorithms for cloud computing,” in Advanced Information Network-ing and Applications Workshops (WAINA), 2010 IEEE 24th International Conferenceon, pp. 551–556, IEEE, 2010.
[149] Y. Zhao, K. Yoshigoe, M. Xie, S. Zhou, R. Seker, and J. Bian, “Lightgraph: Lightencommunication in distributed graph-parallel processing,” in 2014 IEEE InternationalCongress on Big Data, pp. 717–724, IEEE, 2014.
[150] P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan, “Replication-based fault-tolerance for large-scale graph processing,” in 2014 44th Annual IEEE/IFIP Inter-national Conference on Dependable Systems and Networks, pp. 562–573, IEEE,2014.
[151] X. Zhu, W. Han, and W. Chen, “Gridgraph: Large-scale graph processing on a singlemachine using 2-level hierarchical partitioning,” in 2015 USENIX Annual TechnicalConference (USENIX ATC 15), pp. 375–386, 2015.
[152] A. Kyrola, G. Blelloch, and C. Guestrin, “Graphchi: large-scale graph computationon just a pc,” in Presented as part of the 10th USENIX Symposium on OperatingSystems Design and Implementation (OSDI 12), pp. 31–46, 2012.
[153] A. Roy, I. Mihailovic, and W. Zwaenepoel, “X-stream: edge-centric graph processingusing streaming partitions,” in Proceedings of the Twenty-Fourth ACM Symposiumon Operating Systems Principles, pp. 472–488, ACM, 2013.
142
[154] K. Vora, G. Xu, and R. Gupta, “Load the edges you need: A generic i/o optimizationfor disk-based graph processing,” in 2016 USENIX Annual Technical Conference(USENIX ATC 16), 2016.
[155] S. Song, X. Zheng, A. Gerstlauer, and L. K. John, “Fine-grained power analysis ofemerging graph processing workloads for cloud operations management,” in 2016IEEE International Conference on Big Data (Big Data), pp. 2121–2126, Dec 2016.
[156] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, “Fastcrash recovery in ramcloud,” in Proceedings of the Twenty-Third ACM Symposiumon Operating Systems Principles, pp. 29–41, ACM, 2011.
[157] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, andG. Czajkowski, “Pregel: a system for large-scale graph processing,” in Proceedings ofthe 2010 ACM SIGMOD International Conference on Management of data, pp. 135–146, ACM, 2010.
[158] Y. Low, J. E. Gonzalez, A. Kyrola, D. Bickson, C. E. Guestrin, and J. Heller-stein, “Graphlab: A new framework for parallel machine learning,” arXiv preprintarXiv:1408.2041, 2014.
[159] T. Lahiri, M.-A. Neimat, and S. Folkman, “Oracle timesten: An in-memory databasefor enterprise applications.,” IEEE Data Eng. Bull., vol. 36, no. 2, pp. 6–13, 2013.
[160] F. Farber, S. K. Cha, J. Primsch, C. Bornhovd, S. Sigg, and W. Lehner, “Sap hanadatabase: data management for modern business applications,” ACM Sigmod Record,vol. 40, no. 4, pp. 45–51, 2012.
[161] G. H. Golub and J. M. Ortega, Scientific computing: an introduction with parallelcomputing. Elsevier, 2014.
[162] P. Harrison and A. Valavanis, Quantum wells, wires and dots: theoretical andcomputational physics of semiconductor nanostructures. John Wiley & Sons, 2016.
[163] F. Jensen, Introduction to computational chemistry. John wiley & sons, 2017.
[164] T. Chapman, P. Avery, P. Collins, and C. Farhat, “Accelerated mesh sampling forthe hyper reduction of nonlinear computational models,” International Journal forNumerical Methods in Engineering, vol. 109, no. 12, pp. 1623–1654, 2017.
[165] M. S. Nobile, P. Cazzaniga, A. Tangherloni, and D. Besozzi, “Graphics processingunits in bioinformatics, computational biology and systems biology,” Briefings inbioinformatics, vol. 18, no. 5, pp. 870–885, 2017.
[166] M. Arioli, J. W. Demmel, and I. S. Duff, “Solving sparse linear systems with sparsebackward error,” SIAM Journal on Matrix Analysis and Applications, vol. 10, no. 2,pp. 165–190, 1989.
143
[167] Y. Saad, Iterative methods for sparse linear systems, vol. 82. siam, 2003.
[168] Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover, “Gpu cluster for high perfor-mance computing,” in Proceedings of the 2004 ACM/IEEE Conference on Supercom-puting, (USA), p. 47, IEEE Computer Society, 2004.
[169] F. Song, S. Tomov, and J. Dongarra, “Enabling and scaling matrix computationson heterogeneous multi-core and multi-gpu systems,” in Proceedings of the 26thACM International Conference on Supercomputing, ICS ’12, (New York, NY, USA),p. 365–376, Association for Computing Machinery, 2012.
[170] D. Luebke, “Cuda: Scalable parallel programming for high-performance scientificcomputing,” in 2008 5th IEEE international symposium on biomedical imaging: fromnano to macro, pp. 836–838, IEEE, 2008.
[171] M. M. Waldrop, “The chips are down for moore’s law,” Nature News, vol. 530,no. 7589, p. 144, 2016.
[172] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger, “Dark sil-icon and the end of multicore scaling,” in 2011 38th Annual International Symposiumon Computer Architecture (ISCA), pp. 365–376, 2011.
[173] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning withlimited numerical precision,” in International Conference on Machine Learning,pp. 1737–1746, 2015.
[174] L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Accpar: Tensor partitioningfor heterogeneous deep learning accelerators,” in 2020 IEEE International Sympo-sium on High Performance Computer Architecture (HPCA), pp. 342–355, IEEE,2020.
[175] L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Graphr: Accelerating graph pro-cessing using reram,” in 2018 IEEE International Symposium on High PerformanceComputer Architecture (HPCA), pp. 531–543, IEEE, 2018.
[176] L. Song, F. Chen, X. Qian, H. Li, and Y. Chen, “Refloat: Low-cost floating-pointprocessing in reram,” in Under Review, 2020.
[177] Z. Jia, S. Lin, C. R. Qi, and A. Aiken, “Exploring hidden dimensions in parallelizingconvolutional neural networks,” arXiv preprint arXiv:1802.04924, 2018.
[178] M. Wang, C.-c. Huang, and J. Li, “Unifying data, model and hybrid parallelism indeep learning via tensor tiling,” arXiv preprint arXiv:1805.04170, 2018.
[179] M. Wang, C.-c. Huang, and J. Li, “Supporting very large models using automaticdataflow graph partitioning,” in Proceedings of the Fourteenth EuroSys Conference2019, pp. 1–17, 2019.
144
[180] Z. Jia, M. Zaharia, and A. Aiken, “Beyond data and model parallelism for deepneural networks,” arXiv preprint arXiv:1807.05358, 2018.
[181] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scaleimage recognition,” ArXiv Preprint arXiv:1409.1556, 2014.
[182] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale VisualRecognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115,no. 3, pp. 211–252, 2015.
[183] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,”1998.
[184] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-level performance,energy, and area model for emerging nonvolatile memory,” IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 7, pp. 994–1007, 2012.
[185] T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, “Graphicionado: Ahigh-performance and energy-efficient accelerator for graph analytics,” in Proceed-ings of the 49th International Symposium on Microarchitecture, ACM, 2016.
[186] M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk, “Energyefficient architecture for graph analytics accelerators,” in Computer Architecture(ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 166–177,IEEE, 2016.
[187] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memoryaccelerator for parallel graph processing,” in ACM SIGARCH Computer ArchitectureNews, vol. 43, pp. 105–117, ACM, 2015.
[188] J. T. Pawlowski, “Hybrid memory cube (hmc),” in Hot Chips, vol. 23, 2011.
[189] B. Feinberg, U. K. R. Vengalam, N. Whitehair, S. Wang, and E. Ipek, “Enablingscientific computing on memristive accelerators,” in 2018 ACM/IEEE 45th AnnualInternational Symposium on Computer Architecture (ISCA), pp. 367–382, IEEE,2018.
[190] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neuralnetworks, vol. 12, no. 1, pp. 145–151, 1999.
[191] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXivpreprint arXiv:1412.6980, 2014.
145
[192] C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, andR. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” inProceedings of the 45th Annual International Symposium on Computer Architecture,pp. 383–396, IEEE Press, 2018.
[193] X. Wang, J. Yu, C. Augustine, R. Iyer, and R. Das, “Bit prudent in-cache accelerationof deep convolutional neural networks,” in 2019 IEEE International Symposium onHigh Performance Computer Architecture (HPCA), pp. 81–93, IEEE, 2019.
[194] M. Mahmoud, K. Siu, and A. Moshovos, “Diffy: a deja vu-free differential deep neu-ral network accelerator,” in 2018 51st Annual IEEE/ACM International Symposiumon Microarchitecture (MICRO), pp. 134–147, IEEE, 2018.
[195] J. Zhang and J. Li, “Improving the performance of opencl-based fpga accelerator forconvolutional neural network,” in Proceedings of the 2017 ACM/SIGDA InternationalSymposium on Field-Programmable Gate Arrays, pp. 25–34, 2017.
[196] Y. Li, I.-J. Liu, Y. Yuan, D. Chen, A. Schwing, and J. Huang, “Accelerating distributedreinforcement learning with in-switch computing,” in 2019 ACM/IEEE 46th AnnualInternational Symposium on Computer Architecture (ISCA), pp. 279–291, IEEE,2019.
[197] A. Azizimazreah and L. Chen, “Shortcut mining: Exploiting cross-layer shortcutreuse in dcnn accelerators,” in 2019 IEEE International Symposium on High Perfor-mance Computer Architecture (HPCA), pp. 94–105, IEEE, 2019.
[198] H. Cho, P. Oh, J. Park, W. Jung, and J. Lee, “Fa3c: Fpga-accelerated deep reinforce-ment learning,” in Proceedings of the Twenty-Fourth International Conference onArchitectural Support for Programming Languages and Operating Systems, pp. 499–513, ACM, 2019.
[199] Z. Li, C. Ding, S. Wang, W. Wen, Y. Zhuo, C. Liu, Q. Qiu, W. Xu, X. Lin, X. Qian,and Y. Wang, “E-rnn: Design optimization for efficient recurrent neural networksin fpgas,” in 2019 IEEE International Symposium on High Performance ComputerArchitecture (HPCA), pp. 69–80, IEEE, 2019.
[200] M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, “Tangram: Optimized coarse-grained dataflow for scalable nn accelerators,” in Proceedings of the Twenty-FourthInternational Conference on Architectural Support for Programming Languages andOperating Systems, pp. 807–820, ACM, 2019.
[201] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neuralnetworks for efficient systolic array implementations: Column combining under jointoptimization,” in Proceedings of the Twenty-Fourth International Conference onArchitectural Support for Programming Languages and Operating Systems, pp. 821–834, ACM, 2019.
146
[202] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexible dataflow map-ping over dnn accelerators via reconfigurable interconnects,” in Proceedings of theTwenty-Third International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, pp. 461–475, 2018.
[203] H. Kim, J. Sim, Y. Choi, and L.-S. Kim, “Nand-net: Minimizing computationalcomplexity of in-memory processing for binary neural networks,” in 2019 IEEEInternational Symposium on High Performance Computer Architecture (HPCA),pp. 661–673, IEEE, 2019.
[204] S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, andY. Xie, “Scope: A stochastic computing engine for dram-based in-situ accelerator,”in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture(MICRO), pp. 696–709, IEEE, 2018.
[205] P. Srivastava, M. Kang, S. K. Gonugondla, S. Lim, J. Choi, V. Adve, N. S. Kim,and N. Shanbhag, “Promise: An end-to-end design of a programmable mixed-signalaccelerator for machine-learning algorithms,” in Proceedings of the 45th AnnualInternational Symposium on Computer Architecture, pp. 43–56, IEEE Press, 2018.
[206] C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, “Tie: energy-efficient tensortrain-based inference engine for deep neural network,” in Proceedings of the 46thInternational Symposium on Computer Architecture, pp. 264–278, 2019.
[207] S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M. Stuart, Z. Poulos,and A. Moshovos, “Laconic deep learning inference acceleration,” in Proceedings ofthe 46th International Symposium on Computer Architecture, pp. 304–317, ACM,2019.
[208] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “Admm-nn: Analgorithm-hardware co-design framework of dnns using alternating direction methodsof multipliers,” in Proceedings of the Twenty-Fourth International Conference onArchitectural Support for Programming Languages and Operating Systems, pp. 925–938, ACM, 2019.
[209] A. Jain, A. Phanishayee, J. Mars, L. Tang, and G. Pekhimenko, “Gist: Efficientdata encoding for deep neural network training,” in 2018 ACM/IEEE 45th AnnualInternational Symposium on Computer Architecture (ISCA), pp. 776–789, IEEE,2018.
[210] H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, “Mnnfast: A fast and scalable systemarchitecture for memory-augmented neural networks,” in Proceedings of the 46thInternational Symposium on Computer Architecture, pp. 250–263, 2019.
147
[211] M. Riera, J.-M. Arnau, and A. Gonzalez, “Computation reuse in dnns by exploitinginput similarity,” in Proceedings of the 45th Annual International Symposium onComputer Architecture, pp. 57–68, IEEE Press, 2018.
[212] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler, “vdnn: Virtual-ized deep neural networks for scalable, memory-efficient neural network design,” inThe 49th Annual IEEE/ACM International Symposium on Microarchitecture, p. 18,IEEE Press, 2016.
[213] M. Rhu, M. O’Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, “Com-pressing dma engine: Leveraging activation sparsity for training deep neural net-works,” in 2018 IEEE International Symposium on High Performance ComputerArchitecture (HPCA), pp. 78–91, IEEE, 2018.
[214] X. He, W. Lu, G. Yan, and X. Zhang, “Joint design of training and hardware towardsefficient and accuracy-scalable neural network inference,” IEEE Journal on Emergingand Selected Topics in Circuits and Systems, vol. 8, no. 4, pp. 810–821, 2018.
[215] J. Zhang, X. Chen, M. Song, and T. Li, “Eager pruning: algorithm and architec-ture support for fast training of deep neural networks,” in Proceedings of the 46thInternational Symposium on Computer Architecture, pp. 292–303, ACM, 2019.
[216] A. Delmas Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify,M. Nikolic, K. Siu, and A. Moshovos, “Bit-tactical: A software/hardware approachto exploiting value and bit sparsity in neural networks,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Lan-guages and Operating Systems, pp. 749–763, ACM, 2019.
[217] A. Samajdar, P. Mannan, K. Garg, and T. Krishna, “Genesys: Enabling continu-ous learning through neural network evolution in hardware,” in 2018 51st AnnualIEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 855–866,IEEE, 2018.
[218] T.-H. Yang, H.-Y. Cheng, C.-L. Yang, I. Tseng, H.-W. Hu, H.-S. Chang, and H.-P.Li, “Sparse reram engine: joint exploration of activation and weight sparsity incompressed neural networks,” in Proceedings of the 46th International Symposiumon Computer Architecture, pp. 236–249, ACM, 2019.
[219] R. LiKamWa, Y. Hou, Y. Gao, M. Polansky, and L. Zhong, “Redeye: Analog convnetimage sensor architecture for continuous mobile vision,” in 43rd International Sym-posium on Computer Architecture, ISCA 2016, pp. 255–266, Institute of Electricaland Electronics Engineers Inc., 2016.
[220] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo, N. Yoshikawa, andY. Wang, “A stochastic-computing based deep learning framework using adiabatic
148
quantum-flux-parametron superconducting technology,” in Proceedings of the 46thInternational Symposium on Computer Architecture, pp. 567–578, ACM, 2019.
[221] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memory accelerationof deep neural network training with high precision,” in Proceedings of the 46thInternational Symposium on Computer Architecture, pp. 802–815, ACM, 2019.
[222] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Fara-boschi, W.-m. W. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, “Puma: Aprogrammable ultra-efficient memristor-based accelerator for machine learning infer-ence,” in Proceedings of the Twenty-Fourth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems, pp. 715–731, ACM,2019.
[223] Y. Ji, Y. Zhang, X. Xie, S. Li, P. Wang, X. Hu, Y. Zhang, and Y. Xie, “Fpsa: A fullsystem stack solution for reconfigurable reram-based nn accelerator architecture,” inProceedings of the Twenty-Fourth International Conference on Architectural Supportfor Programming Languages and Operating Systems, pp. 733–747, ACM, 2019.
[224] L. Ke, X. He, and X. Zhang, “Nnest: Early-stage design space exploration tool forneural network inference accelerators,” in Proceedings of the International Sympo-sium on Low Power Electronics and Design, p. 4, ACM, 2018.
[225] J. L. Hennessy and D. A. Patterson, “A new golden age for computer architecture:Domain-specific hardware/software co-design, enhanced security, open instructionsets, and agile chip development,” Turing Lecture, 2018.
[226] D. Patterson, “50 years of computer architecture: From the mainframe cpu to thedomain-specific tpu and the open risc-v instruction set,” in 2018 IEEE InternationalSolid-State Circuits Conference-(ISSCC), pp. 27–31, IEEE, 2018.
[227] J. L. Hennessy and D. A. Patterson, “A new golden age for computer architecture,”Communications of the ACM, vol. 62, no. 2, pp. 48–60, 2019.
[228] J. Dean, D. Patterson, and C. Young, “A new golden age in computer architecture:Empowering the machine-learning revolution,” IEEE Micro, vol. 38, no. 2, pp. 21–29,2018.
[229] Q. Yang, H. Li, and Q. Wu, “A quantized training method to enhance accuracy ofreram-based neuromorphic systems,” in Circuits and Systems (ISCAS), 2018 IEEEInternational Symposium on, pp. 1–5, IEEE, 2018.
[230] G. E. Moore, “Cramming more components onto integrated circuits,” IEEE Solid-State Circuits Society Newsletter, 2006.
[231] “Ai & machine learning products cloud tpu.” https://cloud.google.com/tpu/.
149
[232] Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL: http://yann. lecun.com/exdb/lenet, 2015.
[233] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning appliedto document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324,1998.
[234] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scalehierarchical image database,” in Computer Vision and Pattern Recognition, 2009.CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
[235] “Vpc resource quotas.” https://cloud.google.com/vpc/docs/quota.
[236] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networksfor scalable unsupervised learning of hierarchical representations,” in Proceedings ofthe 26th Annual International Conference on Machine Learning, pp. 609–616, ACM,2009.
[237] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks forimage classification,” in Computer Vision and Pattern Recognition (CVPR), 2012IEEE Conference on, pp. 3642–3649, IEEE, 2012.
[238] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhuber, “Flex-ible, high performance convolutional neural networks for image classification,” inIJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22,p. 1237, 2011.
[239] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural networks appliedto house numbers digit classification,” in Pattern Recognition (ICPR), 2012 21stInternational Conference on, pp. 3288–3291, IEEE, 2012.
[240] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-levelimage representations using convolutional neural networks,” in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724,2014.
[241] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, andL. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neuralcomputation, vol. 1, no. 4, pp. 541–551, 1989.
[242] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprintarXiv:1408.5882, 2014.
[243] A. G. Howard, “Some improvements on deep convolutional neural network basedimage classification,” arXiv preprint arXiv:1312.5402, 2013.
150
[244] Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep convolutional ranking formultilabel image annotation,” arXiv preprint arXiv:1312.4894, 2013.
[245] R. Collobert and J. Weston, “A unified architecture for natural language process-ing: Deep neural networks with multitask learning,” in Proceedings of the 25thinternational conference on Machine learning, pp. 160–167, ACM, 2008.
[246] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, “Applying convolutionalneural networks concepts to hybrid nn-hmm model for speech recognition,” in Acous-tics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conferenceon, pp. 4277–4280, IEEE, 2012.
[247] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural networkfor modelling sentences,” arXiv preprint arXiv:1404.2188, 2014.
[248] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learningfor speech recognition and related applications: An overview,” in Acoustics, Speechand Signal Processing (ICASSP), 2013 IEEE International Conference on, pp. 8599–8603, IEEE, 2013.
[249] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrentneural networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEEInternational Conference on, pp. 6645–6649, IEEE, 2013.
[250] C. Xu, D. Niu, N. Muralimanohar, R. Balasubramonian, T. Zhang, S. Yu, andY. Xie, “Overcoming the challenges of crossbar resistive memory architectures,” inHigh Performance Computer Architecture (HPCA), 2015 IEEE 21st InternationalSymposium on, pp. 476–488, IEEE, 2015.
[251] T. y. Liu, T. H. Yan, R. Scheuerlein, Y. Chen, J. K. Lee, G. Balakrishnan, G. Yee,H. Zhang, A. Yap, J. Ouyang, T. Sasaki, A. Al-Shamma, C. Chen, M. Gupta,G. Hilton, A. Kathuria, V. Lai, M. Matsumoto, A. Nigam, A. Pai, J. Pakhale, C. H.Siau, X. Wu, Y. Yin, N. Nagel, Y. Tanaka, M. Higashitani, T. Minvielle, C. Gorla,T. Tsukamoto, T. Yamaguchi, M. Okajima, T. Okamura, S. Takase, H. Inoue, andL. Fasoli, “A 130.7-mm2 2-layer 32-gb reram memory device in 24-nm technology,”IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp. 140–153, 2014.
[252] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, K. Tsutsui, J. Javanifard,K. Tedrow, T. Tsushima, Y. Shibahara, and G. Hush, “19.7 a 16gb reram with 200mb/swrite and 1gb/s read in 27nm technology,” in 2014 IEEE International Solid-StateCircuits Conference Digest of Technical Papers (ISSCC), pp. 338–339, IEEE, 2014.
[253] M.-J. Lee, C. B. Lee, D. Lee, S. R. Lee, M. Chang, J. H. Hur, Y.-B. Kim, C.-J. Kim,D. H. Seo, S. Seo, U.-I. Chung, I.-K. Yoo, and K. Kim, “A fast, high-endurance andscalable non-volatile memory device made from asymmetric ta2o5- x/tao2- x bilayerstructures,” Nature materials, vol. 10, no. 8, pp. 625–630, 2011.
151
[254] C. Hsu, I. Wang, C. Lo, M. Chiang, W. Jang, C. Lin, and T. Hou, “Self-rectifyingbipolar taox/tio2 rram with superior endurance over 1012 cycles for 3d high-densitystorage-class memory vlsi tech,” in 2013 Symposium on, pp. T166–T167, 2013.
[255] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and B. Abali,“Enhancing lifetime and security of pcm-based main memory with start-gap wearleveling,” in Proceedings of the 42nd Annual IEEE/ACM International Symposiumon Microarchitecture, pp. 14–23, ACM, 2009.
[256] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge,R. S. Williams, and J. Yang, “Dot-product engine for neuromorphic computing:programming 1t1m crossbar to accelerate matrix-vector multiplication,” in 201653nd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6, IEEE, 2016.
[257] M. Hu, H. Li, Q. Wu, and G. S. Rose, “Hardware realization of bsb recall func-tion using memristor crossbar arrays,” in Proceedings of the 49th Annual DesignAutomation Conference, pp. 498–503, ACM, 2012.
[258] S. Yu, H.-Y. Chen, Y. Deng, B. Gao, Z. Jiang, J. Kang, and H.-S. P. Wong, “3dvertical rram-scaling limit analysis and demonstration of 3d array operation,” in VLSITechnology (VLSIT), 2013 Symposium on, pp. T158–T159, IEEE, 2013.
[259] C. Xu, P.-Y. Chen, D. Niu, Y. Zheng, S. Yu, and Y. Xie, “Architecting 3d verticalresistive memory for next-generation storage systems,” in Proceedings of the 2014IEEE/ACM International Conference on Computer-Aided Design, pp. 55–62, IEEEPress, 2014.
[260] S. Yu, Y. Wu, and H.-S. P. Wong, “Investigating the switching dynamics and multi-level capability of bipolar metal oxide resistive switching memory,” Applied PhysicsLetters, vol. 98, no. 10, p. 103514, 2011.
[261] M.-C. Wu, W.-Y. Jang, C.-H. Lin, and T.-Y. Tseng, “A study on low-power, nanosec-ond operation and multilevel bipolar resistance switching in ti/zro2/pt nonvolatilememory with 1t1r architecture,” Semiconductor Science and Technology, vol. 27,no. 6, p. 065010, 2012.
[262] F. Alibart, L. Gao, B. D. Hoskins, and D. B. Strukov, “High precision tuning of statefor memristive devices by adaptable variation-tolerant algorithm,” Nanotechnology,vol. 23, no. 7, p. 075201, 2012.
[263] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” inProceedings of the ACM International Conference on Multimedia, pp. 675–678,ACM, 2014.
152
[264] D. Niu, C. Xu, N. Muralimanohar, N. P. Jouppi, and Y. Xie, “Design trade-offs forhigh density cross-point resistive memory,” in Proceedings of the 2012 ACM/IEEEinternational symposium on Low power electronics and design, pp. 209–214, ACM,2012.
[265] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions: A low-overhead,locality-aware processing-in-memory architecture,” in Computer Architecture (ISCA),2015 ACM/IEEE 42nd Annual International Symposium on, pp. 336–348, IEEE,2015.
[266] L. Nai, R. Hadidi, J. Sim, H. Kim, P. Kumar, and H. Kim, “Graphpim: Enablinginstruction-level pim offloading in graph computing frameworks,” in High Perfor-mance Computer Architecture (HPCA), 2017 IEEE International Symposium on,pp. 457–468, IEEE, 2017.
[267] Y. Zhuo, C. Wang, M. Zhang, R. Wang, D. Niu, Y. Wang, and X. Qian, “Graphq:Scalable pim-based graph processing,” in Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture, pp. 712–725, 2019.
[268] M. Zhang, Y. Zhuo, C. Wang, M. Gao, Y. Wu, K. Chen, C. Kozyrakis, and X. Qian,“Graphp: Reducing communication for pim-based graph processing with efficient datapartition,” in 2018 IEEE International Symposium on High Performance ComputerArchitecture (HPCA), pp. 544–557, IEEE, 2018.
[269] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection.”http://snap.stanford.edu/data, June 2014.
[270] T. H. Cormen, Introduction to algorithms. MIT press, 2009.
[271] R. A. Rossi and N. K. Ahmed, “The network data repository with interactive graphanalytics and visualization,” in Proceedings of the Twenty-Ninth AAAI Conference onArtificial Intelligence, 2015.
[272] J. Bennett and S. Lanning, “The netflix prize,” in Proceedings of KDD cup andworkshop, vol. 2007, p. 35, 2007.
[273] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: Ahigh-performance graph processing library on the gpu,” in Proceedings of the 21stACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,p. 11, ACM, 2016.
[274] X. Xie, W. Tan, L. L. Fong, and Y. Liang, “Cumf sgd: Fast and scalable matrixfactorization,” arXiv preprint arXiv:1610.05838, 2016.
[275] D. Sanchez and C. Kozyrakis, “Zsim: fast and accurate microarchitectural simulationof thousand-core systems,” in ACM SIGARCH Computer Architecture News, vol. 41,pp. 475–486, ACM, 2013.
153
[276] M. Gao, G. Ayers, and C. Kozyrakis, “Practical near-data processing for in-memoryanalytics frameworks,” in Parallel Architecture and Compilation (PACT), 2015International Conference on, pp. 113–124, IEEE, 2015.
[277] “Intel xeon processor e5-2630 v3 (20m cache, 2.40 ghz).” http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2 40-GHz.
[278] http://quid.hpl.hp.com:9081/cacti/.
[279] B. Murmann, “Adc performance survey 1997-2016.” http://web.stanford.edu/∼murmann/adcsurvey.html.
[280] H. Akinaga and H. Shima, “Resistive random access memory (reram) based on metaloxides,” Proceedings of the IEEE, vol. 98, no. 12, pp. 2237–2251, 2010.
[281] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, “Compression of deepconvolutional neural networks for fast and low power mobile applications,” arXivpreprint arXiv:1511.06530, 2015.
[282] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, andD. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in 2018 IEEE/CVF Conference on Computer Vision andPattern Recognition, pp. 2704–2713, IEEE, 2018.
[283] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neuralnetworks: Training neural networks with low precision weights and activations,” TheJournal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.
[284] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neuralnetworks: Training deep neural networks with weights and activations constrainedto+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
[285] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization:Towards lossless cnns with low-precision weights,” arXiv preprint arXiv:1702.03044,2017.
[286] D. Fujiki, S. Mahlke, and R. Das, “In-memory data parallel processor,” in Pro-ceedings of the Twenty-Third International Conference on Architectural Support forProgramming Languages and Operating Systems, ASPLOS ’18, pp. 1–14, ACM,2018.
[287] I. S. Committee et al., “754-2008 ieee standard for floating-point arithmetic,” IEEEComputer Society Std, vol. 2008, p. 517, 2008.
[288] J. H. Ferziger and M. Peric, Computational methods for fluid dynamics, vol. 3.Springer, 2002.
154
[289] A. Griewank and A. Walther, Evaluating derivatives: principles and techniques ofalgorithmic differentiation, vol. 105. Siam, 2008.
[290] A. C. Antoulas, Approximation of large-scale dynamical systems, vol. 6. Siam, 2005.
[291] J. H. Wilkinson, Rounding errors in algebraic processes. Courier Corporation, 1994.
[292] C. B. Moler, “Iterative refinement in floating point,” Journal of the ACM (JACM),vol. 14, no. 2, pp. 316–321, 1967.
[293] J. W. Demmel, Applied numerical linear algebra, vol. 56. Siam, 1997.
[294] M. R. Hestenes and E. Stiefel, Methods of conjugate gradients for solving linearsystems, vol. 49. NBS Washington, DC, 1952.
[295] H. A. Van der Vorst, “Bi-cgstab: A fast and smoothly converging variant of bi-cgfor the solution of nonsymmetric linear systems,” SIAM Journal on scientific andStatistical Computing, vol. 13, no. 2, pp. 631–644, 1992.
[296] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACMTransactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
[297] R. F. Boisvert, R. Pozo, K. Remington, R. F. Barrett, and J. J. Dongarra, “Matrixmarket: a web resource for test matrix collections,” in Quality of Numerical Software,pp. 125–137, Springer, 1997.
[298] L. Kull, D. Luu, C. Menolfi, M. Braendli, P. A. Francese, T. Morf, M. Kossel,H. Yueksel, A. Cevrero, I. Ozkaya, and T. Toifl, “28.5 a 10b 1.5 gs/s pipelined-saradc with background second-stage common-mode regulation and offset calibrationin 14nm cmos finfet,” in 2017 IEEE International Solid-State Circuits Conference(ISSCC), pp. 474–475, IEEE, 2017.
[299] L. Song, Y. Wu, X. Qian, H. Li, and Y. Chen, “Rebnn: in-situ acceleration of binarizedneural networks in reram using complementary resistive cell,” CCF Transactions onHigh Performance Computing, vol. 1, no. 3, pp. 196–208, 2019.
[300] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A survey of accelerator architecturesfor deep neural networks,” Engineering, vol. 6, no. 3, pp. 264–274, 2020.
[301] L. Song, F. Chen, Y. Chen, and H. Li, “Parallelism in deep learning accelerators,”in 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC),pp. 645–650, IEEE, 2020.
[302] Y. Wang, F. Chen, L. Song, C.-J. R. Shi, H. H. Li, and Y. Chen, “Reboc: Acceleratingblock-circulant neural networks in reram,” in 2020 Design, Automation & Test inEurope Conference & Exhibition (DATE), pp. 1472–1477, IEEE, 2020.
155
[303] N. Challapalle, S. Rampalli, L. Song, N. Chandramoorthy, K. Swaminathan, J. Samp-son, Y. Chen, and V. Narayanan, “Gaas-x: Graph analytics accelerator supportingsparse data representation using crossbar architectures,” in Proceedings of the 47thInternational Symposium on Computer Architecture, 2020.
[304] Y. Wang, W. Wen, L. Song, and H. H. Li, “Classification accuracy improvement forneuromorphic computing systems with one-level precision synapses,” in 2017 22ndAsia and South Pacific Design Automation Conference (ASP-DAC), pp. 776–781,IEEE, 2017.
[305] C. Wang, D. Feng, W. Tong, Y. Hua, J. Liu, B. Wu, W. Zhao, L. Song, Y. Zhang, J. Xu,X. Wei, and Y. Chen, “Improving multilevel writes on vertical 3d cross-point resistivememory,” IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, 2020.
[306] P. Dai, J. Yang, X. Ye, X. Cheng, J. Luo, L. Song, Y. Chen, and W. Zhao, “Sparsetrain:Exploiting dataflow sparsity for efficient convolutional neural networks training,” inProceedings of the 57th Annual Design Automation Conference 2020, pp. 1–6, 2020.
[307] F. Chen, L. Song, H. Li, and Y. Chen, “Parc: A processing-in-cam architecture forgenomic long read pairwise alignment using reram,” in 2020 25th Asia and SouthPacific Design Automation Conference (ASP-DAC), pp. 175–180, IEEE, 2020.
[308] F. Chen, W. Wen, L. Song, J. Zhang, H. H. Li, and Y. Chen, “How to obtain andrun light and efficient deep learning networks,” in 2019 IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD), pp. 1–5, IEEE, 2019.
[309] P. Bogdan, F. Chen, A. Deshwal, J. R. Doppa, B. K. Joardar, H. Li, S. Nazarian,L. Song, and Y. Xiao, “Taming extreme heterogeneity via machine learning baseddesign of autonomous manycore systems,” in Proceedings of the International Con-ference on Hardware/Software Codesign and System Synthesis Companion, pp. 1–10,2019.
[310] F. Chen, L. Song, and H. Li, “Efficient process-in-memory architecture design forunsupervised gan-based deep learning using reram,” in Proceedings of the 2019 onGreat Lakes Symposium on VLSI, pp. 423–428, 2019.
[311] E. Eken, L. Song, I. Bayram, C. Xu, W. Wen, Y. Xie, and Y. Chen, “Nvsim-vx s: an im-proved nvsim for variation aware stt-ram simulation,” in 2016 53nd ACM/EDAC/IEEEDesign Automation Conference (DAC), pp. 1–6, IEEE, 2016.
[312] C. Liu, B. Yan, C. Yang, L. Song, Z. Li, B. Liu, Y. Chen, H. Li, Q. Wu, and H. Jiang,“A spiking neuromorphic design with resistive crossbar,” in Design AutomationConference (DAC), 2015 52nd ACM/EDAC/IEEE, pp. 1–6, IEEE, 2015.
156
Biography
Linghao Song is a Ph.D. candidate in the Department of Electrical and Computer Engi-
neering at Duke University. His research interests lie in the field of computer architecture,
deep learning accelerators, ReRAM-based accelerators and deep learning applications. He
received M.S. in Electrical Engineering from University of Pittsburgh in 2017 and B.S.E.
in Information Engineering from Shanghai Jiao Tong University in 2014. He worked as a
Research Intern at Oak Ridge National Laboratory (ORNL) in Summer 2018 and a Research
Intern at Pacific Northwest National Laboratory (PNNL) in Summer 2019.
Linghao’s works on computer architecture and accelerators are published on major
computer architecture conferences, including HPCA’20 [174], HPCA’19 [2], HPCA’18
[175] and HPCA’17 [92], and a journal CCF THPC [299]. His leading work on a survey
of deep learning accelerators is published on Elsevier Engineering [300]. His work on
high-energy physics data analysis is published on ICASSP’19 [15]. He leads an invited
work on parallelism in deep learning accelerators on ASP-DAC’20 [301]. He has mentored
several junior or undergraduate students and their works are published on DAC’18 [93],
DATE’18 [94], SafeAI@AAAI’19 [5] and DATE’20 [302]. He also has collaborative works
published on ISCA’20 [303], ASP-DAC’17(Best Paper Award) [304], ASP-DAC’18(Best
Paper Nomination) [95], IEEE TCAD [305], DAC’20 [306], ASP-DAC’20 [307], DAC’19
[99], ICCAD’19(Invited) [308], ESWEEK’19(Invited) [309], GLVLSI’19(Invited) [310],
DATE’18 (Invited) [96], ICCAD’17 [76], DAC’16 [311] and DAC’15 [312].
Linghao served on Artifact Evaluation Committee (AEC) of ASPLOS’20, External
Review Committee of DAC’20, Shadow Program Committee of EuroSys’20, AEC of
LCTES’19, and Program Committee of ICONS’18. He also served as a reviewer for ACM
JETC, Elsevier Neurocomputing, IEEE CAL, IEEE ESL, IEEE IoT Journal, IEEE Micro,
IEEE TC, IEEE TCAD, IEEE TIE, IEEE TNNLS, IEEE TPDS and IEEE TVLSI.
157
top related