fpga-accelerated dense linear machine learning: a ... · q4 fpga float fpga 7.2x 0.001 0.01 0.1 1...

26
Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off 1

Upload: others

Post on 27-Jun-2020

50 views

Category:

Documents


0 download

TRANSCRIPT

Kaan Kara, Dan Alistarh, Gustavo Alonso, Onur Mutlu, Ce Zhang

FPGA-accelerated Dense Linear Machine Learning:A Precision-Convergence Trade-off

1

What is the most efficient way of training dense linear models on an FPGA?

• FPGAs can handle floating-point, but they are certainly not the

best choice.

We have tried: On par with a 10-core Xeon, not a clear win.

Yet!

• How about some recent developments in the machine research?

Low-precision data EXACTLY the same end-result.In theory

leads to2

Stochastic Gradient Descent (SGD)

while not converged dofor i from 1 to M do

𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊𝒙 ← 𝒙 − 𝛾𝒈𝒊

endend

𝐴𝑀×𝑁

N features

M samples 𝑥𝑁×1x =𝑏𝑀×1

Data Set LabelsModel

Computation

CPU, GPU, FPGA

Memory

CPU Cache,DRAM,FPGA BRAM

Data Source

DRAM, SSD,Sensor

Data Ar

Model x

Gradient: dot(Ar, x)Ar

3

Advantage of Low-Precision Data + FPGA

while not converged dofor i from 1 to M do

𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊𝒙 ← 𝒙 − 𝛾𝒈𝒊

endend

𝑥𝑁×1x =𝑏𝑀×1

Data Set LabelsModel

Computation

CPU, GPU, FPGA

Memory

CPU Cache,DRAM,FPGA BRAM

Data Source

DRAM, SSD,Sensor

Data Ar

Model x

Gradient: dot(Ar, x)Ar

Compress!

𝐴

N features

M samples

4

||

Key Takeaway

Iterations

TrainingLoss

Naïve rounding

Stochastic rounding

Full-precision

• Using an FPGA, we can increase the hardware efficiency while maintaining the statistical efficiency of SGD for dense linear models.

Execution Time

TrainingLoss

faster

same quality

• In practice, the system opens up a multivariate trade-off space:Data properties, FPGA implementation, SGD parameters…

5

Outline for the Rest…

1. Stochastic quantization. Data layout

2. Target platform: Intel Xeon+FPGA

3. Implementation: From 32-bit to 1-bit SGD on FPGA.

4. Evaluation: Main results. Side effects.

5. Conclusion

6

Stochastic Quantization [1]

[…. 0.7 ….]

0

1 Naive solution: Nearest rounding (=1)=> Converge to a different solution.

Stochastic rounding: 0 with probability 0.3, 1 with probability 0.7=> Expectation remains the same. Converge to the same solution.

min𝑥

1

2

𝑟

(𝐴𝑟 ∙ 𝑥 − 𝑏𝑟)2

𝒈𝒊 = 𝒂𝒊 ∙ 𝒙 − 𝑏𝑖 𝒂𝒊Gradient

We need 2independent samples 𝒈𝒊 = 𝑄1(𝒂𝒊) ∙ 𝒙 − 𝑏𝑖 𝑄2(𝒂𝒊)

…and each iteration of the algorithm needs fresh samples.

[1] H Zhang, K Kara, J Li, D Alistarh, J Liu, C Zhang: ZipML: An End-to-end Bitwise Framework for Dense Generalized Linear Models, arXiv:1611.054027

Before the FPGA: The Data Layout

32-bit float value

1. Feature

1. Sample 32-bit float value

2. Feature

32-bit float value

3. Feature

32-bit float value

n. Feature

32-bit float valuem. Sample 32-bit float value 32-bit float value 32-bit float value

Quantize to 4-bit.We need 2 independent quantizations!

4-bit | 4-bit

1. Feature

1. Sample

2. Feature

m. Sample

4x compression!

3. Feature

4-bit | 4-bit 4-bit | 4-bit

n. Feature

4-bit | 4-bit

4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit 4-bit | 4-bit

For each iteration

A new index!

1 quantization index

Currently slow!

8

Target Platform

Intel Xeon+FPGA

96GB Main Memory

Mem. Controller

Caches

Intel Xeon CPU

Altera Stratix V

~30 GB/s 6.5 GB/s

QPI

Accelerator

QPIEndpoint

Ivy Bridge, Xeon E5-2680 v2

10 cores, 25 MB L3

9

Full Precision SGD on FPGA

16 floats

64B cache-line

xx16 float multipliers

xx

++

+

+

+

++

+

++

+

+

++

+

16 fixedadders

-

x

b fifo

ax b

γ(ax-b)

a fifo

xxx16 float multipliers

---16 fixedadders

xloading

a

CCCC

16 float2fixedconverters

+

c

1 fixed2floatconverter

CCC16 float2fixedconverters

C

16 fixed2floatconverters

γ

γ(ax-b)a

x - γ(ax-b)a

1 float adder1 float multiplier

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

1

2

3

4

56

B A

7

8

9

C

D

32-bit floating-point: 16 values

Processing rate: 12.8 GB/s

Computation

Custom LogicStorage Device

FPGA BRAM

Data Source

DRAM, SSD,Sensor

10

||

Scale out the design for low-precision data!

How can we increase the internal parallelism while maintaining the processing rate?

11

Challenge + Solution

16 floats

64B cache-line

xx16 float multipliers

xx

++

+

+

+

++

+

++

+

+

++

+

16 fixedadders

-

x

b fifo

ax b

γ(ax-b)

a fifo

xxx16 float multipliers

---16 fixedadders

xloading

a

CCCC

16 float2fixedconverters

+

c

1 fixed2floatconverter

CCC16 float2fixedconverters

C

16 fixed2floatconverters

γ

γ(ax-b)a

x - γ(ax-b)a

1 float adder1 float multiplier

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

1

2

3

4

56

B A

7

8

9

C

D

8-bit: 32 values Q84-bit: 64 values Q42-bit: 128 values Q21-bit: 256 values Q1

=> Scaling out is not trivial!

1) We can get rid of floating-point arithmetic.

2) We can further simplify integer arithmetic for lowest precision data.

12

Trick 1: Selection of quantization levels

U

L

1. Select the lower and upper bound [L,U]

2. Select the size of the interval ∆

All quantized values are integers!

13

This is enough for Q4 and Q8

8-bit: 32 values Q84-bit: 64 values Q42-bit: 128 values Q21-bit: 256 values Q1

K=128, 64, 32 quantized values

64B cache-line

xxK fixed multipliersxx

++

+

+

+

+

+

+

+

+

++

+

K fixedadders

-

x

b fifo

Q (a)x b

γ(Q (a)x-b)

a fifo

xxxK fixed multipliers

---K fixedadders

xloading

Q (a)

+

γ

γ(Q (a)x-b)Q (a)

x - γ(Q (a)x-b)Q (a)

1 fixed adder1 bit-shift

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

A

1

2

3

14

Trick 2: Implement Multiplication using Multiplexer

In[31:0]

Q2value[1:0]

Out[31:0]000110

<<10

In[31:0]

Q2value[1:0]

Out[31:0]000111

0

-In[31:0]

Result[31:0]

1

2

Q2value normalized to[0,2] or [-1,1]

In[31:0]

Q1value[0]

Result[31:0]

0

1

0

Q2 multiplier Q1 multiplier

15

Support for all Qx

K=128, 64, 32 quantized values

64B cache-line

xxK fixed multipliersxx

++

+

+

+

+

+

+

+

+

++

+

K fixedadders

-

x

b fifo

Q (a)x b

γ(Q (a)x-b)

a fifo

xxxK fixed multipliers

---K fixedadders

xloading

Q (a)

+

γ

γ(Q (a)x-b)Q (a)

x - γ(Q (a)x-b)Q (a)

1 fixed adder1 bit-shift

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

A

1

2

3

Circuit for Q8, Q4 and Q2 SGD

Processing rate: 12.8 GB/s

K=256 quantized values

64B cache-line

xx128 fixed

multipliersxx

++

+

+

+

+

+

+

+

+

++

+

128 fixedadders

-

x

b fifo

Q (a)x b

γ(Q (a)x-b)

a fifo

xxx128 fixed multipliers

---128 fixed

adders

xloading

Q (a)

+

γ

γ(Q (a)x-b)Q (a)

x - γ(Q (a)x-b)Q (a)

1 fixed adder1 bit-shift

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

K=128 quantized valuesK=128 quantized values

Split one 64B cache-line into 2 parts

A

1

2

3

4

Circuit for Q1 SGD

Processing rate: 6.4 GB/s16

Support for all Qx

K=128, 64, 32 quantized values

64B cache-line

xxK fixed multipliersxx

++

+

+

+

+

+

+

+

+

++

+

K fixedadders

-

x

b fifo

Q (a)x b

γ(Q (a)x-b)

a fifo

xxxK fixed multipliers

---K fixedadders

xloading

Q (a)

+

γ

γ(Q (a)x-b)Q (a)

x - γ(Q (a)x-b)Q (a)

1 fixed adder1 bit-shift

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

A

1

2

3

Circuit for Q8, Q4 and Q2 SGD

Processing rate: 12.8 GB/s

K=256 quantized values

64B cache-line

xx128 fixed

multipliersxx

++

+

+

+

+

+

+

+

+

++

+

128 fixedadders

-

x

b fifo

Q (a)x b

γ(Q (a)x-b)

a fifo

xxx128 fixed multipliers

---128 fixed

adders

xloading

Q (a)

+

γ

γ(Q (a)x-b)Q (a)

x - γ(Q (a)x-b)Q (a)

1 fixed adder1 bit-shift

Model update

Dotproduct

Gradientcalculation

x

Batch sizeis reached?

K=128 quantized valuesK=128 quantized values

Split one 64B cache-line into 2 parts

A

1

2

3

4

Circuit for Q1 SGD

Processing rate: 6.4 GB/s

Data Type Logic DSP BRAM

float 38% 12% 7%

Q8 35% 25% 6%

Q4 36% 50% 6%

Q2, Q1 43% 1% 6%

17

Data sets for evaluation

Name Size # Features # Classes

MNIST 60,000 780 10

GISETTE 6,000 5,000 2

EPSILON 10,000 2,000 2

SYN100 10,000 100 Regression

SYN1000 10,000 1,000 Regression

18

Following the Intel legal guidelines on publishing performance numbers, we would like to make the reader aware that results in this publication were generated using preproduction hardware and software, and may not reflect the performance of production or future systems.

SGD Performance Improvement

1E-4 0.001 0.01 0.10.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

Trai

nin

g Lo

ss

Time (s)

float CPU 1-thread

float CPU 10-threads

Q4 FPGA

float FPGA

7.2x0.001 0.01 0.1 1

0.000

0.005

0.010

0.015

0.020

0.025

0.030

Trai

nin

g Lo

ss

Time (s)

float CPU 1-thread

float CPU 10-threads

Q8 FPGA

float FPGA

2x

SGD on Synthetic100,4-bit works

SGD on Synthetic1000,8-bit works 19

1-bit also works on some data sets

SGD on MNIST, Digit 71-bit works

0.001 0.01 0.1 1 10

0.0170

0.0175

0.0180

0.0185

0.0190

0.0195

0.0200

0.0205

0.0210

0.0215

Trai

nin

g Lo

ss

Time (s)

float CPU 1-thread

float CPU 10-threads

Q1 FPGA

float FPGA

10.6x

MNISTmulti-classification accuracy

Precision Accuracy Training Time float CPU-SGD 85.82% 19.04 s 1-bit FPGA-SGD 85.87% 2.41 s

20

0 10 20 30 40 50 60

0.05

0.10

0.15

0.20

0.25

Trai

ning

Los

s

Number of Epochs

float

Q2, F2

Q4, F4

Q8, F8

Naïve Rounding vs. Stochastic Rounding

Bias

Stochastic

quantization

results in

unbiased

convergence.

21

Effect of Step Size

0 10 20 30 40 50 600.001

0.01

0.1

Trai

ning

Los

s

Number of Epochs

float, step size=1/29, Q2, step size=2/29

float, step size=1/212, Q2, step size=2/212

float, step size=1/215, Q2, step size=2/215 Large step size +

Full precision

vs.

Small step size +

Low precision

22

||

Conclusion

• Highly scalable, parametrizable FPGA-based stochastic gradient descent

implementations for doing linear model training.

• Open source: www.systems.ethz.ch/fpga/ZipML_SGD

1. The way to train linear models on FPGA should be through the usage of

stochastically rounded, low-precision data.

2. Multivariate trade-off space: Precision vs. end-to-end runtime, convergence

quality, design complexity, data and system properties.

Key Takeaways:

23

|| 24

Backup

Why do dense linear models matter?

• Workhorse algorithm for regression and classification.

• Sparse, high dimensional data sets can be converted

into dense ones.Transfer Learning Random Features

Why is training speed important?

• Often a human-in-the-loop process.

• Configuring parameters, selecting features,

data dependency etc.

Train

Evaluate

25

Effect of Reusing Indexes

0 10 20 30 40 50 600.000

0.005

0.010

0.015

0.020

0.025

Trai

ning

Los

s

Number of Epochs

float

Q2, 64 indexes

Q2, 32 indexes

Q2, 16 indexes

Q2, 8 indexes

Q2, 4 indexes

Q2, 2 indexes

In practice 8

indexes are

enough!

26