kernel approximation methods for speech...

Kernel Approximation Methodsfor Speech Recognition

Avner May

Thesis Defense Committee:David Blei, Shih-Fu Chang, Michael Collins, Daniel Hsu, Brian Kingsbury

December 12, 2017

1

Page 2: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Rise of Deep Learning

Deep learning methods have dramatically improved the state of theart performance in various domains.

'00 '04 '12 '17

5

10

20

30

Wor

d E

rror

Rat

e (%

)

Switchboard Performance (2000-2017)

19.3

14.813.3

5.5

GMMs DeepLearning

MachineHuman (MSFT)Human (IBM)

ImageNet Winners (2010-2017)

28.2

25.8

16.4

11.7

6.7

3.6 3.0 2.3

DeepLearning

'10 '11 '12 '13 '14 '15 '16 '170

5

10

15

20

25

30

35

Imag

eNet

Top

-5 E

rror

Rat

e (%

) MachineHuman

2

Page 3: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Problems:

• Optimization is non-convex.

• Models are hard to interpret.

3

Page 4: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Open problem:Can other methods compete?

4

Page 5: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Motivation

Primary contribution of this dissertation:To demonstrate that kernel methods can bescaled to compete with deep neural networks inspeech recognition.

5

Page 6: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

II. Random Fourier features (RFFs) for acoustic modeling

III. Improved RFFs through feature selection

IV. Nystrom method vs. RFFs

V. Concluding remarks

6

Page 7: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

7

Page 8: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: Kernel Methods

• Non-linear transformation of data (ϕ : X → H), followed by linearlearning algorithm.

• k(x , y) = 〈ϕ(x), ϕ(y)〉.→ k(x , y) can be seen as similarity score.

• Can use “kernel trick” to avoid working in H, but at cost of O(N2)

computation.

8

Page 9: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: DNNs

• Transforms input using composition of linear and non-linearfunctions.

• Feedforward network:f (x) = gL(WL · gL−1(WL−1 · gL−2(WL−2 · · · g1(W1x))))

• Trained using gradient methods (backpropagation).

9

Page 10: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: DNNs vs. Kernels

DNNs KernelsTraining Time O(Np) O(N2d)

Test Time O(p) O(Nd)Universal Approximator yes yes

Convex no yes

• N is the number of training points.

• p is the number of parameters in the DNN.

• d is the dimension of the input space.

• We assume K (x , y) can be computed in O(d).

10

Page 11: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: Kernel Approximation

• Feature expansion z : Rd → Rm such that

z(x) · z(y) ≈ k(x , y).

• Learn linear model over z(x).

• Scalability: Gives large efficiency gains at both train/test timewhen m << N.

11

Page 12: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

z(x) · z(y) ≈ k(x , y).

11

Page 13: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

z(x) · z(y) ≈ k(x , y).

11

Page 14: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

“Random Fourier Features” (RFF) [1]:

zi(x) =

√2m· cos(wi · x + bi),

where E [z(x) · z(y)] = k(x , y) if draw wi ∼ pk (w), and bi ∼ U(0,2π).

Kernel K (x , y) pk (w)

Gaussian e−‖x−y‖22/2σ2

Normal(0d ,1σ2 Id )

Laplacian e−λ‖x−y‖1 Cauchy(0d , λ)

[1] A. Rahimi, B. Recht. Random Features for Large-Scale Kernel Machines. 2007

12

Page 15: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

“Random Fourier Features” (RFF) [1]:

zi(x) =

√2m· cos(wi · x + bi),

where E [z(x) · z(y)] = k(x , y) if draw wi ∼ pk (w), and bi ∼ U(0,2π).

Kernel K (x , y) pk (w)

Gaussian e−‖x−y‖22/2σ2

Normal(0d ,1σ2 Id )

Laplacian e−λ‖x−y‖1 Cauchy(0d , λ)

[1] A. Rahimi, B. Recht. Random Features for Large-Scale Kernel Machines. 2007

12

Page 16: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: Speech Recognition

13

Page 17: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

13

Page 18: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

13

Page 19: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

13

Page 20: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

14

Page 21: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Random Fourier Features for Acoustic Modeling

We use RFFs z(x) in a log-linear model for p(y |x):

Model : p(y |x ;W ) =exp

(〈wy , z(x)〉+ by

)∑

y ′ exp(〈wy ′ , z(x)〉+ by ′

)

bias unitacoustic features

︸︷︷︸

state labels

random numbers cosine

transfer

Training : argmaxW

1N

N∑

i=1

log(p(yi |x ;W ))

15

Page 22: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

(〈wy , z(x)〉+ by

)∑

y ′ exp(〈wy ′ , z(x)〉+ by ′

)

︸︷︷︸

state labels

transfer

Training : argmaxW

1N

N∑

i=1

log(p(yi |x ;W ))

15

Page 23: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

(〈wy , z(x)〉+ by

)∑

y ′ exp(〈wy ′ , z(x)〉+ by ′

)

︸︷︷︸

state labels

transfer

Training : argmaxW

1N

N∑

i=1

log(p(yi |x ;W ))

15

Page 24: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

We use 2 tricks to improve our kernel models:

1. “Linear bottleneck” in output matrix [1].

2. Early stopping using novel metric for improved WERperformance [2].

[1] Low-rank Matrix Factorization for Deep Neural Network Training withHigh-dimensional Output Targets. T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, B.Ramabhadran. ICASSP 2013.

[2] A Comparison Between DNNs and Kernel Acoustic Models for SpeechRecognition. Z. Lu, D. Guo, A. Garakani, K. Liu, A. May, A. Bellet, L. Fan, M. Collins, B.Kingsbury, M. Picheny, F. Sha. ICASSP 2016.

16

Page 25: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Task Details

We use 4 speech recognition datasets:

• Bengali and Cantonese from IARPA Babel program (∼24 hours)

• TIMIT benchmark task (∼3.5 hours)

• 50-hour subset of Broadcast News dataset (BN50).

Dataset Train # Features # ClassesBroadcast News 16M 360 5000

Bengali 7.7M 360 1000Cantonese 7.5M 360 1000

TIMIT 2.3M 440 147

17

Page 26: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Best DNN vs. Best Kernel

• Kernels: 100k random features (200k for TIMIT).

• DNNs: 4 layers, tanh non-linearity.

DNN KernelBroadcast News 16.4 17.1

Bengali 70.2 71.4Cantonese 67.1 67.1

TIMIT 18.6 18.6

Table 1: Development set results (WER %).Best DNN vs. best kernel.

18

Page 27: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Kernels: 100k random features (200k for TIMIT).

• DNNs: 4 layers, tanh non-linearity.

TIMIT 18.6 18.6

Table 1: Development set results (WER %).Best DNN vs. best kernel.

18

Page 28: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Role of Number of Random Features

5 10 25 50 100m / 1000

17

17.5

18

18.5

19

19.5

20

20.5

WE

R (

%)

Broadcast News

LaplacianGaussian

Laplacian : k(x , y) = exp(− λ‖x − y‖1

)

Gaussian : k(x , y) = exp(− ‖x − y‖2

2/2σ2)

19

Page 29: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Primary contributions:

1. Effectively scaled RFFs to acoustic modeling, andcompared with DNNs.

2. Matched performance on two of four datasets.

Remaining problems:

1. We need a ton of random features.

2. We still lag well-behind DNNs on two datasets.

20

Page 30: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Primary contributions:

1. Effectively scaled RFFs to acoustic modeling, andcompared with DNNs.

2. Matched performance on two of four datasets.

Remaining problems:

1. We need a ton of random features.

2. We still lag well-behind DNNs on two datasets.

20

Page 31: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

21

Page 32: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Feature Selection: First try

Goal: Fast online algorithm for feature selection in multi-class setting.

• First try: Use “FOBOS,” Duchi and Singer’s Forward-BackwardSplitting algorithm [1] with `1/`2 regularization.

• Problem: Attaining row-sparsity requires extremely strongregularization⇒ model unable to learn.

[1] Efficient Online and Batch Learning Using Forward Backward Splitting. J. Duchi,Y. Singer. JMLR 2009.

22

Page 33: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Feature Selection: First try

Goal: Fast online algorithm for feature selection in multi-class setting.

• First try: Use “FOBOS,” Duchi and Singer’s Forward-BackwardSplitting algorithm [1] with `1/`2 regularization.

• Problem: Attaining row-sparsity requires extremely strongregularization⇒ model unable to learn.

[1] Efficient Online and Batch Learning Using Forward Backward Splitting. J. Duchi,Y. Singer. JMLR 2009.

22

Page 34: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Our Proposed Feature Selection Algorithm[1]

Say you want to select a total of 50k features...

[1] Compact Kernel Models for Acoustic Modeling via Random Feature Selection.A. May, M. Collins, D. Hsu, B. Kingsbury. ICASSP 2016.

23

Page 35: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

24

Page 36: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

25

Page 37: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

26

Page 38: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

27

Page 39: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

28

Page 40: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

29

Page 41: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

30

Page 42: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

And so on...until we build up to 50k random features

31

Page 43: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Development Set Results: Word Error Rate∗

∗ All of these models use bottleneck + ERP for LR decay32

Page 44: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Feature Selection: A Deeper Look

Why does the Laplacian kernel benefit so much from featureselection, while the Gaussian kernel does not?

• To approximate Laplacian kernel, draw from Cauchy. Fat tailed!

• Projections are effectively “sparse”.

• We can construct sparse random projections explicitly.→ draw k = 5 non-zero weights from Gaussian, rest are 0.

• We call this the “Sparse Gaussian kernel.”

33

Page 45: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Development Set Results: Word Error Rate∗

∗ All of these models use bottleneck + ERP for LR decay34

Page 46: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Can we use even MORE random features?

100k 400kLaplacian 17.7 17.4

+FS 16.7 16.4

Table 2: Broadcast News Development set results.

35

Page 47: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

TIMIT 20.5 20.4

Table 3: Test set results (WER %).Best DNN vs. best kernel.

36

Page 48: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

DNN KernelOur Results 20.5 20.4

Prior Work [1] 20.5 21.3Prior Work [2] N/A 20.9

Table 4: Test set results on TIMIT.

[1] Kernel Methods Match Deep Neural Networks on TIMIT. P.S. Huang, H. Avron, T.Sainath, V. Sindhwani, B. Ramabhadran. ICASSP 2014.

[2] Efficient One-vs-One Kernel Ridge Regression for Speech Recognition. J. Chen,L. Wu, K. Audhkhasi, B. Kingsbury, B. Ramabhadran. ICASSP 2016.

37

Page 49: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Improved RFFs Through Feature Selection

Our contributions:

1. We propose a novel feature selection algorithm, anddemonstrate its effectiveness for acoustic modeling.

2. We propose the new “Sparse Gaussian” kernel, which does wellat acoustic modeling, both with and without feature selection.

3. We attain WER performance comparable to DNNs, on all 4datasets.

38

Page 50: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

39

Page 51: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Motivation: Nystrom vs. RFF

Previous work argues Nystrom is better than RFF [1]:

• Better at approximating kernel.

• Better generalization bounds.

• Better empirical performance on regression/classification tasks.

→ Experiments: Use up to 1000 features on UCI data.

Awesome! Let’s try them at our tasks...

• Will Nystrom be able to scale to our larger, harder tasks?

• Will it outperform RFF for a fixed computational budget?

[1] Nystrom Method vs Random Fourier Features: A Theoretical and EmpiricalComparison. Yang et al. NIPS 2012.

40

Page 52: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Better empirical performance on regression/classification tasks.→ Experiments: Use up to 1000 features on UCI data.

40

Page 53: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Better empirical performance on regression/classification tasks.→ Experiments: Use up to 1000 features on UCI data.

40

Page 54: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Background: Nystrom method

Idea: Use low rank approximation of kernel matrix K ≈ Z T Z togenerate vectors z(x) for kernel approximation.

K = KK−1K

= (KK−1/2)(K−1/2K )

Now, letting Kxj = [k(x1, xj), . . . , k(xN , xj)]T , we have:

k(xi , xj) =(K T

xiK−1/2)(K−1/2Kxj

)

= z(xi)T z(xj)

z(xj) = K−1/2Kxj

Computational cost?

• Pre-processing: O(N3) for SVD.• During training, for each datapoint:

• O(Nd) to compute Kxj .• O(N2) to multiply by K−1/2.

41

Page 55: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Idea: Use low rank approximation of kernel matrix K ≈ Z T Z togenerate vectors z(x) for kernel approximation.

K = KK−1K

= (KK−1/2)(K−1/2K )

Now, letting Kxj = [k(x1, xj), . . . , k(xN , xj)]T , we have:

k(xi , xj) =(K T

xiK−1/2)(K−1/2Kxj

)

= z(xi)T z(xj)

z(xj) = K−1/2Kxj

Computational cost?

• Pre-processing: O(N3) for SVD.• During training, for each datapoint:

• O(Nd) to compute Kxj .• O(N2) to multiply by K−1/2.

41

Page 56: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Nystrom method: Use the representation z(x) corresponding to thekernel matrix K for a random subset of the training data x1, . . . , xm.

k(xi , xj) ≈ z(xi)T z(xj)

z(xj) = K−1/2Kxj

Kxj = [k(x1, xj), . . . , k(xm, xj)]T

We can visualize this as follows:

K ≈

K Tx1

K Tx2

...

K TxN

K−1/2

Kx1 Kx2 · · · KxN

**Computing z(xj) takes time O(md + m2).**

42

Page 57: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

z(xj) = K−1/2Kxj

Kxj = [k(x1, xj), . . . , k(xm, xj)]T

K ≈

K Tx1

K Tx2

...

K TxN

K−1/2

42

Page 58: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

z(xj) = K−1/2Kxj

Kxj = [k(x1, xj), . . . , k(xm, xj)]T

K ≈

K Tx1

K Tx2

...

K TxN

K−1/2

42

Page 59: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Task Details: Nystrom vs. RFF

Datasets:

• Classification: *TIMIT*, Forest, CovType, Cod-RNA, Adult

• Regression: YearPred, Census, CPU

Training Details:

• Use up to 1.6M RFF features, and up to 20k Nystrom features.

• Models trained using SGD, without any additional tricks.

• Nystrom: choose the “landmark points” randomly, or withk-means [1].

• We use the Gaussian kernel for all experiments.

[1] Improved Nystrm low-rank approximation and error analysis. Kai Zhang, Ivor W.Tsang, James T. Kwok. ICML 2008.

43

Page 60: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Kernel Approximation Error

103 104 105 106

Number of Features

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

Memory Requirement (# floats)

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

44

Page 61: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Kernel Approximation Error

103 104 105 106

Number of Features

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

44

Page 62: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Heldout performance

103 104 105 106

Number of Features

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

103 104 105 106

Number of Features

0.3

0.32

0.34

0.36

0.38

0.4

Hel

dout

Cla

ssifi

catio

n E

rror

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

0.3

0.32

0.34

0.36

0.38

0.4

Hel

dout

Cla

ssifi

catio

n E

rror

TIMIT

RFF

Nyström

Nyström K-means

45

Page 63: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

103 104 105 106

Number of Features

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

10-6

10-5

10-4

10-3

Hel

dout

App

rox.

Err

or (

MS

E)

TIMIT

RFF

Nyström

Nyström K-means

103 104 105 106

Number of Features

0.3

0.32

0.34

0.36

0.38

0.4

Hel

dout

Cla

ssifi

catio

n E

rror

TIMIT

RFF

Nyström

Nyström K-means

105 106 107 108 109

0.3

0.32

0.34

0.36

0.38

0.4

Hel

dout

Cla

ssifi

catio

n E

rror

TIMIT

RFF

Nyström

Nyström K-means

45

Page 64: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Something funky is going on...

46

Page 65: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

10-6 10-5 10-4 10-3

Mean Squared Approximation Error

0.3

0.32

0.34

0.36

0.38

0.4

Hel

dout

Cla

ssifi

catio

n E

rror

TIMIT

RFF

Nyström

Nyström K-means

47

Page 66: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Results: Other datasets

10-7 10-6 10-5 10-4 10-3

0.05

0.1

0.15

0.2

0.25

Hel

dout

Cla

ssifi

catio

n E

rror

FOREST

RFF

Nyström

Nyström K-means

10-7 10-6 10-5 10-4 10-3

80

82

84

86

88

Hel

dout

Mea

n S

quar

ed E

rror

YEARPRED

RFF

Nyström

Nyström K-means

48

Page 67: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Kernel Approximation Error: A Second Look

Hypothesis: These differences in heldoutclassification performance are explained bydifferences in the way these methods makeerrors approximating the kernel.

49

Page 68: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

k(x , y) ≤ 0.25 (94.6% of data):

50

Page 69: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

k(x , y) ≤ 0.25 (94.6% of data):

51

Page 70: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

k(x , y) ≤ 0.25 (94.6% of data):

52

Page 71: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

k(x , y) ≤ 0.25 (94.6%) k(x , y) ≥ 0.25 (5.4%)

53

Page 72: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Now let’s look at the error approximating k(x , x).

1 = k(x , x) ≈ z(x)T z(x) = ‖z(x)‖2

54

Page 73: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

55

Page 74: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

55

Page 75: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Kernel Approximation Error: Summary

• Nystrom is generally more accurate than RFF, but has moreoutliers.

• These outliers are concentrated in pairs of points with high truekernel value k(x , y), and generally correspond tounderestimating the true kernel value.

• Nystrom generally underestimates k(x , x) as well, often by alarge margin.

• It appears these large and biased errors are costly in trainingmodels.

56

Page 76: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

56

Page 77: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

56

Page 78: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

56

Page 79: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Nystrom Error Analysis

Can we explain these empirical observationstheoretically?

57

Page 80: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Recall 〈ϕ(x), ϕ(y)〉H = k(x , y).

Key Fact:z(x) is a projection of ϕ(x) ontoA = span

(ϕ(x1), . . . , ϕ(xm)

).

58

Page 81: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Recall 〈ϕ(x), ϕ(y)〉H = k(x , y).

Key Fact:z(x) is a projection of ϕ(x) ontoA = span

(ϕ(x1), . . . , ϕ(xm)

).

58

Page 82: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

59

Page 83: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

60

Page 84: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

61

Page 85: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

62

Page 86: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

63

Page 87: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Fact: ϕ(x) /∈ A⇒ k(x , x) > ‖z(x)‖2

• Theorem 1: Given a point ϕ(x) /∈ A, ∀ε > 0, y ∈ X ,‖x − y‖ < R(x , ε)⇒ k(x , y)− 〈z(x), z(y)〉 > ε.

• Theorems 2-3: Random x is likely to be far from A.

• Theorem 4: EX ,Y [k(X ,Y )− 〈z(X ), z(Y )〉] > 0,assuming EX [ϕ(X )] /∈ A.→ Thus, Nystrom is a biased estimator of k(x , y).

64

Page 88: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Fact: ϕ(x) /∈ A⇒ k(x , x) > ‖z(x)‖2

64

Page 89: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Fact: ϕ(x) /∈ A⇒ k(x , x) > ‖z(x)‖2

64

Page 90: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

• Fact: ϕ(x) /∈ A⇒ k(x , x) > ‖z(x)‖2

64

Page 91: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Outline

I. Background

65

Page 92: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Summary of Contributions

• Scaled kernel approximation methods to large-scale speechrecognition problems, benchmarking performance on fourdatasets.

• Proposed feature selection algorithm to improve performance,along with a new kernel, matching performance with DNNsacross all datasets.

• Compared RFFs to Nystrom method, uncovering some importantdifferences between these two methods, and demonstrating thesuperior performance of RFFs under a computation budget.

66

Page 93: Kernel Approximation Methods for Speech …web.stanford.edu/~avnermay/files/thesis_defense_slides.pdfKernel Approximation Methods for Speech Recognition Avner May Thesis Defense Committee:

Future Work

Theoretically sound methods for large-scale machine learning.

• Next steps on Nystrom vs. RFF work:• Combining Nystrom and RFF features: best of both worlds?• Proving large Nystrom errors are costly for performance.

• Convolutional/recurrent kernels?

• Scalable training: How to best leverage parallelization, andmodern/future hardware?

• Limitations of kernels, and how to overcome them.

• Alternative methods to deep learning, other than kernels?

• Better theoretical understanding of deep learning.

67