use of cuda for continuous space language model

Use of CUDA for Continuous Space Use of CUDA for Continuous Space Language ModelLanguage Model

Elizabeth A. Thompson, Ph.D.Elizabeth A. Thompson, Ph.D.aa

TimothyTimothy R. Anderson, Ph.D. R. Anderson, Ph.D.bb

aaPurdue University, Fort WaynePurdue University, Fort WayneFort Fort WayneWayne, IN, USA 46805, IN, USA 46805

bbAir Air ForceForce Research Lab Research LabWright Patterson Air Force BaseWright Patterson Air Force Base

Dayton, OH, USA 45433Dayton, OH, USA 45433

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. Use of CUDA II. Use of CUDA III. CUDA Architecture III. CUDA Architecture IV. CUDA Implementation of CSLM IV. CUDA Implementation of CSLM V. ResultsV. Results VI.ConclusionsVI.Conclusions

Continuous-Space Language Models Continuous-Space Language Models (CSLM)(CSLM)

This work was based on the article “Continuous-This work was based on the article “Continuous-Space Language Models for Statistical Machine Space Language Models for Statistical Machine Translation” by Holger Schwenk of the University Translation” by Holger Schwenk of the University of Le Mans, France, published in the of Le Mans, France, published in the Prague Prague Bulletin of Mathematical LinguisticsBulletin of Mathematical Linguistics, January , January 2010, and his corresponding open source 2010, and his corresponding open source implementation.implementation.

CSLM (Cont'd)CSLM (Cont'd)

CSLM (Cont'd)CSLM (Cont'd) The CSLM consists of a 3 layer neural network: The CSLM consists of a 3 layer neural network:

projection layer, hidden layer, output layer.projection layer, hidden layer, output layer. Input Input 3 word sequence 3 word sequence Output Output The probability of all words in the The probability of all words in the

vocabulary being the 4vocabulary being the 4thth word in the sequence. word in the sequence.

Training of the CSLMTraining of the CSLM The neural network must be trained through a The neural network must be trained through a

process of adaptive learning.process of adaptive learning. It is trained using a series of 63,070 4-grams:It is trained using a series of 63,070 4-grams:

• Prague Stock Market fallsPrague Stock Market falls• Stock Market falls toStock Market falls to• Market falls to minusMarket falls to minus• falls to minus byfalls to minus by

target word

Training of the CSLM (Cont’d)Training of the CSLM (Cont’d) Text file vocab.txt contains list of vocabulary termsText file vocab.txt contains list of vocabulary terms Each of 14,024 terms in vocab.txt is assigned a Each of 14,024 terms in vocab.txt is assigned a

numerical index, which is used for training the numerical index, which is used for training the neural network:neural network: IndexIndex termterm

00 >>11 --……619619 abandonabandon

Training the Neural NetworkTraining the Neural Network

In the training stage, values are propagated in In the training stage, values are propagated in the forward direction through the neural network the forward direction through the neural network in order to assign weighting values to the input in order to assign weighting values to the input data, and then errors are propagated in the data, and then errors are propagated in the reverse direction to improve these weighting reverse direction to improve these weighting factors.factors.

Projection LayerProjection Layer The projection layer maps each of the 3 input The projection layer maps each of the 3 input

words to a unique 256 length sequence.words to a unique 256 length sequence. Initially, these are generated as uniformly Initially, these are generated as uniformly

distributed random values, but their values distributed random values, but their values change as the neural network is trained.change as the neural network is trained.

For each input word, the corresponding 256 For each input word, the corresponding 256 length sequence is the output of the projection length sequence is the output of the projection layer.layer.

Projection layerProjection layer The projection layer consists of a lookup table.The projection layer consists of a lookup table.

0 -0.10000

0

0.009774

...

1 -0.09980

3

0.001762

...

2 -0.09167

4

-0.08130

8

...

3 ...4 ...............

14023 -0.07989

0

-0.06739

2

Hidden Hidden LayerLayer

For the forward pass, the output of the projection For the forward pass, the output of the projection layer is fed as input to the hidden layer. layer is fed as input to the hidden layer.

BMCD tanh

192x768 192x768 weight weight matrixmatrix

768x128 768x128 output of output of projection projection layerlayer

192x128 192x128 bias bias matrixmatrix

Output LayerOutput Layer For the forward pass, the output of the hidden For the forward pass, the output of the hidden

layer is fed as input to the output layer. layer is fed as input to the output layer.

After applying these weights and biases, a After applying these weights and biases, a softmax normalization is applied.softmax normalization is applied.

KVDO 14024x192 14024x192 weight matrixweight matrix

192x128 output 192x128 output of hidden layer of hidden layer

14024x128 14024x128 bias matrixbias matrix

Backward Pass for TrainingBackward Pass for Training The error of the output compared to the target The error of the output compared to the target

value is propagated backward through the value is propagated backward through the network.network.

Weights and biases in the output layer and then Weights and biases in the output layer and then the hidden layer are updated.the hidden layer are updated.

Finally, the projection layer table is updated to Finally, the projection layer table is updated to reflect the results of the forward pass.reflect the results of the forward pass.

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. II. Use of CUDA Use of CUDA III. CUDA Architecture III. CUDA Architecture IV. CUDA Implementation of CSLM IV. CUDA Implementation of CSLM V. ResultsV. Results VI.ConclusionsVI.Conclusions

CUDA for CSLMCUDA for CSLM

The GPU is specialized for compute intensive, The GPU is specialized for compute intensive, highly parallel computation.highly parallel computation.

All NVIDIA GPUs can support at least 768 All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor. concurrently active threads per multiprocessor.

However, there is an overhead associated with However, there is an overhead associated with using the GPU.using the GPU.

GPU OverheadGPU Overhead To use the GPU, memory must be allocated on To use the GPU, memory must be allocated on

both the host CPU as well as on the GPU. both the host CPU as well as on the GPU. Variables to be used in the computation must be Variables to be used in the computation must be

transferred to the GPU.transferred to the GPU. The computation is then performed on the GPU.The computation is then performed on the GPU. The results must be transferred back to the host The results must be transferred back to the host

CPU. CPU.

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. Use of CUDA II. Use of CUDA III. III. CUDA ArchitectureCUDA Architecture IV. CUDA Implementation of CSLM IV. CUDA Implementation of CSLM V. ResultsV. Results VI.ConclusionsVI.Conclusions

CUDA ArchitectureCUDA Architecture

GPU Streaming multiprocessor

processors (cores)

CUDA Architecture (Cont’d)CUDA Architecture (Cont’d) The CUDA programmer defines The CUDA programmer defines

functions, called kernels.functions, called kernels. A kernel is executed as a grid of A kernel is executed as a grid of

thread blocks.thread blocks. The number of threads per block The number of threads per block

and threads per multiprocessor and threads per multiprocessor depend on compute capability of depend on compute capability of CUDA device.CUDA device.

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. Use of CUDA II. Use of CUDA III. CUDA Architecture III. CUDA Architecture IV. IV. CUDA Implementation of CSLM CUDA Implementation of CSLM V. ResultsV. Results VI.ConclusionsVI.Conclusions

Implementation of CSLM using CUDAImplementation of CSLM using CUDA

The CSLM algorithm is highly computationally The CSLM algorithm is highly computationally intensive and a good candidate for intensive and a good candidate for implementation with CUDA.implementation with CUDA.

The matrix multiplications in the hidden and The matrix multiplications in the hidden and output layer, both forward and backward pass, output layer, both forward and backward pass, are highly parallel. are highly parallel.

CUBLAS Routines for CSLMCUBLAS Routines for CSLM CUBLAS is a CUDA implementation of BLAS CUBLAS is a CUDA implementation of BLAS

(Basic Linear Algebra Subprogram), which (Basic Linear Algebra Subprogram), which perform matrix multiplication operations. perform matrix multiplication operations.

Provide matrix multiplications and handle all Provide matrix multiplications and handle all overhead issues regarding programming of overhead issues regarding programming of threads—does not require programmer to define threads—does not require programmer to define kernels, grids, or thread blocks. kernels, grids, or thread blocks.

CUBLAS Implementation of CSLMCUBLAS Implementation of CSLM The matrix operations were replaced with the The matrix operations were replaced with the

CUBLAS function, cublasSgemm(), which CUBLAS function, cublasSgemm(), which performs the operation performs the operation

A, B, and C are matrices containing single-A, B, and C are matrices containing single-precision values (floats).precision values (floats).

α α and and ββ are scalars. are scalars.

CABC

CUBLAS Implementation of CSLM CUBLAS Implementation of CSLM (Cont’d)(Cont’d)

NVIDIA Performance Primitives Library (NPP) NVIDIA Performance Primitives Library (NPP) nppsExp_32f_I – performs an exponential nppsExp_32f_I – performs an exponential

operation “in-place” on single precision valuesoperation “in-place” on single precision values nppsMulC_32f_I – performs “in-place” nppsMulC_32f_I – performs “in-place”

multiplication of a single precision matrix by a multiplication of a single precision matrix by a constant. constant.

These functions were used to implement the These functions were used to implement the softmax normalization operations.softmax normalization operations.

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. Use of CUDA II. Use of CUDA III. CUDA Architecture III. CUDA Architecture IV. CUDA Implementation of CSLM IV. CUDA Implementation of CSLM V. V. ResultsResults VI.ConclusionsVI.Conclusions

CUBLAS CSML on various platformsCUBLAS CSML on various platformsCUDA device

Compute capability version number

Number of MP

Number of CUDA cores

Maximum threads per block

Maximum threads per MP

CPU platform

CPU operating system

Execution time per epoch (min)

Quadro FX 380 LP

1.2 2 16 512 1024 HP Z200 SFF workstn 4 Intel Core i3-530 processrs 2.93 GHz

Fedora 2.6.33.3-85.fx13x86_64

3

Quadro FX 2700M

1.1 6 48 512 768 Duo core Intel T9600 2.8 GHz

Scientific Linux 6.0

2.5

Quadro FX 5800

1.3 30 240 512 1024 HP Z800 workstn 12 Intel Xeon x5660 processrs 2.8 GHz

CentOS Linux 2.6.32-71.29.1e16.x86-64

1.33

AlgorithmAlgorithm Time per epoch Time per epoch (sec)(sec)

Original Schwenk Original Schwenk using MKLusing MKL

3636

CUDA versionCUDA version 2626

Comparison of revised CUDA version Comparison of revised CUDA version using Quadro FX 5800 vs. original using Quadro FX 5800 vs. original Schwenk algorithm using MKLSchwenk algorithm using MKL

OutlineOutline I. CSLM AlgorithmI. CSLM Algorithm II. Use of CUDA II. Use of CUDA III. CUDA Architecture III. CUDA Architecture IV. CUDA Implementation of CSLM IV. CUDA Implementation of CSLM V. ResultsV. Results VI.VI.ConclusionsConclusions

ConclusionsConclusions A framework has been provided to introduce CUDA A framework has been provided to introduce CUDA

to the CSLM and a time savings over the traditional to the CSLM and a time savings over the traditional CPU approach has been demonstrated.CPU approach has been demonstrated.

CUBLAS & NPP libraries provide a good starting CUBLAS & NPP libraries provide a good starting point for the use of GPUspoint for the use of GPUs

For best performance, avoid redundant uploading For best performance, avoid redundant uploading and downloading of interim results.and downloading of interim results.

Conclusions (Cont’d)Conclusions (Cont’d) GPUs provide a substantial performance benefit at GPUs provide a substantial performance benefit at

relatively low cost, making high performance relatively low cost, making high performance computing accessible to the average user.computing accessible to the average user.

The availability of GPUs on laptops may make it The availability of GPUs on laptops may make it more appealing and practical than a more appealing and practical than a supercomputer in some applications.supercomputer in some applications.

Questions?Questions?

use of cuda for continuous space language model

Documents

layer neural network

projection layer table

cuda implementation

cslmthe neural network

cslm algorithmii

cslm contdcslm contdthe

forward pass

input words