computational science and engineering (int. ?· computational science and engineering (int....

Download Computational Science and Engineering (Int. ?· Computational Science and Engineering (Int. Master’s…

Post on 27-Sep-2018

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Computational Science and Engineering(Int. Masters Program)

    Technische Universitat Munchen

    Masters thesis in Computational Science and Engineering

    Preconditioning for Hessian-FreeOptimization

    Robert Seidl

  • Computational Science and Engineering(Int. Masters Program)

    Technische Universitat Munchen

    Masters thesis in Computational Science and Engineering

    Preconditioning for Hessian-Free Optimization

    Author: Robert Seidl1st examiner: Prof. Dr. Thomas Huckle2nd examiner: Prof. Dr. Michael BaderThesis handed in on: April 2, 2012

  • I hereby declare that this thesis is entirely the result of my own work except whereotherwise indicated. I have only used the resources given in the list of references.

    April 2, 2012 Robert Seidl

  • i

    Abstract

    Recently Martens adapted the Hessian-free optimization method for the training ofdeep neural networks. One key aspect of this approach is that the Hessian is nevercomputed explicitly, instead the Conjugate Gradient(CG) Algorithm is used to com-pute the new search direction by applying only matrix-vector products of the Hessianwith arbitrary vectors. This can be done efficiently using a variant of the backpropaga-tion algorithm. Recent algorithms use diagonal preconditioners to reduce the needediterations of the CG algorithm. They are used because of their easy calculation andapplication. Unfortunately in later stages of the optimization these diagonal precondi-tioners are not as well suited for the inner iteration as they are for the optimization inthe earlier stages. This is mostly due to an increased number of elements of the denseHessian having the same order of magnitude near an optimum.We construct a sparse approximate inverse preconditioner (SPAI) that is used to accel-erate the inner iteration especially in the later stages of the optimization. The qualityof our preconditioner depends on a predefined sparsity pattern. We apply the knowl-edge of the pattern of the Gauss-Newton approximation of the Hessian to efficientlyconstruct the needed pattern for our preconditioner which can then be computed ef-ficiently fully in parallel using GPUs. This preconditioner is then applied to a deepauto-encoder test case using different update strategies.

  • ii

    Contents

    Abstract i

    Outline of the Thesis iv

    I. Introduction 1

    II. Theory 5

    1. Iterative Solution of Linear Equation Systems 61.1. Stationary Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.1.1. Jacobi Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.1.2. Gauss-Seidel Method . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.2. Nonstationary Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . 71.2.1. Conjugate Gradient Method . . . . . . . . . . . . . . . . . . . . . . 81.2.2. GMRES Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2. Preconditioning 112.1. Implicit preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2. Explicit preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3. Hessian-free Optimization 143.1. Truncated Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2. Levenberg-Marquardt Method . . . . . . . . . . . . . . . . . . . . . . . . 153.3. Implemented algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.4. Hessian vs Gauss-Newton approximation . . . . . . . . . . . . . . . . . . 163.5. Inner iteration of the HF method . . . . . . . . . . . . . . . . . . . . . . . 173.6. Preconditioning of inner iteration . . . . . . . . . . . . . . . . . . . . . . . 193.7. Peculiarities of Martens implementation . . . . . . . . . . . . . . . . . . . 20

    4. Neural networks 214.1. Feed-forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2. Deep auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.3. Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4. Error backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.5. Fast multiplication by the Hessian . . . . . . . . . . . . . . . . . . . . . . 27

  • Contents iii

    4.6. Hessian and Gauss-Newton Hessian for neural networks . . . . . . . . . 29

    III. Construction of the SPAI Preconditioner 31

    5. Efficient pattern finding 335.1. Gauss-Newton approximation . . . . . . . . . . . . . . . . . . . . . . . . 345.2. Building the sparsity pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3. Summary of the pattern finding method . . . . . . . . . . . . . . . . . . . 35

    IV. Case study 37

    6. Training of a deep auto-encoder 386.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2. Pathological Curvature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.3. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406.4. Results for different sparsity pattern . . . . . . . . . . . . . . . . . . . . . 446.5. Results for reduced models . . . . . . . . . . . . . . . . . . . . . . . . . . 466.6. Results using CG for the inner iteration . . . . . . . . . . . . . . . . . . . 48

    V. Conclusions and Future Work 51

    7. Conclusions 52

    8. Future Work 538.1. Parallelization of the method . . . . . . . . . . . . . . . . . . . . . . . . . 538.2. Termination of CG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.3. Application of a better approximation of the diagonal of the Hessian . . 538.4. Using more information for the preconditioner . . . . . . . . . . . . . . . 548.5. Efficient calculation of particular entries of the Hessian . . . . . . . . . . 54

    Appendix 57

    A. Pattern finding idea 57

    B. Sparsity Patterns 58

    C. Results for different Hessians 60

    D. CG results 62

    Bibliography 63

  • Outline of the Thesis

    Part I: Introduction

    This chapter presents an overview of the thesis and its purpose.

    Part II: Theory

    CHAPTER 1: ITERATIVE SOLUTION OF LINEAR EQUATION SYSTEMSThis chapter presents the basic theory for the iterative solution of linear equation sys-tems. The first part deals with stationary methods like the Jacobi and Gauss-Seidelmethod. The second part deals with Krylov subspace methods which are nonstation-ary methods that form the basis for the inner iteration of truncated Newton methods.Examples are the Conjugate Gradient method and GMRES.

    CHAPTER 2: PRECONDITIONINGHow fast iterative methods like CG find a solution depends on spectral properties ofthe underlying matrix. It is often convenient to transform the equation system into asystem that has the same solution but has more favorable spectral properties. This isachieved by finding a suitable preconditioner. We will focus on a sparse preconditionerthat is an approximation of the inverse of A. The main task is to find an appropriatesparsity pattern of A1 without explicit knowledge.

    CHAPTER 3: HESSIAN-FREE OPTIMIZATIONThis chapter summarizes the Hessian-Free or Truncated Newton method used for theoptimization of nonlinear functions. It presents basic ideas and discusses peculiaritiesin Martens implementation used for training of deep neural networks. A particular fo-cus lies on the stopping criteria of the inner iteration of the optimization method.

    CHAPTER 4: NEURAL NETWORKSThis chapter presents the basics of neural networks that are needed to understand thetheory behind the case study. It deals with training of neural networks, the backprop-agation algorithm used to compute gradient information needed for training of the netand a variant known as the Pearlmutter trick used to efficiently compute the matrixvector product of the Hessian Hv with an arbitrary vector.

  • Outline of the Thesis v

    Part III: Construction of the SPAI Preconditioner

    This part describes the straightforward approach of Benzi to compute a feasible a priorisparsity pattern for a sparse approximate inverse preconditioner.

    CHAPTER 5: PATTERN FINDINGBecause of the sparsification costs of the above method it is not economic to applythis method to find a suitable pattern for our preconditioner. This chapter presents anapproach to come up with a suitable pattern by exploiting the known pattern of theGauss-Newton Hessian that is usually used as a substitute of the Hessian in the contextof the Hessian-Free optimization method.

    Part IV: Case study

    CHAPTER 6: TRAINING OF A DEEP AUTO-ENCODERIn this case study Martens Matlab implementation of the Hessian-Free optimizationmethod is used to train a deep auto-encoder. This task is known to be hard for algo-rithms based only on gradient information like the backpropagation algorithm, espe-cially without pretraining.Then the changing structure of the Gauss-Newton Hessian in the different stages of theoptimization is investigated and results of applying different preconditioners on repre-sentatives of the different stages are presented.Furthermore the final results of the application of the SPAI preconditioner, using differ-ent update strategies in HF using GMRES or CG for the inner iteration with differentstopping criteria based on the value of the quadratic approximation, are discussed .

    Part V: Conclusions and Future Work

    CHAPTER 7: CONCLUSIONSThis chapter summarizes the thesis.

    CHAPTER 8: FUTURE WORKThis chapter presents some possible topics for future work like studies dealing withspeedup of the computation of SPAI preconditioners on a graphics card or a possiblymore efficient way to compute the needed entries of the Hessian for the computationof our preconditioner.

  • 1

    Part I.

    Introduction

  • 2

    Introduction

    Recently there are a lot of algorithm

Recommended

View more >