yunshu_informationgeometry

Upload: semiring

Post on 02-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 Yunshu_InformationGeometry

    1/79

    Introduction to Information Geometry

    based on the book Methods of Information Geometry written by

    Shun-Ichi Amari and Hiroshi Nagaoka

    Yunshu Liu

    2012-02-17

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    2/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Outline

    1 Introduction to differential geometry

    Manifold and Submanifold

    Tangent vector, Tangent space and Vector field

    Riemannian metric and Affine connection

    Flatness and autoparallel

    2 Geometric structure of statistical models and statistical inference

    The Fisher metric and-connection

    Exponential familyDivergence and Geometric statistical inference

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 2 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    3/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Part I

    Introduction to differential geometry

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 3 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    4/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Basic concepts in differential geometry

    Basic concepts

    Manifold and SubmanifoldTangent vector, Tangent space and Vector field

    Riemannian metric and Affine connection

    Flatness and autoparallel

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 4 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    5/79

    I t d ti t diff ti l t G t i t t f t ti ti l d l d t ti ti l i f

  • 8/11/2019 Yunshu_InformationGeometry

    6/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Manifold

    Manifold S

    Definition: LetSbe a set, if there exists a set of coordinate systems A forSwhich satisfies the condition (1) and (2) below, we callSan n-dimensional

    C differentiable manifold.

    (1) Each element ofA is a one-to-one mapping from Sto some opensubset ofRn.

    (2) For all A, given any one-to-one mapping fromSto Rn, thefollowing hold:

    A 1

    is aC

    diffeomorphism.

    Here, by aC diffeomorphism we mean that 1 and its inverse 1are bothC(infinitely many times differentiable).

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 6 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    7/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Examples of Manifold

    Examples of one-dimensional manifold

    A straight line: a manifold in R1, even if it is given in Rk for k 2.Any open subset of a straight line: one-dimensional manifoldA closed subset of a straight line: not a manifold

    A circle: locally the circle looks like a line

    Any open subset of a circle: one-dimensional manifold

    A closed subset of a circle: not a manifold.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 7 / 79

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    8/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

  • 8/11/2019 Yunshu_InformationGeometry

    9/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Examples of Manifold: surface of a torus

    The torus in R3(surface of a doughnut):

    x(u, v) = ((a+b cos u)cos v, (a+b cos u)sin v, b sin u), 0 u, v< 2.wherea is the distance from the center of the tube to the center of the torus,

    andb is the radius of the tube. A torus is a closed surface defined as product

    of two circles: T2 =S1 S1.n-torus: T

    n

    is defined as a product ofn circles: T

    n

    = S

    1

    S1

    S1

    .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 9 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    10/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

  • 8/11/2019 Yunshu_InformationGeometry

    11/79

    a g y a a a a a

    Coordinate systems for manifold

    Parametrization of unit hemisphere

    Parametrization: map of unit hemisphere intoR2

    (2) by stereographic projections.

    x(u, v) = ( 2u

    1+u2 +v2,

    2v

    1+u2 +v2,1 u2 v21+u2 +v2

    ) where u2 +v2 1 (2)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 11 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    12/79

    g y

    Examples of Manifold: colors

    Parametrization of color models

    3 channel color models: RGB, CMYK, LAB, HSV and so on.

    The RGB color model: an additive color model in which red, green, and

    blue light are added together in various ways to reproduce a broad array

    of colors.

    The Lab color model: three coordinates of Lab represent the lightness ofthe color(L), its position between red and green(a) and its position

    between yellow and blue(b).

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 12 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    13/79

    Coordinate systems for manifold

    Parametrization of color modelsParametrization: map of color into R3

    Examples:

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 13 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    14/79

    Submanifolds

    Submanifolds

    Definition: a submanifold M of a manifold S is a subset of S which itself hasthe structure of a manifold

    An open subset of n-dimensional manifold forms an n-dimensional

    submanifold.

    One way to construct m(

  • 8/11/2019 Yunshu_InformationGeometry

    15/79

    Submanifolds

    Examples: color models

    3-dimensional submanifold: any open subset;

    2-dimensional submanifold: fix one coordinate;

    1-dimensional submanifold: fix two coordinates.

    Note: In Lab color model, we set a and b to 0 and change L from 0 to100(from black to white), then we get a 1-dimensional submanifold.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 15 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    16/79

    Basic concepts in differential geometry

    Basic concepts

    Manifold and SubmanifoldTangent vector, Tangent space and Vector field

    Riemannian metric and Affine connection

    Flatness and autoparallel

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 16 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    17/79

    Curves and Tangent vector of Curves

    Curves

    Curve: ISfrom some interval I( R)to S.Examples: curve on sphere, set of probability distribution, set of linear

    systems.

    Using coordinate system {i} to express the point(t)on the curve(where t I): i(t) =i((t)), then we get (t) = [1(t), , n(t)].

    CCurves

    C: infinitely many times differentiable(sufficiently smooth).If(t)is C fortI, we callaC on manifold S.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 17 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    18/79

    Tangent vector of Curves

    A tangent vector is a vector that is tangent to a curve or surface at a given

    point.When S is an open subset ofRn, the range ofis contained within a singlelinear space, hence we consider the standard derivative:

    (a) = limh0

    (a+h)

    (a)

    h(3)

    In general, however, this is not true, ex: the range ofin a color modelThus we use a more general derivative instead:

    (a) =

    ni=1

    i(a)( i

    )p (4)

    wherei(t) =i (t), i(a) = ddt

    i(t)|t=aand( i )p is anoperatorwhichmapsf

    ( f

    i)p

    for given functionf :S

    R.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 18 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    19/79

    Tangent space

    Tangent space

    Tangent space atp: ahyperplane Tpcontaining all the tangents of curvespassing through the pointp S. (dim Tp(S)= dimS)

    Tp(S) ={n

    i=1

    ci(

    i)p|[c1, , cn] Rn}

    Examples: hemisphere and color

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 19 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    20/79

    Vector fields

    Vector fields

    Vector fields: a map from each point in a manifold S to a tangent vector.

    Consider a coordinate system {i} for a n-dimensional manifold, clearlyi =

    i

    are vector fields for i =1, , n.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 20 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    21/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

  • 8/11/2019 Yunshu_InformationGeometry

    22/79

    Riemannian Metrics

    Riemannian Metrics: an inner product of two tangent vectors(Dand D

    Tp(S)) which satisfy D,D p R, and the following condition hold:

    Linearity: aD+bD

    ,D

    p =aD,D

    p+ bD

    ,D

    pSymmetry: D,Dp =D ,DpPositive definiteness: If D=0then D,Dp >0

    The components

    {gij

    }of a Riemannian metric g w.r.t.the coordinate system

    {i} are defined bygij= i, j, wherei = i .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 22 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    23/79

    Riemannian Metrics

    Examples of inner product:

    ForX= (x1, ,xn)and Y= (y1, ,yn), we can define inner productas X, Y1=X Y=

    n

    i=1xiyi, or X, Y2 =YMX, where M is anysymmetry positive-definite matrix.

    For random variables X and Y, the expected value of their product:

    X, Y= E(XY)For square real matrix,

    A,B

    = tr(ABT)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 23 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    24/79

    Riemannian Metrics

    For unit sphere:

    x(, ) = (sin cos , sin sin, cos), 0< < , 0

  • 8/11/2019 Yunshu_InformationGeometry

    25/79

    Affine connection

    Parallel translation along curves

    Let: [a, b]Sbe a curve inS,X(t)be a vector field mapping each point(t)to a tangent vector, if for all t[a, b]and the corresponding infinitesimaldt, the corresponding tangent vectors are linearly related, that is to say there

    exist a linear mappingp,p , such thatX(t+dt)=p,p (X(t))for t[a, b],we say X isparallel along, and call theparallel translation along.Linear mapping: additivity and scalar multiplication.

    Figure :Translation of a tangent vectoralonga curveYunshu Liu (ASPITRG) Introduction to Information Geometry 25 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    26/79

    Affine connection

    Affine connection: relationships between tangent space at different points.Recall:

    Natural basis of the coordinate system[i]:(i)p=( i

    )p: an operator

    which mapsf ( f i

    )pfor given functionf :SRat p.

    Tangent space:

    Tp(S) ={n

    i=1

    ci(

    i)p|[c1, , cn] Rn}

    Tangent vector(elements in Tangent space) can be represented as linearcombinations ofi.

    Tangent spaceTp Tangent vectorXp Natural basis(i)p=( i )p

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 26 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    27/79

    Affine connection

    If the difference between the coordinates of p andp

    are very small, that wecan ignore the second-order infinitesimals(di)(dj), wheredi =i(p

    ) i(p), then we can express difference betweenp,p

    ((j)p)and

    ((j)p )as a linear combination of{d1, , dn}:

    p,p ((j)p) = (j)p

    i,k

    (di(kij)p(k)p ) (7)

    where {(kij)p; i,j, k=1, , n} aren3 numbers which depend on the point p.FromX(t) =

    ni=1X

    i

    (t)(i)p andX(t+dt) =n

    i=1(Xi

    (t+dt)(i)p

    ), wehavep,p (X(t)) =

    i,j,k

    ({Xk(t) dti(t)Xj(t)(kij)p}(k)p ) (8)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 27 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    28/79

    Affine connection

    p,p

    ((j)p

    ) = (j)p

    i,k(di(k

    ij)p

    (k)p )

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 28 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    29/79

    Connection coefficients(Christoffels symbols):(kij)p

    Given a connection on the manifold S, the value of(kij)pare different fordifferent coordinate systems, it shows how tangent vectors changes on a

    manifold, thus shows how basis vectors changes.

    Inp,p ((j)p) = (j)p

    i,k

    (di(kij)p(k)p )

    if we letkij =0 fori,j, k=x,y, we will have

    p,p ((j)p) = (j)p

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 29 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    k

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    30/79

    Connection coefficients(Christoffels symbols):(kij)p

    Given a connection on a manifold S,ki,jdepend on coordinate system. Define

    a connection which makeski,jto be zero in one coordinate system, we willget non-zero connection coefficients in some other coordinate systems.

    Example: If it is desired to let the connection coefficients for Cartesian

    Coordinates of a 2D flat plane to be zero, kij

    =0 fori,j, k=x,y, we can

    calculate the connection coefficients for Polar Coordinates: r= r=

    1r

    ,

    r=r, andkij=0 for all others.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 30 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    k

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    31/79

    Connection coefficients(Christoffels symbols):(kij)p

    Example(cont.):

    Now if we want to let the connection coefficients for Polar Coordinates to be

    zero,kij=0 fori,j, k=r, , we can calculate the connection coefficients for

    Polar Coordinates:xxx = sin2 cos

    r ,yxx=

    sin(1+cos2 )r

    ,

    xxy

    = xyx

    = sin3

    r ,y

    xy= y

    yx= cos

    3

    r ,x

    yy= cos(1+sin

    2 )

    r , and

    yyy= sin cos2

    r .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 31 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    32/79

    Affine connection

    Covariant derivative along curves

    Derivative: dX(t)

    dt =limdt0

    X(t+dt)X(t)dt

    , what ifX(t)and X(t+dt)lie indifferent tangent spaces? Xt(t+dt) = (t+dt),(t)(X(t+dt))

    X(t) = Xt(t+dt) X(t) = (t+dt),(t)(X(t+dt)) X(t)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 32 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Affi i

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    33/79

    Affine connection

    Covariant derivative along curves

    We call X(t)

    dt the covariant derivative ofX(t):

    X(t)dt

    = limdt0

    Xt(t+dt) X(t)dt

    = (t+dt),(t)(X(t+dt)) X(t)dt

    (9)

    (t+dt)(X(t+dt)) =i,j,k

    ({Xk(t+dt) +dti(t)Xj(t)(kij)(t)}(k)(t))(10)

    X(t)dt

    =i,j,k

    ({ Xk(t) + i(t)Xj(t)(kij)(t)}(k)(t)) (11)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 33 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Affi ti

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    34/79

    Affine connection

    Covariant derivative of any two tangent vector

    Covariant derivative ofYw.r.t. X, whereX=n

    i=1(Xii)and

    Y=n

    i=1(Yii):

    XY= i,j,k

    (Xi{iYk +Yjkij}k) (12)

    i j =n

    k=1

    kijk (13)

    Note:(XY)p =Xp Y Tp(S)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 34 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    E l f Affi ti

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    35/79

    Examples of Affine connection

    metric connection

    Definition: If for all vector fieldsX, Y,Z T(S),

    ZX, Y=ZX, Y + X, ZY.

    whereZ

    X, Y

    denotes the derivative of the function

    X, Y

    along this vector

    field Z, we say that is a metric connection w.r.t. g.Equivalent condition: for all basisi, j, k T(S),

    ki, j=ki, j + i, ki.

    Property: parallel translation on a metric connection preserves inner products,

    which means parallel transport is an isometry.

    (D1), (D2)q =D1,D2p.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 35 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Examples of Affine connection

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    36/79

    Examples of Affine connection

    Levi-Civita connection

    For a given connection, whenkij= k

    jihold for all i, j and k, we call it a

    symmetric connection or torsion-free connection.

    From i j=n

    k=1kijk, we know for a symmetric connection:

    i j =j iIf a connection is both metric and symmetric, we call it the Riemannian

    connection or the Levi-Civita connection w.r.t. g.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 36 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Basic concepts in differential geometry

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    37/79

    Basic concepts in differential geometry

    Basic concepts

    Manifold and Submanifold

    Tangent vector, Tangent space and Vector field

    Riemannian metric and Affine connection

    Flatness and autoparallel

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 37 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Flatness

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    38/79

    Flatness

    Affine coordinate systemLet {i} be a coordinate system forS, we call {i} an affine coordinate systemfor the connection if then basis vector fieldsi = i are all parallel onS.Equivalent conditions for a coordinate system to be an affine coordinate

    system:

    i j=n

    k=1(kijk) =0 for all i and j (14)

    kij =0 for all i,j and k (15)

    FlatnessS is flatw.r.t the connection : an affine coordinate system exist for theconnection .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 38 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Flatness

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    39/79

    Flatness

    Examples:

    i j=n

    k=1(kijk) =0 for all i and j

    kij =0 for all i,j and k

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 39 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Flatness

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    40/79

    Flatness

    Curvature R and torsion T of a connection

    R(i, j)k=

    l

    (Rlijkl)and T(i, j) =

    k

    (Tkijk) (16)

    whereR

    l

    ijkandT

    k

    ijcan be computed in the following way:

    Rlijk=il

    jk jlik+ lihhjk ljhhik (17)Tkij =

    kij kji (18)

    If a connection is flat, then T=R=0;If T= 0,kij=0 for all i,j and k, we get the symmetry connnection.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 40 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Flatness

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    41/79

    Flatness

    Curvature

    CurvatureR = 0 iff parallel translation does not depend on curve choice.Curvature is independent of coordinate system, under Riemannian

    connection, we can calculate:

    Curvature of 2 dimensional plane: R= 0;

    Curvature of 3 dimensional sphere: R=

    2

    r2 .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 41 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    42/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Autoparallel submanifold

  • 8/11/2019 Yunshu_InformationGeometry

    43/79

    Autoparallel submanifold

    GeodesicsGeodesics(autoparallel curves): A curve with tangent vector transported by

    parallel translation.

    Examples under Riemannian connection:

    2 dimensional flat plane: straight line

    3 dimensional sphere: great circle

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 43 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Autoparallel submanifold

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    44/79

    p

    Geodesics

    The geodesics with respect to the Riemannian connection are known tocoincide with the shortest curve joining two points.

    Shortest curve: curve with the shortest length.

    Length of a curve : [a, b]S:

    =

    b

    a

    ddt

    dt=

    b

    a

    gijijdt (22)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 44 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Part II

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    45/79

    Geometric structure of statistical models and statistical

    inference

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 45 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Motivation

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    46/79

    Motivation

    Consider the set of probability distributions as a manifold.

    Analysis the relationship between the geometric structure of the manifold andstatistical estimation.

    Introduce concepts like metric, affine connection on statistical models and

    studying quantities such as distance, the tangent space (which provides linear

    approximations), geodesics and the curvature of a manifold.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 46 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Statistical models

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    47/79

    Statistical models

    P(X) ={p:X R | p(x)> 0(x X),

    p(x)dx= 1} (23)

    Example Normal Distribution:

    X = R, n= 2, = [, ], ={[, ]| <

  • 8/11/2019 Yunshu_InformationGeometry

    48/79

    Basic concepts

    The Fisher metric and-connectionExponential familyDivergence and Geometric statistical inference

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 48 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    The Fisher information matrix

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    49/79

    Fisher information matrixG() = [gi,j()], and

    gi,j() =E[ij] =

    i(x; )j(x; )p(x; )dx

    where =(x; ) =log p(x; )and E denotes the expectation w.r.t. thedistributionp .

    Motivation:

    Sufficient statistic and Cramer-Rao bound

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 49 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    The Fisher information matrix

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    50/79

    Sufficient statistic

    Sufficient statistic: forY=F(X), given the distributionp(x; )ofX, we havep(x; ) = q(F(x); )r(x; ), ifr(x; )does not depend onfor allx, we saythatFis a sufficient statistic for the modelS. Then we can write

    p(x; ) = q(y; )r(x).A sufficient statistic is a function whose value contains all the information

    needed to compute any estimate of the parameter (e.g. a maximum likelihood

    estimate).

    Fisher information matrix and sufficient statistic

    LetG()be the Fisher information matrix ofS=p(x; ), andGF()be theFisher information matrix of the induced modelSF=q(y; ), then we haveGF() G()in the sense thatG() =GF() G()is positivesemidefinite.G() =0 iff. F is a sufficient statistic for S.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 50 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Cramer-Rao inequality

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    51/79

    Cramer-Rao inequality

    The variance of any unbiased estimator is at least as high as the inverse of the

    Fisher information.

    Unbiased estimator : E[(X)] =

    The variance-covariance matrixV[] = [vij ]where

    vij =E[(

    i(X) i)(j(X) j)]

    Thus Cramer-Rao inequality state thatV

    [] G()1, and an unbiased

    estimator satisfyingV[] =G()1 is called an efficient estimator.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 51 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    -connection

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    52/79

    -connection

    LetS={p} be an n-dimensional model, and consider the function()ij,kwhich maps each pointto the following value:

    (()ij,k) =E[(ij+

    1

    2 ij)(k)] (24)

    where is an arbitrary real number. We defined an affine connection ()which satisfy:

    ()

    i j, k

    =

    ()

    ij,k (25)

    whereg =, is the Fisher metric. We call () the-connection

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 52 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    -connection

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    53/79

    Properties of-connection

    -connection is a symmetric connection

    Relationship between-connection and-connection:

    ()ij,k =

    ()ij,k +

    2

    E[ijk]

    The 0-connection is the Riemannian connection with respect to the

    Fisher metric.

    ()

    ij,k =

    (0)

    ij,k+

    2 E[ijk]

    () = 1+2

    (1) + 1 2

    (1)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 53 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Geometric structure of statistical models and statistical inference

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    54/79

    Basic concepts

    The Fisher metric and-connectionExponential family

    Divergence and Geometric statistical inference

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 54 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Exponential family

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    55/79

    Exponential family

    p(x; ) =exp[C(x) +

    ni=1

    iFi(x) ()]

    [i]are called the natural parameters(coordinates), and is the potential

    function for[i], which can be calculated as

    () =log

    exp[C(x) +

    ni=1

    iFi(x)]dx

    The exponential families include many of the most common distributions,including the normal, exponential, gamma, beta, Dirichlet, Bernoulli,

    binomial, multinomial, Poisson, and so on.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 55 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Exponential family

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    56/79

    Exponential family

    Examples: Normal Distribution

    p(x; , ) = 1

    2e

    (x)2

    22 (26)

    whereC(x) =0, F1(x) =x, F2(x) =x2, and1 =

    2, 2 = 1

    22are the

    natural parameters, the potential function is :

    =(1

    )2

    42 +12

    log( 2 ) = 2

    22 +log(2) (27)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 56 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Mixture family

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    57/79

    Mixture family

    p(x; ) =C(x) +n

    i=1

    iFi(x)

    In this case we say thatSis a mixture family and[i]are called the mixtureparameters.

    e-connection and m-connection

    The natural parameters of exponential family form a 1-affine coordinate

    system((1)ij,k=0), which means the connection is 1-flat, we call the

    connection (1) the e-connection, and call exponential family e-flat.The mixture parameters of mixture family form a (-1)-affine coordinate

    system((1)ij,k =0), which means the connection is (-1)-flat, and we call the

    connection (1) the m-connection and call mixture family m-flat.Yunshu Liu (ASPITRG) Introduction to Information Geometry 57 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Dual connection

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    58/79

    Dual connection

    Definition: Let S be a manifold on which there is given a Riemannian metric

    g and two affine connection and . If for all vector fieldsX, Y,Z T(S),

    Z=+ < X, ZY> (28)

    hold, we say that and are duals of each other w.r.t. g and call one thedual connection of the other.

    Additional, we call the triple (g,

    ,) a dualistic structure on S.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 58 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Dual connection

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    59/79

    Properties

    For any statistical model, the-connection and the ()-connection aredual with respect to the Fisher metric.

    (D1), (D2)q=D1,D2p.whereand

    are parallel translation alongw.r.t. and .

    R= 0

    R =0

    whereR and R are the curvature tensors of and .

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 59 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Dually flat spaces and dual coordinate system

    http://goforward/http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    60/79

    Dually flat spaces

    Let(g, , )be a dualistic structure on a manifoldS, then we haveR=0 R =0, and if the connnection and are bothsymmetric(T=T =0), then we see that -flatness and -flatness areequivalent.We call(S, g, , )a dually flat space if both duals and are flat.Examples: Since-connections and -connections are dual w.r.t. Fishermetric and-connections are symmetry, we have for any statistical model Sand for any real number

    S is flat S is() flat (29)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 60 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Dually flat spaces and dual coordinate system

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    61/79

    Dual coordinate system

    For a particular -affine coordinate system[i], if we choose a corresponding-affine coordinate system[j]such that

    g=i, j=jiwherei =

    i

    andj = j

    .

    Then we say the two coordinate systems mutually dual w.r.t. metric g, and

    call one the dual coordinate system of the other.

    Existence of dual coordinate system

    A pair of dual coordinate system exist if and only if(S, g, , )is a duallyflat space.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 61 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Legendre transformations

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    62/79

    Consider mutually dual coordinate system[i]and [i]with functions:S R and : S R satisfy the following equations:

    i=i

    i= i

    gi,j =ij =ji =ij

    () =max{ii ()}

    () =max

    {i

    i ()}

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 62 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Legendre transformations

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    63/79

    Geometric interpretation forf(p) =maxx(px f(x)):A convex functionf(x)is shown in red, andthe tangent line at point(x0,f(x0))is shownin blue. The tangent line intersects the

    vertical axis at(0, f)and f is the value ofthe Legendre transformf(p0), wherep0 =f(x0). Note that for any other point onthe red curve, a line drawn through that point

    with the same slope as the blue line will have

    a y-intercept above the point(0, f),showing that is indeed a maximum.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 63 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Examples of Legendre transformations

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    64/79

    Examples:

    The Legendre transform off(x) = 1p|x|p (where 1

  • 8/11/2019 Yunshu_InformationGeometry

    65/79

    For distributionp(x; ) =exp[C(x) +n

    i=1iFi(x) ()],[i]are

    called the natural parameters

    If we definei =E[Fi] =

    Fi(x)p(x; )dx, we can verify[i]is a(-1)-affine coordinate system dual to[i], we call this[i]the expectationparameters or the dual parameters.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 65 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    The natural parameter and dual parameter of Exponential family

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    66/79

    Recall: Normal Distribution

    p(x; , ) = 12

    e (x)2

    22 (30)

    whereC(x) =0, F1(x) =x, F2(x) =x2, and1 =

    2, 2 = 1

    22are the

    natural parameters, the potential function is :

    =(1)2

    42 +

    1

    2log(

    2) =

    2

    22+log(

    2) (31)

    The dual parameter are calculated as1= 1

    = =

    1

    22,

    2 = 2

    =2 +2 = (1)2224(2)2

    , It has potential function:

    =12

    (1+log( 2

    )) =12

    (1+log(2)) +2log) (32)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 66 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Geometric structure of statistical models and statistical inference

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    67/79

    Basic concepts

    The Fisher metric and-connectionExponential family

    Divergence and Geometric statistical inference

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 67 / 79

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    68/79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Divergence, semimetrics and metrics

  • 8/11/2019 Yunshu_InformationGeometry

    69/79

    A distance satisfying positive-definiteness, symmetry and triangle

    inequality is called ametric;

    A distance satisfying positive-definiteness and symmetry is calledsemimetrics;

    A distance satisfying only positive-definiteness is called adivergence.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 69 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Kullback-Leibler divergence

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    70/79

    Discrete random variables p and q:

    DKL(pq) =

    i

    p(x)logp(x)q(x)

    (34)

    Continuous random variables p and q:

    DKL(pq) =

    p(x)logp(x)q(x)

    dx (35)

    Generally , we use Kullback-Leibler divergence to measurethe difference

    between two probability distributions p and q. KL measures the expected

    number of extra bits required to code samples from p when using a codebased on q, rather than using a code based on p. Typically p represents the

    true distribution of data, observations, or a precisely calculated theoretical

    distribution. The measure q typically represents a theory, model, description,

    or approximation of p.Yunshu Liu (ASPITRG) Introduction to Information Geometry 70 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Bregman divergence

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    71/79

    Bregman divergence associatedwithFfor pointsp, q is :BF(xy) =F(y) F(x) (y x), F(x),where F(x) is a convex function

    defined on a closed convex set

    .

    Examples:F(x) =x2, thenBF(xy) =x y2.More generally, ifF(x) = 1

    2xTAx, thenBF(xy) = 12 (x y)TA(x y).

    KL divergence:ifF=

    ix logx

    x, we get KullbackLeibler divergence.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 71 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    72/79

    Canonical divergence(a divergence for dually flat space)

    Let(S, g, , )be a dually flat space, and {[i], [j]} be mutually dual affinecoordinate systems with potentials {, }, then the canonicaldivergence((g, ) divergence) is defined as:

    D(pq) =(p) +(q) i(p)j(q) (36)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 72 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    73/79

    Properties:

    Relation between(g, ) divergenceand(g, ) divergence:D(pq) =D(qp)If M is a autoparallel submanifold w.r.t. either or , then the(gM,

    M)-divergenceDM= D

    |MMis given byDM(p

    q) =D(p

    q)

    If is a Riemannian connection(=) which is flat on S, there exista coordinate system which is self-dual(i =i), then== 1

    2

    i(

    i)2, then the canonical divergence is

    D(pq) = 1

    2{d(p, q)}2

    whered(p, q) =

    i{i(p) i(q)}2

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 73 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    http://find/http://goback/
  • 8/11/2019 Yunshu_InformationGeometry

    74/79

    Triangular relation

    Let {[i], [i]} be mutually dual affine coordinate systems of a dually flatspace(S, g,

    ,

    ), and let D be a divergence on S. Then a necessary and

    sufficient condition for D to be the(g, )-divergence is that for all p, q, rSthe following triangular relation holds:

    D(pq) +D(qr) D(pr) ={i(p) i(q)}{i(p) i(q)} (37)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 74 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    Pythagorean relation

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    75/79

    Pythagorean relation

    Let p, q, and r be three points in S. Let 1 be the

    -geodesic connecting p and

    q, and let2be the -geodesic connecting q and r. If at the intersection q thecurve1 and2are orthogonal(with respect to the inner product g), then wehave the following Pythagorean relation.

    D(p

    r) =

    D(p

    q) +

    D(

    q

    r)

    (38)

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 75 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    Projection theorem

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    76/79

    Projection theorem

    Let p be a point in S and let M be a submanifold of S which is

    -autoparallel. Then a necessary and sufficient condition for a point q in Mto satisfy

    D(pq) =minrMD(pr) (39)is for the

    -geodesic connecting p and q to be orthogonal to M at q.

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 76 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    77/79

    Examples

    From the definition of exponential family and mixture family, the product of

    exponential family are still exponential family, the sum of mixture family are

    still mixture family.

    e-flat submanifold: set of all product distributions:

    E0 ={pX|pX(x1, ,xN) =N

    i=1pXi (xi)}m-flat submanifold: set of joint distributions with given marginals:

    M0 =

    {pX

    |X\ipX(x) =qi(xi)

    i

    {1,

    ,N

    }}

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 77 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Canonical divergence

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    78/79

    Examples

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 78 / 79

    Introduction to differential geometry Geometric structure of statistical models and statistical inference

    Thanks!

    http://find/
  • 8/11/2019 Yunshu_InformationGeometry

    79/79

    Thanks!

    Question?

    Yunshu Liu (ASPITRG) Introduction to Information Geometry 79 / 79

    http://find/